Multi-mode Unrolled Architectures for Polar Decoders by Giard, Pascal et al.
1Multi-mode Unrolled Architectures
for Polar Decoders
Pascal Giard, Gabi Sarkis, Claude Thibeault, and Warren J. Gross
Abstract—In this work, we present a family of architec-
tures for polar decoders using a reduced-complexity successive-
cancellation decoding algorithm that employs unrolling to achieve
extremely high throughput values while retaining moderate
implementation complexity. The resulting fully-unrolled, deeply-
pipelined architecture is capable of achieving a coded throughput
in excess of 1 Tbps on a 65 nm ASIC at 500 MHz—three orders of
magnitude greater than current state-of-the-art polar decoders.
However, unrolled decoders are built for a specific, fixed code.
Therefore we also present a new method to enable the use of
multiple code lengths and rates in a fully-unrolled polar decoder
architecture. This method leads to a length- and rate-flexible
decoder while retaining the very high speed typical to unrolled
decoders. The resulting decoders can decode a master polar code
of a given rate and length, and several shorter codes of different
rates and lengths. We present results for two versions of a multi-
mode decoder supporting eight and ten different polar codes,
respectively. Both are capable of a peak throughput of 25.6 Gbps.
For each decoder, the energy efficiency for the longest supported
polar code is shown to be of 14.8 pJ/bit at 250 MHz and of
8.8 pJ/bit at 500 MHz.
Index Terms—polar codes, ASIC, high throughput, multi-
mode, unrolled architecture
I. Introduction
POLAR codes are gathering a lot of attention lately. Theyare error-correcting codes with an explicit construction
that provably achieve the symmetric capacity of memoryless
channels with a low-complexity decoding algorithm: succes-
sive cancellation (SC) [1]. As SC proceeds bit-by-bit, hardware
implementations suffered from low throughput and high la-
tency [2]–[5]. To overcome this, modified SC-based algorithms
were proposed [6]–[10]. The first hardware implementation
with a throughput greater than 1 Gbps was presented in [9].
In [11], a fully-unrolled deeply-pipelined hardware architec-
ture for polar decoders was proposed. Results showed a very
high throughput, greater than 200 Gbps on FPGA. However,
these architectures are built for a fixed polar code i.e. the
code length or rate cannot be configured after designing
the decoder. This is a major drawback for most modern
wireless communication applications that largely benefit from
the support of multiple code lengths and rates. Furthermore,
a deeply-pipelined architecture causes the area to grow very
fast with the frame size.
The goal of this paper is twofold. First, it is to generalize the
unrolled architecture presented in [11] into a family of archi-
tectures offering a flexible trade-off between throughput, area
P. Giard, G. Sarkis and W. J. Gross are with the Department of Electrical
and Computer Engineering, McGill University, Montréal, Québec, Canada (e-
mail: {pascal.giard,gabi.sarkis}@mail.mcgill.ca, warren.gross@mcgill.ca).
C. Thibeault is with the Department of Electrical Engineering,
École de technologie supérieure, Montréal, Québec, Canada (e-mail:
claude.thibeault@etsmtl.ca).
and energy efficiency. The (1024, 512) fully-unrolled deeply-
pipelined polar decoder implementation of [11] is significantly
improved on all metrics. Second and most importantly, it
is to show how an unrolled decoder built specifically for
a polar code, of fixed length and rate, can be transformed
into a multi-mode decoder supporting many codes of various
lengths and rates. More specifically, we show how decoders
for moderate-length polar codes contain decoders for many
other shorter—but practical—polar codes of both high and low
rates. The required hardware modifications are detailed, and
ASIC synthesis and power estimations are provided for the
65 nm CMOS technology from TSMC. Results show a peak
information throughput greater than 15 Gbps at 250 MHz in
4.29 mm2 or greater than 20 Gbps at 500 MHz in 1.71 mm2.
Latency is of 2 µs and 650 ns for the former and latter.
The remainder of this paper starts with Section II by
briefly reviewing polar codes, their construction and their
representation. Section III provides the necessary background
on the Fast Simplified Successive-Cancellation (Fast-SSC)
decoding algorithm. Section IV describes the proposed family
of unrolled hardware architectures. The concept, hardware
modifications and other practical considerations related to the
proposed multi-mode decoder are presented in Section V.
Error-correction performance and implementation results for
both dedicated and multi-mode decoders are provided in
Section VI. Comparison against the fastest state-of-the-art
polar decoder implementations in the literature is carried out
in Section VI as well. Finally, a conclusion is drawn in
Section VII.
II. Polar Codes
A. Construction
Polar codes exploit the channel polarization phenomenon
by which the probability of correctly estimating codeword
bits tends to either 1 (completely reliable) or 0.5 (completely
unreliable). These probabilities get closer to their limit as the
code length increases when a recursive construction such as
the one shown in Fig. 1 is used, where ⊕ represents a modulo-
2 addition (XOR). Under successive-cancellation decoding,
polar codes were shown to achieve the symmetric capacity
of memoryless channels as their code length N → ∞ [1].
An (N, k) polar code has length N, carries k information
bits and is of rate R = k/N. The other N−k bits—frozen bits—
are set to a predetermined value—usually zero—during the
encoding process. The grayed ui’s where i ∈ {0, 1, 2, 4} on the
left hand side of Fig. 1 correspond to frozen bit locations of
a (16, 12) polar code.
ar
X
iv
:1
50
5.
01
45
9v
2 
 [c
s.A
R]
  1
1 J
ul 
20
16
2u0 + + + + x0
u1 + + + x1
u2 + + + x2
u3 + + x3
u4 + + + x4
u5 + + x5
u6 + + x6
u7 + x7
u8 + + + x8
u9 + + x9
u10 + + x10
u11 + x11
u12 + + x12
u13 + x13
u14 + x14
u15 x15
v
Fig. 1: Graph representation of a (16, 12) polar code.
Depending on the type of channel and its conditions, the op-
timal location of the frozen bits varies and can be determined
using the method described in [12] for example.
Encoding schemes for polar codes can be either non-
systematic, as shown in Fig. 1, or systematic as discussed in
[13]. Systematic polar codes offer better bit-error rate (BER)
than their non-systematic counterparts; while maintaining the
same frame-error rate (FER). A low-complexity systematic
encoding method was presented in [9] and proven to be correct
in [14]. In this work, we use systematic polar codes.
Both encoding types use the same generator matrix, and as
this matrix is built recursively, so are polar codes i.e. a code
of length N is the concatenation of two codes of length N/2.
B. Representation
Fig. 1 shows the graph representation of a (16, 12) polar
code where the blue-dashed-circled v represents a concatena-
tion of two codes of length 4, a (4, 1) polar code with a (4, 3)
one, yielding an (8, 4) polar code.
As polar codes are built recursively, it was proposed in
[6] to represent them as binary trees. Fig. 2a illustrates
such a representation, called decoder tree, equivalent to the
graph of Fig. 1. In the decoder tree, white and black leaves
represent frozen and information bits, respectively. Leaf nodes
correspond to individual bits denoted ui, where 0 ≤ i < N,
and where the largest position index i is on the right hand
side of the tree. Moving up in the decoder tree corresponds
to the concatenation of constituent codes. For example, the
concatenation operation circled in blue in Fig. 1 corresponds
to the node labeled v in Fig. 2a.
The left-hand-side (LHS) and right-hand-side (RHS) sub-
trees rooted in the top node are polar codes of length N/2. In
the remainder of this paper, we designate the polar code, of
length N, decoded by traversing the whole decoder tree as the
master code and the various codes of lengths smaller than N
as constituent codes.
By definition, and like the master code, a constituent code
of length N/2 is in turn the concatenation of two polar codes
of length N/4, and so on until the leaf nodes are reached. As
such, the decoding of a polar code of length N can be seen
v
αv
βvαl
βl
αr βr
u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15
(a) SC
v
αv
αl
βlαr
βr
βv
u30 u
7
4
u158
(b) Fast-SSC
Fig. 2: Decoder trees for SC (a) and Fast-SSC (b) decoding
of a (16, 12) polar code.
as the decoding of two constituent codes of length N/2, or of
four constituent codes of length N/4, etc. For example, and as
shown in the graph representation of Fig. 1, but better seen
in the decoder tree representation of Fig. 2a, a master code
of length 16 is the concatenation of two constituent codes of
length 8, or of four constituent codes of length 4, or of eight
constituent codes of length 2.
It should be noted that sibling constituent codes with the
same parent node share a special relation. Let us consider
the polar code (constituent code) of length Nv = 8 taking
root in v as illustrated in Fig. 2a, as the concatenation of two
constituent codes of length Nv/2 = 4. As that polar code gets
decoded, the estimated bits βl from its LHS constituent code
are required to compute the soft inputs αr required to decode
its RHS constituent code. Furthermore, once the estimated bits
βr are obtained by decoding the RHS constituent code, they
are combined with βl to form the bit-estimate vector βv for v.
III. The Fast-SSC Decoding Algorithm
As mentioned above, a polar code is the concatenation of
smaller constituent codes. Instead of using the SC algorithm
on all constituent codes, the location of the frozen bits can
be taken into account to use more efficient, lower complexity
algorithms on some of these constituent codes [6], [9].
Fig. 2b shows the decoder tree equivalent to Fig. 2a, but
when key parts of the Fast-SSC decoding algorithm [9] are
used. The black node represents a rate-1 constituent code
i.e. a polar code entirely composed of information bits. The
green striped and orange cross-hatched nodes are repetition
and single-parity-check (SPC) constituent codes, respectively.
Gray nodes are codes of rate 0 < R < 1. It can be seen that
Fast-SSC visits fewer nodes in the decoder tree, significantly
decreasing the latency and increasing the throughput. It pro-
vides the same codeword estimates as SC though, hence offers
the same error-correction performance.
While the proposed multi-mode unrolled decoders are in-
dependent of the decoding algorithm, we briefly go over the
decoding operations mentioned in this paper.
Decoding Operations
Three functions are inherited from the original SC algo-
rithm and log-likelihood ratios (LLRs) are used for the soft
messages. Going down a left edge—colored blue in Fig. 2—,
αl is calculated with the min-sum approximation [3]
αl[i] = sgn(αv[i] · αv[i + Nv/2]) min(|αv[i]|, |αv[i + Nv/2]|), (1)
3for 0 ≤ i < Nv/2, where αv is the input to the node and Nv
the width of αv. Going down a right edge—colored red in
Fig. 2—, αr is calculated with
αr[i] =
αv[i + Nv/2] + αv[i], when βl[i] = 0;αv[i + Nv/2] − αv[i], otherwise, (2)
for 0 ≤ i < Nv/2, where βl is the bit estimate from the LHS
child.
Once a leaf node is reached, the bit estimate is set to zero
when it corresponds to a frozen bit location. Otherwise, it
is calculated by threshold detection on αv. Going back up a
RHS edge the bit estimates from both children are combined
to generate the node’s bit-estimate vector
βv[i] =
βl[i] ⊕ βr[i], when i < Nv/2;βr[i − Nv/2], when Nv/2 ≤ i < Nv, (3)
where ⊕ is modulo-2 addition (XOR).
In [6], the Simplified SC (SSC) algorithm is introduced
where decoder tree nodes are split into three categories: Rate-
0, Rate-1, and Rate-R nodes.
1) Rate-0 Nodes: are subtrees whose leaf nodes all cor-
respond to frozen bits. We do not need to use a decoding
algorithm on such a subtree as the exact decision, by definition,
is always the all-zero vector.
2) Rate-1 Nodes: are subtrees where all leaf nodes carry
information bits, none are frozen. The maximum-likelihood
decoding rule for these nodes is to take a hard decision on the
input LLRs:
βv[i] =
0, when αv[i] ≥ 0;1, otherwise, (4)
for 0 ≤ i < Nv. With a fixed-point representation, this
operation amounts to copying the most significant bit of the
input LLRs.
3) Rate-R Nodes: Lastly, Rate-R nodes, where 0 < R < 1,
are subtrees such that leaf nodes are a mix of information
and frozen bits. As shown in [9], instead of always using the
SC or SSC algorithm, some Rate-R nodes corresponding to
specific frozen-bit locations can be decoded using algorithms
with lower complexity and latency. The subset of nodes and
operations from [9] used in our proposed family of architec-
tures are briefly reviewed in the following.
4) F , G and G0R Operations: The F and G operations are
among the functions used in the conventional SC decoding
algorithm and are calculated using (1) and (2), respectively.
G0R is a special case of the G operation where the left child
is a frozen node i.e. βl is known a priori to be the all-zero
vector of length Nv/2.
5) Combine and C0R Operations: As defined by (3), the
Combine operation generates the bit estimate vector. A C0R
operation is a special case of the Combine operation where
the LHS constituent code, βl , is a Rate-0 node.
6) Repetition Node: In this node, all leaf nodes are frozen
bits, with the exception of the node that corresponds to the
most RHS leaf in a tree. At encoding time, the only informa-
tion bit gets repeated over the Nv outputs. The information bit
can be estimated by using threshold detection over the sum of
the input LLRs αv:
βv =
0, when
(∑Nv−1
i=0 αv[i]
)
≥ 0;
1, otherwise,
where βv gets replicated Nv times to create the bit-estimate
vector.
7) Single-parity-check (SPC) Node: An SPC node is a node
such that all leaf nodes are information bits with the exception
of the node at the least significant position (LHS leaf in a tree).
To decode an SPC code, we start by calculating the parity of
the input LLRs:
parity =
Nv−1⊕
i=0
βv[i], where βv[i] =
0, when αv[i] ≥ 0;1, otherwise.
The estimated bit vector is then generated by reusing the
calculated βv above unless the parity constraint is not satisfied
i.e. is different than zero. In that case, the estimated bit
corresponding to the input with the smallest LLR magnitude
is flipped:
βv[i] = βv[i] ⊕ 1,where i = arg min
j
(|αv[ j]|).
Our proposed decoders borrow from the Fast-SSC algorithm
in that it uses specialized nodes and operations described
above to reduce the decoding latency. However, the family of
architectures we propose greatly differs from the processor-
like architecture of [9]. Moreover, [9] proposes hybrid node
types combining the ones above in order to further reduce the
decoding latency. With the exception of the RepSPC node—a
specialized node decoding a Repetition code concatenated with
an SPC code—that is used in one of the implementations, we
do not use those hybrid nodes in this paper.
IV. Unrolled Architectures
In an unrolled decoder, each and every operation required
is instantiated so that data can flow through the decoder with
minimal control.
The idea of fully unrolling a decoder has previously been
applied to decoders for other families of error-correcting
codes. Notably, in [15], [16], the authors propose a fully-
unrolled deeply-pipelined decoder for an LDPC code. Polar
codes are more suitable to unrolling as they do not feature a
complex interleaver like LDPC codes.
A. Deeply Pipelined
In a deeply-pipelined architecture, a new frame is loaded
into the decoder at every clock cycle. Therefore, a new
estimated codeword is output at each clock cycle as each
register is active at each rising edge of the clock (no enable
signal required). In that architecture, at any point in time,
there are as many frames being decoded as there are pipeline
stages. This leads to a very high throughput at the cost of
high memory requirements. Some pipeline stage paths do not
contain any processing logic, only memory. They are added to
ensure that the different messages remain synchronized. These
added memories yield register chains, or SRAM blocks.
4αc
CC
αc αc
F
α1
αc
Rep
β1
G
α2
β1
SPC
β2
β1
C
om
bi
ne βc
βc
61 2 3 4 5
Fig. 3: Fully-unrolled deeply-pipelined decoder for a (8, 4)
polar code. Clock signals omitted for clarity.
Fig. 3 shows a fully-unrolled and deeply-pipelined decoder
for a (8, 4) polar code. The α and β blocks illustrated in light
blue are registers storing LLRs or bit estimates, respectively.
White blocks are the functions described in Section III and
dotted registers are regular registers but will be referred to
in the next section. Among the registers, two are needed to
retain the channel LLRs, denoted αc in the figure, during the
2nd and 3rd clock cycles. Similarly, two registers have to be
added for the persistence of the hard-decision vector β1 over
the 4th and 5th clock cycles. Such unrolled architectures for
polar decoders were described in [11].
The information throughput can be defined as P f R bps,
where P is the width of the output bus in bits, f is the
execution frequency in Hz and R is the code rate. In this
paper, P is assumed to be equal to the code length N. The
decoding latency depends on the frozen bit locations and the
constrained maximum width for all processing nodes, but is
less than N log2 N. In our experiments, with the operations
and optimizations described below, the decoding latency never
exceeded N/2 clock cycles.
B. Partially Pipelined
In a deeply-pipelined architecture, a significant amount of
memory is required for data persistence. That memory quickly
increases with the code length N. Instead of loading a new
frame into the decoder and estimating a new codeword at every
cycle, we propose a compromise where the unrolled decoder
can be partially pipelined to reduce the required memory. Let
I be the initiation interval, where a new estimated codeword is
output every I clock cycles. The case where I = 1 translates
to a deeply-pipelined architecture. We note that the interval
only affects the memory, not the computational elements, in
the decoder.
Setting I > 1 leads to a significant reduction in the memory
requirements. An initiation interval of I translates to an
effective required register chain length of dL/Ie instead of L,
where L is the length of the register chain. Using I = 2 leads
to a ∼ 50% reduction in the amount of memory required for
that section of the circuit. This reduction applies to all register
chains present in the decoder. A partially-pipelined decoder
with I = 2 can be obtained for a (8, 4) polar code by removing
the dotted registers in Fig. 3, leading to the decoder of Fig. 4.
The initiation interval I can be increased further in order
to reduce the memory requirements, but only up to a certain
limit. We call that limit the maximum initiation interval Imax,
and its value depends on the decoder tree. By definition, the
longest register chain in a fully-unrolled decoder is used to
αc
CC
αc
F
α1
αc
Rep
β1
G
α2
SPC
β2
β1
C
om
bi
ne βc
βc
61 2 3 4 5
Fig. 4: Fully-unrolled partially-pipelined decoder for a (8, 4)
polar code with I = 2. Clock signals omitted for clarity.
preserve the channel LLRs αc. Hence, the maximum initiation
interval corresponds to the number of clock cycles required for
the decoder to reach the last operation in the decoder tree that
requires αc, GN , the operation calculated when going down the
right edge linking the root node to its right-hand-side child.
Once that GN operation is completed, αc is no longer needed
and can be overwritten. As an example, consider the (8, 4)
polar decoder illustrated in Fig. 4. As soon as the switch to
the right-hand side of the decoder tree occurs, i.e. when G is
traversed, the register containing the channel LLRs αc can be
updated with the LLRs for the new frame without affecting the
remaining operations for the current frame. Thus the maximum
initiation interval, Imax, for that decoder is 3.
The resulting coded and information throughput are
TC = N · fI and TI =
N · f · R
I , (5)
respectively, where I is the initiation interval. Note that this
new definition can also be used for the deeply-pipelined archi-
tecture. The decoding latency remains unchanged compared to
the deeply-pipelined architecture.
Fig. 5 shows a fully-unrolled partially-pipelined decoder
with an initiation interval I = 2 for the (16, 12) polar code of
Fig. 2b. Some control and routing logic was added to make
it multi-mode as detailed in the next section. The “&” blocks
are bit-vector joining operators.
The partially-pipelined architecture requires a more elab-
orate controller than the deeply-pipelined architecture. For
both fully- and partially-pipelined architectures, the controller
generates a done signal to indicate that a new estimated
codeword is available at the output. For the partially-pipelined
architecture, the controller also contains a counter with max-
imum value of (I − 1) which generates the I enable signals
for the registers. An enable signal is asserted only when the
counter reaches its value, in [0,I − 1], otherwise it remains
deasserted. Each register uses an enable signal corresponding
to its location in the pipeline modulo I. As an example, let
us consider the decoder of Fig. 5, i.e. I is set to 2. In that
example, two enable signals are created and a simple counter
alternates between 0 and 1. The registers storing the channel
LLRs αc are enabled when the counter is equal to 0 because
their input resides on the even (0, 2, 4 and 6) stages of the
pipeline. On the other hand, the two registers holding the α1
LLRs are enabled when the counter is equal to 1 because their
inputs are on odd (1 and 3) stages. The other registers follow
the same rule.
The required memory resources could be further reduced
by performing the decoding operations in a combinational
5α150
CC
αc
F
m1
α70
α1
αc
F
m2
α30
α2
α1
Rep
β1
αc
G
m3
α74
β2
SPC
β3
β1
αc
C
om
bi
ne β4
G I
β5
C
om
bi
ne
m4
&
m5
& βc
β 150
9
[15..8]
[7..0]
1 2 3 4 5 6 7 8
Fig. 5: Unrolled partially-pipelined decoder for a (16, 12) polar code with initiation interval I = 2. Clock, flip-flop enable and
multiplexer select signals are omitted for clarity.
manner, i.e. by removing all the registers except the ones
labeled αc and βc, as in [17]. However, the resulting reachable
frequency is too low for the desired throughput level.
C. Replacing Register Chains with SRAM Blocks
As the code length N grows, long register chains start to
appear in the decoder, especially with a smaller I. In order
to reduce the number of registers required, register chains can
be converted into SRAM blocks.
Consider the register chain of length 4 used for the persis-
tence of the channel LLRs αc in the fully-unrolled partially-
pipelined (16, 12) decoder shown in top row of Fig. 5. Pre-
serving the first register, the remaining 3 registers in that chain
can be replaced by a dual-port SRAM block with a width of
16Q bits—Q is the number quantization bits—and depth of 3
along with a controller to generate the appropriate read and
write addresses. Similar to a circular buffer, if the addresses
are generated to increase every clock cycle, the write address
is set to be one position ahead of the read address.
SRAM blocks can replace register chains in a deeply-
pipelined architecture as well. In both architectures, the SRAM
block depth has to be equal or greater than the register chain
length minus one.
V. Multi-mode Unrolled Decoders
It can be noted that an unrolled decoder for a polar code
of length N is composed of unrolled decoders for two polar
codes of length N/2, which are each composed of unrolled
decoders for two polar codes of length N/4, and so on. Thus,
by adding some control and routing logic, it is possible to
directly feed and read data from the unrolled decoders for
constituent codes of length smaller than N. The end result is a
multi-mode decoder supporting frames of various lengths and
code rates.
A. Hardware Modifications to the Unrolled Decoders
Consider the decoder tree shown in Fig. 2b along with its
unrolled implementation as illustrated in Fig. 5. In Fig. 2b, the
constituent code taking root in v is an (8, 4) polar code. Its
corresponding decoder can be directly employed by placing
the 8 channels LLRs into α70 and by selecting the bottom
input of the multiplexer m1 illustrated in Fig. 5. Its estimated
codeword is retrieved from reading the output of the Combine
block feeding the β4 register i.e. by selecting the top and
bottom inputs from m4 and m5, respectively, and by reading
the 8 least-significant bits from β 150 . Similarly, still in Fig. 5,
the decoders for the repetition and SPC constituent codes
can be fed via the m2 and m3 multiplexers and their output
eventually recovered from the output of the Rep and SPC
blocks, respectively.
Although not illustrated in Figs. 3, 4 or 5, the proposed
unrolled decoders feature a minimal controller. While not
mandatory, the functionality of this controller is altered to
better accommodate the use of multiple polar codes. Two look-
up tables (LUTs) are added. One LUT stores the decoding
latency, in clock cycles, of each code. It serves as a stopping
criteria to generate the done signal. The other LUT stores the
clock cycle “value” istart at which the enable-signal generator
circuit should start. Each non-master code may start at a
value (istart mod I) , 0. In such cases, using the unaltered
controller would result in the waste of (istart mod I) clock
cycles. It can be significant for short codes, especially with
large values of I. For example, without these changes, for the
implementation with a master code of length 1024 and I = 20
presented in Section VI below, the latency for the (128, 96)
polar code would increase by 20% as (istart mod I) = 17 and
the decoding latency is of 82 clock cycles.
Lastly, the modified controller also generates the multiplexer
select signals, allowing proper data routing, based on the
selected mode.
B. On the Construction of the Master Code
Conventional approaches construct polar codes for a given
channel type and condition. In this work, many of the con-
stituent codes contained within a master code are not only used
internally to detect and correct errors, they are used separately
as well. Therefore, we propose to assemble a master code
using two optimized constituent codes in order to increase
the number of optimized polar codes available. Doing so, the
number of information bits, or the code rate, of the second
largest supported codes can be selected. In the following, a
master code of length 2048 is constructed by concatenating
two constituent codes of length 1024. The LHS and RHS
62 2.5 3 3.5 4
10−4
10−3
10−2
10−1
100
Eb/N0 (dB)
FE
R
2 2.5 3 3.5 4
10−6
10−5
10−4
10−3
10−2
10−1
Eb/N0 (dB)
B
E
R
Optimized with [12] Assembled
Fig. 6: Error-correction performance of two (2048, 1365) polar
codes with different constructions.
constituent codes are chosen to have a rate of 1/2 and of
5/6, respectively. As a result, the assembled master code has
rate 2/3. The location of the frozen bits in the master code
is dictated by its constituent codes. Note that the constituent
code with the lowest rate is put on the left—and the one with
the highest rate on the right—to minimize the coding loss
associated with a non-optimized polar code.
Fig. 6 shows both the frame-error rate (left) and the bit-
error rate (right) of two different (2048, 1365) polar codes.
The black-solid curve is the performance of a polar code con-
structed using the method described in [12] for Eb/N0 = 4 dB.
The dashed-red curve is for the (2048, 1365) constructed by
concatenating (assembling) a (1024, 512) polar code and a
(1024, 853) polar code. Both polar codes of length 1024 were
also constructed using the method of [12] for Eb/N0 values of
2.5 and 5 dB, respectively.
From the figure, it can be seen that constructing an op-
timized polar code of length 2048 with rate 2/3 results in a
coding gain of approximately 0.17 dB at a FER of 10−3—an
FER appropriate for certain applications—over one assembled
from two shorter polar codes of length 1024. The gap is
increasing with the signal-to-noise ratio, reaching 0.24 dB
at a FER of 10−4. Looking at the BER curves, it can be
observed that the gap is much narrower. Compared to that of
the assembled master code, the optimized polar code shows a
coding gain of 0.07 dB at a BER of 10−5.
C. About Constituent Codes: frozen bit locations, rate and
practicality
The location of the frozen bits in non-optimized constituent
codes is dictated by their parent code. In other words, if
the master code of length N has been assembled from two
optimized (constituent) polar codes of length N/2 as suggested
in the previous section, the shorter optimized codes of length
N/2 determine the location of the frozen bits in their respective
constituent codes of length < N/2. Otherwise, the master code
dictates the frozen bit locations for all constituent codes.
Assuming that the decoding algorithm takes advantage of
the a priori knowledge of these locations, the code rate and
frozen bit locations of constituent codes cannot be changed at
2 3 4 5 6 7
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 (dB)
FE
R
5 5.25 5.5 5.75
10−3.2
10−3
10−2.8
Eb/N0 (dB)
FE
R
(128, 100) (128, 102) (128, 107) (128, 108)
Fig. 7: Error-correction performance of the four constituent
codes of length 128 with a rate of approximately 5/6 contained
in the proposed (2048, 1365) master code.
execution time. However, there are many constituent codes to
choose from and code shortening can be used [18] to create
more, e.g. in order to obtain a specific number of information
bits or code rate.
Because of the polarization phenomenon, given any two sib-
ling constituent codes, the code rate of the LHS one is always
lower than that of the RHS one for a properly constructed
polar code [14]. That property plays to our advantage as, in
many wireless applications, it is desirable to offer a variety of
codes of both high and low rates.
It should be noted that not all constituent codes within a
master code are of practical use e.g. codes of very high rate
offer negligible coding gain over an uncoded communication.
For example, among the four constituent codes of length 4
included in the (16, 12) polar code illustrated in Fig. 2a,
two of them are rate-1 constituent codes. Using them would
be equivalent to uncoded communication. Moreover, among
constituent codes of the same length, many codes may have
a similar number of information bits with little to no error-
correction performance difference in the region of interest.
Fig. 7 shows the frame-error rate of all four constituent
codes of length 128 with a rate of approximately 5/6 that are
contained within the proposed (2048, 1365) master code. It can
be seen that, even at such a short length, at a FER of 10−3 the
gap between both extremes is under 0.5 dB. Among those
constituent codes, only the (128, 108) was selected for the
implementation presented in Section VI. It is beneficial to limit
the number of codes supported in a practical implementation
of a multi-mode decoder in order to minimize routing circuitry.
D. Latency and Throughput Considerations
If a decoding algorithm taking advantage of the a priori
knowledge of the frozen bit locations is used in the unrolled
decoder, such as Fast-SSC [9], the latency will vary even
among constituent codes of the same length. However, the
coded throughput will not. The coded throughput of an un-
rolled decoder for a polar code of length N will be twice that
of a constituent code of N/2, which in turn, is double that of
7a constituent code of length N/4, and so on. The coded and
information throughput are defined by (5).
In wireless communication standards where multiple code
lengths and rates are supported, the peak information through-
put is typically achieved with the longest code that has both
the greatest latency and highest code rate. It is not mandatory
to reproduce this with our proposed method, but it can be done
if considered desirable. It is the example that we provide in
the implementation section of this paper.
Another possible scenario would be to use a low-rate master
code, e.g. R = 1/3, that is more powerful in terms of error-
correction performance. The resulting multi-mode decoder
would reach its peak information throughput with the longest
constituent code of length N/2 that has the highest code rate, a
code with a significantly lower decoding latency than that of
the master code.
VI. Implementation and Results
In this section, we start by presenting results for dedicated
unrolled decoders: showing the effect of the initiation interval,
the code length and the code rate on unrolled decoders. Then,
we present results for two implementations of our proposed
multi-mode unrolled decoders. For the latter, we had the
objective of building decoders with a throughput in the vicinity
of 20 Gbps.
The multi-mode decoder examples are built around
(1024, 853) and (2048, 1365) master codes. In the following,
the former is referred to as the decoder supporting a maximum
code length Nmax of 1024 and the latter as the decoder with
Nmax = 2048. A total of ten polar codes were selected for the
decoder supporting codes of lengths up to 2048. The other
decoder with Nmax = 1024 has eight modes corresponding to
a subset of the ten polar codes supported by the bigger decoder.
The master codes used in this section are the same as those
used in Section V-B.
For the decoder with Nmax = 1024, the Repetition and SPC
nodes were constrained to a maximum size Nv of 8 and 4,
respectively. For the decoder with Nmax = 2048, we found it
more beneficial to lower the execution frequency and increase
the maximum sizes of the Repetition and SPC nodes to 16 and
8, respectively. Additionally, the decoder with Nmax = 2048
also uses RepSPC [9] nodes to reduce latency.
A. Methodology
In our experiments, decoders are built with sufficient mem-
ory to accommodate storing an extra frame at the input, and
to preserve an estimated codeword at the output. As a result,
the next frame can be loaded while a frame is being decoded.
Similarly, an estimated codeword can be read while the next
frame is being decoded. We define decoding latency to include
the time required to load channel LLRs, decode a frame and
offload the estimated codeword.
The quantization used was determined by running fixed-
point simulations with bit-true models of the decoders. A
smaller number of bits is used to store the channel LLRs
compared to that of the other LLRs used in the decoder. All
LLRs use 2’s complement representation and share the same
2 3 4
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 (dB)
FE
R
2 3 4
10−7
10−6
10−5
10−4
10−3
10−2
Eb/N0 (dB)
B
E
R
Float 6.5.1 5.4.0 5.4.1
Fig. 8: Effect of quantization on the error-correction perfor-
mance of a (1024, 512) polar code.
TABLE I: Decoders for a (1024, 512) polar code with various
initiation intervals I. The clock is set to 500 MHz and the
latency is of 728 ns.
I Tot. Area Log. Area Mem. Area T/P Power Energy
(mm2) (mm2) (mm2) (Gbps) (mW) (pJ/bit)
1 12.369 0.60 11.75 512.0 3,830 7.5
4 4.921 0.64 4.24 128.0 1,060 8.3
50 1.232 0.65 0.56 10.2 107 10.5
167 0.998 0.63 0.34 3.1 62 20.0
number of fractional bits. We denote quantization as Qi.Qc.Q f ,
where Qc is the total number of bits to store a channel LLR,
Qi is the total the number of bits used to store internal LLRs
and Q f is the number of fractional bits in both. Qi and Qc both
include the sign bit. Fig. 8 shows that, for a (1024, 512) polar
code modulated with BPSK and transmitted over an AWGN
channel, using Qi.Qc.Q f equal to 5.4.0 results in a 0.1 dB
performance degradation at a bit-error rate of 10−6. Thus we
used that quantization for the hardware results.
ASIC synthesis results are for the 65 nm CMOS GP
technology from TSMC and are obtained with Cadence RTL
Compiler. Unless indicated otherwise, all results are for the
worst-case library at a supply voltage of 0.72 V with an operat-
ing temperature of 125◦C. Power consumption estimations are
also obtained from Cadence RTL Compiler, switching activity
is derived from simulation vectors. Only registers were used
for memory due to the lack of access to an SRAM compiler.
B. Dedicated Decoders: Effect of the Initiation Interval
In this section, we explore the effect of the initiation interval
on the implementation of the fully-unrolled architecture. The
decoders are built for the same (1024, 512) polar code used
in [11], although many improvements were made since the
publication of that work. Regardless of the initiation interval,
all decoders use 5.4.0 quantization and have a decoding latency
of 364 clock cycles.
Table I shows the results for various initiation intervals.
Besides the effect on throughput, increasing the initiation
interval causes a significant reduction in memory requirements
without significantly affecting combinational logic. Since area
8is largely dominated by registers, increasing the initiation
interval has great effect on the total area. For example, using
I = 50 results in an area that is more than 10 times smaller,
at the cost of a throughput that is 50 times lower. That table
also shows that reducing the area has a direct effect on the
estimated power consumption, which significantly drops as I.
As expected, increasing the initiation interval I offers a
diminishing return as it gets closer to the maximum, 167 for
the example (1024, 512) code. Also, as I is increased, the
energy efficiency is reduced.
C. Dedicated Decoders: Effect of the Code Length and Rate
Results for other polar codes are presented in this section
where we show the effect of the code length and rate on
performance and resource usage.
TABLE II: Deeply-pipelined decoders for polar codes of
various lengths with rate R = 1/2. The clock is set to 500 MHz.
N Tot. Area Log. Area Mem. Area Latency T/P Power Energy
(mm2) (mm2) (mm2) (ns) (Gbps) (mW) (pJ/bit)
128 0.349 0.05 0.29 152 64 105 1.6
256 1.121 0.12 0.99 268 128 342 2.7
512 3.413 0.27 3.14 408 256 1,050 4.0
1024 12.369 0.60 11.75 728 512 3,830 7.5
2048 43.541 1.32 42.16 1,304 1,024 13,526 13.2
Tables II and III show the effect of the code length on
area, decoding latency, coded throughput, power consumption,
and on energy efficiency for polar codes of short to moderate
lengths. Table II contains results for the fully-unrolled deeply-
pipelined architecture (I = 1) and the code rate R is fixed to
1/2 for all polar codes. Table III contains results for the fully-
unrolled partially-pipelined architecture where the maximum
initiation interval (Imax) is used and the code rate R is 5/6.
As shown in Table II, with a deeply-pipelined architecture,
logic area usage almost grows as N log2 N, whereas memory
area is closer to being quadratic in code length N. The
logic area required for a deeply-pipelined unrolled decoder
implemented in 65 nm ASIC technology can be approximated
with an accuracy greater than 98% using C · N log2 N, where
the constant C is set to 1/17,000. For comparison, the logic area
of tree-based SC decoders is O(N) while the other state-of-the-
art partially-parallel architectures have fixed logic area that do
not depend on the code length.
Curve fitting shows that the memory area is quadratic with
code length N. Let the memory area be defined by a+bN+cN2,
setting a = 0.249, b = 2.466×10−3 and c = 8.912×10−6 results
in a standard error of 0.1839.
As shown in Table II, throughput exceeding 1 Tbps and
500 Gbps can be achieved with a deeply-pipelined decoder
for polar codes of length 2048 and 1024, respectively. As the
memory area grows quadratically with the code length the
amount of energy required to decode a bit increases with the
code length. The decoder for the (4096, 2048) polar code could
not be synthesized on our server due to insufficient memory.
For a partially-pipelined architecture with Imax, both the
memory and total area scale linearly with N. The power
consumption is shown to almost scale linearly as well. The
TABLE III: Partially-pipelined decoders with initiation interval
set to Imax for polar codes of various lengths with rate R = 5/6.
The clock is set to 500 MHz.
N I Tot. Area Mem. Area Latency T/P Power Energy
(mm2) (mm2) (µs) (Gbps) (mW) (pJ/bit)
1024 206 0.793 0.28 0.646 2.5 51 20.5
2048 338 1.763 0.61 0.888 3.0 108 35.6
4096 665 4.248 1.44 1.732 3.1 251 81.5
results of Table III also show that it was possible to synthesize
ASIC decoders for larger code lengths than what was possible
with a deeply-pipelined architecture.
TABLE IV: Deeply-pipelined decoders for polar codes of
length N = 1024 with common rates. The clock is set to
500 MHz and the throughput is of 512 Gbps.
R
Tot. Area
(mm2)
Mem. Area
(mm2)
Latency Power
(mW)
Energy
(pJ/bit)(CCs) (ns)
1/2 12.369 11.75 364 727 3,830 7.5
2/3 13.049 12.45 326 651 4,041 6.2
3/4 15.676 15.05 373 745 4,865 6.5
5/6 14.657 14.05 323 645 4,549 7.1
The effect of using different code rates for a polar code of
length N = 1024 is shown in Table IV. We note that the higher
rate codes do not have noticeably lower latency compared to
the rate-1/2 code, contrary to what was observed in [9]. This
is due to limiting the width of SPC nodes to NSPC = 4 in this
work, whereas it was left unbounded in the others. The result
is that long SPC codes are implemented as trees whose left-
most child is a width-4 SPC node and the others are all rate-1
nodes. Thus, for each additional stage (log2 Nv − log2 NSPC) of
an SPC code of length Nv > NSPC, four nodes with a total
latency of 3 clock cycles are required: F , G followed by I,
and Combine. This brings the total latency of decoding a long
SPC code to 3(log2 Nv − log2 NSPC) + 1 clock cycles compared
to
⌈
Nv/P
⌉
+ 4 in [9], where P is the number of LLRs that can
be read simultaneously (256 was a typical value for P in [9]).
From Table IV, it can be seen that varying the rate does
not affect the logic area that remains almost constant at
approximately 0.61 mm2. Memory, in the form of registers,
dominates the decoder area. Therefore, the estimated power
consumption scales according to the memory area.
D. Deeply-pipelined SC Decoders
To decode a frame, an SC decoder needs to load a frame,
visit all
∑log2 N
i=1 2
i edges of the decoder tree twice and store
the estimated codeword. A deeply-pipelined SC decoder for
a (128, 64) polar code has an area of 2.17 mm2, a latency
of 510 clock cycles, and a power consumption of 677 mW.
These values are 6.2, 6.7, and 6.4 times as much as their
counterparts of the deeply-pipelined Fast-SSC decoder re-
ported in Table II. These results indicate that deeply-pipelined
SC decoders will be limited to very short polar codes, and
that alternative algorithms and architectures will yield more
practical implementations.
92 3 4 5 6 7 8
10−5
10−4
10−3
10−2
10−1
100
Eb/N0 (dB)
FE
R
(2048, 1365)
(1024, 512)
(1024, 853)
(512, 490)
(512, 363)
(256, 228)
(256, 135)
(128, 108)
(128, 96)
(128, 39)
Fig. 9: Error-correction performance of the polar codes.
E. Multi-mode Decoders: Error-correction Performance
Fig. 9 shows the frame-error rate performance of ten differ-
ent polar codes. The decoder with Nmax = 2048 supports all ten
illustrated polar codes whereas the decoder with Nmax = 1024
supports all polar codes but the two shown as dotted curves.
All simulations are generated using random codewords mod-
ulated with binary phase-shift keying and transmitted over an
additive white Gaussian channel.
It can be seen from the figure that the error-correction
performance of the supported polar codes varies greatly. As
expected, for codes of the same lengths, the codes with
the lowest code rates performs significantly better than their
higher rate counterpart. For example, at a FER of 10−4, the
performance of the (512, 363) polar code is almost 3 dB better
than that of the (512, 490) code.
While the error-correction performance plays a role in the
selection of a code, the latency and throughput are also
important considerations. As it will be shown in the following
section, the ten selected polar codes perform much differently
in that regard as well.
F. Multi-mode Decoders: Latency and Throughput
Table V shows the latency and information throughput for
both decoders with Nmax ∈ {1024, 2048}. To reduce the area
and latency while retaining the same throughput, the initiation
interval I can be increased along with the clock frequency (5).
If both decoders have initiation intervals of 20—as used
in the section below—Table V assumes clock frequencies of
500 MHz and 250 MHz for the decoders with Nmax = 1024
and Nmax = 2048, respectively. While their master codes differ,
both decoders feature a peak information throughput in the
vicinity of 20 Gbps. For the decoder with the smallest Nmax,
the seven other polar codes have an information throughput
in the multi-gigabit per second range with the exception of
the shortest and lowest-rate constituent code. That (128, 39)
constituent code still has an information throughput close
to 1 Gbps. The decoder with Nmax = 2048 offers multi-
gigabit throughput for most of the supported polar codes. The
minimum information throughput is also with the (128, 39)
polar code at approximately 500 Mbps.
TABLE V: Information throughput and latency for the multi-
mode unrolled polar decoders based on the (2048, 1365) and
(1024, 853) master codes, respectively with a Nmax of 1024
and 2048.
Code
(N, k)
Rate
(k/N)
Info. T/P (Gbps) Latency (CCs) Latency (ns)
Nmax = 1024 2048 1024 2048 1024 2048
(2048, 1365) 2/3 - 17.1 - 503 - 2,012
(1024, 853) 5/6 21.3 10.7 323 236 646 944
(1024, 512) 1/2 - 6.4 - 265 - 1,060
(512, 490) 19/20 12.3 6.2 95 75 190 300
(512, 363) 7/10 9.1 4.5 226 159 452 636
(256, 228) 9/10 5.7 2.6 86 61 172 244
(256, 135) 1/2 3.4 1.7 138 96 276 384
(128, 108) 5/6 2.7 1.4 54 40 108 160
(128, 96) 3/4 2.4 1.2 82 52 164 208
(128, 39) 1/3 0.98 0.49 54 42 108 168
In terms of latency, the decoder with Nmax = 1024 requires
646 ns to decode its longest supported code. The latency for
all the other codes supported by that decoder is under 500 ns.
Even with its additional dedicated node and relaxed maximum
size constraint on the Repetition and SPC nodes, the decoder
with Nmax = 2048 has greater latency overall because of its
lower clock frequency. For example, its latency is of 2.01 µs,
944 ns and 1.06 µs for the (2048, 1365), (1024, 853) and
(1024, 512) polar codes, respectively.
Using the same nodes and constraints as for Nmax = 1024,
the Nmax = 2048 decoder would allow for greater clock fre-
quencies. While 689 clocks cycles would be required to decode
the longest polar code instead of 503, a clock of 500 MHz
would be achievable, effectively reducing the latency from
2.01 µs to 1.38 µs and doubling the throughput. However,
this reduction comes at the cost of much greater area and an
estimated power consumption close to 1 W.
G. Comparing with the State of the Art
Table VI shows the synthesis results along with power
consumption estimations for the two implementations of the
proposed multi-mode unrolled decoder. The work in the first
two columns is for the decoder with Nmax = 1024, based
on the (1024, 853) master code. It was synthesized for clock
frequencies of 500 MHz and 650 MHz, respectively, with
initiation intervals I of 20 and 26. Our work shown in the third
and fourth columns is for the decoders with Nmax = 2048, built
from the assembled (2048, 1365) polar code. These decoders
have an initiation interval I of 20 or 28, with lower clock
frequencies of 250 MHz and 350 MHz, respectively. For com-
parison with other works, the same table also includes results
for a dedicated partially-pipelined decoder for a (1024, 512)
polar code.
The four fastest polar decoder implementations from the
literature are also included for comparison along with nor-
malized area results. For consistency, only the largest polar
code supported by each of our proposed multi-mode unrolled
decoders is used and the coded throughput, as opposed to the
information one, is compared to match what was done in most
of the other works.
From Table VI, it can be seen that the area for the proposed
decoders with Nmax = 1024 are similar to that of the BP
10
TABLE VI: Comparison with state-of-the-art polar decoders.
Multi-mode Dedicated [19] [20] [17] [8]
Algorithm Fast-SSC Fast-SSC Fast-SSC BP SC 2-bit SC
Technology 65 nm 65 nm 65 nm 65 nm 90 nm 45 nm
Nmax 1024 2048 1024 1024 1024 1024 1024
Code (1024, 853) (2048, 1365) (1024, 512) (1024, 512) (1024, 512) (1024, k) (1024, 512)
Init. Interval (I) 20 26 20 28 20 - - - -
Supply (V) 0.72 1.0 0.72 1.0 1.0 1.0 1.0 1.3 N/A
Oper. temp. (◦C) 125 25 125 25 25 25 ≈ 25 N/A N/A
Area (mm2) 1.71 1.44 4.29 3.58 1.68 0.69 1.48 3.21 N/A
Area @65nm (mm2) 1.71 1.44 4.29 3.58 1.68 0.69 1.48 1.68 0.4
Frequency (MHz) 500 650 250 350 500 600 300 2.5 750
Latency (µs) 0.65 0.50 2.01 1.44 0.73 0.27 50 0.39 1.02
Coded T/P (Gbps) 25.6 25.6 25.6 25.6 25.6 3.7 4.7 @ 4 dB 2.56 1.0
Sust. Coded T/P (Gbps) 25.6 25.6 25.6 25.6 25.6 3.7 2.0 2.56 1.0
Area Eff. (Gbps/mm2) 15.42 17.75 5.97 7.16 15.27 5.40 3.18 @ 4 dB 0.80 N/A
Power (mW) 226 546 379 740 386 215 478 191 N/A
Energy (pJ/bit) 8.8 21.3 14.8 28.9 15.1 57.7 102.1 74.5 N/A
 Measurement results.
decoder of [20] as well as the normalized area for the unrolled
SC decoder from [17]. However, their area is from 2.1 to 2.5
times greater than that of [19]. Comparing the multi-mode
decoders, the area for the decoder with Nmax = 2048 is over
twice that of the ones with Nmax = 1024, however the master
code for the former has twice the length of the latter and
supports two more modes.
All proposed decoders have a coded throughput that is an
order of magnitude greater than the other works. Latency
is one to two orders of magnitude lower than that of the
BP decoder. Comparing against the SC decoder of [17], the
latency is 1.7 or 3.7 times greater for decoders with an Nmax
of 1024 and 2048, respectively. It should be noted that the
decoder of [17] support codes of any rate, where the proposed
multi-mode decoders support a limited number of code rates.
The latency of the proposed decoders is higher than the
programmable Fast-SSC decoder of [19]. This is due to greater
limitations on the specialized repetition and SPC decoders.
The decoder in [19] limits repetition decoders to a maximum
length of 32, compared to 8 or 16 in this work, and does not
place limits on the SPC decoders.
Finally, among the decoders with Nmax = 1024 implemented
in 65 nm with a 1 V power supply and operating at 25◦C, our
proposed implementation offers the greatest area and energy
efficiency. The proposed multi-mode decoder exhibits 3.3 and
5.6 times better area efficiency than the decoders of [19] and
[20], respectively. The energy efficiency is estimated to be
2.7 and 4.8 times higher compared to that of the same two
decoders from the literature.
Recently, a List-based multi-mode decoder was proposed in
[21], where the definition of the word “multi-mode” differs
greatly with our work: in our work, it is used to indicate that
the decoder is capable of decoding codes with varying length
and rate. Whereas in [21], a “mode” indicates the level of
parallelism in the decoder. The decoder of [21] is capable of
decoding 4 paths in parallel by implementing 4 processing
units. It can be configured to either do SC-based decoding of
4 frames or List-based decoding. For the latter, two list sizes
L are supported. If L = 2, 2 frames are decoded in parallel
otherwise if L = 4, only 1 frame is decoded at a time.
H. I/O Bounded Decoding
The family of unrolled architectures that we proposed
requires tremendous throughput at the input of the decoder,
especially with a deeply-pipelined architecture. For example,
if a quantization of Qc = 4 bits is used for channel LLRs, for
every estimated bit, 4 times as many bits have to be loaded
into the decoder. In other words, the total data rate is 5 times
that of the output. This can be a significant challenge on both
FPGA and ASIC. If only for that reason, partially-pipelined
architectures are certainly more attractive.
VII. Conclusion
In this paper we presented a family of architectures for fully-
unrolled polar decoders. With an initiation interval that can be
adjusted, these architectures make it possible to find a trade-
off between area and achievable throughput without affecting
decoding latency. We showed that a fully-unrolled deeply-
pipelined decoder implemented on an ASIC could achieve a
throughput up to three orders of magnitude greater than the
state of the art. Furthermore, we presented a new method to
transform an unrolled architecture into a multi-mode decoder
supporting various polar code lengths and rates. We showed
that a master code can be assembled from two optimized
polar codes of smaller length, with desired code rates, without
sacrificing too much coding gain. We provided results for
two decoders, one built for a (1024, 853) master code and
the other for a longer (2048, 1365) polar code. Both decoders
support from seven to nine other practical codes. On 65 nm
ASIC, they were shown to have a peak throughput greater than
25 Gbps. One has a worst-case latency of 2 µs at 250 MHz
and an energy efficiency of 14.8 pJ/bit. The other has a worst-
case latency of 646 ns at 500 MHz and an energy efficiency
of 8.8 pJ/bit. Both implementation examples show that, with
their great throughput and support for codes of various lengths
and rates, multi-mode unrolled polar decoders are promising
candidates for future wireless communication standards.
ACKNOWLEDGEMENT
Claude Thibeault is a member of ReSMiQ. Warren J. Gross
is a member of ReSMiQ and SYTACom.
11
References
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, 2009.
[2] A. Mishra, A. Raymond, L. Amaru, G. Sarkis, C. Leroux, P. Mein-
erzhagen, A. Burg, and W. Gross, “A successive cancellation decoder
asic for a 1024-bit polar code in 180nm cmos,” in IEEE Asian Solid
State Circuits Conf. (A-SSCC), Nov 2012, pp. 205–208.
[3] C. Leroux, A. J. Raymond, G. Sarkis, I. Tal, A. Vardy, and W. J. Gross,
“Hardware implementation of successive-cancellation decoders for polar
codes,” J. Signal Process. Syst., vol. 69, no. 3, pp. 305–315, 2012.
[4] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
Process., vol. 61, no. 2, pp. 289–299, Jan 2013.
[5] A. Raymond and W. Gross, “A scalable successive-cancellation decoder
for polar codes,” IEEE Trans. Signal Process., vol. 62, no. 20, pp. 5339–
5347, Oct 2014.
[6] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15,
no. 12, pp. 1378–1380, 2011.
[7] A. Pamuk and E. Arikan, “A two phase successive cancellation decoder
architecture for polar codes,” in IEEE Int. Symp. on Inf. Theory Proc.
(ISIT), Jul 2013, pp. 957–961.
[8] B. Yuan and K. Parhi, “Low-latency successive-cancellation polar de-
coder architectures using 2-bit decoding,” IEEE Trans. Circuits Syst. I,
vol. 61, no. 4, pp. 1241–1254, Apr 2014.
[9] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar
decoders: Algorithm and implementation,” IEEE J. Sel. Areas Commun.,
vol. 32, no. 5, pp. 946–957, May 2014.
[10] B. Li, H. Shen, D. Tse, and W. Tong, “Low-latency polar codes via
hybrid decoding,” in Int. Symp. on Turbo Codes and Iterative Inf.
Process. (ISTC), Aug 2014, pp. 223–227.
[11] P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “237 Gbit/s unrolled
hardware polar decoder,” IET Electron. Lett., vol. 51, no. 10, pp. 762–
763, 2015.
[12] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Trans. Inf.
Theory, vol. 59, no. 10, pp. 6562–6582, Oct 2013.
[13] E. Arıkan, “Systematic polar coding,” IEEE Commun. Lett., vol. 15,
no. 8, pp. 860–862, 2011.
[14] G. Sarkis, I. Tal, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross,
“Flexible and low-complexity encoding and decoding of systematic polar
codes,” IEEE Trans. Commun., vol. PP, no. 99, 2016.
[15] P. Schläfer, N. Wehn, M. Alles, and T. Lehnigk-Emden, “A new
dimension of parallelism in ultra high throughput LDPC decoding,” in
IEEE Workshop on Signal Process. Syst. (SiPS), 2013, pp. 153–158.
[16] N. Wehn, S. Scholl, P. Schläfer, T. Lehnigk-Emden, and M. Alles, “Chal-
lenges and limitations for very high throughput decoder architectures
for soft-decoding,” in Advanced Hardware Design for Error Correcting
Codes, C. Chavet and P. Coussy, Eds. Springer International Publishing,
2015, pp. 7–31.
[17] O. Dizdar and E. Arıkan, “A high-throughput energy-efficient imple-
mentation of successive-cancellation decoder for polar codes using
combinational logic,” IEEE Trans. Circuits Syst. I, vol. 63, no. 3, pp.
436–447, Mar 2016.
[18] Y. Li, H. Alhussien, E. Haratsch, and A. Jiang, “A study of polar codes
for MLC NAND flash memories,” in Int. Conf. on Comput., Netw. and
Commun. (ICNC), Feb 2015, pp. 608–612.
[19] P. Giard, A. Balatsoukas-Stimming, G. Sarkis, C. Thibeault, and
W. J. Gross, “Fast low-complexity decoders for low-rate polar
codes,” CoRR, vol. abs/1603.05273, Mar 2016. [Online]. Available:
http://arxiv.org/abs/1603.05273
[20] Y. S. Park, Y. Tao, S. Sun, and Z. Zhang, “A 4.68Gb/s belief propagation
polar decoder with bit-splitting register file,” in Symp. on VLSI Circuits
Dig. of Tech. Papers, Jun 2014, pp. 1–2.
[21] C. Xiong, J. Lin, and Z. Yan, “A multimode area-efficient SCL polar
decoder,” IEEE Trans. VLSI Syst., vol. PP, no. 99, pp. 1–14, 2016.
Pascal Giard received the B.Eng. and M.Eng. de-
gree in electrical engineering from École de tech-
nologie supérieure (ÉTS), Montreal, QC, Canada, in
2006 and 2009. From 2009 to 2010, he worked as
a research professional in the NSERC-Ultra Elec-
tronics Chair on ’Wireless Emergency and Tactical
Communication’ at ÉTS. He is currently working
toward the Ph.D. degree at McGill University. His
research interests are in the design and implemen-
tation of signal processing systems with a focus on
modern error-correcting codes.
Gabi Sarkis received the B.Sc. degree in electrical
engineering from Purdue University, West Lafayette,
Indiana, United States, in 2006 and the M.Eng. and
Ph.D. degrees from McGill University, Montreal,
Quebec, Canada, in 2009 and 2016, respectively.
His research interests are in the design of efficient
algorithms and implementations for decoding error-
correcting codes, in particular non-binary LDPC and
polar codes.
Claude Thibeault received his Ph.D. from Ecole
Polytechnique de Montreal, Canada. He is now
with the Electrical Engineering department of Ecole
de technologie superieure, where he serves as full
professor. His research interests include design and
verification methodologies targeting ASICs and FP-
GAs, defect and fault tolerance, radiation effects, as
well as IC and PCB test and diagnosis. He holds 13
US patents and has published more than 140 journal
and conference papers, which were cited more than
850 times. He co-authored the best paper award
at DVCON’05, verification category. He has been a member of different
conference program committees, including the VLSI Test Symposium, for
which he was program chair in 2010–2012, and general chair in 2014 and
2015.
Warren J. Gross received the B.A.Sc. degree in
electrical engineering from the University of Wa-
terloo, Waterloo, Ontario, Canada, in 1996, and the
M.A.Sc. and Ph.D. degrees from the University of
Toronto, Toronto, Ontario, Canada, in 1999 and
2003, respectively. Currently, he is an Associate
Professor with the Department of Electrical and
Computer Engineering, McGill University, Mon-
tréal, Québec, Canada. His research interests are in
the design and implementation of signal process-
ing systems and custom computer architectures. Dr.
Gross is currently Chair of the IEEE Signal Processing Society Technical
Committee on Design and Implementation of Signal Processing Systems.
He has served as Technical Program Co-Chair of the IEEE Workshop on
Signal Processing Systems (SiPS 2012) and as Chair of the IEEE ICC
2012 Workshop on Emerging Data Storage Technologies. Dr. Gross served
as Associate Editor for the IEEE Transactions on Signal Processing. He
has served on the Program Committees of the IEEE Workshop on Signal
Processing Systems, the IEEE Symposium on Field-Programmable Custom
Computing Machines, the International Conference on Field-Programmable
Logic and Applications and as the General Chair of the 6th Annual Analog
Decoding Workshop. Dr. Gross is a Senior Member of the IEEE and a licensed
Professional Engineer in the Province of Ontario.
