Auto-Generation of Pipelined Hardware Designs for Polar Encoder by Zhong, Zhiwei et al.
ar
X
iv
:1
80
1.
00
47
2v
1 
 [c
s.A
R]
  1
 Ja
n 2
01
8
Auto-Generation of Pipelined Hardware Designs for
Polar Encoder
Zhiwei Zhong1,2, Xiaohu You2, and Chuan Zhang1,2,∗
1Lab of Efficient Architectures for Digital-communication and Signal-processing (LEADS)
2National Mobile Communications Research Laboratory, Southeast University, Nanjing, China
Email: {zwzhong, xhyu, chzhang}@seu.edu.cn
Abstract—This paper presents a general framework for auto-
generation of pipelined polar encoder architectures. The pro-
posed framework could be well represented by a general formula.
Given arbitrary code length N and the level of parallelismM , the
formula could specify the corresponding hardware architecture.
We have written a compiler which could read the formula
and then automatically generate its register-transfer level (RTL)
description suitable for FPGA or ASIC implementation. With
this hardware generation system, one could explore the design
space and make a trade-off between cost and performance. Our
experimental results have demonstrated the efficiency of this
auto-generator for polar encoder architectures.
Index Terms—Polar encoder, pipelined architecture, hardware
auto-generation, high-level synthesis.
I. INTRODUCTION
Polar code [1], the first channel code which can provably
achieve the capacity of the binary-input discrete memory-
less channels (BDMCs), has been considered as the recent
breakthrough of coding theory. Recently, polar code has been
adopted by the enhanced mobile broadband (eMBB) control
channels for the 5G NR interface. As pointed out by [1], to
achieve a good error-correcting performance of polar code, the
code length is expected to be sufficiently long. However, as for
polar code, the hardware complexity of fully parallel encoder
will be high as the code length increases. Therefore, pipelined
architecture should be introduced to reduce the hardware cost.
Using folding transformation [2], [3] has proposed both feed-
forward and feed-back polar encoder with 2-parallel process-
ing; [4] has proposed pipelined polar encoder architecture
with 4-parallel processing. Although [4] has claimed that the
folding transformation could derive polar encoder with any
level of parallelism, the detailed framework is not given.
In synthesizing hardware architectures for an N -bit polar
encoder, different level of parallelism leads to different latency,
throughput, silicon area and memory cost. Intuitively, the level
of parallelism M suitable for an N -bit polar encoder should
be 2 6 M 6 N/2, where M is a power of two. Thus, as the
code length increases, there will be more choices of M and
the design space will be wider. Therefore, it will be exhausting
to choose the optimal values of N and M under different
hardware constraints.
In order to fulfill the requirements of different applications,
a auto-generator which can connivently output polar encoder
architecture with given code length N and parallelism M
is highly expected. Also, this auto-generator can free the
hardware designers from the laborious case designs, bypass
the hardware details, and give the design space in a more
convenient way. Inspired by a fast Fourier transform (FFT)
generator [5] which could automatically generate FFT hard-
ware architecture with arbitrary parallelism and figure out
hardware cost, this paper proposes an auto-generation system
which could produce polar encoder hardware architecture with
arbitrary code length and arbitrary level of parallelism.
The remainder of this paper is organized as follows. In Sec-
tion II, the brief description of polar encoding is introduced.
In Section III, we propose the generation system of polar
encoder and an exemplary 32-bit polar encoder with 8-parallel
processing. In Section IV, the analysis of the performance of
the generation system is given. In Section V, we conclude and
remark on the entire paper.
II. PRELIMINARIES
A. Polar Encoder
In polar code encoding, uN−10 is regarded as the source
word and xN−10 as the codeword. The encoding scheme can
be defined by Eq. (1), where GN and BN are the generation
matrix and the bit-reversal permutation matrix respectively,
and F⊗n is the Kronecker power of n with n = log2 N and
F = [ 1 01 1 ].
xN−10 = u
N−1
0 GN = u
N−1
0 BNF
⊗n. (1)
As proved by [3], the data-flow graph (DGF) of polar
encoder could be derived from the DFG of FFT processors
by replacing all the butterfly modules with XOR-and-PASS
modules, and all the twiddle factors with 1’s. Therefore, the
proposed framework for polar encoder has the potential for
implementing the pipelined hardware architecture for FFT by
reversing the replacement. An exemplary DGF of an 8-bit
polar encoder is shown in Fig. 1. Note that this DFG is similar
to the that of an 8-point radix-2 decimation-in-frequency (DIF)
FFT processor in the way mentioned above.
III. HARDWARE GENERATION
In this section, we introduce the general pipelined frame-
work for polar encoder with arbitrary code length N and
arbitrary level of parallelism M . The general framework could
be easily denoted by a general formula F (N,M). Then we
Fig. 1. The data-flow graph of an 8-bit polar encoder.
show how to use an algorithm to derive a specific formula
fN,M from F (N,M) based on the values of N and M .
Finally, a compiler is employed to translate fN,M into RTL
description. The hardware generation system is illustrated in
Fig. 2.
Code Length n
Parallelism m Formula: F (N,M)
Formula:  fn,m
General Framework
Algorithm
RTL verilog
Compiler
Fig. 2. The hardware generation system for polar encoder.
A. From General Framework to Formula
Consider that the general framework is expected to imple-
ment polar encoder with arbitrary code length and arbitrary
level of parallelism, the framework should be scalable, i.e.,
the number of stages and the number of hardware modules in
each stage should change with the values of N and M . Such
a scalable framework could represented by formula F (N,M)
shown in Eq. (2). Here the parametersN andM are powers of
2, and 4 6 M 6 N/2. Before we go into details of F (N,M),
we introduce all the symbols that might be used in F (N,M)
and fN,M , as well as the symbols’ corresponding hardware
modules. Note that the final hardware implementation of fN,M
is the serial connection of the individual modules of different
symbols. Fig. 3 illustrates all the exemplary modules, as well
as symbols, that might be used in our design, all of which take
uN−10 as input and x
N−1
0 as output.
sw
it
c
h k/2 D
k/2 D
XP SK
A
IK  
A
A
...
k
AP4
P8
u0
u1
x0
x1
u0
u1
x0
x1
u0
u1
u2
u3
x0
x1
x2
x3
u0
u1
uk
...
u0
u1
u2
u3
u4
u5
u6
u7
x0
x1
xk
...
x0
x1
x2
x3
x4
x5
x6
x7
Fig. 3. The symbols and corresponding hardware modules in the formula.
Symbol XP represents an XOR-and-PASS module that
achieves: x0 = u0 + u1 (in GF(2)) and x1 = u1. The number
of inputs of XP is fixed and equals to 2 in our design.
Symbol SK (K is a power of 2, K > 1) represents a switch
with k/2 delay elements (denoted by D) on each side. A
log2 K-bit counter is needed to control the switch: the value
0 of the most significant bit of the counter infers direct data
transfer, and the value 1 infers cross data transfer. The number
of inputs of SK is fixed and equals to 2.
Symbol PN (N is a power of 2, N > 2) denotes the
permutation on an N -dimensional vector. The detail function
of PN is illustrated in Algorithm 1. Intuitively, PN is the
duplication of PN/2. For example, P8 could be viewed as
partial overlap of two P4 modules with red wires and black
wires respectively.
Symbol (IK⊗A) (K is a power of 2,K > 0) is a Kronecker
product representing K parallel instances of module A, where
A is an abstract module and A could be replaced by XP, SK
or PN . Note that when K = 1, (IK⊗A) equals to A. Suppose
that A has X inputs, the number of inputs of (IK ⊗A) equals
K ×X .
The general formula F (N,M) is composed of symbols
mentioned above, except that the W in Eq. (2) is a variable
module. When deriving fN,M from F (N,M), symbol W
should be replaced by PN or SK according to its subscript. In
Algorithm 2, as the code length and the level of parallelism
are given, all the subscripts of each symbol in F (N,M) will
be figure out. Then the module (I ⊗W ) is replaced by (I ⊗ P )
or (I ⊗ S) based on the value of the subscript of W . Finally,
the formula fN,M is determined.
B. Compiler
We have built a compiler in Python that takes fN,M as input
and automatically connects all the basic modules in fN,M in
left-to-right order. Specifically, as we input N and M into
F (N,M), the fN,M will be determined and transformed into
the register-transfer level (RTL) Verilog by the compiler. The
detail of the compiler is beyond the scope of this paper; we
only provide a brief introduction here.
(IM/2 ⊗ XP)(IM/4 ⊗ P4)
{
Π
log
2N−3
i=0 [(IM/2 ⊗WN/(2iM))(IM/2 ⊗ XP)]
}
(IM/4 ⊗ P4)(IM/2 ⊗ SN/M )(IM/2 ⊗ XP) (2)
(I4 ⊗ XP)(I2 ⊗ P4) {(I4 ⊗W4)(I4 ⊗ XP)(I4 ⊗W2)(I4 ⊗ XP)W1(I4 ⊗ XP)} (I2 ⊗ P4)(I4 ⊗ S4)(I4 ⊗ XP) (5)
(I4 ⊗ XP)(I2 ⊗ P4) {(I4 ⊗ S4)(I4 ⊗ XP)(I4 ⊗ S2)(I4 ⊗ XP)P8(I4 ⊗ XP)} (I2 ⊗ P4)(I4 ⊗ S4)(I4 ⊗ XP) (6)
There are totally three types of basic modules in the formula
fN,M : the XOR-and-PASS module XP, the switch module SK ,
and the permutation module PN . There are two ways to expand
these modules. The first one is to employ the symbol IK⊗ to
layout the duplication of one module in a parallel way. The
other one is to change the symbols’ subscripts. Therefore, the
compiler needs to read each symbol of fN,M from left to
right, and recognizes IK⊗ as well as each symbol’s subscript.
Then the compiler could determine the specific hardware
architecture and print the Verilog files.
C. Input and Output Orders
The input and output data of this framework are in regular
order. Suppose the input data of fN,M is u
N−1
0 , since fN,M
represents a pipelined architecture, uN−10 will be divided into
N/M M -dimensional vectors Vin(i) illustrated in Eq. (3),
where i = 0, 1, ..., (N/M)− 1. All the data in Vin(i) will
be entered into the encoder in parallel, and i indicates the
sequence of the input vector, i.e., Vin(0) is the first set of the
input data and the Vin(N/M−1) is the last set of the input
data. The output data are in bit-reversal order. Specifically,
suppose xN−10 is the theoretical codeword and y
N−1
0 is in the
bit-reversal form of xN−10 . Then the i-th output vector Vout(i)
equals to y
(M×i)+(M−1)
M×i , where i = 0, 1, ..., (N/M)− 1.
Vin(i) =


u(M/2)×i
u(M/2)×i+(N/2)
u(M/2)×i+1
u(M/2×i)+1+(N/2)
u(M/2)×i+2
u(M/2×i)+2+(N/2)
...
u(M/2)×i+(M/2)−1
u(M/2)×i+(M/2)−1+(N/2)


. (3)
For the general framework, the processing latency (clock
cycles) is Tlatency = (3N/2M) − 1. The number of XOR gates
and delay elements are:
#XOR = (M/2)× log2 N ;
#MEM = (3N/2)−M.
(4)
D. A 32-Bit 8-Parallel Polar Encoder
According to Algorithm 2, given N = 32 and M = 8,
formulas F (32, 8) and f32,8 are obtained in Eq. (5) and Eq.
(6), respectively. The hardware architecture is illustrated in
Fig. 4, which consists of 20 XOR gates and 40 delay elements
in accordance with Eq. (4). The architecture could be split
Algorithm 1 The Permutation on an N -dimensional Vector
Require: The input vector uN−10 .
1: for (i = 0; i < N/2; i = i+ 2) do
2: x[i] = u[i]
3: end for
4: for (i = N − 1; i > N/2; i = i− 2) do
5: x[i] = u[i]
6: end for
7: for (i = 1; i < N/2; i = i+ 2) do
8: x[i] = u[i− 1 + (N/2)]
9: x[i − 1 + (N/2)] = u[i]
10: end for
11: Output xN−10 .
Algorithm 2 The Generation of Formula fN,M
Require: The code length N and the level of Parallelism M
(N = 2i,M = 2i, i > 2, i ∈ Z,M 6 N/2).
1: Input N and M in to the general formula F (N,M).
2: if (k >= 1, k = 2i, i ∈ Z) then
3: (IM/2 ⊗W1/k) = (Ik ⊗ PM/k)
4: else
5: (IM/2 ⊗W1/k) = (IM/2 ⊗ S1/k)
6: end if
7: fN,M = F (N,M)
8: Output the formula fN,M .
in 11 columns, each of which has its relevant symbol under
the column. Note that Eq. (6) is actually composed of all the
symbols at the bottom of Fig. 4. The order of the input data u
(k = 0, 1, 2, 3) at the leftmost part of Fig. 4 conforms to the
order mentioned above. The output data x is in the bit-reversal
order.
IV. PERFORMANCE AND COMPLEXITY
Some of the hardware designs derived from the auto-
generation system were implemented on the Xilinx Virtex-7
VC709 FPGA platform with Virtex-7 XC7VX690T. All the
design examples are of the same code length N = 1024, but
with different level of parallelism. The synthesis results are
illustrated in Table I. From the table, it can be observed that
the throughput (T/P) and the number of Slice LUTs and Slice
Registers increase as the value of M increases. In an extreme
case, the polar encoder with M = 512 consumes more Slice
u4k
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
sw
it
ch
 D
 D
sw
it
ch
 D
 D
sw
it
ch
 D
 D
sw
it
ch
 D
 D
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
sw
it
ch
2 D
2 D
I4  XP I2  P4 I4  S4 I4  XP I4  S2 I4  XP P8 I4  XP I2  P4 I4  S4 I4  XP
u4k+1
u4k+2
u4k+3
u4k+16
u4k+17
u4k+18
u4k+19
B
it-r
e
v
er
sa
l  o
u
tp
u
t x
x0
x8
x4
x12
x16
x24
x20
x28
x12
x10
x6
x14
x18
x26
x22
x30
x1
x9
x5
x13
x17
x25
x21
x29
x3 
x11
x7
x15
x19
x27
x23
x31
Fig. 4. The hardware architecture of polar encoder with N = 32, M = 8.
LUTs than the polar encoder with M = 4 by 5167% but
achieves higher throughput by 8710%.
As mentioned in Section III, the value of M conforms
to 4 6 M 6 N/2. Then, given the code length N , the
generation system could implement (log2 N)−2 designs with
differentM , covering a wide cost/performance trade-off space.
Therefore, one could choose the most suitable polar encoder
in the design space to fit the application.
TABLE I
IMPLEMENTATION OF THE HARDWARE DESIGNS DERIVED FROM THE
AUTO-GENERATION SYSTEM ON THE XILINX VIRTEX-7 VC709 FPGA
PLATFORM WITH VIRTEX-7 XC7VX690T.
N M Slice LUTs Slice Registers
Max freq T/P
(MHz) (Gbps)
1024 4 148 82 519.535 2.07
1024 32 467 312 407.05 13.02
1024 128 1278 845 340.518 43.58
1024 256 1704 1194 348.712 89.27
1024 512 2628 1025 356.223 182.38
V. CONCLUSION
This paper proposes an auto-generation system for the
hardware architecture of polar encoder. The system could
offer users a wide range of design space so that the users
could make a trade-off between cost and performance to best
fit their applications. The essence of the generation system
lies in the formula-based expression of the general framework
for polar encoder that could achieve encoding with arbitrary
code length and arbitrary parallelism. This auto-generation can
help designers to conveniently design polar encoder without
touching hardware details. The derivation of design space can
further help us to identify the required design.
In this paper, we also introduce the scalable hardware mod-
ules associated with the formula, as well as the compiler that
could transform the formula into RTL Verilog files. Synthesis
results on FPGA have demonstrated the efficiency and the
large trade-off space of the auto-generated circuits.
Future work will be directed toward the auto-generation of
successive cancellation polar decoder and belief prorogation
decoder based on our previous works [6–8], and the design
optimization based on the design space.
REFERENCES
[1] E. Arikan, “Channel polarization: A method for constructing
capacity-achieving codes for symmetric binary-input memoryless
channels,” IEEE Transactions on Information Theory, vol. 55,
no. 7, pp. 3051–3073, 2009.
[2] K. K. Parhi, C.-Y. Wang, and A. P. Brown, “Synthesis of control
circuits in folded pipelined dsp architectures,” IEEE Journal of
Solid-State Circuits, vol. 27, no. 1, pp. 29–43, 1992.
[3] C. Zhang, J. Yang, X. You, and S. Xu, “Pipelined imple-
mentations of polar encoder and feed-back part for SC polar
decoder,” in IEEE International Symposium on Circuits and
Systems (ISCAS), 2015, pp. 3032–3035.
[4] H. Yoo and I.-C. Park, “Partially parallel encoder architecture for
long polar codes,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 62, no. 3, pp. 306–310, 2015.
[5] P. Milder, F. Franchetti, J. C. Hoe, and M. Pu¨schel, “Computer
generation of hardware for linear digital signal processing trans-
forms,” ACM Transactions on Design Automation of Electronic
Systems (TODAES), vol. 17, no. 2, p. 15, 2012.
[6] C. Zhang, B. Yuan, and K. K. Parhi, “Reduced-latency SC polar
decoder architectures,” in Proc. IEEE International Conference
on Communications (ICC), June 2012, pp. 3471–3475.
[7] C. Zhang and K. Parhi, “Low-latency sequential and overlapped
architectures for successive cancellation polar decoder,” IEEE
Trans. Signal Process., vol. 61, no. 10, pp. 2429–2441, 2013.
[8] J. Yang, C. Zhang, H. Zhou, and X. You, “Pipelined belief prop-
agation polar decoders,” in Proc. IEEE International Symposium
on Circuits and Systems (ISCAS), May 2016, pp. 413–416.
