Reduced-Latency SC Polar Decoder Architectures by Zhang, Chuan et al.
Reduced-Latency SC Polar Decoder Architectures 
 
Chuan Zhang, Bo Yuan, and Keshab K. Parhi 
Department of Electrical and Computer Engineering 
University of Minnesota, Twin Cities 
Minneapolis, MN 55455, USA 
{zhan0884, yuan0103, parhi}@umn.edu 
 
 
Abstract—Polar codes have become one of the most favorable 
capacity achieving error correction codes (ECC) along with their 
simple encoding method. However, among the very few prior 
successive cancellation (SC) polar decoder designs, the required 
long code length makes the decoding latency high. In this paper, 
conventional decoding algorithm is transformed with look-ahead 
techniques. This reduces the decoding latency by 50%. With 
pipelining and parallel processing schemes, a parallel SC polar 
decoder is proposed. Sub-structure sharing approach is employed 
to design the merged processing element (PE). Moreover, inspired 
by the real FFT architecture, this paper presents a novel input 
generating circuit (ICG) block that can generate additional input 
signals for merged PEs on-the-fly. Gate-level analysis has 
demonstrated that the proposed design shows advantages of 50% 
decoding latency and twice throughput over the conventional one 
with similar hardware cost. 
Keywords-Polar codes; successive cancellation; look-ahead; 
sub-structure sharing; on-the-fly. 
I.  INTRODUCTION 
Proposed by Arıkan [1], polar codes have been considered 
as the first “low complexity” scheme which provably achieves 
the capacity for a fairly wide array of channels. However, most 
related research is focused on code performance rather than 
efficient decoder design. Among the few literatures on the 
latter topic, [1] proposed a straightforward implementation with 
successive cancellation (SC) algorithm, whose complexity is 
ࣩ(Nlog2N). Compared with the belief propagation (BP) 
algorithm [1]-[2], SC approach is more suitable for hardware 
design due to its lower complexity. Several further revised SC 
polar decoders with complexity of ࣩ(N) were presented by [3]. 
For these conventional polar decoders, decoding a code of 
length N requires 2(N-1) clock cycles. Since modern 
communication systems require a code length greater than 210 
is required, the resulting decoding delay is high. Also, 
restricted by the successive schedule, the highest hardware 
efficiency of an active stage can only be 50%. In order to 
achieve faster decoding and higher efficiency, the loop 
computation is reformulated with look-ahead techniques, which 
pre-calculate all possible values of the next code bit and then 
select the correct one with a multiplexer. This paper proposes a 
nice recursive time chart construction method which succeeds 
in reducing the decoding latency by 50%. A parallel decoder 
example is proposed at gate-level with VLSI-DSP techniques. 
A hardware-efficient merged processing element (PE) and the 
input generating circuit (ICG) block, which works best with 
the whole decoder, are also employed. Comparison results have 
shown that the proposed design can achieve only half decoding 
latency and twice higher throughput while maintaining 
comparable hardware complexity as the conventional one. 
The remainder of this paper is organized as follows. A brief 
review of min-sum SC decoding algorithm is provided in 
Section II. In Section III, the systematic algorithm to construct 
the look-ahead scheduling scheme is given in a recursive 
manner. The parallel polar decoder architectures with gate-
level design details are proposed Section IV. The performance 
estimation and comparison with the state-of-the-art design are 
presented in Section V. Section VI concludes the paper. 
II. REVIEW OF MIN-SUM SC ALGORITHM 
Consider an arbitrary polar code with parameter (N, K, A, 
cuA ) [1]. We denote the input vector as 1
Nu , which consists of 
a random part uA  and a frozen part cuA . The corresponding 
output vector through channel WN is 1
Ny  with conditional 
probability 1 1( | )
N N
NW y u . Define the likelihood ratio (LR) as, 
( ) 1
( ) 1 1 1
1 1 ( ) 1
1 1
ˆ( , | 0)ˆ( , ) .
ˆ( , |1)
i N i
i N i N
N i N i
N
W y u
L y u
W y u
−
−
−
   (1) 
The a posteriori decision scheme is given as follows. Here Ac 
denotes the index set of channels associated with frozen bits. 
LRs with even and odd indices can be generated by recursively 
applying Eq. (2) and (3), respectively. 
A Posteriori Decision Scheme with Frozen Bits  
( ) 1
1 1
ˆ
ˆ ˆ( , ) 1 0
ˆ 1
c
i i
i N i
N i
i
i u u
L y u u
u
−
∈
≥
1: if  then 
2: else
3:         if then 
4:         else 
5:         endif
6: endif
= ；
= ；
= ；
A
 
 
2 1
(2 ) 2 1
1 1
ˆ1 2( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
ˆ( , )
ˆ ˆ ˆ=[ ( , )] ( , ),i
i N i
N
ui N i i i N i
N o e N N e
L y u
L y u u L y u−
−
−− − −
+⊕ ⋅
 (2) 
(2 -1) 2 2
1 1
( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
ˆ( , )
ˆ ˆ ˆ( , ) ( , ) 1
= .
ˆ ˆ ˆ( , )+ ( , )
i N i
N
i N i i i N i
N o e N N e
i N i i i N i
N o e N N e
L y u
L y u u L y u
L y u u L y u
−
− − −
+
− − −
+
⊕ +
⊕
 (3) 
The decoding procedure of a polar code example with N = 
8 is illustrated in Fig. 1, where Type I and Type II PEs are in 
charge of Eq. (2) and (3), respectively. The label attached to 
each PE indicates the clock cycle index when it is activated. 
( ) ( )L y11 1
( ) ( )L y11 2
( ) ( )L y11 3
( ) ( )L y11 4
( ) ( )L y11 5
( ) ( )L y11 6
( ) ( )L y11 7
( ) ( )L y11 8
( ) ( )1 88 1L y
( ) ˆ( , )5 8 48 1 1L y u
( ) ˆ( , )3 8 28 1 1L y u
( ) ˆ( , )7 8 68 1 1L y u
( ) ˆ( , )2 88 1 1L y u
( ) ˆ( , )6 8 58 1 1L y u
( ) ˆ( , )4 8 38 1 1L y u
( ) ˆ( , )8 8 78 1 1L y u
ˆ ˆ ˆ ˆu u u u+ + +1 2 3 4
ˆ ˆu u+3 4
ˆ ˆu u+2 4
uˆ4uˆ6
uˆ2
ˆ ˆu u+5 6
ˆ ˆu u+1 2
uˆ7
uˆ3
uˆ5
uˆ1
 
Figure 1:  SC decoding process of polar codes with length N = 8. 
 
In logarithm domain, Eq. (2) and (3) can be rewritten as: 
2 1
(2 ) 2 1
1 1
ˆ ( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
ˆ( , )
ˆ ˆ ˆ=(-1) ( , ) ( , ),i
i N i
N
u i N i i i N i
N o e N N e
y u
y u u y u−
−
− − −
+⊕ +
L
L L  (4) 
(2 -1) 2 1
1 1
( ) 2 2 2 2 2
2 1 1, 1,
( ) 2 2
2 2 1 1,
ˆ( , )
ˆ ˆ=2artanh{tanh[ ( , )]
ˆtanh[ ( , )]}.
i N i
N
i N i i
N o e
i N i
N N e
y u
y u u
y u
−
− −
−
+
⊕ ⋅
L
L
L
  (5) 
( ) 1 ( ) 1
1 1 1 1ˆ ˆ( , ) ln ( , ).
− −i N i i N iN Ny u L y uL   (6) 
Since large size of look-up table (LUT) is required to 
implement Eq. (5), it is reduced to the min-sum update rule 
with sub-optimal approximation: 
(2 -1) 2 2
1 1
( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
( ) 2 2 2 2 2 ( ) 2 2
2 1 1, 1, 2 2 1 1,
ˆ( , )
ˆ ˆ ˆsgn[ ( , )]sgn[ ( , )]
ˆ ˆ ˆmin[ ( , ) , ( , ) ].
i N i
N
i N i i i N i
N o e N N e
i N i i i N i
N o e N N e
y u
y u u y u
y u u y u
−
− − −
+
− − −
+
⊕ ⋅
⊕

L
L L
L L
(7) 
Simulation results have shown that the min-sum SC 
algorithm, which is LUT free, can keep a balance between 
decoding performance and hardware efficiency [3], which is 
very attractive for VLSI designers. Therefore, in the following 
sections only min-sum SC decoding algorithm is considered. 
III. LATENCY-REDUCED UPDATING SCHEME 
However, among all pre-stated algorithms, probabilities are 
updated according to the same data flow illustrated in Fig. 1, 
which is straightforward but not efficient. In this section, a 
high-performance scheme for polar decoder, which only needs 
half number of clock cycles to obtain the estimated information 
bits, is developed in a recursive manner. Thorough 
investigation has revealed that time chart of the straightforward 
SC decoding process for N-bit polar codes can be constructed 
in recursive way as follows, 
Recursive Construction of Conventional Time Chart  
2
2
TC
log , ,1
log 1
TC {[  of  Type I,TC], }
TC [TC,TC]
change the leftmost  of  Type I with  of  Type II
1: initializtion 
2: for  do
3:         
4:         
5:         
6:         
7: endfor
8:
i N i
j N i
j i
j j
= − −
= − +
=
=
= ；
；
；
；
；
5\SS
TC. output 
 
 
Here notation TC {[ ,TC], }= sC  is used to denote the left 
insertion of an array C  into the previously arranged time chart 
TC at Stage s. Similarly, TC [TC,TC]=   simply means 
duplicating the previous time chart to obtain the new one. i and 
j are iterative execution indices. “j of Type I” is the short for “j 
copy/copies of Type I PE(s) is/are activated in that clock cycle”. 
The corresponding time chart is illustrated in Fig. 2 (a). Since 
Stage i is activated 2i times during the whole decoding process, 
the total number of clock cycles required is, 
2 2log 1 log
0
(2 1)2 2 2 2( 1).
2 1
N N
i
i
N
−
=
−
= ⋅ = −
−
∑  (8) 
 
( ) ( )L y1 88 1
⇓
( ) ˆ( , )L y u2 88 1 1
⇓
( ) ˆ( , )L y u3 8 28 1 1
⇓
( ) ˆ( , )L y u4 8 38 1 1
⇓
( ) ˆ( , )L y u5 8 48 1 1
⇓
( ) ˆ( , )L y u6 8 58 1 1
⇓
( ) ˆ( , )L y u7 8 68 1 1
⇓
( ) ˆ( , )L y u8 8 78 1 1
⇓
1 2 3 4 5 6 7 8 9 10 11 12 13 14
4 of Type II 4 of Type I
2 of Type II 2 of Type I 2 of Type II 2 of Type I
1 of Type II 1 of Type I 1 of Type II 1 of Type I 1 of Type II 1 of Type I 1 of Type II 1 of Type I
1 2 3 4 5 6 7 8
Clock
Stage 1
Stage 2
Stage 3
Output
 
(a) 
( )
( )
( ),
ˆ( , )
L y
L y u
1 8
8 1
2 8
8 1 1
⇓
( )
( )
ˆ( , ),
ˆ( , )
L y u
L y u
3 8 2
8 1 1
4 8 3
8 1 1
⇓
1 2 3 4 5 6 7
4 of Type I & II
2 of Type I & II 2 of Type I & II
1 of Type I & II 1 of Type I & II 1 of Type I & II 1 of Type I & II
1 & 2 3 & 4 5 & 6 7 & 8
Clock
Stage 1
Stage 2
Stage 3
Output
( )
( )
ˆ( , ),
ˆ( , )
L y u
L y u
5 8 4
8 1 1
6 8 5
8 1 1
⇓
( )
( )
ˆ( , ),
ˆ( , )
L y u
L y u
7 8 6
8 1 1
8 8 7
8 1 1
⇓
 
(b) 
Figure 2:  Conventional and look-ahead decoding time charts for polar codes with N = 8. 
( ) ( )L y11 1
( ) ( )L y11 2
( ) ( )L y11 3
( ) ( )L y11 4
( ) ( )L y11 5
( ) ( )L y11 6
( ) ( )L y11 7
( ) ( )L y11 8
( ) ˆ( , )i iL y u− −2 1 8 2 21 1 1
( ) ˆ( , )i iL y u −2 8 2 11 1 1
 
Figure 3:  Pipelined decoder architectures of polar codes with length N = 8. 
 
However, as mentioned previously the conventional 
decoding approach is not suitable for real-time communication 
systems for two reasons. First, in order to achieve the required 
performance, the code length N is usually set as 210-220. An 
immediate consequence is the latency of 2(N-1) clock cycles is 
too large. Second, it is apparent that during the whole decoding 
process the highest hardware utilization in a specific clock 
cycle is only 50% (Clock cycle 1). As the stage index increases, 
the hardware efficiency will go down as low as 1/N (Clock 
cycle log2N), which can be lower than 2-10 for practical 
applications. Even for the pipelined tree architecture proposed 
by [3] in Fig. 3, the highest utilization is only 50% as well, 
which means half PEs are in idle state during each clock cycle. 
This dilemma is introduced by the bottleneck of sequential 
decoding property of SC algorithm. It is noted that if both LLR 
inputs for Eq. (4) are available, there can be only two possible 
outputs, depending on what value 2 1ˆ −iu  will take. Therefore, 
for Type I PE, given both deterministic inputs, the look-ahead 
scheme only needs to pre-compute two output candidates, 
which can be selected by a multiplexer thereafter. For instance, 
shown in Fig. 1, all possible outputs of Type I PEs labeled by 8 
in Stage 1 can be pre-calculated in Clock cycle 1. In other 
words, for Stage 1 the required computation in Clock cycle 8 
can be incorporated into Clock cycle 1. In the similar way, for 
Stage 2 computation in Clock cycle 5 and 12 can be taken care 
of in Clock cycle 2 and 9, respectively. Calculation in Clock 
cycle 4, 7, 11, and 14 can be re-scheduled into Clock cycle 3, 6, 
10, and 13 for Stage 3. As a result, only half clock cycles are 
required to implement the same decoding task with help of the 
proposed look-ahead schedule. For the 8-bit polar decoder 
example shown in Fig. 2 (b), all PEs at Stage 1 are activated 
during Clock cycle 1 because both deterministic LLR inputs for 
each PE are guaranteed by channel outputs. However, in Clock 
cycle 2, only PEs labeled with 2 or 5 can be activated, because 
they are the only ones with deterministic inputs. For PEs with 
labels of 9 or 12, their inputs are generated by Type I PEs in 
Stage 1, which have two possible values at this moment. In 
order to avoid error propagation caused by pre-computing to 
the next stage, those PEs stay idle during Clock cycle 2. Similar 
schemes apply to further decoding processes. It is clear that the 
required number of clock cycles can be halved to N-1. The time 
chart construction of the proposed scheme is given as follows: 
Recursive Construction of Look-Ahead Time Chart  
2
2
TC
log , ,1
log 1
TC {[  of  Type I & II,TC], }
1 
TC [TC,TC]
1: initializtion 
2: for  do
3:         
4:         
5:         if then
6:                 break
7:         endif
8:         
9: end
i N i
j N i
j i
i
= − −
= − +
=
=
=
= ；
；
；
；
；
5\SS
TC.
for
10: output 
 
 
As indicated by Step 4, both types of PEs can work 
simultaneously in the same clock cycle, which not only 
shortens the decoding latency by 50% but also improves the 
hardware efficiency twice. Moreover, the proposed approach 
leads to a construction method in a recursive way. For clear 
understanding of the Russian Doll-like relationship between 
stages, the conventional and look-ahead construction processes 
have been pointed out with arrows in Fig. 2. 
IV. ARCHITECTURES FOR LOOK-AHEAD DECODER 
A. Design of Type I PE 
According to the look-ahead scheme, Type I PE is in charge 
of pre-computing two possible outputs in parallel, which is in 
fact an adder-subtractor. Suppose X and Y are two operands, 
and Zin is the carried-in or borrowed-from bit. For the full adder 
the sum and carry-out bit are represented by S and Cout. The 
difference and borrow-out produced by the full subtructor are 
denoted by D and Bout. The logic equations are as follows: 
in ;S X Y Z= ⊕ ⊕        (9) out in( ) ;C X Y X Y Z= ⋅ + ⊕ ⋅ (10) 
in ;D X Y Z= ⊕ ⊕       (11) out in .B X Y X Y Z= ⋅ + ⊕ ⋅   (12) 
Bin
X
Y
Cin
D
Bout
S
Cout
 
(a) 1-bit full adder-subtractor. 
X
Y
S D
Bout
Cout  
(b) 1-bit half adder-subtractor. 
Figure 4:  Proposed 1-bit adder-subtractor architectures. 
Notice that S and D are actually the same, and X Y⋅  is an 
intermediate term of X Y⊕ . in( )X Y Z⊕ ⋅  is also a byproduct 
of inX Y Z⊕ ⊕ . The resulting gate-sharing scheme not only 
implements parallel processing but also reduces the hardware 
consumption. The gate-level structures of 1-bit full and half 
adder-subtractor are depicted in Fig. 4 (a) and (b), respectively. 
The proposed q-bit adder-subtractor, which is illustrated in Fig. 
5, requires only less than 57% hardware compared with the 
conventional one while achieving the same performance. 
 
Figure 5: Proposed Type I PE architectures. 
B. Design of Type II PE 
Type II PE with the min-sum algorithm is shown in Fig.6. 
TtoS
TtoS
sgn
CMP
sgn
mag
magStoT
Type II PE
( ) ( )
,
( )
, ,
ˆ( , ); ˆ( , ).
ˆ ˆ( , );
−
−
+
− −⊕
2 2 2 -1 2 2
2 2 1 11 1 12 2 2 2 2
2 2 1 1 1
input : output:
input :
i N i i N i
N N e Ni N i i
N o e
y u y u
y u u
L L
L
1input
2input
output
q
q
q
qq
q
 
Figure 6:  Proposed architectures of Type II PE. 
C. Design of Merged PEs 
Since the comparator in Type II PE is actually a q-bit 
subtractor, which is also employed by Type I PE, it is possible 
to merge Type I and Type II PEs together with the sub-
structure sharing scheme. The detailed structure is as follows: 
( ) ( )
,
( ) ( )
, ,
( )
ˆ( , ); ˆ ˆ( , ), ;
ˆ ˆ ˆ ˆ( , ); ( , ) .
ˆ( , );
−
−
+
−
− − −
−
−
=
⊕ =，
2 2 2 -1 2 2
2 2 1 1 1 1 2 11 22 2 2 2 2 2 -1 2 2
2 2 1 1 1 3 1 1 2 1
2 -1 2 2
1 1 1
0input : output :
input : output :  1
output :
i N i i N i
N N e N i
i N i i i N i
N o e N i
i N i
N
y u y u u
y u u y u u
y u
L L
L L
L
1input
2input
1output
2output
3output
 
Figure 7:  Proposed structure of the Merged PE. 
D. Input Generating Circuit for Type I PEs 
As indicated in Eq. (4), except for ( ) 2 2 2 2 22 1 1, 1,ˆ ˆ( , )
− −⊕i N i iN o ey u uL  
and ( ) 2 22 2 1 1,ˆ( , )
−
+
i N i
N N ey uL , a third input 2 1ˆ −iu  is also required by 
Type I PE. Moreover, for efficient execution of each Type I PE, 
the value of 2 1ˆ −iu  needs to be provided on-the-fly. However, 
even for the 8-bit decoder illustrated in Fig. 1, the complicated 
interleaving of odd and even indices makes the straightforward 
calculation of 2 1ˆ −iu  inconvenient. In order to solve this inherent 
problem, the input generating circuit (IGC) for Type I PEs is 
proposed in this section. Careful investigation has shown that it 
is possible to generate the required 2 1ˆ −iu  using the real FFT-
like signal flow [4]. For instance, all 2 1ˆ −iu  for 8-bit polar 
decoder can be generated with the in-place procedure in Fig. 8. 
ˆ1u
ˆ2u
ˆ3u
ˆ4u
ˆ5u
ˆ6u
ˆ ˆ ˆ ˆu u u u+ + +1 2 3 4
ˆ ˆu u+2 4
ˆ ˆu u+5 6
ˆ ˆu u+1 2
ˆ2u
ˆ ˆ3 4u u+
ˆ4u
ˆ6u
ˆ ˆ3 4u u+
ˆ4u
 
Figure 8: Flow graph of the proposed IGC. 
 
The pipelined architecture of the flow graph is illustrated in 
Fig. 9, where Ui denotes the unit which is consists of i stage(s): 
ˆ
−iu2 1
ˆ iu2
Stage 2Stage 1
D
D
U1
U2
c1
0
1
0
1
 
Figure 9: Pipelined architecture for the flow graph in Fig. 8. 
 
In general, for N-bit length decoder, since the data 
structures of IGC are defined recursively for powers of 2, the 
pipelined architecture can be constructed with the recurrence 
relationship. The recursion for the general case is shown in Fig. 
10, where module Un can be constructed based on module Un-1 
and N/4 extra XOR-pass elements. For efficient design, memory 
banks are employed instead of flip-flops. Here, n = log2N-1. 
Control signal cn can be obtained by down sampling c1 by n. 
nc
 
Figure 10: Recursive construction of Un based on Un-1 using RAMs.  
E. Parallel Architecture of the Look-Ahead Decoder 
Taking the advantage of the pre-stated blocks, the revised 
look-ahead decoder can be designed accordingly. Here we 
employ an 8-bit polar decoder as an example. However, it can 
be noticed that although hardware utilization for each active 
stage is 100%, other stages still remain idle at the same time. 
( ) ( )L y11 1
( ) ( )L y11 2
( ) ( )L y11 3
( ) ( )L y11 4
( ) ( )L y11 5
( ) ( )L y11 6
( ) ( )L y11 7
( ) ( )L y11 8
( ) ˆ( , )i iL y u− −2 1 8 2 21 1 1
( ) ˆ( , )i iL y u −2 8 2 11 1 1
c1
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
ˆ ˆ  u or u2 6
or
ˆ ˆu u+1 2
ˆ ˆu u+5 6
ˆ ˆ ˆ ˆu u u u+ + +1 2 3 4
ˆ ˆu u+3 4
ˆ ˆu u+2 4
uˆ4
⎧⎪⎨⎪⎩
 
Figure 11:  Pipelined decoder of look-ahead polar codes with N = 8. 
 
Moreover, each codeword needs N-1 clock cycles to be 
properly decoded with the given approach, during which no 
new codeword could be processed. Therefore, a 2-parallel 
decoder is designed. Though an additional clock cycle is 
required as shown in Table I, for large N it is negligible 
compared with the decoding latency. Also, twice higher 
throughput can be achieved by the decoder. 
TABLE I  NUMBER OF ACTIVE MERGED PES IN EACH CLOCK CYCLE 
Input 
Clock cycle 
1 2 3 4 5 6 7 8 
 C1 4  2 1 1 2 1 1 
 C2  4 2 1 1 2 1 1 
V. COMPARISON OF LATENCY AND HARDWARE 
In this section, the proposed polar decoder is compared 
with the state-of-the-art reference. For the sake of fairness, both 
decoders have the same number of PEs. Since [3] failed to 
provide details of the ˆsu  computation block, the counterpart of 
IGC, only comparison for the rest blocks is conducted. 
TABLE II  COMPARISON OF DIFFERENT POLAR DECODERS 
Different designs Proposed design Line design [3] 
Hardware consumption (q-bit quantization) 
# of Merged PEs N/2 N/2 
1 PE 
XOR 9q 11q-3 
REG 0 1 
MUX 6q 5q 
# of IGCs 2 ––
1 IGC 
XOR N/2-1 ––
RAM N/2-2 ––
MUX N/2-2 ––
# of other REGs q(9N/2+4) q(N-1) 
# of other MUXs q(N+2) 3q(N/2-1) 
Total XOR
† ~17qN/2 ~(19q-3)N/2 
REG ~9qN/2 ~(q+1/2)N 
Decoding schedule 
Latency N 2(N-1) 
Normalized throughput 2 1 
†MUX is converted to XOR with the standard proposed in [5]. 
c1
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3
I1
I2
O1
O2
O3⎧⎪⎨⎪⎩
⎧⎪⎨⎪⎩
c1
( ) ( )( ), ( )"L y L y1 11 3 1 11
( ) ( )( ), ( )"L y L y1 11 4 1 12
( ) ( )( ), ( )"L y L y1 11 5 1 13
( ) ( )( ), ( )"L y L y1 11 6 1 14
( ) ( )( ), ( )"L y L y1 11 7 1 15
( ) ( )( ), ( )"L y L y1 11 8 1 16
( ) ( )( ), ( )"L y L y1 11 1 1 9
( ) ( )( ), ( )"L y L y1 11 2 1 10
 
Figure 12:  Parallel architectures for 8-bit polar decoder. 
 
According to Table II, the given design only requires half 
latency as the reference does, while achieving twice higher 
throughput. And similar amount of hardware is required by the 
proposed one. Further discussion can show that the look-ahead 
approach is suitable for other SC decoders.. 
VI. CONCLUSION 
A novel look-ahead SC decoding schedule for polar codes 
is proposed in this paper, which can halve the decoding latency 
required by conventional approaches. For efficient hardware 
implementation issue, a merged PE and an IGC block are 
presented. Compared with its conventional counterpart, the 
parallel decoder example can halve the decoding latency and 
double the throughput with similar hardware consumption. 
REFERENCES 
[1] E. Arikan, “Channel polarization: a method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” 
IEEE Trans. on Inf. Theory, vol. 55, no. 7, pp. 3051-3073, July 2009. 
[2] E. Arkan, “A performance comparison of polar codes and Reed-Muller 
codes,” IEEE Commun. Lett., vol. 12, no. 6, pp. 447-449, June 2008. 
[3] C. Leroux, I. Tal, A. Vardy, and W. J. Gross, “Hardware architectures 
for successive cancellation decoding of polar codes,” in Proc. Int. Conf. 
Acoust., Speech, and Sig. Proc. (ICASSP), pp. 1665-1668, May 2011. 
[4] M. Garrido, K. K. Parhi, and J. Grajal, “A pipelined FFT architecture for 
real-valued signals,” IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 56, 
no. 12, pp. 2634-2643, Dec. 2009. 
[5] Xinmiao Zhang and Fang Cai, “Efficient Partial-Parallel Decoder 
Architecture for Quasi-Cyclic Nonbinary LDPC Codes,” IEEE Trans. 
Circuits Syst. I: Reg. Papers, vol. 58, no. 2, pp. 402-414, Feb. 2011. 
