Partial Sums Computation In Polar Codes Decoding by Berhault, Guillaume et al.
1Partial Sums Computation In Polar Codes Decoding
Guillaume Berhault*, Camille Leroux, Christophe Jego, Dominique Dallet
Abstract—Polar codes are the first error-correcting codes to
provably achieve the channel capacity but with infinite code-
lengths. For finite codelengths the existing decoder architectures
are limited in working frequency by the partial sums computation
unit. We explain in this paper how the partial sums computation
can be seen as a matrix multiplication. Then, an efficient
hardware implementation of this product is investigated. It
has reduced logic resources and interconnections. Formalized
architectures, to compute partial sums and to generate the
bits of the generator matrix κ⊗n, are presented. The proposed
architecture allows removing the multiplexing resources used to
assigned to each processing elements the required partial sums.
Index Terms—FEC, polar codes, hardware architecture, suc-
cessive cancellation decoding
I. INTRODUCTION
POLAR codes [1] are a new class of error correction codes.These linear block codes are proven to achieve the capac-
ity of any symmetric memoryless channel under successive
cancellation (SC) decoding [2]. Nevertheless, they require a
very large code length (N = 2n > 220, [1]) in order to actually
approach the channel capacity. Consequently, the practical
interest of polar codes highly depends on the possibility to
design efficient encoder and decoder architectures for large
codelengths.
When implemented in hardware ([3] and [4]), an SC decoder
is composed of three main units: the processing unit (PU), the
memory unit (MU) and the partial sums unit (PSU) as seen in
Fig. 1. The decoded bits, uˆm, are generated one after the other
by the PU which needs (i) Log likelihood ratio (LLR) values
(λ) stored in the MU, and (ii) partial sums (S) calculated in
the PSU . In SC decoding, the partial sums, which are used
to carry on the decoding, are a combination of the previously
decoded bits and are updated whenever a bit is decoded.
As shown in previous works [5] and [6], the hardware imple-
mentation of SC decoders is constrained by the partial sums
computation unit which occupies a major part of the area
and limits the maximum working frequency, especially as N
grows. In [7], a method to compute partial sums is proposed
but the best of our knowledge it has not been implemented.
In [8], an efficient partial sum unit architecture was proposed
and experimentally validated. However, no formal description
of the concept has been given. The purpose of the paper is
to bring some analytical contributions to [8]. The proposed
formalism could then be used with arbitrary kernels and
extended to the structure of [8].
The authors are with the IMS Research Lab., University of Bordeaux, IPB
ENSEIRB-MATMECA, 351 Cours de la Libration, 33405 Talence Cedex,
France. (e-mail:firstname.lastname@ims-bordeaux.fr)
PUλ ûm
S
MU PSU
Fig. 1. Typical SC decoder structure.
û0 
û1 
= 
S0,0=û0 S0,1=û0+û1 
S1,1=û1 
û2 
û3 
= 
S2,0=û2 
= 
= 
S0,2=û0+û1+û2+û3 
S1,2=û1+û3 
S2,2=û2+û3 
S3,2=û3 
λ0,0 λ0,1 
λ1,0 λ1,1 
λ2,0 λ2,1 
λ3,0 λ3,1 
Fig. 2. Factor graph for N = 4 polar code.
II. SUCCESSIVE CANCELLATION DECODING
For a code of length (N = 2n), after being sent over the
transmission channel, the noisy version Y of the codeword X
is received. Each sample ym is converted into log likelihood
ratio (LLR) format. These LLRs are denoted λm, with 0 ≤
m ≤ N − 1. The decoder successively estimates every bit
um based on the channel observation vector (λN−10 ) and the
previously estimated bits (uˆm−10 ). In order to estimate each bit
um, the decoder computes the following LLR value:
λm,0 = log
Pr(yN−10 , uˆ
m−1
0 |um = 0)
Pr(yN−10 , uˆ
m−1
0 |um = 1)
. (1)
The estimated bit uˆm is calculated based on the following rule:
uˆm =
{
0 if λm,0 > 0
1 otherwise. (2)
As proposed by Arıkan in [1], the factor graph representation
of polar codes can be used to efficiently compute the λm,0. For
a code of length (N = 2n), the associated factor graph has n
columns and N rows. SC decoding can be seen as an instance
of belief propagation decoding where LLRs are propagated on
the factor graph of the code with a particular scheduling. In SC
decoding, bits uˆm are processed sequentially and the decision
is then fed back into the graph for the decoding of subsequent
bits. In Fig. 3, the decoding on the factor graph of a simple
N = 2 polar code is represented. The graph is composed of
a check node (CN or ⊕) and a variable node (VN or = ). In
û0 
= 
λ0,1 
λ1,1 
λ0,0 
(a) Decoding of uˆ0.
û0 
û1 
= 
λ0,1 
λ1,1 
S0,0= û0 
λ1,0 
(b) Decoding of uˆ1.
Fig. 3. N = 2 polar code decoding example.
ar
X
iv
:1
31
0.
17
12
v2
  [
cs
.A
R]
  9
 Ja
n 2
01
5
2general, the decoder successively estimates the bits uˆm from
the computation of LLRs of the indexed edges. The LLR of
edge (m, q) is computed such as:
λm,q =
{
f (λm,q+1, λm+2q,q+1) if B(m, q) = 0
g (λm−2q,q+1, λm,q+1, Sm−2q,q) if B(m, q) = 1,
(3)
with: {
f(a, b) = sgn(ab)×min(|a|, |b|)
g(a, b, s) = b⊕ (−1)sa. (4)
where B(m, q) ≡ bm
2q
c mod 2, 0 ≤ m < N and 0 ≤ q < n.
Sm,q represents the partial sum, located at the mth row and qth
column of the factor graph. It corresponds to the propagation
of decisions back into the factor graph. The partial sum set is
denoted
S = {Sm,q|m ∈ J0;N − 1K, q ∈ J0;nK}.
The elements of the partial sum set S are not all used during
the SC decoding, only those such as B(m, q) = 0. For
example S2,1 = uˆ2 + uˆ3 is updated two times, when uˆ2 and
uˆ3 are generated by the PU.
III. FROM MATRIX PRODUCT TO REGISTER-BASED
ARCHITECTURE
A. Matrix product representation
We now wish to prove that the set composed of the bits of
Pn(t), for 0 ≤ t < N , contains all the elements of the set
S . Let us define the proposition Qn: “All the partial sums
of the factor graph are included in the set that contains the
values of the vector Pn(t), for N = 2n and 0 ≤ t < N”, for
all n ∈ N∗.
Let us verify that Q1 is true.
• when t = 0
P1(0) = Uˆ1(0)× κ⊗1 = [uˆ0; 0]×
[
1 0
1 1
]
= [uˆ0; 0] =
[p0(0) p1(0)].
One can notice that p0(0) = uˆ0 = S0,0 as seen in Fig. 2.
• when t = 1
P1(1) = Uˆ1(1) × κ⊗1 = [uˆ0; uˆ1] ×
[
1 0
1 1
]
=
[uˆ0 ⊕ uˆ1; uˆ1] = [p0(1) p1(1)].
One can notice that p0(1) = uˆ0 ⊕ uˆ1 = S1,0 = S1,1 and
p1(1) = uˆ1 = S0,1 as seen in Fig. 2.
The computations of Pn(t) for t = 0 and t = 1 generate all
the required partial sums to decode a code of size n = 1.
Therefore Q1 is true.
Assuming that, for n ∈ N∗, Qn is true, let us show that Qn+1
is true as well. Let us define two N -bit vectors:
• Vˆn(t) = uˆt0 for 0 ≤ t < N − 1,
• Wˆn(t) = uˆtN for N ≤ t < 2N − 1,
such as Uˆn+1(t) = [Vˆn(t); Wˆn(t)]. During the decoding of the
N first bits, Uˆn+1(t) is equivalent to the concatenation of two
N -bit vectors Vˆn(t) and 0N , such that Uˆn+1(t) = [Vˆn(t), 0N ].
Κ(x)n 
Generation 
unit
ct,3
ct,2ct,1
ct,0
R0 R1 R2 R3p3(t)
p3(t-1)p0(t-1)
p0(t) p1(t)
p1(t-1) p2(t-1)
p2(t)
ût
p3(0)=X
p3(1)=X
p3(2)=X
p3(3)=S3,0=S3,1=S3,2
p2(0)=X
p2(1)=X
p2(2)=S2,0
p2(3)=S2,1=S2,2
p1(0)=X
p1(1)=S1,0=S1,1
p1(2)
p1(3)=S1,2
p0(0)=S0,0
p0(1)=S0,1
p0(2)
p0(3)=S0,2
Fig. 4. Register-based architecture for partial sums computation (N = 4).
The matrix multiplication between Uˆn+1(t) and κ⊗(n+1) =[
κ⊗n 0
κ⊗n κ⊗n
]
, for t < N , becomes:
Pn+1(t) = Uˆn+1(t)× κ⊗(n+1) = [Vˆn(t)× κ⊗n, 0N ]. (5)
Since Qn is assumed true, all partial sums of the N first rows
and n first columns of the factor graph are located in the N
leftmost bits of Pn+1(t) (0 ≤ t < N ) : Vˆn(t)× κ⊗n.
Similarly, during the decoding of the N last bits, Uˆn+1(t) is
equivalent to the concatenation of two N -bit vectors Vˆn =
Vˆ N−10 and Wˆn(t), such that Uˆn+1(t) = [Vˆn, Wˆn(t)]. The
matrix product between Uˆn+1(t) and κ⊗(n+1), for t ≥ N ,
becomes:
Pn+1(t) = [(Vˆn ⊕ Wˆn(t))× κ⊗n, Wˆn(t)× κ⊗n]. (6)
Since Qn holds, every partial sum of the N last rows and n
first columns of the factor graph is located in the N rightmost
bits of the resulting vector (N ≤ t < 2N ) : Wˆn(t)× κ⊗n.
Finally, when t = 2N − 1 the resulting vector of the product
contains the partial sums of the last column of the factor graph.
Therefore, Qn+1 is true. As a consequence, every partial sums
of the factor graph of a code of size n are generated by
computing Pn(t), for 0 ≤ t < N .
B. Register-based structure
Pn(t) is composed of N bits pj(t), for 0 ≤ j < N . Each
bit is the result of a matrix multiplication and can be rewritten
as
pj(t) =
∑t
l=0 (uˆl × cl,j) (mod 2) ∀(j, t) ∈ J0;N − 1K2
where ci,j are the elements of the matrix κ⊗n. This sum can
be split into two finite sums. The first one for l ∈ J0; t − 1K,
and the second one for l = t, l being the index of the sum of
pj(t). Therefore, the previous equation can be rewritten as:
pj(t) = pj(t− 1)⊕ uˆt × ct,j . (7)
Equation (7) is a recurrent series which can be implemented
by the register-based structure shown in Fig. 4 for N = 4.
Since Pn(t) is an N -bit vector, an N -bit register is required
to store pj(t), for 0 ≤ t < N , along with N XORs and N
ANDs elements. Every pj(t), for 0 ≤ t < N , is stored in
the jth DFF, Rj . One can notice that the partial sums of the
jth row of the graph are computed by pj(t). Therefore, the
partial sums, Sm,q , located on the mth row of the graph are
3PE(0,0) 
= 
= 
= 
= 
= 
= 
PE(0,1) 
= 
= 
PE(1,1) 
= 
PE(0,2) 
= 
PE(1,2) 
= 
PE(2,2) PE(3,2) 
= 
S0,0 S0,1 
S1,1 
S2,0 
S4,0 S4,1 
S5,1 
S6,0 
S0,2 
S1,2 
S2,2 
S3,2 
Fig. 5. Factor graph with identification to PE and the required partial sums
(N = 8).
successively stored in Rm.
IV. SHIFT-REGISTER BASED STRUCTURE
In a tree SC decoder [9], a PE can be assigned to the
processing of one or more nodes in the graph. A PE is
identified as PE(x, y) such that 0 ≤ y ≤ n − 1 and
0 ≤ x ≤ 2y − 1. For instance, in Fig. 5, the partial sums
{S0,0, S2,0, S4,0, S6,0} are assigned to PE(0, 0). Moreover, in
the register-based architecture given in Fig. 4, the partial sum
Sm,q is stored in the DFF Rm. This means that a PE may be
connected to multiple DFFs. Complex multiplexing resources
are then necessary to select the partial sums for a given PE.
The main purpose of this section is to modify the PSU
architecture detailed in Fig. 4 so that all the partial sums
required by a given PE are located in the same DFF. Such
a structure would avoid any kind of multiplexing between a
PE and the DFFs containing the required partial sums.
A. Partial sum location
The proposed structure is derived from the regular architec-
ture depicted in Fig. 4. Instead of updating the current DFF
value and store it back in the same DFF, it is possible to update
and store this value in the next DFF as shown in Fig. 6 for
N = 4.
The shift of the pm(t) values produces the exact same result
as long as the coefficient of κ⊗n are shifted accordingly. In
this section we consider that the ci,j bits are shifted as well in
order to compute the same partial sums and are denoted c′i,j .
Note that the generation of κ⊗n is further detailed in section
V.
As shown in the previous section, without shift, the mth DFF
contains the values pm(t), then the partial sum Sm,q . In the
proposed architecture, due to the shift, pm(t) is not necessarily
located in the mth DFF, thus neither is Sm,q . For example in
Fig. 6, at time t = 0, p0(0) is in R0. At time t = 1, p1(1)
is in R0 and p0(1) is in R1. More generally, at time t, pm(t)
is in Rt−m. This means that Sm,q needs to be located, that
is to say, one needs to determine the time of availability, τ ,
such that pm(τ) = Sm,q . In APPENDIX A it is shown that
the partial sum Sm,q is generated at time:
τ = (bm
2q
c+ 1) · 2q − 1. (8)
Κ(x)n 
Generation
unit
c't,3
c't,2c't,1
c't,0
R0 R1 R2 R3
ût
X
X
X
p0(3)=S0,2
X
X
p0(2)
p1(3)=S1,2
X
p0(1)=S0,1
p1(2)
p2(3)=S2,1=S2,2
p0(0)=S0,0
p1(1)=S1,0=S1,1
p2(2)=S2,0
p3(3)=S3,0=S3,1=S3,2
Fig. 6. Shift-register-based architecture for partial sums computation (N=4).
In other words, at time τ , the partial sum Sm,q is located in
the DFF Rτ−m.
B. DFF-PE direct connection
It is now possible to know where and when any needed
partial sum is located. However, the set of the partial sums
that are required by a given PE has to be found in order to
show that its elements are generated in the same DFF.
In Fig. 5, a PE(x, y) requires all the partial sums that verify
Sx+k·2y+1,y with k chosen such that 0 ≤ x+k ·2y+1 ≤ N−1.
For instance, PE(0, 1) requires S0,1 (k = 0) and S4,1 (k = 1).
In the shift-register-based structure, the partial sum Sm,q is
located in the DFF Rτ−m. This means that the set of partial
sums required by PE(x, y) are located in Rτ−(x+k2y+1). By
replacing the expression of τ , one can show that the set of
DFF required by a PE(x, y) are indexed by the expression −(x
mod 2y) + 2y − 1. This index is independent of k. In other
words, the partial sums required by PE(x, y) are all located in
the same DFF.
Moreover, as 0 ≤ y ≤ n − 1, therefore 0 ≤ 2y ≤ N2 . With
these considerations, the previous expression of the DFF index
ranges from 0 to N2 −1 (0 ≤ −(x mod 2y)+2y−1 ≤ N2 −1).
As a consequence, the N2 first DFFs are sufficient to memorize
all the required partial sums during the decoding of code of
length N .
The proposed architecture can easily be applied to line SC
decoders by grouping the PE which are assigned to the same
DFFs ([9]). The shift-register-based architecture may also
be employed for a semi-parallel SC decoder architecture by
adding multiplexing.
V. κ⊗n MATRIX GENERATION UNIT
The partial sums calculations, for a code of length N = 2n,
require the values of κ⊗(n−1) two times as seen in equations
(5) and (6), in section III, but for a code size of (2n+1)
instead of 2n. The first time to calculate the partial sums of
the first half of the rows in the graph. The last one is for the
remaining partial sums. The generation of the bits of the rows
of κ⊗n can be seen as a finite state machine with as many
state as there are rows to generate. Each state represents a
row of the matrix. Every row could be stored in a ROM but
4R0 R1 R2 R3 ci,3ci,0 ci,1 ci,21
1
1
1
1
1
...
0
1
0
1
0
...
0
0
1
1
0
...
0
0
0
1
0
...
Fig. 7. κ⊗2 matrix generation for a code of size N = 8.
this architectural solution would become impractical for code
length reaching 220 bits. Another approach is to compute the
value of the future state using the current state value. To apply
this proposition, a quick observation of the matrix κ⊗(n−1) is
necessary. Two main properties can be highlighted:
ci,0 = 1 ∀ i ∈ J0; N2 − 1K
ci,j = ci−1,j−1 ⊕ ci−1,j ∀(i, j) ∈ J1; N2 − 1K2
The first property means that the first bit is always one, which
is immediate due to the Kronecker power definition. Therefore,
this bit does not require recalculations when changing state.
The second property is the most important because it is
exploited to compute the future state from the current state
bit values.
The rows of κ⊗(n−1) are generated one after the other.
Therefore, in ci,j , the index i represents the time t, while
j corresponds to the DFF index, in which the value is stored.
The second property is the equation of the construction and
can be rewritten as Mj(t) = Mj−1(t − 1) ⊕Mj(t − 1). To
implement such an equation, an AND logic gate and a DFF
are sufficient. The N2 DFFs are connected one to the other as
seen in Fig. 7.
The shift-register based structure, used to compute the partial
sums, requires that the bits of κ⊗n are shifted accordingly.
One can verify that the diagonals and the columns are equal.
Therefore, the proposed structure generates the bits of κ⊗n
that can be employed as they are for the shift-register based
architecture.
VI. CONCLUSION AND PERSPECTIVES
Designing efficient hardware decoders for polar codes
would result in their potential inclusion in future digital
telecommunication standards. State of the art works propose
efficient successive cancellation decoder hardware designs
whose limiting element is the partial sum unit. This paper
brings contribution to formalizing the structure proposed in
[8], reducing the hardware complexity. The shift-register based
architecture can be extended to line SC decoder. It can
also be applied to semi-parallel architectures by adding more
multiplexing resources.
The proposed computation method opens the way for addi-
tional works such as the extension of this architecture to higher
kernels or the enhancement of parallelism in this structure.
APPENDIX A
TIME OF AVAILABILITY FOR A PARTIAL SUM
Any partial sum, Sm,q , can be seen as an element of a sub-
codeword SCW (m, q). This sub-codeword is the encoded ver-
sion of the sub-code uˆba ∈ Uˆ . All the elements of SCW (m, q)
are valid whenever all bits of uˆba are valid. Since the bits are
decoded sequentially, a partial sum Sm,q is valid when the bit
uˆb is available, that is to say when τ = b. The main purpose
is then to find the expression of b. We already know q, thus
a is the only remaining variable to find before getting the
expression of b. The following equality comes from the length
of SCW (m, q), which is the same as the length uˆba.
2q = b− a+ 1. (9)
Since a is the starting index, it is a multiple of 2q . The
following expression returns the value of a:
a = bm
2q
c ∗ 2q. (10)
Now the only remaining variable is b. Using equation (9) and
(10) it follows:
b = 2q − 1 + bm
2q
c ∗ 2q.
Finally, Sm,q is only valid when τ = b, that is to say:
τ = b = (bm
2q
c+ 1) ∗ 2q − 1.
REFERENCES
[1] E. Arikan, “Channel polarization: A method for constructing capacity-
achieving codes,” in Information Theory, 2008. ISIT 2008. IEEE Interna-
tional Symposium on, Jul. 2008, pp. 1173 –1177.
[2] E. Sasoglu, E. Telatar, and E. Arikan, “Polarization for arbitrary discrete
memoryless channels,” arXiv:0908.0302, Aug. 2009.
[3] A. J. Raymond and W. J. Gross, “Scalable successive-cancellation hard-
ware decoder for polar codes,” arXiv e-print 1306.3529, Jun. 2013.
[4] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar
decoders: Algorithm and implementation,” arXiv e-print 1307.7154, Jul.
2013, Submitted to the IEEE Journal on Selected Areas in Communica-
tions (JSAC) on May 15th, 2013.
[5] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” Signal Processing, IEEE
Transactions on, vol. PP, no. 99, pp. 289–299, 2012.
[6] A. Mishra, A. J. Raymond, L. G. Amaru, G. Sarkis, C. Leroux, P. Mein-
erzhagen, A. Burg, and W. Gross, “A successive cancellation decoder
ASIC for a 1024-bit polar code in 180nm CMOS,” in Asian Solid-State
Circuits Conference, Nov. 2012.
[7] C. Zhang and K. Parhi, “Low-latency sequential and overlapped archi-
tectures for successive cancellation polar decoder,” IEEE Transactions on
Signal Processing, pp. 1–1, 2013.
[8] G. Berhault, C. Leroux, C. Jego, and D. Dallet, “Partial sums generation
architecture for successive cancellation decoding of polar codes,” accepted
SIPS, IEEE Workshop on Signal Processing Systems, Oct. 2013.
[9] C. Leroux, I. Tal, A. Vardy, and W. Gross, “Hardware architectures
for successive cancellation decoding of polar codes,” in 2011 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), May 2011, pp. 1665 –1668.
