Polar codes are the first error-correcting codes to provably achieve the channel capacity but with infinite codelengths. For finite codelengths the existing decoder architectures are limited in working frequency by the partial sums computation unit. We explain in this paper how the partial sums computation can be seen as a matrix multiplication. Then, an efficient hardware implementation of this product is investigated. It has reduced logic resources and interconnections. Formalized architectures, to compute partial sums and to generate the bits of the generator matrix κ ⊗n , are presented. The proposed architecture allows removing the multiplexing resources used to assigned to each processing elements the required partial sums.
I. INTRODUCTION
P OLAR codes [1] are a new class of error correction codes.
These linear block codes are proven to achieve the capacity of any symmetric memoryless channel under successive cancellation (SC) decoding [2] . Nevertheless, they require a very large code length (N = 2 n > 2 20 , [1] ) in order to actually approach the channel capacity. Consequently, the practical interest of polar codes highly depends on the possibility to design efficient encoder and decoder architectures for large codelengths. When implemented in hardware ( [3] and [4] ), an SC decoder is composed of three main units: the processing unit (PU), the memory unit (MU) and the partial sums unit (PSU) as seen in Fig. 1 . The decoded bits,û m , are generated one after the other by the PU which needs (i) Log likelihood ratio (LLR) values (λ) stored in the MU, and (ii) partial sums (S) calculated in the PSU . In SC decoding, the partial sums, which are used to carry on the decoding, are a combination of the previously decoded bits and are updated whenever a bit is decoded. As shown in previous works [5] and [6] , the hardware implementation of SC decoders is constrained by the partial sums computation unit which occupies a major part of the area and limits the maximum working frequency, especially as N grows. In [7] , a method to compute partial sums is proposed but the best of our knowledge it has not been implemented. In [8] , an efficient partial sum unit architecture was proposed and experimentally validated. However, no formal description of the concept has been given. The purpose of the paper is to bring some analytical contributions to [8] . The proposed formalism could then be used with arbitrary kernels and extended to the structure of [8] . 
II. SUCCESSIVE CANCELLATION DECODING
For a code of length (N = 2 n ), after being sent over the transmission channel, the noisy version Y of the codeword X is received. Each sample y m is converted into log likelihood ratio (LLR) format. These LLRs are denoted λ m , with 0 ≤ m ≤ N − 1. The decoder successively estimates every bit u m based on the channel observation vector (λ N −1 0 ) and the previously estimated bits (û m−1 0 ). In order to estimate each bit u m , the decoder computes the following LLR value:
.
The estimated bitû m is calculated based on the following rule:
As proposed by Arıkan in [1] , the factor graph representation of polar codes can be used to efficiently compute the λ m,0 . For a code of length (N = 2 n ), the associated factor graph has n columns and N rows. SC decoding can be seen as an instance of belief propagation decoding where LLRs are propagated on the factor graph of the code with a particular scheduling. In SC decoding, bitsû m are processed sequentially and the decision is then fed back into the graph for the decoding of subsequent bits. In Fig. 3 , the decoding on the factor graph of a simple N = 2 polar code is represented. The graph is composed of a check node (CN or ⊕) and a variable node (VN or = ). In 
where B(m, q) ≡ ⌊ m 2 q ⌋ mod 2, 0 ≤ m < N and 0 ≤ q < n. S m,q represents the partial sum, located at the m th row and q th column of the factor graph. It corresponds to the propagation of decisions back into the factor graph. The partial sum set is denoted
The elements of the partial sum set S are not all used during the SC decoding, only those such as B(m, q) = 0. For example S 2,1 =û 2 +û 3 is updated two times, whenû 2 and u 3 are generated by the PU.
III. FROM MATRIX PRODUCT TO REGISTER-BASED ARCHITECTURE

A. Matrix product representation
We now wish to prove that the set composed of the bits of P n (t), for 0 ≤ t < N , contains all the elements of the set S. Let us define the proposition Q n : "All the partial sums of the factor graph are included in the set that contains the values of the vector P n (t), for N = 2 n and 0 ≤ t < N ", for all n ∈ N * . Let us verify that Q 1 is true.
One can notice that p 0 (0) =û 0 = S 0,0 as seen in Fig. 2 .
One can notice that p 0 (1) =û 0 ⊕û 1 = S 1,0 = S 1,1 and p 1 (1) =û 1 = S 0,1 as seen in Fig. 2 . The computations of P n (t) for t = 0 and t = 1 generate all the required partial sums to decode a code of size n = 1. Therefore Q 1 is true. Assuming that, for n ∈ N * , Q n is true, let us show that Q n+1 is true as well. Let us define two N -bit vectors:
. During the decoding of the N first bits,Û n+1 (t) is equivalent to the concatenation of two N -bit vectorsV n (t) and 0 N , such thatÛ The matrix multiplication betweenÛ n+1 (t) and
, for t < N , becomes:
Since Q n is assumed true, all partial sums of the N first rows and n first columns of the factor graph are located in the N leftmost bits of
Since Q n holds, every partial sum of the N last rows and n first columns of the factor graph is located in the N rightmost bits of the resulting vector (N ≤ t < 2N ) :Ŵ n (t) × κ ⊗n . Finally, when t = 2N − 1 the resulting vector of the product contains the partial sums of the last column of the factor graph. Therefore, Q n+1 is true. As a consequence, every partial sums of the factor graph of a code of size n are generated by computing P n (t), for 0 ≤ t < N .
B. Register-based structure P n (t) is composed of N bits p j (t), for 0 ≤ j < N . Each bit is the result of a matrix multiplication and can be rewritten as
are the elements of the matrix κ ⊗n . This sum can be split into two finite sums. The first one for l ∈ 0; t − 1 , and the second one for l = t, l being the index of the sum of p j (t). Therefore, the previous equation can be rewritten as:
Equation (7) is a recurrent series which can be implemented by the register-based structure shown in Fig. 4 for N = 4.
Since P n (t) is an N -bit vector, an N -bit register is required to store p j (t), for 0 ≤ t < N , along with N XORs and N ANDs elements. Every p j (t), for 0 ≤ t < N , is stored in the j th DFF, R j . One can notice that the partial sums of the j th row of the graph are computed by p j (t). Therefore, the partial sums, S m,q , located on the m th row of the graph are successively stored in R m .
IV. SHIFT-REGISTER BASED STRUCTURE
In a tree SC decoder [9] , a PE can be assigned to the processing of one or more nodes in the graph. A PE is identified as PE(x, y) such that 0 ≤ y ≤ n − 1 and 0 ≤ x ≤ 2 y − 1. For instance, in Fig. 5 , the partial sums {S 0,0 , S 2,0 , S 4,0 , S 6,0 } are assigned to PE(0, 0). Moreover, in the register-based architecture given in Fig. 4 , the partial sum S m,q is stored in the DFF R m . This means that a PE may be connected to multiple DFFs. Complex multiplexing resources are then necessary to select the partial sums for a given PE. The main purpose of this section is to modify the PSU architecture detailed in Fig. 4 so that all the partial sums required by a given PE are located in the same DFF. Such a structure would avoid any kind of multiplexing between a PE and the DFFs containing the required partial sums.
A. Partial sum location
The proposed structure is derived from the regular architecture depicted in Fig. 4 . Instead of updating the current DFF value and store it back in the same DFF, it is possible to update and store this value in the next DFF as shown in Fig. 6 for N = 4. The shift of the p m (t) values produces the exact same result as long as the coefficient of κ ⊗n are shifted accordingly. In this section we consider that the c i,j bits are shifted as well in order to compute the same partial sums and are denoted c ′ i,j . Note that the generation of κ ⊗n is further detailed in section V. As shown in the previous section, without shift, the m th DFF contains the values p m (t), then the partial sum S m,q . In the proposed architecture, due to the shift, p m (t) is not necessarily located in the m th DFF, thus neither is S m,q . For example in Fig. 6 , at time t = 0, p 0 (0) is in R 0 . At time t = 1, p 1 (1) is in R 0 and p 0 (1) is in R 1 . More generally, at time t, p m (t) is in R t−m . This means that S m,q needs to be located, that is to say, one needs to determine the time of availability, τ , such that p m (τ ) = S m,q . In APPENDIX A it is shown that the partial sum S m,q is generated at time:
In other words, at time τ , the partial sum S m,q is located in the DFF R τ −m .
B. DFF-PE direct connection
It is now possible to know where and when any needed partial sum is located. However, the set of the partial sums that are required by a given PE has to be found in order to show that its elements are generated in the same DFF. In Fig. 5, a PE(x, y) requires all the partial sums that verify S x+k·2 y+1 ,y with k chosen such that 0 ≤ x+k ·2 y+1 ≤ N −1. For instance, PE(0, 1) requires S 0,1 (k = 0) and S 4,1 (k = 1). In the shift-register-based structure, the partial sum S m,q is located in the DFF R τ −m . This means that the set of partial sums required by PE(x, y) are located in R τ −(x+k2 y+1 ) . By replacing the expression of τ , one can show that the set of DFF required by a PE(x, y) are indexed by the expression −(x mod 2 y ) + 2 y − 1. This index is independent of k. In other words, the partial sums required by PE(x, y) are all located in the same DFF. Moreover, as 0 ≤ y ≤ n − 1, therefore 0 ≤ 2 y ≤ N 2 . With these considerations, the previous expression of the DFF index ranges from 0 to N 2 −1 (0 ≤ −(x mod 2 y )+2 y −1 ≤ N 2 −1). As a consequence, the N 2 first DFFs are sufficient to memorize all the required partial sums during the decoding of code of length N . The proposed architecture can easily be applied to line SC decoders by grouping the PE which are assigned to the same DFFs ( [9] ). The shift-register-based architecture may also be employed for a semi-parallel SC decoder architecture by adding multiplexing.
V. κ ⊗n MATRIX GENERATION UNIT
The partial sums calculations, for a code of length N = 2 n , require the values of κ ⊗(n−1) two times as seen in equations (5) and (6), in section III, but for a code size of (2 n+1 ) instead of 2 n . The first time to calculate the partial sums of the first half of the rows in the graph. The last one is for the remaining partial sums. The generation of the bits of the rows of κ ⊗n can be seen as a finite state machine with as many state as there are rows to generate. Each state represents a row of the matrix. Every row could be stored in a ROM but this architectural solution would become impractical for code length reaching 2 20 bits. Another approach is to compute the value of the future state using the current state value. To apply this proposition, a quick observation of the matrix κ ⊗(n−1) is necessary. Two main properties can be highlighted:
The first property means that the first bit is always one, which is immediate due to the Kronecker power definition. Therefore, this bit does not require recalculations when changing state. The second property is the most important because it is exploited to compute the future state from the current state bit values. The rows of κ ⊗(n−1) are generated one after the other. Therefore, in c i,j , the index i represents the time t, while j corresponds to the DFF index, in which the value is stored. The second property is the equation of the construction and can be rewritten as M j (t) = M j−1 (t − 1) ⊕ M j (t − 1). To implement such an equation, an AND logic gate and a DFF are sufficient. The N 2 DFFs are connected one to the other as seen in Fig. 7 . The shift-register based structure, used to compute the partial sums, requires that the bits of κ ⊗n are shifted accordingly. One can verify that the diagonals and the columns are equal. Therefore, the proposed structure generates the bits of κ ⊗n that can be employed as they are for the shift-register based architecture.
VI. CONCLUSION AND PERSPECTIVES
Designing efficient hardware decoders for polar codes would result in their potential inclusion in future digital telecommunication standards. State of the art works propose efficient successive cancellation decoder hardware designs whose limiting element is the partial sum unit. This paper brings contribution to formalizing the structure proposed in [8] , reducing the hardware complexity. The shift-register based architecture can be extended to line SC decoder. It can also be applied to semi-parallel architectures by adding more multiplexing resources. The proposed computation method opens the way for additional works such as the extension of this architecture to higher kernels or the enhancement of parallelism in this structure.
APPENDIX A TIME OF AVAILABILITY FOR A PARTIAL SUM
Any partial sum, S m,q , can be seen as an element of a subcodeword SCW (m, q). This sub-codeword is the encoded version of the sub-codeû b a ∈Û . All the elements of SCW (m, q) are valid whenever all bits ofû b a are valid. Since the bits are decoded sequentially, a partial sum S m,q is valid when the bit u b is available, that is to say when τ = b. The main purpose is then to find the expression of b. We already know q, thus a is the only remaining variable to find before getting the expression of b. The following equality comes from the length of SCW (m, q) , which is the same as the lengthû b a . 2 q = b − a + 1.
Since a is the starting index, it is a multiple of 2 q . The following expression returns the value of a:
Now the only remaining variable is b. Using equation (9) and (10) it follows: b = 2 q − 1 + ⌊ m 2 q ⌋ * 2 q . Finally, S m,q is only valid when τ = b, that is to say:
