Multi-kernel polar codes have recently been proposed to construct polar codes of lengths different from powers of two. Decoder implementations for multi-kernel polar codes need to account for this feature, that becomes critical in memory management. We propose an efficient, generalized memory management framework for implementation of successivecancellation decoding of multi-kernel polar codes. It can be used on many types of hardware architectures and different flavors of SC decoding algorithms. We illustrate the proposed solution for small kernel sizes, and give complexity estimates for various kernel combinations and code lengths.
I. INTRODUCTION
Polar codes [1] are a family of error correcting codes with capacity-achieving property over various classes of channels, providing excellent error rate performance for practical code lengths [2] . The construction of polar codes is based on the polarization effect of the Kronecker powers of the binary 2 × 2 kernel matrix T 2 = 1 0 1 1 . A major drawback of this construction is the restriction of achievable block lengths to powers of 2. Puncturing and shortening techniques can be used to adjust the code length, at the cost of a reduced bit polarization [3] . To overcome this limitation, multi-kernel polar codes have been introduced in [4] . By mixing binary kernels of different sizes in the construction of the code, these codes prove that many block lengths can be achieved while keeping the polarization effect. Many software and hardware implementations of polar code decoders have been proposed in literature. While software guarantees a higher degree of flexibility in terms of data structures, fast software decoders have to rely on efficient memory management [5] , [6] . The importance of smart memory usage is even more evident in hardware implementations, where memory accounts for the majority of area occupation and power consumption, and heavily impacts decoder speed [7] - [9] . The memory structure first proposed in [10] for purely binary polar codes, and widely adopted in SC-based decoders [11] , relies on the observation that memory requirements decrease as the decoding stage increases. We show how this trend continues in multi-kernel polar codes, proposing an efficient memory structure for SC-based polar decoders, and 
providing functions for the evaluation of the overall memory requirements. This structure supports the decoding of codes constructed with any combination of kernel sizes, making it an ideal framework for multi-kernel decoder hardware implementations [12] .
II. MULTI-KERNEL POLAR CODES
Multi-kernel polar codes generalize the construction of polar codes by mixing binary kernels of different sizes. Similarly to polar codes, an (N, K) multi-kernel polar code is completely defined by a N × N transformation matrix G N and a frozen set F, with |F| = N −K. Transformation matrix has the form
where T pi is a p i × p i binary matrix, i = 1, 2, . . . , s, denoting a polarizing kernel of size p i , and N = p 1 · p 2 · . . . · p s . Binary kernels of different sizes can be found in [13] . Transformation matrix G 12 = T 2 ⊗ T 2 ⊗ T 3 is shown in Figure 1 , where
and the recursive structure of the matrix is highlighted. The frozen set F indicates the N − K bits to be frozen in the code construction, and can generally be designed according to bit reliabilities [4] or minimum distance [14] . Finally, the encoder is defined by x = u · G N , mapping the input vector u ∈ F 2 N to the codeword x ∈ F 2 N , where u i = 0 for i ∈ F, and u i , i / ∈ F, stores the information bits. We recall the set I = F c to be termed as information set.
The structure of multi-kernel polar codes can be better understood through the Tanner graph of the code; this consists of various p i × p i blocks B pi , corresponding to the different T pi kernels used in the construction of the transformation matrix, connecting input vector and codeword. Each of the s stages composing the graph is formed by N i = N/p i kernel blocks B pi , performing the operations involving kernel T pi . Permutations P i between stages are described in [4] ; an example of Tanner graph for a G 12 is given in Figure 2 .
Multi-kernel polar codes can be decoded through successive cancellation (SC) decoding on the Tanner graph of the code, where log-likelihood ratios (LLRs) [15] are passed from the right to the left, while partial sums (PSs) based on hard decisions on the decoded bits are passed from the left to the right. LLRs and PSs are calculated in the kernel blocks, depicted as
Blocks in the same column belong to the same stage and can perform decoding operations in parallel. Roughly speaking, L i and l i represent the LLRs of the partial sums u i and x i respectively. However, PSs are calculated on the basis of the previously decoded bit, hence they may not match with the connected LLRs. We indicate with L i,(j−1)pi , . . . , L i,jpi−1 and u i,(j−1)pi , . . . , u i,jpi−1 the LLRs and PSs input of the j-th block of stage i respectively, with j ≤ N i = N/p i . LLRs L 1,0 , . . . , L 1,N −1 correspond to channel LLRs, while u s,0 , . . . , u s,N −1 correspond to the decoded bits. An example of this labeling is given in Figure 2 .
Given the binary input vector u = (u 0 , u 1 , . . . , u p−1 ), corresponding to the partial sums calculated from the decoded bits, the output vector x = (x 0 , x 1 , . . . , x p−1 ) is calculated as x = u · T p . If we call T i p the i-th column of the kernel matrix T p , the update rule for the PSs can be written as
The vector x corresponds to the partial sums calculated by the kernel T p , that will be used as input for the LLRs calculations of other blocks. This update rule is performed from left to right, and can be used also for the encoding. Output LLRs l 0 , . . . , l p−1 are calculated sequentially using the input LLRs L 0 , . . . , L p−1 coming from the previous stage and the PSs corresponding to the previously decoded bit, i.e.,
with l 0 = f p 0 (L 0 , . . . , L p−1 ). This update is performed from the right to the left, and corresponds to the successive cancellation principle. Rules for the derivation of LLR update functions for arbitrary binary kernels can be found in [16] .
III. MEMORY MANAGEMENT
Similarly to polar codes, it is possible to describe the SC decoding process of a multi-kernel polar code using a data flow graph, depicted in Figure 3 for the code generated by G 12 . The data flow graph represents memory dependencies arising during the decoding, where circles and squares represent memory needed to store LLRs and PSs, respectively, and black circles represent channel LLRs. In particular, circles and squares identify the need for new memory allocation, while horizontal lines determine the number of time steps for which the values need to be stored. Thick lines represent LLR updates, while dotted lines identify operations involving partial sums, i.e. LLR updates when they merge with thick lines and PSs updates when they connect squares. The study of the data flow graph highlights the strong dependencies among data, along with a repetitive structure in the LLR update functions, and gives a precise order in the scheduling of the decoding operations. In general, the hardest constraint for the LLR update functions is given by the calculation of the necessary PSs. The memory usage patterns observed in Figure 3 can be found for code of any length, and can be exploited to develop a memory management framework as follows.
A. Memory Structure
The memory structures for a generic SC decoder for the multi-kernel polar code defined by G 12 is presented in Figure 4 , along with memory dependencies. We call Λ, Π and Υ the data structures used to respectively store LLRs, PSs and decoded bits. We define as Q the number of bits assigned to the representation of each internal LLR, while a partial sum and a decoded bit are, by definition, single-bit values. The proposed memory structure relies on the observation made in in [10] that memory requirements for polar codes decoding decrease as the stage index increases; we show that this phenomenon can be extended to multi-kernel polar codes.
The memory structure of multi-kernel polar codes decoder depends on the order of the kernels defining the transformation matrix G N = T p1 ⊗. . .⊗T ps , where s is the number of stages, i.e., the number of factors in the Kronecker product. LLRs can be stored in s + 1 Q-bits vectors Λ 0 , . . . , Λ s of different lengths, i.e. with a different number of elements. The length of vector Λ s is always 1, and stores the LLR of the currently decoded bit. The length of vector Λ i is given by the product of the last s − i kernel sizes, i.e., Λ i has p i+1 · . . . · p s entries. PSs are stored in s binary matrices Π 1 , . . . , Π s of different width and depth, depending on the decoding stage. The width of Π i is given by the size of kernel p i , while its depth is given by the product of the last s − i kernel sizes, i.e., is given by p i+1 · . . . · p s , similarly to LLR vectors. The first matrix Π 1 is an exception, since it has width p s − 1. This is due to the fact that the last column of Π s would be updated during the PS update phase of the decoding of the last bit. Since the PS update is executed right after the bit estimation, we skip this last PS update and we do not need to store the last column of Π s . Finally, the decoded bits are stored in the binary vector Υ of length N . PS update 6: end for 7: return u Algorithm 1 depicts the logical flow of operations required by SC decoding. In this Section, we follow its schedule and first describe the update operations for the LLRs, then for the decoded bits u, and finally for the PSs. The memory update operations are performed by the kernel block. LLR vector Λ 0 is initially filled with the N LLRs extracted from the received symbols, while the rest of the memory is initialized to zero; we recall that the LLRs have to be permuted according to P 1 before the insertion in Λ 0 . According to the SC algorithm, bits are decoded sequentially, hence some of the memory structures are updated at every bit estimation; this update process is illustrated for the decoding of generic input bit u i .
B. Memory Update
LLR update: The update function (4) to be used in this phase depends on the index i of the decoded bit u i , and it is selected using the mixed radix representation of i based on the kernels composing the transformation matrix of the code.
In a base p radix system, positive integers are represented as a finite sequence of digits smaller than p. A mixed radix system is a non-standard positional numeral system, generalizing classic radix system, in which the numerical base depends on the digit position. A well known example of a mixed radix numeral system is the one used to measure time in hours, minutes and seconds. We use the sequence p 1 , . . . p s of the sizes of the kernels constructing the transformation matrix as the base of a finite mixed radix system representing the index i of the decoded bit u i . According to this representation, any integer i < N can be expressed as a vector of s digits
The mixed radix representation of the decoded bits indices for G 12 = T 2 ⊗ T 2 ⊗ T 3 is given by: The update of LLR vectors proceeds from right to left in Figure 4 . Starting from Λ 1 , all the vectors are updated using the previous LLR vector and the present PS matrix as input; in general, vector Λ j is updated using vector Λ j−1 and the partial sums stored in Π j . The LLR update rule to be used in the update of Λ j is selected using the mixed radix representation of i, and more precisely the LLR update rule f
This method is an extension of the method proposed in [10] for the scheduling of f and g functions in the decoding of polar codes. Each entry Λ j (k) of the LLR vector is calculated as
(6) The update operations of a vector Λ j can be run in parallel to reduce latency using up to p j+1 · . . . · p s kernel blocks.
Using the proposed LLR update algorithm, s LLR vectors, i.e. from Λ 1 to Λ s , are updated for every decoded bit u i . However, the data flow presented in Figure 3 shows that the number of vectors to be updated actually depends on the index i of the decoded bit. A closer look to the mixed radix representation table suggests the reason of this scheduling: in fact, the mixed radix representations of two consecutive numbers differ only on the right of the position of the rightmost nonzero element of the second number. In practice,
As a consequence, to decode the bit u i it is not necessary to update the vectors Λ j with indices j < z, and the update can be run starting from Λ z . Of course, for the case i = 0 all the vectors have to be updated. This acceleration technique is a generalization of the one proposed in [15] for polar codes, and halves the number of vectors updates.
This property allows a further simplification of the LLR update algorithm. We have seen that the LLR update is run starting from Λ z with z such that b . This means that it is only necessary to find the subscript of the first LLR update function, while the other ones all have subscript 0. u i estimation: If i ∈ F, i.e. it belongs to the frozen set, its value is known to be zero, hence Υ(i) = 0. Otherwise, i.e. if i / ∈ F, the value decoded bit is decided by hard decision on its LLR. After the LLR update phase, the LLR of the bit u i will be copied in Λ s as explained in next paragraph. In our implementation, negative LLRs represent the bit 1, while positive LLRs represent the bit 0. Through hard decision, we set Υ(i) = sgn(Λs(0))+1 2 . To sum up, we have that
PS update: PS matrices are updated in decreasing order starting from Π s . Inside each matrix, the entries update is performed per columns, in increasing order starting from the first column. When the last column of a matrix is filled, a column of the next matrix is updated. Similarly to LLRs, the update function depends on the mixed radix representation of index i; in 16 : end for 17: for j = z + 1 . . . z do 18: n = n · p j 19:
for k = 0 . . . n − 1 do 20: particular, the number of matrices to be updated is given by the number of consecutive digits of the mixed radix representation of i with the highest symbol admitted by the radix, counting from the last digit.
Update always starts from the last PS matrix Π s , that is a row vector of width p s . The value of the decoded bit u i is copied in the column b 
. . , Π z are going to be updated. When the last column of matrix Π j is filled, i.e. when b (i) j = p j − 1, then the column b j−1 of matrix Π j−1 has to be updated. In this case, each row of Π j is used to update the column b 0) , . . . , Π j (k, p j − 1)] · T pj for k = 0, . . . , p j+1 · . . . · p s − 1. If we call T k p the vector formed by the k-th column of the kernel matrix T p , the update rule for the PSs can be rewritten as
for k = 0, . . . , p j+1 · . . . · p s , where Π j (k, −) represents the k-th row of Π j and c = (k mod p j+1 ) + 1. As an exception, the PS update step is not executed for the last decoded bit u N −1 , since this phase would have been executed after the decoding of the last bit and it would be pointless. 
IV. ANALYSIS AND CONCLUSIONS
The proposed memory structure allows to limit the memory requirement of a multi-kernel polar decoder. In fact, a naïve memory management of the SC decoder for a multi-kernel polar codes with transformation matrix G N = T p1 ⊗ . . . ⊗ T ps requires to store all the LLRs and the PSs depicted in the Tanner graph of the code. As a consequence, M LLR = N · (s + 1) LLRs and M P S = N · s PSs, with N = p 1 · . . . · p s , have to be stored, with space complexity O(sN ). The memory requirement is hence linearly dependent on both the code length N and the number of kernels s.
In the proposed memory structure, every LLR vector Λ i with i ≤ 1 stores 
By construction, we have that M P S prop ≤ N ≤ M LLR prop < 2N , hence the space complexity for both LLRs and PSs is reduced to O(N ). A comparison between the memory requirements for the proposed memory structure and the naïve one involving only kernels of sizes 2 and 3 is presented here: The memory requirement reduction enabled by the proposed memory structure is remarkable. This proves that multi-kernel polar codes can be used as a valid alternative to punctured polar codes in terms of memory complexity. Given the similarities between polar codes and multi-kernel polar codes, it is straightforward to apply the proposed memory structure to list or simplified SC decoders. Finally, the proposed implementation can be easily transposed to hardware, reducing the complexity of an ASIC or FPGA dedicated architecture.
