Polar codes have received increasing attention in the past decade, and have been selected for the next generation of the wireless communication standard. Most research on polar codes has focused on codes constructed from a 2×2 polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of 2. A few recent works have proposed construction methods based on multiple kernels of different dimensions, not only binary ones, allowing code lengths different from powers of 2. In this paper, we design and implement the first multi-kernel successive cancellation polar code decoder in literature. It can decode any code constructed with binary and ternary kernels: the architecture, sized for a maximum code length N max , is fully flexible in terms of code length, code rate, and kernel sequence. The decoder can achieve a frequency of over 1 GHz in 65 nm CMOS technology, and a throughput of 615 Mb/s. The area occupation ranges between 0.11 mm 2 for N max = 256 and 2.01 mm 2 for N max = 4096. Implementation results show an unprecedented degree of flexibility: with N max = 4096, up to 55 code lengths can be decoded with the same hardware, along with any kernel sequence and code rate.
The majority of current research is focused on polar codes recursively constructed from a 2 × 2 polarization matrix, also called a binary kernel [1] . The code lengths of polar codes constructed from binary kernels are bound to powers of 2. This is a strong limitation, that is currently overcome with ratematching schemes [3] , [4] , whose performance and optimality is hard to evaluate a priori. A few recent works have proposed construction methods based on multiple kernels of different dimensions [5] [6] [7] . Multi-kernel polar codes can have block lengths different from powers of 2, at the cost of more complex decoding algorithm update rules. In [6] , it has been shown that multi-kernel codes can outperform codes of the same length obtained through the application of state-of-the-art puncturing and shortening schemes. At a frame error rate (FER) of almost 10 −3 , multi-kernel codes yield gains ranging from 0.1 dB to 1.1 dB.
Polar code decoder architectures in literature focus mainly on design-time flexibility [8] [9] [10] , with parametrized designs that can be implemented to decode a particular code. Some decoders guarantee code-rate online flexibility [11] [12] [13] [14] : while the decoder can decode a single code length, any code rate is supported with the same hardware. The decoder architectures presented in [15] and [16] target binary kernels only, and are online flexible in terms of both code rate and code length. However, a different decoding program must be stored for every considered combination of code length and rate, leading to huge area occupation. The unrolled architecture presented in [17] can decode a small set of binary nested code lengths and rates.
In this work, we consider multi-kernel polar codes constructed from binary and ternary (3 × 3) kernels, and we propose a flexible decoder architecture. The presented design can decode any code constructed from any combination of binary and ternary kernels, up to a maximum code length defined at design time, and any code rate. It is the first multikernel decoder in literature, yielding an unmatched degree of flexibility, with up to 55 supported code lengths in the considered case study. Implementation results in 65 nm CMOS technology show an achievable frequency of more that 1 GHz and 615 Mbps coded throughput.
The remainder of the paper is organized as follows. In Section II, we introduce polar codes construction and decoding, while in Section III we show the errorcorrection performance of some multi-kernel codes. Section IV details the proposed decoder architecture, while implementation results are given in V, together with a comparison with the state of the art. Conclusions are drawn in Section VI.
II. PRELIMINARIES

A. Polar Codes
A polar code P(N, K ) is a linear block code of length N and rate K /N, that relies on a phenomenon called channel polarization [1] . When N tends to infinite, the symmetric capacity of each bit-channel tends towards either 0 or 1, thus identifying very reliable and very unreliable channels.
Let us assume N = 2 n , where n ≥ 1, and let u = (u 0 , u 1 , . . . , u N−1 ) be the N-bit vector input to the encoder. The K information bits are assigned to the K most reliable channels of u, while the remaining N − K are fixed to a known value (usually 0), and are known as frozen bits. The ensemble of their indices is the frozen set F .
The encoding process can be represented through the linear transformation x = uG, where G = T 2 ⊗n is the generator matrix, expressed through the n-th Kronecker product of the matrix T 2 . The matrix T 2 is a binary polarization matrix, or kernel, defined as follows:
From the definition of G, the recursive nature of the encoding process can be noticed: a polar code of length N can in fact be obtained as the concatenation of two N/2 polar codes. Polar code encoding can also be portrayed through a Tanner graph, as shown in Fig. 1 for an N = 8 code. Each stage depicts a Kronecker product, and the dashed boxes represent each T 2 operation. Between neighboring stages permutations are inserted, in which the bit-indices of the inputs are cyclically rotated to the right by one place [1] .
B. Successive Cancellation Decoding
In [1] a first successive cancellation (SC) decoding algorithm has been proposed. It can be represented as a binary search tree where all the nodes must be explored, with priority being given to left branches. An example of a P(8, 4) polar code SC decoding tree is shown in Fig. 2a : the leaf nodes at stage s = 0 can be either information bits (dark gray) or frozen bits (light gray). Let us call y = (y 0 , y 1 , . . . , y N−1 ) the vector of logarithmic likelihood ratios (LLRs) obtained at the channel output, andû the estimated vector output by the decoder. The decoding starts from the root node, and at each node information is passed from parent to child according to the scheme shown in Fig. 2b . The LLR value α is received and used to compute α l , then β l is obtained and used to compute α r . Once β r is available, β can be computed. Once a leaf node is reached, the value ofû i is estimated. If index i ∈ F , its value is set to 0, otherwise a hard decision on the sign of α is performed. Calling N s the length of the polar code at stage s, we can define ∀i ∈ 0, 1, ..., N s 2 − 1 :
where ⊕ represents the XOR operation and ϕ() is a function returning the sign of the argument. In (1), both the exact and the approximate (hardware-friendly) computation, proposed in [8] , are shown. At leaf nodes, β is initialized asû i (4), where i is the index identifying the current leaf node.
C. Multi-Kernel construction
In [6] a generalized construction method for polar codes has been presented: together with T 2 , larger kernels have been investigated. Thus, the matrix G is composed of a series of Kronecker products between kernels of different sizes. Ternary kernels, i.e. kernels of dimensions 3×3, have been considered in [6] , where the proposed polarization matrix is indices. For each stage i > 1, the permutation matrix P i can be found as
where Q i is the so-called canonical permutation introduced in [6] , N i = i−1 j =1 n j , and n j × n j are the dimensions of the j -th kernel of the Kronecker product. Finally, P 1 is computed in order to re-align output indices with those relative to the encoder input, considering all the previous permutations. Fig. 4a shows the SC decoding tree for the same code, and the message passing criterion in case of ternary nodes is shown in Fig. 4b . Defining (1) as f b , (2) as g b , and (3) as comb b , for a ternary node at stage s the decoding rules ∀i ∈ 0, 1, ..., N s 3 − 1 are: Similarly to the binary kernel case, we define (5), (6), (7) and (8) as f t , g t 1 , g t 2 and comb t respectively.
III. MULTI-KERNEL CODES
The multi-kernel code construction method proposed in [6] yields substantial error-correction performance gain with respect to puncturing and shortening schemes. Table I reports such gain when two multi-kernel codes are compared to codes obtained with the puncturing method in [18] and the shortening method in [19] , for SC decoding and list SC (SCL) [20] with a list size of 8. Depending on the target FER, the gain ranges from 0.1 to 1.1 dB.
Using the construction method described in [6] , multi-kernel codes have been constructed. Their error-correction performance has been simulated through a binary-input additive white Gaussian noise (AWGN) channel with binary phaseshift keying modulation. The bit error rate (BER) and FER curves are shown in Fig. 5 , obtained with SC decoding and LLRs represented in double-precision floating-point format. As discussed in [6] and [7] , the Kronecker product is not commutative, and different kernel orders will results in different codes. However, there is currently no theoretical way to identify the best kernel multiplication order: thus, the different kernel orders need to be simulated to identify the one that gives the best error-correction performance. In the remainder of our work, we considered the following codes and kernel orders, obtained with the method described in [7] :
IV. DECODER ARCHITECTURE
We propose a multi-code semi-parallel SC decoder which supports purely-binary, purely-ternary and binary-ternary mixed construction polar codes. The architecture is sized with a maximum code length N max , and can support any code length N ≥ 2 that can be expressed as a combination of binary and ternary kernels, and any code rate. For mixed polar codes, the architecture can decode codes constructed with any kernel order, without knowledge of the code structure at design time.
The overall decoder architecture is shown in Figure 6 . It relies on P processing elements (PEs) implementing (1)-(8), and dedicated memories for channel and internal LLRs, β values and candidate codeword. Both channel and internal LLRs are represented on Q bits, Q f of which are assigned to the fractional part.
Together with the code length, the decoder receives as inputs the following parameters:
• information about binary and ternary stages; • memory address offsets for both LLRs and β values, relative to the current code length; • number of steps required by each stage to process all inputs given the number P of PEs. This is due to the fact that the decoder has a semi-parallel architecture and, for stages where N s > 2P, the number of PEs is not sufficient to elaborate all data in a single clock cycle.
In order to simplify and reduce both memory accesses and routing, the architecture has been designed for bit-reversed polar codes [8] . This approach allows to dramatically simplify the memory accesses.
A. Data Flow
The channel output y is initially stored in the Channel LLR RAM, while the frozen set F and the code parameters listed in the previous section are uploaded to their dedicated memories, respectively the Frozen Pattern RAM and a set of registers. For operations involving soft values, the Processing Unit receives as input either the channel or the internal LLRs, according to the current stage of the decoding tree. For comb operations (3)-(8), data read from the Internal β RAM are used. Results are stored either in the Internal LLR RAM or in the Internal β RAM, according to the performed operation. When a leaf node is reached and a hard decision (HD) is performed to decide the value of a bit (4), the result is stored in the Codeword RAM. The decoding phase ends when the bit associated to the rightmost leaf node is estimated: the decoded codewordû is thus output.
B. Processing Unit
The Processing Unit (PU) is the computational core of the decoder, where all the operations are performed: (7) and comb t (8) . It contains P processing elements (PEs) and P combine blocks (CBs) organized as follows:
• 2 3 P = P b/t binary-ternary mixed PEs, each of them able to compute any f or g operations, both binary and ternary;
• 1 3 P binary PEs, which support only f b and g b ;
• 2 3 P = P b/t binary-ternary mixed CBs which perform both comb b and comb t ;
• 1 3 P binary CBs, which support only comb b . Since it has been observed that between binary and ternary operations there are common computations, mixed PEs are used to increase resource sharing, at the cost of a multiplexing operation; additional purely-binary PEs are used to align the number of used inputs both for binary and ternary operations. Thus the maximum number of elaborated soft inputs is fixed to 2P = 3P b/t , while the results are either P or P b/t LLRs: in fact it can be noticed that the number of operations simultaneously performed is P in the binary case and P b/t in the ternary one. For binary operations each i -th PE elaborates the 2i -th and (2i + 1)-th LLR inputs, while for ternary ones each i -th mixed PE uses LLRs corresponding to indices 3i , 3i + 1 and 3i + 2. The same holds for CBs. From the last considerations P must be a multiple of 3; an example of PU with P = 3 is shown in Fig. 7 , where the multiplexers inserted before the PEs are used to align the correct LLRs in case of a binary or ternary stage.
Although there are situations in which not all PEs are performing useful computations, 2P inputs are nevertheless elaborated and stored in the corresponding memory. Unnecessary data are subsequently ignored in the final estimation: this happens for stages s where N s is not a multiple of 2P. The impact of two different LLR representations on the implementation cost of the PU has been evaluated: we have in fact designed PEs with both 2's complement and sign and magnitude representations. FPGA synthesis results have shown that the sign and magnitude binary PE has 14% lower resource requirements and 20% shorter critical path than the 2's complement one, while the sign and magnitude mixed PE has similar resource requirements and 23% shorter critical path than the 2's complement one. Thus, all LLRs in the proposed decoder are represented with sign and magnitude.
1) Binary Processing Elements: The architecture of binary PEs is the one proposed in [8] . Let us call α a and α b the input LLRs. For the hardware-friendly version of f b (1) operation the result computation is straightforward:
where α b f is the f b operation result. Analyzing the complete truth table both for sign ϕ(α b g ) and magnitude |α b g | of g b (2), its resulting equations are: This architecture is shown in Figure 8 . The rightmost multiplexers select the output values depending on the selected operation, while the internal multiplexers are used to select the correct result of the maximum and minimum identification, of |α b g |, and of ϕ(α b g ), based on the comparison between input LLRs and computed partial results. Adders and subtractors saturate their result if outside the available range.
2) Binary-Ternary Mixed Processing Elements: An analysis analogous to the binary case has been conducted on f t (5), g t 1 (6) and g t 2 (7) . The resulting equations are the following:
where
The circuit implementing these operations is shown in Figure 9 , where again adders and subtractors can saturate the result. The multiplexers have the same role of those shown in Fig. 8 : their number increases due to the higher number of input LLRs, computations to be performed, higher number of results to be computed, and concurrent binary/ternary datapath. The M block is a combination of pruned multiplexers selecting the minimum absolute value according to the already computed selection signals, which correspond to the most significant bits of the output of the subtractors.
Mixed PEs perform both binary and ternary operations, and need to select their input accordingly. Thus, LLR multiplexing logic is inserted at their input. This logic consists of two Q-bit multiplexers for each mixed PE.
3) Combine Blocks: Both binary and binary-ternary mixed CBs are composed of XOR gates implementing comb b and (comb b )sel + (comb t )sel respectively, where sel is the binary/ternary selector.
C. Memory System
While efficient in terms of resource usage, register-based approaches like [11] lead to excessive area occupation. Thus, this design foresees the usage of SRAM banks. The width of these memories is different from that of memories in a purelybinary decoder design, since they have to accommodate ternary operations and their concurrent input and output volume. Additionally, for Internal LLR RAM a three-bank solution has been implemented, since ternary-kernel functions are supported: for purely-binary decoders two banks would have been sufficient.
1) Channel LLR RAM: This memory stores the LLRs coming from the channel. Each memory word is 2P · Q long, since for each operation involving LLRs 2P of them are required by the PU. Its depth is D L L R ch = N max 2 P . This memory uses two separate ports, one for reading and one for writing.
2) Internal LLR RAM: It contains the partial results of f and g operations. Similarly to the Channel LLR RAM, the parallelism must be 2P · Q. The computation of the depth D L L R int takes into account that for each decoding stage only one LLR vector must be stored: once the node which took as input the computed LLR has generated its output β, that soft value will be no longer used and can be overwritten. In addition, for stage s = 0 it is not needed to memorize the result since the hard decision is performed in the same clock cycle.
The memory depth is computed as:
Also for this memory two separate ports for reading and writing are required. It is possible to rearrange the Internal LLR RAM with a bank structure. However, due to the variable number of data that needs to be written, depending on the stage being binary or ternary, four banks with two different widths should be implemented. This would incur significant control and addressing overhead, with no tangible advantage with respect to the proposed structure. More details on the handling of different result sizes are given in Section IV-D.
3) Internal β RAM: This memory stores all β values computed inside the decoding tree; it is organized in three banks, which share the same input writing bus:
• BANK0 for β 0 : it is equal to β l in both binary and ternary cases; • BANK1 for β 1 : it is equal to β r for binary stages, while for ternary ones it represents β c ; • BANK2 for β 2 : it corresponds to the ternary stages β r . The bank organization is fundamental for parallel data reading in g t 2 , comb b and comb t operations. Each bank has a width of 2P since results of comb operations are on 2P bits, while their depths D β int are equal to:
4) Codeword RAM:
It is used to store the decoder outputû, composed by the HDs performed at the leaf nodes. Its width W cod is a design choice independent from all other parameters, while the depth is
5) Frozen Pattern RAM:
It stores the frozen set, where each of N max bits identifies if the corresponding bit-channel is frozen or not. The memory width W f rozen is an independent design choice, while the depth can be expressed as Table II reports the breakdown of the memory requirements for the proposed decoder with various N max , P and Q combinations. To correctly evaluate the memory overhead brought by the multi-kernel approach, the memory sizes for purely binary polar decoders with similar parameters have been detailed as well. It can be seen that most of the additional memory bits can be found in the internal β memory.
D. Memory Interfaces
Two interfacing modules are required to adapt the inherent parallelism of the memories to that of the PU. 
1) Internal LLR Memory
Interface: Fig. 10 shows the interface circuit. It is tasked with choosing, during write operations, which part of the memorized word has to be overwritten. In fact, the results of f and g operations are P or P b/t LLR, for binary and ternary cases respectively, while the width of the LLR memories is 2P = 3P b/t . Each memory location takes two or three clock cycles to be overwritten with useful data. So, at tree stages where N s > 2P and the PU takes more than one clock cycle to process them, the following steps are performed:
• For binary stages: 1) The 2i -th operation result (P LLRs) is stored in the memory together with Q P b appended zeros; 2) The (2i + 1)-th operation result is stored after the P most significant bits of the previously written word, so that the padding zeros are overwritten and the new stored word contains the P results of both the 2i -th and (2i + 1)-th operations. • For ternary stages:
1) The 3i -th operation result (P b/t LLRs) is stored in the memory together with 2Q P b/t appended zeros. 2) The (3i + 1)-th operation result is stored after the Q P b/t most significant bits of the previously written word. The new word contains the P b/t results of both the 3i -th and (3i + 1)-th operations; 3) The (3i + 2)-th operation result is stored after the previously written 2Q P b/t bits, completing the 3P b/t = 2P LLR word. To overwrite only parts of the previously written word, the bypass buffer output is used. When N s ≤ 2P, the results are stored in the first part of the word as usual; the remaining bits are not considered in subsequent operations.
2) β Memory Interface: Figure 11 shows the interface architecture. It is used both for reading and writing from the Internal β RAM:
• Reading: operations involving β values need either P or P b/t bits per bank as input, while each word is composed of 2P bits. Thus, the relevant word parts are selected according to the actual number of elaborated LLRs for that node. Fig. 10 . Internal LLR RAM interface circuit. Fig. 11 . β memory interface circuit, where r_data_0, r_data_1 and r_data_2 are the outputs of bank0, bank1 and bank2 respectively.
• Writing: the data is selected between the CB results and the HD for the leaf nodes.
E. Bypass Registers
Two bypass registers must be used since the memory system is RAM-based and, if a result is computed and ready to be stored at the j -th clock cycle, it can be correctly read only from the ( j +2)-th cycle onwards, to avoid incurring conflicts. So, for all the nodes at stage s ≤ log 2 2P, bypass registers allow reading newly computed data already at the following clock cycle. A 2Q P-bit register is used for the Internal LLR RAM, while a second 2P-bit register is necessary for the Internal β RAM.
F. Control Unit
The Control Unit provides all the memory addresses to the memories and control signals to the datapath. It has been designed as several hierarchically controlled finite state machines. The decoding process follows the same approach of the tree exploration by means of different counters, which keep track of the status and of the number of visited leafs. The decoding process ends when a number of leafs equal to the code length has been visited.
G. Multi-Code Support
Memories are sized for a maximum code length N max , but any code length N ≤ N max , with N a multiple of 2 or 3 is supported. Memory requirements are upper bounded by the largest combination of T 2 kernels leading to N max , since a higher number of stages are present in the decoding tree than in a mixed-kernel polar code with similar code length. The input code parameters allow to know when the leaf node stage is reached, and thus when the tree ascension has to start. The status counter in the CU uses foreknowledge of the number of kernels and their dimension to schedule the right operation at each stage: thus, any code rate and kernel order can be decoded without any change to the hardware. The total amount of bits required to store the code parameters for a code of length N is log 2 N + s m 2 + 2 log 2 N 2 P , where s m is the number of kernel composing the code. The PU has been designed independently of the code length.
V. IMPLEMENTATION RESULTS
The decoder architecture illustrated in the previous Section has been described in VHDL, verified with ModelSim, and synthesized with Cadence RTL Compiler on TSMC 65nm CMOS technology node.
The choice of the number of LLR quantization bits Q influences a substantial part of the computational hardware and memory width. In Figure 12 the error-correcting performance of a P(4096, 2048) polar code is shown: between Q = 7 and Q = 8 curves there is not a significant difference, while choosing Q = 6 leads to larger error figures with respect to floating point precision. Although the number of fractional bits Q f does not influence the hardware architecture, a high Q f requires a higher Q. In Figure 12 we can notice that Q f = 3 yields only minor FER degradation. Thus, for N max = 4096 we chose Q = 7 and Q f = 3. Similar studies were performed in case of N max = 1024 and N max = 256, leading to Q = 6, Q f = 3 in the first case and to Q = 5, Q f = 2 in the second. Table III reports synthesis results for three sets of decoder parameters. Along with the parameters, the number of supported code lengths N and the maximum achievable frequency f max are shown. All implementations can run at more than one GHz. The A reg is the area occupation when all memories are synthesized as registers, while in A RAM all the memories The latency of the decoding phase depends on the number P of PEs, on the number of kernels s m , on the kernels dimension and their order.
The decoding latency, measured in clock cycles (CCs) can be computed as:
In Table IV some polar code timing performance are shown, where L is the decoding latency, f is the achievable frequency, and T the coded throughput. They consider a wide range of code parameters over three different decoder implementations. Since the kernel order impacts the decoding latency, dimension of each kernel has been reported, from left to right as in the Kronecker product. It is possible to see that the achievable frequency is consistently above 1 GHz, and that the coded throughput ranges from 350 to 615 Mbps.
In Table V the implementation results of the proposed decoder have been compared to rate-flexible purely binary decoders in the state of the art, since to the best of our knowledge this is the first multi-kernel decoder in literature. All decoders have been implemented with 65 nm CMOS technology, and target a code with N = 1024, that for our work corresponds to N max as well. Both in [8] and [10] semi-parallel architectures are proposed, supporting the SC algorithm and a single fixed code length. The reported results for [10] refer to their best devised architecture, called folded high performance partial sum network. It limits the number of processing elements by folding highly parallel operations and performing them in several clock cycles, thus increasing hardware utilization. Observing the bit-percycle (bpc) throughput in Table V , it can be noticed that both [8] and [10] outperform the proposed decoder for the considered purely binary codes. The reason can be found in the additional clock cycles required for comb operations in our architecture: since different kernel orders are supported, (3) and (8) is not always the same. Thus, it is not possible to hardwire an XOR tree to compute the comb at all stages in one clock cycle, like in decoders supporting only binary kernels: separate clock cycles are spent to perform the comb operations according to the correct kernel order.
On the other hand, [8] and [10] consider only binary kernels and, implementing a tree of comb operations and eventually selecting a partial result, β values are computed in the same clock cycle immediately after the g. This is not affecting the critical path in a significant way since only few XOR gates are added. As shown in Table IV , codes constructed with higherdimension kernels yield a higher throughput. When decoding a ternary node, due to the higher utilization factor of the PEs and the higher number of useful computations in each clock cycle, the number of clock cycles needed to decode a codeword is lower. Moreover, latency-reduction techniques like the ones presented in [21] and [9] can be easily adapted to the proposed architecture. The proposed decoder yields a higher area occupation than both [8] and [10] . This is mainly due to the higher quantization parameter Q and to the support to ternary functions. Mixed PEs require ×2.57 LUTs on FPGA and ×2.10 area occupation with respect to the purely binary ones. However, our decoder is completely code-length flexible and supports multiple kernel sizes, any code rate and any kernel order. Moreover, it can achieve the highest frequency among the considered works, and a higher throughput in Mbps than [8] .
Semi-parallel SC-based decoders in literature, while supporting only binary kernels and often being designed targeting a single code, share the basic multi-PE structure of our work. For the sake of completeness, in Table V we consider also [13] and [17] . These architectures are very different from semiparallel decoders, but guarantee a certain degree of flexibility. The decoder in [17] can decode a fixed set of combinations of code lengths and code rates, while the architecture proposed in [13] is rate-flexible. Both architectures are able to achieve a higher throughput than the proposed decoder, at the cost of larger area occupation and a lower degree of flexibility.
VI. CONCLUSION
In this work, we have proposed the first polar code decoder architecture supporting kernels of different sizes. It implements the successive cancellation algorithm, and can support any code rate, any sequence of binary and ternary kernels and any code length N ≤ N max that can be expressed as a combination of binary and ternary kernels. The decoder can achieve a frequency of more than a GHz in 65 nm CMOS technology, and a throughput of 615 Mb/s. The area occupation ranges between 0.11 mm 2 for N max = 256 and 2.01 mm 2 for N max = 4096. Implementation results show an unprecedented degree of flexibility: with N max = 4096, up to 55 code lengths can be decoded with the same hardware, along with any kernel sequence and code rate.
