Abstract-Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successivecancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-SSCF, identify the special constituent codes in a polar code graph off-line, produce a list of operations, store the list in memory, and feed the list to the decoder to decode the constituent codes in order efficiently, thus increasing the decoding speed. However, the list of operations is dependent on the code rate and as the rate changes, a new list is produced, making fast SC-based decoders not rate-flexible. In this paper, we propose a completely rate-flexible fast SC-based decoder by creating the list of operations directly in hardware, with low implementation complexity. We further propose a hardware architecture implementing the proposed method and show that the area occupation of the rate-flexible fast SC-based decoder in this paper is only 38% of the total area of the memory-based base-line decoder when 5G code rates are supported.
successive-cancellation (SC) decoding algorithm [2] . However, this capacity-achieving property under SC decoding only occurs as the code length tends towards infinity. For practical values of code length, SC decoding fails to provide a reasonable errorcorrection performance.
In order to improve the error-correction performance of SC decoding, SC list (SCL) [3] and SC flip (SCF) [4] decoders run multiple SC decoders in parallel and in series, respectively. Therefore, SCL improves the error-correction performance of SC at the cost of higher area occupation when implemented on hardware, while SCF improves the error-correction performance of SC at the cost of higher latency and lower throughput. With this error-correction performance improvement, polar codes were selected as a channel coding scheme for the enhanced mobile broadband (eMBB) control channel in the next generation of wireless communications standard (5G).
SC-based decoding algorithms such as SC, SCL, and SCF, suffer from high latency and low throughput when implemented on hardware. This is due to the serial nature of SC decoding in which the decoding proceeds bit by bit. In order to address this issue, polar codes where shown to be a concatenation of smaller constituent codes which can be decoded in parallel [5] , [6] . These constituent codes are shown to add small implementation complexity overhead while keeping the error-correction performance of SC unchanged. In [7] , more constituent codes were identified and low-complexity parallel decoders were designed to increase the throughput of SC decoders even further. It was shown in [8] , [9] that the constituent codes can be decoded efficiently under SCL decoding while keeping the error-correction performance of SCL decoder unaltered. The same approach was applied to the SCF decoder in [10] .
The construction of polar codes is based on the identification of reliable bit-channels through which information bits are transmitted. The remaining bit-channels carry fix values and are called frozen bits. The location of the frozen bits and of the information bits is known to the encoder and the decoder. In SC-based decoders, the frozen and information bit sequence can be either stored in a memory, or computed on-line given the bit-channel relative reliability vector and desired code rate, as proposed in [11] . In fact, the latter approach is significantly more efficient in case of multi-code decoders, and is facilitated by nested reliability vectors as those selected for the 5G eMBB control channel [12] . Therefore, in 5G, the polar encoder and decoder are provided with a vector of bit indices in descending reliability order and an information length K, from which the encoder and the decoder should extract the frozen/information bit sequence. It should be noted that the number of information bits for polar codes in the 5G eMBB control channel can be any value between 12 and 1706 [13] . Thus, the encoder and the decoder should be able to support a vast range of code rates.
Fast SC-based decoders rely on the identification of the type and the length of constituent codes in a polar code. While the calculation of the frozen/information bit sequence is straightforward and can be performed by simply assigning information bits to the first K elements of the reliability vector, the direct calculation of the list of operations for fast SC-based decoders requires complicated controller logic [6] . Therefore, the identification of the type and the length of constituent codes is performed off-line and the decoding order is stored in a dedicated memory as a list of operations [6] , [8] , [9] . The decoder fetches the list of operations from memory to decode the constituent codes in order one by one. The main drawbacks of the aforementioned fast SC-based decoders are twofold: first, the list of operations requires high memory usage when implemented on hardware. Second, the list of operations is highly dependent on the rate of the polar code and as the rate changes, the list of operations changes too. Therefore, for 5G applications which require the support of multiple rates, multiple lists of operations need to be stored in memory. This in turn increases the hardware implementation overhead and renders fast SC-based decoders not rate-flexible.
In this paper, we propose completely rate-flexible fast SCbased decoders by introducing a method to infer the list of operations directly in hardware by using the bit-channel relative reliability vector and without the need to store it in memory. We show that the type and the length of a constituent code in a polar code can be identified with low hardware implementation complexity, by checking only a few bits of the constituent code. We further show that the list of operations adapts with the rate of the code, allowing the resulting fast SC-based decoder to be completely rate-flexible. We design and implement a hardware architecture for the proposed decoder and show that the memory required to store the list of operations can be completely removed, resulting in significantly lower decoder area occupation.
The remainder of this paper is organized as follows: Section II reviews polar codes, SC-based decoding algorithms, and their fast counterparts. We propose the rate-flexible fast decoder for polar codes in Section III. In Section IV, a hardware architecture to implement the proposed method is introduced. Section V provides the hardware implementation results and comparisons with state of the art. Finally, conclusions are drawn in Section VI.
II. PRELIMINARIES

A. Polar Codes
A polar code of length N = 2 n that carries K information bits has a rate R = K/N and can be represented as P(N, K ). It can be constructed using a lower-triangular generator matrix G as . Fig. 1 . SC-based decoding on a binary tree for P (8, 4) and v = {7, 6, 5, 3, 4, 2, 1, 0} (s = {0, 0, 0, 1, 0, 1, 1, 1}).
As N goes toward infinity, the polarization phenomenon creates bit-channels that are either completely noisy or completely noiseless and the fraction of noiseless bit-channels equals the channel capacity. For finite practical code lengths, the polarization of bit-channels is incomplete, therefore, there are bitchannels that are partially noisy. In principle, a bit-channel relative reliability vector v = {v 0 , v 1 , . . . , v N−1 }, where 0 ≤ v i < N, is generated and fed into the encoder and the decoder based on the polarization phenomenon which shows the rank of each bit-channel. Thus, v is a vector of integers such that if v i < v j , then bit-channel i is more reliable (less noisy) than bit-channel j. The polar encoding process consists of the classification of the bit-channels in u into two groups based on v: the K good (more reliable) bit-channels which carry the information bits, and the N − K bad (less reliable) bit-channels that are fixed to a predefined value (usually 0). This classification can be represented as a sequence of binary values s = {s 0 , s 1 , . . . , s N−1 } where
More formally, let W be a BMS channel with input alphabet X = {0, 1} and output alphabet Y, and let {W (y | x) : x ∈ X, y ∈ Y} be the transition probabilities. In order to quantify the reliability of the channel W , we use the Bhattacharyya parameter Z (W ) ∈ [0, 1], that is defined as
Hence, the good bit-channels are the ones that have the lowest Bhattacharyya parameter.
B. SC-Based Decoding
SC-based decoding algorithms can be represented as a depthfirst binary tree search with priority to the left branches as depicted in Fig. 1 . Two kinds of messages are passed between the nodes in the graph: the soft log-likelihood ratio (LLR) values α = {α 0 , α 1 , . . . , α 2T −1 } which are passed from a parent node at level log 2 (2T ) = t + 1 to the child nodes at level log 2 (T ) = t, and the hard bit estimates β = {β 0 , β 1 , . . . , β 2T −1 } which are passed from a child node at level t to a parent node at level t + 1. 
where
Assume that the vector of relative reliabilities of bit-channels v is stored in memory and is available to the decoder. In SC and SCF decoding algorithms, when a leaf node is reached, the i-th bitû i can be estimated aŝ
while in SCL decoding, at a leaf node we havê
As can be seen in (10), when an information bit is reached in SCL decoding, both of its possible values of 0 and 1 are considered. In order to limit the exponential growth in the complexity of the SCL decoder, at each bit estimation, only L candidates are allowed to survive with the help of a path metric (PM) [14] . To this end, a sorter module is used to rank the PMs of the 2L generated candidates and selecting L of them with the best PMs. After the estimation of bits by (9) or (10) 
where ⊕ is the bitwise XOR operation. The depth-first binary tree search of SC-based decoding algorithms can be represented by a list of operations. Let
0 } represent the binary expansion of the integer i. The LLR value associated with u i can be calculated by a set of F t and G t operations as [15] :
For example, the LLR value associated with u 0 in Fig. 1 can be calculated by performing F 2 , F 1 , and F 0 , respectively, and the LLR value associated with u 1 in Fig. 1 
It should be noted that since the hard estimate operations of (9), (10) , and (11) are performed right after F t or G t functions at a leaf node and in the same time step, we do not include them in the list of operations. The list of operations for SC-based decoders can be generated directly on hardware by simple bitwise operations [14] , [15] .
It is worth mentioning that the list of operations for SC-based decoders is fixed for all rates and thus SC-based decoders are rate-flexible. However, the number of time steps required to finish the decoding process in SC-based decoders is at least 2N − 2.
1 This limits the latency and throughput of polar codes when decoded by SC-based decoders.
C. Fast SC-Based Decoding
In order to reduce the latency and increase the throughput of SC-based decoders for polar codes, special node structures are identified and the decoding is performed based on the LLR values at the intermediate levels in the SC-based decoding tree without the need of traversing it. It was shown in [5] , [6] r Repetition (Rep) Node: This node consists of frozen bits except for the last bit which is an information bit, i.e.,
r Single parity-check (SPC) Node: This node consists of information bits except for the first bit which is a frozen bit, i.e., v t 0 ≥ K and
It was shown in [8] , [9] that these nodes can be decoded efficiently also in simplified SCL (SSCL), SSCL-SPC, fast SSCL (Fast-SSCL), and Fast-SSCL-SPC decoding without the need for traversing the tree. This is performed by estimating bits one by one at an intermediate level of the decoding tree, thus generating only 2L candidates and selecting the best L from them, similar to the conventional SCL decoding process. This guarantees that the sorter module which selects the L candidates out of 2L remains the same as the conventional SCL decoder. The method was also applied to the SCF decoder which resulted in the Fast-SSCF decoder in [10] . Recently, five new special nodes are observed in [7] and efficient decoders that can be used in SC decoding were designed for them. These nodes are:
r Type-I Node: This node consists of frozen bits except for the last two bits which are information bits, i.e.,
r Type-II Node: This node consists of frozen bits except for the last three bits which are information bits, i.e.,
r Type-III Node: This node consists of information bits except for the first two bits which are frozen bits, i.e.,
r Type-IV Node: This node consists of information bits except for the first three bits which are frozen bits, i.e.,
r Type-V Node: This node consists of frozen bits except for the bits T − 5, T − 3, T − 2, and T − 1 which are information bits, i.e.,
It was shown in [16] that these new nodes can be decoded efficiently to improve the speed of SCL decoding. However, the drawback of using these new nodes when implementing the decoder on hardware is that these nodes are based on multiple bit estimations at a time, thus producing more than 2L candidates in each decoding step. Therefore, a large sorter is required to select the final L caldidates which adversely affects the hardware implementation complexity. In particular, at each decoding step, Type-I node produces 4L candidates to account for all the cases for its two information bits, Type-II node produces 8L candidates to account for all the cases for its three information bits, and Type-V node produces 16L candidates to account for all the cases for its four information bits. Moreover, Type-III node is decoded using two parallel SPC node decoders, and Type-IV node starts by decoding a Rep node of length four followed by four parallel SPC node decoders [16] .
The pruned decoding tree for the same example as in Fig However, by considering the new nodes, the decoder can immediately decode the received vector by decoding the Type-V node. The corresponding list of operations would be {Type-V 3 }, where Type-V t represents the decoding of Type-V nodes of length T = 2 t . The operations which are performed in fast SC-based decoders are summarized in Table I . Note that F t and G t operations are common between conventional SC-based and fast SC-based decoding algorithms. In the hardware implementation of fast SC-based decoders, this list of operations is stored in memory and is fed into the decoder to perform decoding [6] , [8] , [9] .
Let us consider the example in Fig. 2 . If the rate of the code changes from 1/2 to 5/8, the list of operations also changes as shown in Fig. 3 . Without using the new nodes, the list of operations becomes {F 2 , Rep 2 , G 2 , Rate-1 2 }, and by considering the new nodes it becomes {Type-IV 3 }. Therefore, as the rate changes, the list of operations changes. The resulting decoder is therefore not rate-flexible. For applications that support codes with multiple rates, for each rate, the list of operations has to be stored in memory to make the decoder flexible. However, this results in high memory usage when implemented on hardware.
III. RATE-FLEXIBLE FAST POLAR DECODING
The high memory usage of storing the list of operations can be mitigated by generating the list of operations on hardware as the decoding proceeds. A rudimentary approach would be to generate the vector s t from K and the vector v t using comparators, and check the pattern of information and frozen bits in s t for every encountered node. This is shown in Fig. 4 for determining Rate-0, Rate-1, Rep, and SPC nodes of length 8. It should be noted that the comparators in Fig. 4a have two inputs A and B, and an output C where
The problem with this approach is that for nodes of large length, there is a high hardware complexity overhead in generating s t from K and v t , and determining the node typaes. Moreover, the module that generates the list of operations should account for the largest possible node which is the root node in the decoding tree with size N. This results in a large critical path which limits the operating frequency. In order to tackle the above issue, the idea is to exploit the inherent order in the Bhattacharyya parameters of the bit-channels. Let W i and W j be the bit-channels corresponding to u i and u j , and let b i and b j be the binary expansions of the integers i and j. In [17] , [18] a partial order between the polarized bit-channels was introduced. In particular, it was proven that W i is stochastically degraded with respect to W j , i.e., W i ≺ W j , when one of the following two properties hold:
r Left-Swap Property [19] : There exist k, l ∈ {0, 1, . . . , n − 1} such that l < k and
Recall that, if W i ≺ W j , then all the reliability measures of W i are worse than those of W j , i.e., W i has smaller mutual information, larger Bhattacharyya parameter, and larger error probability. Consequently, if u j belongs to the frozen set, then also u i belongs to the frozen set. Furthermore, if u i belongs to the information set, then also u j belongs to the information set. By using the two properties above, it was shown in [19] that it suffices to compute the reliability of a sublinear fraction of channels in order to identify the frozen and the information sets.
Another option to find an ordering between the Bhattacharyya parameters of the bit-channels can be described as follows. Consider the transmission over a BMS channel W with Bhattacharyya parameter Z (W ) and define the synthetic channels W 0 and W 1 as
Then, the following inequalities between Z (W 0 ), Z (W 1 ) and
which follow from Proposition 5 of [2] and from Exercise 4.62 of [20] . Furthermore, the bit-channel W i corresponding to u i is given by the recursive formula below:
In what follows, we will denote by Z i the Bhattacharyya parameter of W i . At this point, we are ready to state and prove the first result of this paper, which concerns the identification of Rate-0, Rate-1, Rep, and SPC nodes.
Theorem 1: Consider a node of length T = 2 t in a polar code of length N = 2 n . Then, the following properties hold: 
This means that the polar code consists of only frozen bits, i.e., it is a Rate-0 node. 2) Note that b 0 = {0, . . . , 0}. By using the addition property (14), we obtain that W 0 ≺ W i for any i ∈ {1, 2, . . . , T − 1}. Hence, as v t 0 < K, v t i < K for any i ∈ {1, 2, . . . , T − 1}. This means that the polar code consists of only information bits, i.e., it is a Rate-1 node.
3) Note that b T−2 = {1, . . . , 1, 0}. By using the addition property (14) and the left-swap property (15), we obtain that W i ≺ W T −2 for any i ∈ {0, 1, . . . , T − 3}. Hence, as v t T −2 ≥ K, v t i ≥ K for any i ∈ {0, 1, . . . , T − 3}. As v t T −1 < K, the polar code consists of frozen bits except for the last bit which is an information bit, i.e., it is a Rep node. 4) Note that b 1 = {0, . . . , 0, 1}. By using the addition property (14) and the left-swap property (15), we obtain that W 1 ≺ W i for any i ∈ {2, 3, . . . , T − 1}. Hence, as v t 1 < K, v t i < K for any i ∈ {2, 3, . . . , T − 1}. As v t 0 ≥ K, the polar code consists of information bits except for the first bit which is a frozen bit, i.e., it is an SPC node. In the proof of Theorem 1, we used the fact that for any node of length T = 2 t in a polar code of length N = 2 n , the n-bit binary expansions of the integers corresponding to the bitchannels in the node are equal in the bits {n − 1, n − 2, . . . , t}, and are different in the bits {t − 1, t − 2, . . . , 0}. An immediate consequence of Theorem 1 is that, by checking only one value, we can find out if a constituent node is either a Rate-0 or a Rate-1 node. Furthermore, by checking only two values, we can find out if a constituent node is either a Rep or an SPC node. This observation significantly reduces the hardware complexity associated with the on-line node identification. In addition, the proposed approach is independent of the node length, making it suitable for codes of any length and rate. Fig. 5 shows the circuit required to generate the list of operations on-line for any node of length T . It can be seen that the circuit consists of only four comparators, three NOT gates, and two AND gates.
Let us now state and prove the second result of this paper, which concerns the identification of Type-I, Type-II, Type-III, Type-IV, and Type-V nodes.
Theorem 2: Consider a node of length T = 2 t in a polar code of length N = 2 n . Then, the following properties hold: 1) If v t T −1 < K, v t T −2 < K, and v t T −3 ≥ K, then the node represents a Type-I node. 2) If 
and v t T −9 ≥ K, then the node represents a Type-V node.
Proof: 1) Note that b
T −3 = {1, . . . , 1, 0, 1}. By using the addition property (14) and the left-swap property (15), we obtain that W i ≺ W T −3 for any i ∈ {0, 1, . . . , T − 4}. Hence, as v t T −3 ≥ K, v t i ≥ K for any i ∈ {0, 1, . . . , T − 4}. As v t T −1 < K and v t T −2 < K, the node consists of frozen bits except for the last two bits which are information bits, i.e., it is a Type-I node. Then, by using (17), we have that
It is easy to check that, for any z ∈ [0, 1],
which implies that
Consequently, as v t T −5 ≥ K, v t T −4 ≥ K. As a result, since v t T −1 < K, v t T −2 < K, and v t T −3 < K, the node consists
of frozen bits except for the last three bits which are information bits, i.e., it is a Type-II node. 3) Note that b 2 = {0, . . . , 0, 1, 0}. By using the addition property (14) and the left-swap property (15), we obtain that W 2 ≺ W i for any i ∈ {3, 4, . . . , T − 1}. Hence, as v t 2 < K, v t i < K for any i ∈ {3, 4, . . . , T − 1}. As v t 0 ≥ K and v t 1 ≥ K, the node consists of information bits except for the first two bits which are frozen bits, i.e., it is a Type-III node. 4) Note that b 4 = {0, . . . , 0, 1, 0, 0}. By using the addition property (14) and the left-swap property (15), we obtain that W 4 ≺ W i for any i ∈ {5, 6, . . . , T − 1}. Hence, as v t 4 < K, v t i < K for any i ∈ {5, 6, . . . , T − 1}. Furthermore, note that b 3 = {0, . . . , 0, 0, 1, 1}. Let W be the transmission channel and let z be the Bhattacharyya parameter of the channel defined as Then, by using (17), we have that
Since (19) holds for any z ∈ [0, 1], we obtain that
Consequently, as v t 4 < K, v t 3 < K. As a result, since v t 0 ≥ K, v t 1 ≥ K, and v t 2 ≥ K, the node consists of information bits except for the first three bits which are frozen bits, i.e., it is a Type-IV node. 5) Note that b T −9 = {1, . . . , 1, 0, 1, 1}. By using the addition property (14) and the left-swap property (15), we obtain that W i ≺ W T −9 for any i ∈ {0, 1, . . . , T − 10}. Hence, as v t T −9 ≥ K, v t i ≥ K for any i ∈ {0, 1, . . . , T − 10}. By using again the left-swap property (15) , we obtain that W T −6 ≺ W T −4 and W T −7 ≺ W T −4 . By using again the addition property (14) , we obtain that W T −8 ≺ W T −4 . Hence, as v t T −4 ≥ K, v t i ≥ K for any i ∈ {T − 6, T − 7, T − 8}. As a result, since v t T −1 < K, v t T −2 < K, v t T −3 < K, and v t T −5 < K, the node is a Type-V node. The proofs for the identification of Rate-0, Rep, SPC, Rate-1, Type-I, Type-III, and Type-V nodes are based on stochastic degradation arguments. Consequently, these proofs are general and do not depend on the fact that the frozen bits are determined according to the value of the Bhattacharyya parameter. On the contrary, the proofs for Type-II and Type-IV nodes use the inequalities (17) which are valid for Bhattacharyya parameters. However, let us point out that the strategy of the proof (use extremes of information combining bounds such as (17) in order to compare the reliability of specific channels) is general. In order to prove a similar statement for different reliability measures, one would need to find bounds of the form (17) for the desired reliability measure (e.g., mutual information, error probability). Let us further clarify that the proofs for Type-II and Type-IV nodes provide an ordering between the Bhattacharyya parameter of bit-channels. As such, they do not depend on the particular technique used to compute those Bhattacharyya parameters (Gaussian approximation [21] , beta-expansion [22] , Monte Carlo simulation [2] , etc.). Let us also note that the Bhattacharyya parameter represents the typical performance metric employed for code construction [23] [24] [25] .
It is also worth mentioning that since every node in the SCbased decoding tree represents a polar code constructed for a different channel [2] , the results in this section are valid for all the nodes in any polar code of any length. Fig. 6 shows the circuit required to generate the list of operations on-line for any node of length T , if Type-I, Type-II, Type-III, Type-IV, and Type-V nodes are considered in addition to Rate-0, Rep, SPC, and Rate-1 nodes. It can be seen that the circuit consists of ten comparators, nine NOT gates, and fourteen AND gates, in order to identify all the special nodes. 
IV. DECODER ARCHITECTURE
As a proof of concept, a decoder architecture implementing the proposed technique has been designed. It implements the layered partitioned SCL (LPSCL) decoding algorithm detailed in [26] and the Fast-SSCL-SPC algorithm introduced in [9] , along with the memory-reduction techniques proposed in [27] . The LPSCL decoder decreases the memory requirements of standard SCL decoding by dividing the SC decoding tree in different partitions; the bottom part of the SC decoding tree belonging to each partition is decoded with SCL with a list size L max . When information needs to be passed between partitions, i.e. at the top stages of the tree, only L t < L max candidate codewords are passed, with L t decreasing progressively as the stage t increases. The Fast-SSCL-SPC algorithm is applied to the lower stages of the tree, where L max candidates are considered. Fig. 7 shows the architecture of the proposed decoder. It is based on a semi-parallel SCL decoder architecture, where L max sets of N PE processing elements (PEs) are instantiated in parallel, implementing (7) and (8 (11) for all the tree stages as well, updating them every time that a bit is estimated. PMs, that identify the likelihood of a candidate codeword (or path) to be correct, are incremented every time a bit is estimated differently from the sign of the LLR value associated to it. They are sorted in PM memory before and after the estimation of an information bit, in order to identify the L max surviving paths out of the 2L max created. When none of the paths coming from the splitting of a particular candidate codeword survives, all stages of its LLR memory are overwritten, along with the bit estimate and PM memories.
This baseline architecture has been modified to implement the LPSCL decoder. The bottom stages of the SC decoding tree are left unchanged, and decoded with a list size L max . Given the partitioning factor P, the top log 2 P stages rely on a smaller list size L t , with n − log 2 P < t ≤ n, and L t ≥ L t+1 . Consequently, only L t LLR memories are instantiated in the upper stages, reducing the LLR memory requirements for each upper stage of a factor
, as shown in Fig. 8 . Depending on the number of instantiated PEs and on the partitioning factor, the high and/or low stage memories might need to be separated into different memory structures, each part belonging to a different layer of LPSCL and thus instantiated a different number of times, depending on L t . Since the number of surviving paths is reduced from L max to L t when ascending the decoding tree above stage n − log 2 P, the L max − L t candidate codewords with the highest PMs need to be discarded. In the baseline architecture, PMs are sorted only when an information bit is estimated, i.e. when the paths split. However, in the proposed architecture the PMs need to be sorted also when imod(N/P) = 0, where i is the index of the codeword bit that needs to be estimated, and mod represents the modulo operation. The decoding of a bit with such an index i identifies the completion of the decoding of a subtree of size N/P, and the need to transfer information to the upper tree stages, where L t < L max . The sorting of PMs allows the most reliable paths, their LLR values, and their hard bit estimates to be transferred between partitions through the memory copy mechanism addressed in Fig. 8 .
The implementation of the Fast-SSCL-SPC algorithm requires more substantial modifications, that have been detailed in [9] . The hard bit estimate memory and path memories are updated according to different values depending on the node type, along with PMs. This requires different parallel instantiations of the PM computation logic, as shown in Fig. 9 . More complex routing and selection logic are necessary to update memories, since multiple concurrent values need to be updated and propagated through the hard bit estimates memory structure. A sorter module for LLR values is needed in Rate-1 and SPC nodes, to identify the order with which bits are estimated: the disruption of the sequential bit estimation order that SC is based on leads to additional complexity in memory updates and control logic.
Aside from the logic needed to perform the calculations for special node PM update and bit estimations, the decoder needs to know at which point in the SC tree the special nodes are found, and what is their type. This information is used to identify the number of clock cycles needed for the decoding of a particular node, and which of the different parallel PM, path, and LLR updates is memorized. In [9] , the proposed decoder architecture relied on an off-line compiler to obtain the sequence of special nodes, their size, and the stage at which they are encountered. These informations differ for every code supported by the decoder, and need to be stored in a memory. Note that the frozen and information bit sequence can be either stored in a memory, as supposed by most decoder architectures in literature, or computed on-line given the bit-channel relative reliability vector and the desired code rate, as proposed in [11] . This approach is significantly more efficient in case of multi-code decoders, and is facilitated by nested reliability vectors as those selected for the 5G eMBB control channel [12] . This is the approach taken in both the baseline and the modified architectures in this paper, by comparing each entry of the relative reliability vector v to the desired K in order to obtain s.
The control unit of the modified architecture implements the proposed special node on-line identification, based on the relative reliability vector v and K. Fig. 5 shows the simple logic needed to identify the considered special nodes. Given the low complexity of the node identification circuit, the structure is instantiated at every decoding tree stage t, separately at every partition identified by LPSCL, to reduce the amount of multiplexing needed at the inputs and the possible increase in the system critical path. The logic pictured in Fig. 5 is inserted within a finite state machine (FSM) in the decoder control unit to identify the correct decoding phase, through two main control signals, NodeType and NodeSize. A maximum NodeSize value for each NodeType is selected at design time, to limit the additional complexity and critical path degradation.
r While the general node type can be identified easily through the proposed identification, different decoding phases are foreseen within each special node. Thus, NodeType foresees subtypes in the special node. While the Rate-0 node is a standalone node type, the Rate-1 node is divided into three subtypes: one phase is assigned to the fetching and sorting of the LLR values, a second to the estimation of the bits associated to the least reliable LLR values, and the third for the hard-decision on the remainder of the bits. The Rep node is divided in two subtypes, one for the frozen bits and one for the information bit. Finally, SPC nodes foresee four subtypes: one for the concurrent fetching and sorting of LLR values and frozen bit selection, one for the bit estimations, one for the hard decision on the remaining bits, and one for the parity correction. The NodeType signal is thus influenced not only by the result of the logic in Fig. 5 , but also by the number of estimated bits within the special node, the stage t, and the current NodeType subtype.
r The control unit identifies the size of the special node NodeSize as 2 t , given the current SC decoding tree stage t. This information is used to update the index i of the codeword bit to be estimated. The index i is usually updated once a leaf node has been reached and the corresponding bit estimated, but during the decoding of special nodes, it is kept fixed pointing at the first bit of the node. Once the decoding is terminated, the index is updated as i+NodeSize.
V. HARDWARE IMPLEMENTATION RESULTS
The proposed decoder architecture has been described in VHDL and synthesized in TSMC 65 nm CMOS technology, at the operating conditions defined by the NCCOM corner, i.e. 1.2 V core voltage and a temperature of 298 K. Two versions of the decoder have been implemented: one considering the proposed special node identification technique, and one based on the off-line identification and storage used in [9] . Both decoders target the 5G polar code with a code length N = 1024 [12] , rely on a partitioning factor P = 4, and make use of 64 parallel PEs. The bottom part of the SC decoding tree is decoded with a list size L max = 4, while for the upper stages L 10 = L 9 = 2. Fig. 10 shows the frame error rate (FER) and bit error rate (BER) performance of the LPSCL decoder used in this paper in comparison with SCL decoding with L = 4. The curves in Fig. 10 are provided for the code rates of { } without using a cyclic redundancy check (CRC). It can be seen that LPSCL decoding incurs negligible FER and BER performance loss with respect to SCL for all considered rates. It should be noted that the introduction of the proposed technique to infer the list of operations on the fly does not change the FER or BER performance of the decoder in comparison with the same memory-based decoder.
The channel LLR values are quantized with 4 bits and internal LLR values with 6 bits, with 2 bits assigned to the fractional part, 
while PMs are quantized with 8 bits [27] . The maximum node size is set to 16 for Rate-0 and Rep nodes, and to 64 for Rate-1 and SPC nodes. Table II reports the area occupation and achievable frequency for the proposed decoder, and for the decoder based on the off-line identification technique, labelled as memory-based decoder. The two decoders differ in their implementation of the control unit (CU): its area occupation A CU in the proposed decoder is 24% less than that of the memory-based decoder. This is due to the fact that the information computed off-line in the memory-based case, i.e. the equivalent of the NodeType signal, needs to be inserted in an FSM analogous to that used by the control unit of the proposed decoder. This FSM handles the node subtypes and the internal counters that determine when a special node decoding is terminated. Moreover, the memorybased case needs an additional information, NodeStage, to identify at which SC decoding tree stage the special node is encountered: the NodeSize information is derived from that. The NodeStage signal is inserted in its own FSM, that adds substantial complexity to the control unit, resulting in a larger A CU . While the contribution of A CU to the total decoder area occupation A total is relatively small, with A total = 1.410 mm Fig. 11 . Memory requirements to store the list of operations of the memorybased decoder for different values of K. The polar code of length 1024 is used which is adopted in 5G [12] .
and A total = 1.454 mm 2 for the proposed and the memory-based decoders respectively, the NodeStage FSM influences signals in the NodeSize and NodeType FSM, lengthening the critical path so that it becomes the critical one. In particular, the state of NodeStage is combined to the NodeType and NodeSize to determine the current and future node subtypes. This leads to a lower achievable frequency f , lower throughput T P , and lower area efficiency A eff in the memory-based decoder in comparison with the proposed decoder, as provided in Table II . On the contrary, the additional operations in the proposed decoder are simple enough that they do not cause their signal path to become critical, and thus do not impact achievable frequency, decoding latency, or throughput.
The proposed decoder fetches four required values of the relative reliability vector from memory, compares them with K, and identifies the node types efficiently. Table II also reports the external memory requirements Mem ext of the proposed decoder in comparison with the memory-based decoder considering 5G code rates are supported. For a code of length 1024, the vector of relative reliabilities v contains 1024 entries where each entry is stored with 10 bits. Therefore, a total of 1024 × 10 = 10240 bits are stored in memory. For the memory-based decoder, the memory requirement is different for different values of K (different rates). This is depicted in Fig. 11 where it can be seen that the list of operations is large for medium rates and becomes small as the rate becomes very high or very low. Note that the proposed decoder is capable of supporting any code rate within a given code length which is also foreseen in 5G [13] . If the memory-based decoder is designed such that it supports all the code rates of 5G for a code of length 1024 (12 ≤ K ≤ 1024), the memory requirement of it considering a 4-bit representation for NodeType and NodeStage signals is 128140 × 8 = 1025120 bits, more than 100 times larger than the number of bits required for the proposed decoder.
Artisan dual-port SRAM compiler was used for the implementation of the external memories. Table III shows   TABLE III  SRAM SYNTHESIS RESULTS FOR EXTERNAL MEMORIES the area occupation of the external memory for the proposed decoder in comparison with the memory-based decoder. While the proposed decoder supports all the code rates, the memory requirement of the memory-based decoder depends on the number of code rates it can support. In Table III , we showed four cases of memory requirements for the memory-based decoder: when it supports 5 code rates of { }, and when it supports all the code rates considered in 5G, similar to the proposed decoder. It can be seen that the proposed decoder occupies a smaller area in comparison with the memory-based decoder even when the memory-based decoder supports only 5 code rates. In fact, the area occupation of the memory-based decoder increases as the number of supported code rates increases. This consequently reduces the area efficiency of the memory-based decoder as can be seen in Table III . The area occupation of the proposed decoder is only 38% of that of the memory-based decoder when both decoders support all 5G code rates.
It is worth mentioning that the goal of this paper is to propose a low-complexity approach to generate the list of operations for fast SC-based decoders directly on hardware, therefore, allowing for the implementation of a fast and rate-flexible SC-based decoder. Our implementation results show that by using the proposed method, there is a negligible area occupation overhead or throughput loss in comparison with the memory-based decoders, while having a completely rate-flexible decoder.
The main advantage of the proposed approach is that given the design code length, any code with the same N can be decoded using the Fast-SSCL-SPC algorithm without foreknowledge of the information/frozen bit sequence, regardless of rate and target E b /N 0 . On the contrary, the memory-based decoder needs to store the NodeType and NodeStage information for each considered code in an external memory of Mem ext bits. Table IV compares the proposed decoder to other architectures in the state of the art which use 64 parallel PEs. Results are reported for P(1024, 512) and L = 4. The architectures presented in [9] and [8] are based on the Fast-SSCL-SPC and SSCL-SPC algorithms, respectively: it is possible to add the cost of the external memory directly to their area occupation and evaluate its impact on the area efficiency, considering all the code rates in 5G are supported. These modified results are reported within parentheses. It can be seen that the external memory increases A total by 131% in [9] and by 193% in [8] : the proposed special node identification technique is thus able to substantially limit the area occupation and increase the area efficiency in both architectures. The architecture presented in this work has higher A eff and lower A total than both [8] and [9] . Different design choices in terms of concurrent operations in the special nodes lead to a slightly lower T P than [9] , together with a substantially lower A total and higher A eff .
The architectures presented in [28] [29] [30] do not rely on a special-node-based decoding algorithm: thus, the throughput benefits and complexity saving of the proposed node identification technique cannot be directly evaluated. Moreover, the synthesis results of [28] were reported in 90 nm technology, but they were carried out in 65 nm technology. Therefore, a factor of 90/65 was used to convert the frequency, and a factor of (65/90) 2 was used to convert the area of the decoder from 90 nm to 65 nm technology in [28] . The same conversion factors were used to convert to 65 nm technology the synthesis results in [29] , [30] , which were synthesized with a 90 nm node.
Our work shows 31% higher throughput and 31% lower latency with respect to the multibit decision SCL decoder architecture of [28] , while the smaller area occupation of [28] leads to a higher A eff . The decoder in [29] shows lower area occupation than our work. However, the architecture proposed in this work achieves 122% higher throughput and 55% lower latency, leading to 11% higher area efficiency. The high throughput SCL decoder architecture of [30] achieves higher throughput and lower latency than this work, at the cost of 38% higher area occupation and 7% lower A eff . Moreover, [30] relies on tunable parameters that can lead to more than 0.2 dB error-correction performance loss. These parameters also reduce the flexibility of the decoder, since for each code rate, a different set of parameters needs to be used. However, the decoder proposed in this paper is designed to guarantee rate-flexibility, making it suitable for 5G applications.
VI. CONCLUSION
The main drawback of the fast successive-cancellation-based decoders for polar codes is that they require to store a list of operations for each code rate in a dedicated memory, in order to tell the decoder when a special node in a polar code graph is reached. In this paper, we tackled this issue by proposing a technique to generate the list of operations on-the-fly directly in hardware. We proved that this technique can be applied to polar codes of any rate, therefore, removing the memory needed to store the list of operations completely. We proposed a hardware architecture for the proposed technique and showed that the total area occupation of the proposed decoder is 38% of the base-line memory-based decoder, if 5G code rates are considered.
