Abstract-Polar codes have gained significant amount of attention during the past few years and have been selected as a coding scheme for the next generation of mobile broadband standard. Among decoding schemes, successive-cancellation list (SCL) decoding provides a reasonable trade-off between the error-correction performance and hardware implementation complexity when used to decode polar codes, at the cost of limited throughput. The simplified SCL (SSCL) and its extension SSCL-SPC increase the speed of decoding by removing redundant calculations when encountering particular information and frozen bit patterns (rate one and single parity check codes), while keeping the error-correction performance unaltered. In this paper, we improve SSCL and SSCL-SPC by proving that the list size imposes a specific number of path splitting required to decode rate one and single parity check codes. Thus, the number of splitting can be limited while guaranteeing exactly the same error-correction performance as if the paths were forked at each bit estimation. We call the new decoding algorithms Fast-SSCL and Fast-SSCL-SPC. Moreover, we show that the number of path forks in a practical application can be tuned to achieve desirable speed, while keeping the error-correction performance almost unchanged. Hardware architectures implementing both algorithms are then described and implemented: it is shown that our design can achieve 1.86 Gb/s throughput, higher than the best state-of-the-art decoders.
I. INTRODUCTION
Polar codes are the first family of error-correcting codes with provable capacity-achieving property and a low-complexity encoding and decoding process [2] . The successive-cancellation (SC) decoding is a low-complexity algorithm with which polar codes can achieve the capacity of a memoryless channel. However, there are two main drawbacks associated with SC. Firstly, SC requires the decoding process to advance bit by bit. This results in high latency and low throughput when implemented in hardware [3] . Second, polar codes decoded with SC only achieve the channel capacity when the code length tends toward infinity. For practical polar codes of moderate length, SC falls short in providing a reasonable error-correction performance.
The first issue is a result of the serial nature of SC. In order to address this issue, the recursive structure of polar codes construction and the location of information and parity (frozen) bits were utilized in [4] , [5] to identify constituent This work has been published in parts in the IEEE Wireless Communications and Networking Conference Workshops (WCNCW), 2017 [1] . S. A. Hashemi, C. Condo, and W. J. Gross are with the Department of Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. e-mail: seyyed.hashemi@mail.mcgill.ca, carlo.condo@mail.mcgill.ca, warren.gross@mcgill.ca. polar codes. In particular, rate zero (Rate-0) codes with all frozen bits, rate one (Rate-1) codes with all information bits, repetition (Rep) codes with a single information bit in the most reliable position, and single parity-check (SPC) codes with a single frozen bit in the least reliable position, were shown to be capable of being decoded in parallel with low-complexity decoding algorithms. This in turn increased the throughput and reduced the latency significantly. Moreover, the simplifications in [4] , [5] did not introduce any error-correction performance degradation with respect to conventional SC.
The second issue stems from the fact that SC is suboptimal with respect to maximum-likelihood (ML) decoding. The decoding of each bit is only dependent on the bits already decoded. SC is unable to use the information about the bits that are not decoded yet. In order to address this issue, SC list (SCL) decoding advances by estimating each bit as either 0 or 1. Therefore, the number of candidate codewords doubles at each bit estimation step. In order to limit the exponential increase in the number of candidates, only L candidate codewords are allowed to survive by employing a path metric (PM) [6] . The PMs were sorted and the L best candidates were kept for further processing. It should be noted that SCL was previously used to decoder Reed-Muller codes [7] . SCL reduces the gap between SC and ML and it was shown that when a cyclic redundancy check (CRC) code is concatenated with polar codes, SCL can make polar codes outperform the state-of-the-art codes to the extent that polar codes have been chosen to be adopted in the next generation of mobile broadband standard [8] .
The good error-correction performance of SCL comes at the cost of higher latency, lower throughput, and higher area occupation than SC when implemented on hardware [9] . It was identified in [10] that using the log-likelihood ratio (LLR) values results in a SCL decoder which is more area-efficient than the conventional SCL decoder with log-likelihood (LL) values. In order to reduce the latency and increase the throughput associated with SCL, several attempts have been made to reduce the number of required decoding time steps as defined in [2] . It should be noted that different time steps might entail different operations (e.g. a bit estimation or an LLR value update), and might thus last a different number of clock cycles. A group of M bits were allowed to be decoded together in [11] , [12] . [13] proposed a high throughput architecture based on a tree-pruning scheme and further extended it to a multimode decoder in [14] . The throughput increase in [13] is based on code-based parameters which could degrade the errorcorrection performance significantly. Based on the idea in [5] , a fast list decoder architecture for software implementation
arXiv:1703.08208v2 [cs.IT] 29 Aug 2017
was proposed in [15] which was able to decode constituent codes in a polar code in parallel. This resulted in fewer number of time steps to finish the decoding process. However, the SCL decoder in [15] is based on an empirical approach to decode constituent Rate-1 and SPC codes and cannot guarantee the same error-correction performance as the conventional SCL decoder. Moreover, all the decoders in [13] - [15] require a large sorter to select the surviving candidate codewords. Since the sorter in the hardware implementation of SCL decoders has a long and dominant critical path which is dependent on the number of its inputs [10] , increasing the number of PMs results in a longer critical path and a lower operating frequency.
Based on the idea of list sphere decoding in [16] , a simplified SCL (SSCL) was proposed in [17] which identified and avoided the redundant calculations in SCL. Therefore, it required fewer number of time steps than SCL to decode a polar code. The advantage of SSCL is that it not only guarantees the error-correction performance preservation, but also it uses the same sorter as in the conventional SCL algorithm. To further increase the throughput and reduce the latency of SSCL, the matrix reordering idea in [18] was used to develop the SSCL-SPC decoder in [19] . While SSCL-SPC uses the same sorter as in the conventional SCL, it provides an exact reformulation for L = 2 and its approximations bring negligible error-correction performance loss with respect to SSCL.
While SSCL and SSCL-SPC are algorithms that can work with any list size, they fail to address the redundant path splitting associated with a specific list size. In this paper, we first prove that there is a specific number of path splitting required for decoding the constituent codes in SSCL and SSCL-SPC for every list size to guarantee the error-correction performance preservation. Any path splitting after that number is redundant and any path splitting before that number cannot provably preserve the error-correction performance. Since these decoders require fewer number of time steps than SSCL and SSCL-SPC, we name them Fast-SSCL and Fast-SSCL-SPC, respectively. We further show that in practical polar codes, we can achieve similar error-correction performance to SSCL and SSCL-SPC with even fewer number of path forks. Therefore, we can optimize Fast-SSCL and Fast-SSCL-SPC for speed. We propose hardware architectures to implement both new algorithms: implementation results yield the highest throughput in the state-of-the-art with comparable area occupation.
This paper is an extension to our work in [1] in which the Fast-SSCL algorithm was proposed. Here, we propose the Fast-SSCL-SPC algorithm and prove that its error-correction performance is identical to that of SSCL-SPC. We further propose speed-up techniques for Fast-SSCL and Fast-SSCL-SPC which incur almost no error-correction performance loss. Finally, we propose hardware architectures implementing the aforementioned algorithms and show the effectiveness of the proposed techniques by comparing our designs with state of the art.
The remainder of this paper is organized as follows: Section II provides a background on polar codes and its decoding algorithms. Section III introduces the proposed Fast-SSCLû and Fast-SSCL-SPC algorithms and their speed optimization technique. A decoder architecture is proposed in Section IV and the implementation results are provided in Section V. Finally, Section VI draws the main conclusions of the paper.
II. PRELIMINARIES

A. Polar Codes
A polar code of length N with K information bits is represented by P(N, K) and can be constructed recursively with two polar codes of length N/2. The encoding process can be denoted as a matrix multiplication as x = uG N , where u = {u 0 , u 1 , . . . , u N −1 } is the sequence of input bits, x = {x 0 , x 1 , . . . , x N −1 } is the sequence of coded bits, and G N = B N G ⊗n is the generator matrix created by the product of B N which is the bit-reversal permutation matrix, and G ⊗n which is the n-th Kronecker product of the polarizing matrix G = 1 0 1 1 . The encoding process involves the determination of the K bit-channels with the best channel characteristics and assigning the information bits to them. The remaining N −K bit-channels are set to a known value known at the decoder side. They are thus called frozen bits with set F . Since the value of these bits does not have an impact on the error-correction performance of polar codes on a symmetric channel, they are usually set to 0. The codeword x is then modulated and sent through the channel. In this paper, we consider binary phase-shift keying (BPSK) modulation which maps {0, 1} to {+1, −1}.
B. Successive-Cancellation Decoding
The SC decoding process can be represented as a binary tree search as shown in Fig. 1 for P(8, 4) . Thanks to the recursive construction of polar codes, at each stage s of the tree, each node can be interpreted as a polar code of length N s = 2 s . Two kinds of messages are passed between the nodes, namely, soft LLR values α = {α 0 , α 1 , . . . , α N s −1 } which are passed from parent to child nodes, and the hard bit estimates β = {β 0 , β 1 , . . . , β N s −1 } which are passed from child nodes to the parent node. The 
where ⊕ is the bitwise XOR operation. At leaf nodes, the i-th bitû i can be estimated aŝ
Equation (1) can be reformulated in a more hardware-friendly (HWF) version that has first been proposed in [3] :
C. Successive-Cancellation List Decoding
The error-correction performance of SC when applied to codes with short to moderate length can be improved by the use of SCL-based decoding. The SCL algorithm estimates a bit considering both its possible values 0 and 1. At every estimation, the number of codeword candidates (paths) doubles: in order to limit the increase in the complexity of this algorithm, only a set of L codeword candidates is memorized at all times. Thus, after every estimation, half of the paths are discarded. To this purpose, a PM is associated to each path and updated at every new estimation: it can be considered a cost function, and the L paths with the lowest PMs are allowed to survive. In the LLR-based SCL [10] , the PM can be computed as
where l is the path index andû j l is the estimate of bit j at path l. A HWF version of Equation (6) has been proposed in [10] :
which can be rewritten as
In case the hardware does not introduce bottlenecks and both (2) and (5) can be computed in a single time step, the number of time steps required to decode a code of length N with K information bits in SCL is [10] 
D. Simplified Successive-Cancellation List Decoding 1) SSCL Decoding: The SSCL algorithm in [17] provides efficient decoders for Rate-0, Rep, and Rate-1 nodes in SCL without traversing the decoding tree while guaranteeing the error-correction performance preservation. For example in Fig. 1 , the black circles represent Rate-1 nodes, the white circles represent Rate-0 nodes, and the white triangles represent Rep nodes. The pruned decoding tree of SSCL for the example in Fig. 1 is shown in Fig. 2a which consists of two Rep nodes and a Rate-1 node.
Let us consider that the vectors α l and η l = 1 − 2β l are relative to the top of a node in the decoding tree. Rate-0 nodes can be decoded as
Rep nodes can be decoded as
where η N s −1 l represents the bit estimate of the information bit in the Rep node. Finally, Rate-1 nodes can be decoded as
It was shown in [19] that the time step requirements of Rate-0, Rep, and Rate-1 nodes of length N s in SSCL decoding can be represented as
While the SSCL algorithm reduces the number of required time steps to decode Rate-1 nodes by almost a factor of three, it fails to address the effect of list size on the maximum number of required path forks. In Section III, we prove that the number of required time steps to decode Rate-1 nodes depends on the list size and that the new Fast-SSCL algorithm is faster than both SCL and SSCL without incurring any errorcorrection performance degradation.
2) SSCL-SPC Decoding: In [19] , a low-complexity approach was proposed to decode SPC nodes which resulted in exact reformulations for L = 2 and its approximations for other list sizes brought negligible error-correction performance degradation. The pruned tree of SSCL-SPC for the same example as in Fig. 1 is shown in Fig. 2b which consists of a Rep node and a SPC node. The idea is to decode the frozen bit in SPC nodes in the first step of the decoding process. In order to do that, the PM calculations in the HWF formulation were carried out by only finding the LLR value of the least reliable bit and using the LLR values at the top of the polar code tree in the SCL decoding algorithm for the rest of the bits. The least reliable bit in an SPC node of length N s is found as
and the parity of it is derived as
To satisfy the even-parity constraint, γ is found for each path based on (17) . The PMs are then initialized as
In this way, the least reliable bit which corresponds to the even-parity constraint is decoded first. For bits other than the least reliable bit, the PM is updated as
Finally, when all the bits are estimated, the least reliable bit is set to preserve the even-parity constraint as
In [19] , the time step requirements of SPC nodes of length N s in SSCL-SPC decoding was shown to be
which consists of one time step for (18) , N s − 1 time steps for (19) , and one time step for (20) . The SSCL-SPC algorithm reduces the number of required time steps to decode SPC nodes by almost a factor of three, but as in the case of Rate-1 nodes, it fails to address the effect of list size on the maximum number of required path forks. In Section III, we prove that the number of required time steps to decode SPC nodes depends on the list size and that the new Fast-SSCL-SPC algorithm is faster than SSCL-SPC without incurring any error-correction performance degradation.
III. FAST-SSCL DECODING
In this section, we propose a fast decoding approach for Rate-1 nodes and use it to develop Fast-SSCL. We further propose a fast decoding approach for SPC nodes in SSCL-SPC and use it to develop Fast-SSCL-SPC. To this end, we provide the exact number of path forks in Rate-1 and SPC nodes to guarantee error-correction performance preservation. Any path splitting after that number is redundant and any path splitting less than that number cannot guarantee the errorcorrection performance preservation. We further show that in practical applications, this number can be reduced with almost no error-correction performance loss. We use this phenomenon to optimize Fast-SSCL and Fast-SSCL-SPC for speed.
A. Guaranteed Error-Correction Performance Preservation
The fast Rate-1 and SPC decoders can be summarized by the following theorems. Theorem 1. In SSCL decoding with list size L, the number of path splitting in a Rate-1 node of length N s required to get the exact same results as the conventional SSCL decoder is
The proposed technique results in
which improves the required number of time steps to decode Rate-1 nodes when L − 1 < N s . Every bit after the L −1-th can be obtained through hard decision on the LLR as
without the need for path splitting. On the other hand, in case min (L − 1, N s ) = N s , all bits of the node need to be estimated and the decoding automatically reverts to the process described in [17] . The proof of the theorem is nevertheless valid for both L − 1 < N s and L − 1 ≥ N s and is provided in [1] . The proposed theorem remains valid also for the HWF formulation that can be written as
The proof of the theorem in the HWF formulation case is also presented in [1] . The result of Theorem 1 provides an exact number of path forks in Rate-1 nodes for each list size in SCL decoding in order to guarantee error-correction performance preservation. The Rate-1 node decoder of [15] empirically states that two path forks are required to preserve the error-correction performance. The following remarks are the direct results of Theorem 1.
Remark 1. The Rate-1 node decoder of [15] for L = 2 is redundant.
Theorem 1 states that for a Rate-1 node of length N s when L = 2, the number of path splitting is min(L − 1, N s ) = 1. Therefore, there is no need to split the path after the least reliable bit is estimated. [15] for L = 2 is thus redundant. 10 −3
BER [17] [15] Fig. 3 : FER and BER performance comparison of SSCL [17] and the empirical method of [15] for P(1024, 860) when L = 128. The CRC length is 32.
Remark 2. The Rate-1 node decoder of [15] falls short in preserving the error-correction performance for higher rates and larger list sizes.
For codes of higher rates, the number of Rate-1 nodes of larger length increases [19] . Therefore, when the list size is also large,
The gap between the empirical method of [15] and the result of Theorem 1 can introduce significant error-correction performance loss. Fig. 3 provides the frame error rate (FER) and bit error rate (BER) of decoding a P(1024, 860) code with SSCL of [17] and the empirical method of [15] when the list size is 128. It can be seen that the error-correction performance loss reaches 0.25dB at FER of 10 −5 . In Section III-B, we show that the number of path forks can be tuned for each list size to find a good tradeoff between the error-correction performance and the speed of decoding.
Theorem 2. In SSCL-SPC decoding with list size L, the number of path forks in a SPC node of length N s required to get the exact same results as the conventional SSCL-SPC decoder is
Following the time step calculation of SSCL-SPC, the proposed technique in Theorem 2 results in T Fast-SSCL-SPC SPC (N s , N s − 1) = min (L, N s ) + 1 which improves the required number of time steps to decode SPC nodes when L < N s . Every bit after the L-th can be obtained through hard decision on the LLR as in (23) without the need for path splitting. In case min (L, N s ) = N s , the paths need to be split for all bits of the node and the decoding automatically reverts to the process described in [19] . The proof of the theorem is nevertheless valid for both L < N s and L ≥ N s . We defer the proof to Appendix A.
The effectiveness of hard decision decoding after the min(L − 1, N s )-th bit in Rate-1 nodes and the min(L, N s )-th bit in SPC nodes is due to the fact that the bits with high absolute LLR values are more reliable and less likely to incur path splitting. However, whether path splitting must occur or not depends on the list size L. The proposed Rate-1 node decoder is used in Fast-SSCL and Fast-SSCL-SPC algorithms and the proposed SPC node decoder is used in Fast-SSCL-SPC, while the decoders for Rate-0 and Rep nodes remain similar to those used in SSCL [17] such that
It should be noted that the number of path forks is directly related to the number of time steps required in the decoding process [10] . Therefore, when L < N s , the time step requirement of SPC nodes based on Theorem 2 is two time steps more than the time step requirement of Rate-1 nodes as in Theorem 1. However, if SPC nodes are not taken into account as in Fast-SSCL decoding, the polar code tree needs to be traversed to find Rep nodes and Rate-1 nodes as shown in Fig. 2a . For a SPC node of length N s , this will result in additional time step requirements as Table I summarizes the number of time steps required to decode each node with different decoding algorithms. In practical polar codes, there are many instances where L − 1 < N s for Rate-1 nodes and using the Fast-SSCL algorithm can significantly reduce the number of required decoding time steps with respect to SSCL. Similarly, there are many instances where L < N s for SPC nodes and using the Fast-SSCL-SPC algorithm can significantly reduce the number of required decoding time steps with respect to SSCL-SPC. Fig. 4 shows the savings in time step requirements of a polar code with three different rates. It should be noted that as the rate increases, the number of Rate-1 and SPC nodes increases. This consequently results in more savings by going from SSCL (SSCL-SPC) to Fast-SSCL (Fast-SSCL-SPC).
B. Speed Optimization
The analysis in Section III-A provides exact reformulations of SSCL and SSCL-SPC decoders without introducing any error-correction performance loss. However, in practical polar codes, there are fewer required path forks for Fast-SSCL and Fast-SSCL-SPC in order to match the error-correction performance of SSCL and SSCL-SPC, respectively.
Without loss of generality, let us consider L − 1 < N s for Rate-1 nodes and L < N s for SPC nodes such that Fast-SSCL and Fast-SSCL-SPC result in higher decoding speeds than SSCL and SSCL-SPC, respectively. Let us now consider S Rate-1 be the number of path forks in a Rate-1 node of length N s , and S SPC be the number of path forks in a SPC node of length N s where S Rate-1 ≤ L − 1 and S SPC ≤ L. It 
Algorithm
Rate-0 Rep Rate-1 SPC The definition of the parameters S Rate-1 and S SPC provides a trade-off between error-correction performance and speed of Fast-SSCL and Fast-SSCL-SPC. Let us consider CRC-aided Fast-SSCL decoding of P(1024, 512) with CRC length 16. Fig. 5 shows that for L = 2, choosing S Rate-1 = 0 results in significant FER and BER error-correction performance degradation. Therefore, when L = 2, the optimal value of S Rate-1 = 1 is used for Fast-SSCL. The optimal value of S Rate-1 for L = 4 is 3. However, as shown in Fig. 6 , S Rate-1 = 1 results in almost the same FER and BER performance as the optimal value of S Rate-1 = 3. For L = 8, the selection of S Rate-1 = 1 results in ∼0.1 dB of error-correction performance degradation at FER = 10 −5 as shown in Fig. 7 . However, selecting S Rate-1 = 2 removes the error-correction performance gap to the optimal value of S Rate-1 = 7. In the case of CRCaided Fast-SSCL-SPC decoding of P(1024, 512) with 16 bits of CRC, selecting S Rate-1 = 1 and S SPC = 3 for L = 4 results in almost the same FER and BER performance as the optimal values of S Rate-1 = 3 and S SPC = 4 as shown in Fig. 8 . As illustrated in Fig. 9 for L = 8, the selection of S Rate-1 = 2 and S SPC = 4 provides similar FER and BER performance as the optimal values of S Rate-1 = 7 and S SPC = 8.
IV. DECODER ARCHITECTURE
To evaluate the impact of the proposed techniques on a practical case, a SCL-based polar code decoder architecture implementing Fast-SSCL and Fast-SSCL-SPC has been designed. Its basic structure is inspired to the decoders presented in [9] , [19] , and it is portrayed in Fig. 10 . The decoding flow follows the one portrayed in Section II-C for a list size L. This means that the majority of the datapath and of the memory are replicated L times, and work concurrently on different candidate codewords and the associated LLR values.
Starting from the tree root, the tree is descended by recursively computing (5) and (2) on left and right branches respectively at each tree stage s, with a left-first rule. The computations are performed by L sets of P processing elements (PEs), where each set can be considered a standalone SC decoder, and P is a power of 2. In case 2 s > 2P, (5) and (2) require 2 s /(2P) time steps to be completed, while otherwise needing a single time step. The updated LLR values are stored in dedicated memories. The internal structure of PEs is shown in Fig. 11 . Each PE receives as input two LLR values, outputting one. The computations for both (5) and (2) are performed concurrently, and the output is selected according to i s , that represents the s-th bit of the index i, where 0 ≤ i < N. The index i is represented with s max = log 2 N bits, and identifies the next leaf node to be estimated, and can be composed by observing the path from the root node to the leaf node. From stage s max down to 0, for every left branch we set the corresponding bit of i to 0, and to 1 for every right branch.
When a leaf node is reached, the controller checks Node Sequence, identifying the leaf node as an information bit or a frozen bit. In case of a frozen bit, the paths are not split, and the bit is estimated only as 0. All the L path memories are updated with the same bit value, as are the LLR memories and the β memories. On the other hand, in case of an information bit, both 0 and 1 are considered. The paths are duplicated and the PMs are calculated for the 2L candidates according to (8) . They are subsequently filtered through the sorter module, designed for minimum latency. Every PM is compared to every other in parallel: dedicated control logic uses the resulting signals to return the values of the PMs of the surviving paths and the newly estimated bits they are associated with. The latter are used to update the LLR memories, the β memories and the path memories, while also being sent to the CRC calculation module to update the remainder.
All memories in the decoder are implemented as registers: this allows the LLR and β values to be read, updated by the PEs, and written back in a single clock cycle. At the same time, the paths are either updated, or split and updated (depending on the constituent code), and the new PMs computed. In the following clock cycle, in case the paths were split, the PMs are sorted, paths are discarded and the CRC value updated. In case paths were not split, the PMs are not sorted, and the CRC update occurs in parallel with the following operation.
A. Memory Structure
The decoding flow described above relies on a number of memories that are shown in Fig. 12 . The channel memory stores the N LLR values received from the channel at the beginning of the decoding process. Each LLR value is quantized with Q LLR bits, and represented with sign and magnitude. The high and low stage memories store the intermediate α (5) or (2) that can be performed: for a node in stage s, where 2 s > 2P, a total of 2 s /(2P) time steps are needed to descend to the lower tree level. The depth of the high stage memory is thus
2 j /P = N/P − 2, while its width is Q LLR × P. On the other hand, the low stage memory stores the LLR values for stages where 2 s ≤ 2P: the width of this memory is Q LLR , while its depth is defined as log 2 P−1 j=0 P/2 j = 2P − 2. Both high and low stage memory words are reused by nodes belonging to the same stage s, since once a subtree has been completely decoded, its LLR values are not needed anymore. While high and low stage memories are different for each path, the channel LLR values are shared among the L datapaths. Table II summarizes 
been completed, the β memories are reused for the right half. Finally, the PM memories store the L PM values computed in (8) .
B. Special Nodes
The decoding flow and memory structure described before implement the standard SCL decoding algorithm. The SSCL, SSCL-SPC and the proposed Fast-SSCL and Fast-SSCL-SPC algorithms demand modifications in the datapath to accommodate the simplified computations for Rate-0, Rate-1, Rep and SPC nodes.
As with standard SCL, the pattern of frozen and information bits is known a priori given a polar code structure, the same can be said for special nodes. In the modified architecture, the Node Sequence input in the controller (see Fig. 10 ) is not limited to the frozen/information bit pattern, but it includes the type of encountered nodes, their size and the tree stage in which they are encountered. Table III summarizes the content of Node Sequence depending on the type of node for SSCL and SSCL-SPC, while in case of Fast-SSCL and Fast-SSCL-SPC Node Sequence is detailed in Table IV . The node stage allows the decoder to stop the tree exploration at the right level, and the node type identifies the operations to be performed. Each of the four node types is represented with one or more decoding phases, each of which involves a certain number of codeword bits, identified by the node size parameter. Finally, the frozen bit parameter identifies a bit or set of bits as frozen or not. To limit the decoder complexity, the maximum node stage for special nodes is limited to s = log 2 P, thus the maximum node size is P. If the code structure identifies special nodes with node size larger than P, they are considered as composed by a set of P-size special nodes.
• Rate-0 nodes are identified in the Node Sequence with a single decoding phase. No path splitting occurs, and all 
the 2 s node bits are set to 0. The PM update requires a single time step, as discussed in [19] .
• Rate-1 nodes are composed of a single phase in both SSCL and SSCL-SPC, in which paths are split 2 s times.
In case of Fast-SSCL and Fast-SSCL-SPC, each Rate-1 is divided into two phases. The first takes care of the min(S Rate-1 , 2 s ) path forks, requiring as many time steps, while the second sets the remaining 2 s − min(S Rate-1 , 2 s ) bits according to (23) and updates the PM according to (24). This second phase takes a single time step.
• Rep nodes are identified by two phases in the Node Sequence, the first of which takes care of the 2 s − 1 frozen bits similarly as Rate-0 nodes do, and the second estimates the single information bit. Each of these two phases lasts a single time step.
• SPC nodes are split in three phases in the original SSCL-SPC formulation. The first phase takes care of the frozen bit, and computes both (16) and (17), initializing the PM as (18) in a time step. The extraction of the least reliable bit in (16) is performed through a comparison tree that carries over both the index and the value of the LLR. The second phase estimates the 2 s − 1 information bits, splitting the path as many times in as many time steps. During this phase, each time a bit is estimated, it is XORed with the previous β values: this operation is useful to compute (20) . The update of β i min is finally performed in the third phase, that takes a single time step.
Moving to Fast-SSCL-SPC, the second SPC phase is split in two, similarly to what happens to the Rate-1 node.
• Descend is a non-existing node type that is inserted for one clock cycle in Node Sequence for control purposes after every special node. The node size and stage associated with this label are those of the following node. The Descend node type is used by the controller module.
• Leaf nodes identify all nodes that can be found at s = 0, for which the standard SCL algorithm applies.
The decoding of special nodes requires a few major changes in the decoder architecture.
• Path Memory: each path memory is an array of N registers, granting concurrent access to all bits with a 1-bit granularity. In SCL, the path update is based on the Path Memory
Address Data in Write Enable
Node Type combination of a write enable signal, the codeword bit index i that acts as a memory address, and the value of the estimated bit after the PMs have been sorted and the surviving paths identified. Fig. 13 shows the path memory access architecture for Fast-SSCL-SPC. Unlike SCL, the path memory is not always updated with the estimated bitû. Thus, the SCL datapath is bypassed according to the node type. When Node Sequence identifies RATE0, REP1 and SPC1 nodes that consider frozen bits, the path memory is updated with 0 values. The estimated bitû is chosen as input for RATE1-1, REP2, SPC2-2 and LEAF nodes, where the path is split. RATE1-2 and SPC2-2 nodes estimate the bits through hard decision on the LLR values, while in the SPC3 case the update considers the result of (20) . At the same time. whenever the estimated bits are more than one, the corresponding bits in the path memory must be concurrently updated. Thus, the address becomes a range of addresses for RATE0, RATE1-2, REP1 and SPC2-2.
• β Memory: the update of this memory depends on the value of the estimated bit. In order to limit the latency cost of these computations, concurrently to the estimation ofû, the updated values of all the bits of the β memory are computed assuming bothû = 0 andû = 1. The actual value ofû is used as a selection signal to decide on the two alternatives. The β memory in SCL, unlike the path memory, already foresees the concurrent update of multiple entries that are selected based on the bit index i. Given an estimated leaf node, the β values of all the stages that it affects are updated: in fact, since as shown in (3) the update of β values is at most a series of XORs, it is possible to distribute this operation in time. The same can be said of multi-bit (3) updates. To implement Fast-SSCL-SPC, the β update selection logic must be modified to foresee the special nodes, similar to that portrayed in Fig. 13 for the path memory. For RATE0, REP1, and SPC1, theû = 0 update is always selected. RATE1-1, REP2, SPC2-1 and LEAF nodes maintain the standard SCL selection based on the actual value ofû. The update for SPC3 case is based on β i min . For RATE1-2 and SPC1-2, the selection is based on the XORed sign bits of the LLR values read from the memory.
• PM Calculation: this operation is performed, in the original SCL architecture and for leaf nodes in general according to (8) . The paths and associated PMs are split and sorted every time an information bit is estimated, while PMs are updated without sorting when frozen bits are encountered. While the sorting architecture remains the same, the implementation of the proposed algorithm requires a different PM update structure for each special node. Unlike with leaf nodes, the LLR values needed for the PM update in special nodes are not the output of PEs, and are read directly from the LLR memories. Additional bypass logic is thus needed. For RATE0 and REP1, (10b) and (11b) require a summation over up to P values, while SPC1 nodes need to perform the minimum α search (16): these operations are tackled through adder and comparator trees. RATE1-1, REP2 and SPC2-1 PM updates are handled similarly to the leaf node case, since a single bit at a time is being estimated. RATE1-2, SPC2-2 and SPC3 do not require any PM to be updated.
• CRC Calculation: the standard SCL architecture foresees the estimation of a single bit at a time. Thus, the CRC is computed sequentially. However, Rate-0 and Rep nodes in SSCL and SSCL-SPC estimate up to P and P − 1 bits concurrently. Thus, for the CRC operation not to become a latency bottleneck, the CRC calculation must be parallelized by updating the remainder. Following the idea presented in [20] , it is possible to allow for variable input sizes with a high degree of resource sharing and limited overall complexity. The circuit is further simplified by the fact that both Rate-0 and Rep nodes guarantee that the estimated bit values are all 0. Fig. 14 shows the modified CRC calculation module in case P = 64, where N CRC represents the number of concurrently estimated bits: the estimated bit can be different from 0 only in case of leaf nodes and s = 1 Rep nodes, for which a single bit is estimated in any case. The Fast-SSCL and Fast-SSCL-SPC architectures follow the same idea, but require additional logic. RATE1-2 and SPC2-2 nodes introduce new degrees of parallelism, as up to P − S Rate-1 and P − S SPC bits are updated at the same time. Moreover, it is not possible to assume that these bits are 0 as with RATE0 and REP1. The value of the estimated bit must be taken into account, leading to increased complexity.
• Controller: this module in the SCL architecture is tasked with the generation of memory write enables, the update of the codeword bit index i and the stage tracker s, along with the LLR memory selection signals according to Table II and path enable demands that most of the control signal generation logic is modified. Of particular importance is the fact that, in the SCL architecture, the update of i is bound to having reached a leaf node, i.e. s = 0. In Fast-SSCL-SPC, it is instead linked to s being equal to the special node stage. The index i is moreover incremented of the amount of bits estimated in a single time step, depending on the type of node. Memory write enables are also bound to having reached the special node stage, and not only to s = 0.
V. RESULTS
A. Hardware Implementation
The architecture designed in Section IV has been described in the VHDL language and synthesized in TSMC 65 nm CMOS technology. Implementation results are provided in Table V for different decoders: along with the Fast-SSCL and Fast-SSCL-SPC described in this work, the SCL, SSCL and SSCL-SPC decoders proposed in [19] are presented as well. Each decoder has been synthesized with three list sizes (L = 2, 4, 8), while the Fast-SSCL and Fast-SSCL-SPC architectures have been synthesized for considering different combinations of S Rate-1 and S SPC , as portrayed in Section III-B. Quantization values are the same used in [19] , i.e. 6 bits for LLR values and 8 bits for PMs, with two fractional bits each. All memory elements have been implemented through registers and the area results include both net area and cell area. The reported throughput is coded.
All Fast-SSCL and Fast-SSCL-SPC, regardless of the value of S Rate-1 and S SPC , show a substantial increase in area occupation with respect to SSCL and SSCL-SPC. The main contributing factors to the additional area overhead are three:
• In SSCL and SSCL-SPC, the CRC computation needs to be parallelized, since in Rep and Rate-0 nodes multiple bits are updated at the same time. However, the bit value is known at design time, since they are frozen bits. This, along with the fact that 0 is neutral in the XOR operations required by CRC calculation, limits the required additional area overhead. On the contrary, in Fast-SSCL and Fast-SSCL-SPC, Rate-1 and SPC nodes update multiple bits within the same time step (SPC2-2 and RATE1-2 stages). In these cases, however, they are information bits, whose values cannot be known at design time: the resulting parallel CRC tree is substantially wider and deeper than the ones for Rate-0 and Rep nodes. Moreover, with increasing number of CRC trees, the selection logic becomes more cumbersome.
• A similar situation is encountered for the β memory update signal. As described in the previous section, the β memory update values are computed assuming both estimated values, and the actual value ofû is used as a selection signal. In SSCL and SSCL-SPC the multiplebit update does not constitute a problem since all the estimated bits are 0 and the β memory content does not need to be changed. On the contrary, in Fast-SSCL and Fast-SSCL-SPC, the value of the estimated information bits might change the content of the β memory. Moreover, since β is computed as (3), the update of β bits depends on previous bits as well as the newly estimated ones. Thus, an XOR tree is necessary to compute the right selection signal for every information bit estimated in SPC2-2 and RATE1-2 stages.
• The aforementioned modifications considerably lengthen the system critical path. In case of large code length, small list size, or large P, the critical path starts in the controller module, in particular in the high stage memory addressing logic, goes through the multiplexing structure that routes LLR values to the PEs, and ends after the PM update. In case of large list sizes or short code length, the critical path passes through the PM sorting and path selection logic, and through the parallel CRC computation. Thus, pipeline registers have been inserted to lower the impact of critical path, at the cost of additional area occupation. Fast-SSCL and Fast-SSCL-SPC implementations show consistent throughput improvements with respect to previously proposed architectures. The gain is lower than what is shown to be theoretically achievable in Fig. 4 . This is due to the aforementioned pipeline stages, that increase the number of steps needed to complete the decoding of component codes.
B. Comparison with Previous Works
The Fast-SSCL-SPC hardware implementation presented in this paper for P(1024, 512) and P = 64 is compared with the state-of-the-art architectures in [11] - [14] , [19] and the results are provided in technology using a factor of 90/65 for the frequency and a factor of (65/90) 2 for the area. The synthesis results in [11] were carried out in 65 nm technology but reported in 90 nm technology. Therefore, a reverse conversion was applied to convert the results back to 65 nm technology.
The architecture in this paper shows 72% higher throughput and 42% lower latency with respect to the multibit decision SCL decoder architecture of [11] for L = 4. However, the area occupation of [11] is smaller, leading to a higher area efficiency than the design in this paper.
The symbol-decision SCL decoder architecture of [12] shows lower area occupation than the design in this paper for L = 4 but it comes at the cost of lower throughput and higher latency. Our decoder architecture achieves 192% higher throughput and 66% lower latency than [12] which resulted in 17% higher area efficiency.
The high throughput SCL decoder architecture of [13] for L = 2 requires lower area occupation than our design but it comes at the expense of lower throughput and higher latency. Moreover, the design in [13] relies on parameters that need to be tuned for each code, and it is shown in [13] that a change of code can result in more than 0.2 dB error-correction performance loss. For L = 4, our decoder not only achieves higher throughput and lower latency than [13] , but also it occupies a smaller area. This in turn yields a 12% increase in the area efficiency in comparison with [13] .
The multimode SCL decoder in [14] relies on a higher number of PEs than our design: nevertheless, it yields lower throughput and higher latency than the architecture proposed in this paper for L = 4. It should be noted that [14] is based on the design presented in [13] , whose code-specific parameters may lead to substantial error-correction performance degradation. On the contrary, the design in this paper is targeted for speed and flexibility and can be used to decode any polar code of any length.
Compared to our previous work [19] , that has the same degree of flexibility of the proposed design, this decoder achieves 51% higher throughput and 34% lower latency for L = 2, and 40% higher throughput and 28% lower latency for L = 4. However, the higher area occupation of the new design yields lower area efficiencies than [19] for L = {2, 4}. For L = 8, the proposed design has 39% higher throughput and 29% lower latency than [19] , which results in 9% increase in area efficiency. The reason is that for L = 8, the sorter is quite large and falls on the critical path. Consequently, the maximum achievable frequency for the proposed design is limited by the sorter and not by Rate-1 and SPC nodes as opposed to the L = {2, 4} case. This results in the same maximum achievable frequency for both designs, hence, higher throughput and area efficiency. Fig. 15 plots the area occupation against the decoding latency for all the decoders considered in Table VI . For each value of L, the design proposed in this work have the shortest latency, shown by their leftmost position on the graph.
VI. CONCLUSION
In this work, we have proven that the list size in polar decoders sets a limit to the useful number of path forks in Rate-1 and SPC nodes. We thus propose Fast-SSCL and Fast-SSCL-SPC polar code decoding algorithms that, depending on L and the number of performed path forks, can reduce the number of required time steps of more than 75% at no error-correction performance cost. Hardware architectures for the proposed algorithms have been described and implemented in CMOS 65 nm technology. They have a very high degree of flexibility and can decode any polar code, regardless of its This work [13] [19]
This work [11] [12]
[13]
[14]
[19]
This work [19] Latency [µs] rate. The proposed decoder is the fastest SCL-based decoder in literature: sized for N = 1024 and L = 2, it yields a 1.861 Gb/s throughput with an area occupation of 1.048 mm 2 . The same design, sized for L = 4 and L = 8, leads to throughputs of 1.608 Gb/s and 1.198 Gb/s, and areas of 1.822 mm 2 and 3.975 mm 2 , respectively.
APPENDIX A PROOF OF THEOREM 2
Proof. In order to prove Theorem 2, we note that the first step is to initialize the PMs based on (18) . Therefore, the least reliable bit needs to be estimated first. For the bits other than the least reliable bit, the PMs are updated based on (19) . However, the term (1 − 2γ)|α i min | is constant for all the bit estimations in the same path. Therefore, we can define a new set of N s − 1 LLR values as
for i i min and 0 ≤ i m < N s − 1, which results in
The problem is now reduced to a Rate-1 node of length N s −
