We consider practical hardware implementation of polar decoders. To reduce latency due to the serial nature of successive cancellation, existing optimizations improve parallelism with two approaches, i.e., multi-bit decision or reduced path splitting. In this paper, we combine the two procedures into one with an error-pattern-based architecture. It simultaneously generates a set of candidate paths for multiple bits with pre-stored patterns. For rate-1 (R1) or single parity-check nodes, we prove that a small number of deterministic patterns are required to guarantee performance preservation. For general nodes, low-weight error patterns are indexed by syndrome in a look-up table and retrieved in O(1) time. The proposed flipsyndrome-list decoder fully parallelizes all constituent code blocks without sacrificing performance, and thus is suitable for ultra-low-latency applications. Meanwhile, two code construction optimizations are presented to further reduce complexity and improve performance.
Polar codes [1] , [2] have been selected for the fifth generation (5G) wireless standard. With state-of-the-art code construction techniques [3] [4] [5] and successive-cancellation (SC) list (SCL) decoding algorithm [6] [7] [8] [9] [10] [11] [12] [13] , Polar codes demonstrate competitive performance over LDPC and Turbo codes in terms of block error rate (BLER). Beyond 5G, ultra-low decoding latency emerges as a key requirement for applications such as autonomous driving and virtual reality. The latency of practical Polar decoders, e.g., an SCL decoder with list size L = 8, is relatively long due to the serial processing nature.
The fundamental SC decoding can be represented as binary tree search, i.e., decoding a length-N code (a parent node) is recursively decomposed into decoding two length-N /2 codes (child nodes), and then four length-N /4 codes, and so on. The nodes (codes after decomposition) are categorized by their code rates:
• Those with no information bit are called rate-0 (R0) nodes;
• Those with only 1 information bit are called repetition (Rep) nodes;
• Those with no frozen bit are called rate-1 (R1) nodes;
• Those with only 1 frozen bit are called single paritycheck (SPC) nodes.
• The nodes satisfying none of the above are general (Gen) nodes. Based on the ''tree-search representation'', continuous efforts [7] [8] [9] [10] [11] [12] [13] have been made to significantly reduce decoding latency. Among them, we are particularly interested in hardware implementations, which are dominant in realworld products, due to better power-and area-efficiency. This implies that maximal latency, rather than average latency as in some software optimizations, are considered. According to our cross-validation, three approaches are shown to be cost-effective, yet incur no or negligible performance loss compared to the original SCL decoder, as summarized below:
1) Pruning on the SC decoding tree [7] (parallelizing constituent code blocks with mult-bit decision)
• R0 and Rep nodes [8] , [9] .
• Gen nodes comprised of certain consecutive bits [10] , [11] . 2) Reducing the number of path splitting • R1 and SPC nodes [9] . • Do not split at the most reliable bits [12] , [13] .
3) Reducing the latency of list pruning • Adopt bitonic sort [14] for efficient pruning.
• Quick list pruning [15] .
B. MOTIVATION AND OUR CONTRIBUTIONS
It is well known that an SC decoder requires 2N − 2 time steps for a length-N code [1] . The SC decoding factor graph reveals that, the main source of latency is the left hand side (LHS, or information bit side) of the graph. In contrast, the right hand side (RHS, or codeword side) of the graph consists of independent code blocks and already supports parallel decoding. With the above observations, the key to low-latency decoding is to parallelize LHS processing.
An SCL decoder [6] runs L instances of SC decoders in parallel and keeps the L most likely decoding paths with smallest path metrics (PM). Existing hardware implementations of SC/SCL decoders are pioneered by Alamdar-Yazdi and Kschischang [7] , Hashemi et al. [8] , [9] , Sarkis et al. [10] , and Yuan and Parhi [11] , which prunes the SC decoding tree by not traversing the four special nodes [7] . For SCL decoders, the corresponding PMs are directly updated at the parent node [8] . Even though, there is still room for further optimizations:
• The processing of an R1/SPC node is not fully parallel (e.g., a number of sequential path extension & pruning are still required [9] ). A higher degree of parallelism can be exploited to further reduce latency.
• Optimizations (e.g., parallel processing) are applied to some special nodes (e.g., R0/Rep/SPC/R1), and the length of such blocks, denoted by B, is often short due to insufficient polarization. According to our measurement under typical code lengths, the main source of latency is now incurred by the general nodes whose constituent code rates are between 2 B and B−2 B . Motivated by Alamdar-Yazdi and Kschischang [7] , Hashemi et al. [8] , [9] , Sarkis et al. [10] , and Yuan and Parhi [11] , and thanks to the recent advances in efficient list pruning [14] , [15] , we find it profitable to further improve parallelism for ultra-low-latency applications. Our contributions are summarized below: 1) We propose to fully parallelize the processing of R1/SPC nodes via multi-bit hard estimation and flipping at intermediate stages. Only one-time path extension/pruning per node is required by applying a small number of flipping patterns on the raw hard estimation. Such simplification is proven to preserve performance. 2) For general nodes, we apply flip-syndrome-list (FSL) decoding to constituent code blocks. Specifically, a small set of low-weight error patterns are pre-stored in a table indexed by syndrome. During decoding, syndrome is calculated per constituent code block. Its associated error patterns are retrieved from the syndrome table, and used for bit-flip-based sub-path generation. Similar to R1/SPC nodes, the FSL decoder narrows down the candidates for path extension, and enjoys the simplicity of a hard-input decoder. The proposed optimization is shown to incur negligible performance loss.
3) The complexity of an FSL decoder is mainly incurred by constituent code blocks with medium rates. We propose to re-adjust the distribution of information bits in order to avoid certain constituent code rates, such that decoder complexity can be significantly reduced. We show that the performance loss can be negligible. 4) With the FSL decoder's capability to decode arbitrary linear outer constituent codes, not necessarily Polar codes, we propose to adopt hybrid outer codes with optimized weight spectrum. The hybrid-Polar codes demonstrate better performance than the original Polar codes. Paper is organized as follows, Section II introduces the fundamentals of Polar SCL decoding, Section III provides the details of FSL decoder including R1/SPC nodes, general nodes, latency analysis and BLER performance, Section IV proposes two improved code construction methods that benefit from the FSL decoder architecture, Section V concludes the paper.
II. POLAR CODES AND SCL DECODING
A binary Polar code of mother code length N = 2 n can be defined by c = uG and a set of information sub-channel indices I. The information bits are assigned to sub-channels with indices in I, i.e., u I , and the frozen bits (zero-valued by default) are assigned to the rest of sub-channels. The Polar kernel matrix is G = F ⊗n , where F = 1 0 1 1 is the kernel and ⊗ denotes Kronecker power, and c is the codeword. The transmitted BPSK symbols are x = 1 − 2 · c, transmitted over additive white Gaussian noise (AWGN) channel, and the received vector is y. For completeness, the original SCL decoder [6] is briefly revisited. The SC decoding factor graph of a length-N Polar code consists of N × (log 2 N + 1) nodes, as shown in Fig. 1 . The row indices i = {0, 1, · · · , N − 1} denote the N bit indices. The column indices s = 0, 1, · · · , log 2 N denote decoding stages, with s = 0 labeling the information bit side and s = log 2 N labeling the input LLR side (or codeword side). Each node in the factor graph can be indexed by a (s, i) pair, and is associated with a soft LLR value α s,i , which is initialized by α log 2 N ,i = y i , and a hard estimate β s,i .
For all s and i satisfying (i mod 2 s+1 ) < 2 s , a hardwarefriendly right-to-left updating rule for α is:
The hard estimate of the i-th bit is β 0,i = 1−sgn(α 0,i ) 2
. The corresponding left-to-right updating rule for β is: An SCL decoder with list size L executes path split upon each information bit, and preserves L paths with smallest PMs. Given the l-th path withû l i as the i-th hard output bit, a hardware-friendly PM updating rule [16] is
and PM l 0 = 0, where PM l i denotes the PM of the l-th path at bit index i, and α l 0,i and β l 0,i denote its corresponding soft LLR and hard estimation, respectively.
After decoding the last bit, the first path 1 is selected as decoding output.
III. FLIP-SYNDROME-LIST (FSL) DECODING
SC-based decoding of length-N Polar codes requires log 2 N + 1 stages to propagate received signal (s = log 2 N ) to information bits (s = 0). The degree of parallelism is 2 s , i.e., reduces by half after each decoding stage.
To increase parallelism, we propose to terminate the LLR propagation at intermediate stage s = log 2 B, and process all length-B constituent code blocks with a hard-input decoder. 2 The design is detailed throughout this section, where differences to existing works mainly include (i) fully parallelized processing for B bits and L paths, and (ii) supporting arbitrary-rate blocks rather than special ones (e.g., R0/Rep/SPC/R1).
A. MULTI-BIT HARD DECISION AT INTERMEDIATE STAGE
The indices of a constituent code block is denoted by B {i, i + 1, · · · , i + B − 1}. Once the soft LLRs at the s-th stage 1 For CRC-aided Polar, the first path that passes CRC check is selected. 2 In the following, we define s = log 2 B as the ''hard-decision stage'' unless otherwise specified. are obtained, where s = log 2 B, a raw hard estimation is immediately obtained by
In contrast to SCL that uses the soft LLR α α α s,B , a constituent block decoder takes β β β s,B as its hard input, and directly generates hard codewordβ β β s,B as decoded output.
The hard-input decoders for R1, SPC and general nodes will be described next in Section III-B and III-C. For now, we assume such a decoder outputs a hard codewordβ β β s,B for each candidate path, and recovers the corresponding information vector byû
Given the soft LLRsα α α s,B and the recovered codeword β β β s,B , the multi-bit version of PM updating rule [8] is
The remaining updating of α and β is based on the hard decisionβ β β s,B rather than the raw estimation β β β s,B .
B. PARALLELIZED PATH EXTENSION VIA BIT FLIPPING 1) RATE-1 NODES
For an R1 node, the state-of-the-art decoding method [9] requires min(L − 1, B) times path extensions. First, the input soft LLRs α α α l s,B for each list path is sorted. Then, path extensions are performed only on the min(L − 1, B) least LLR positions to reduce complexity. Such simplification incurs no performance loss since additional path extensions are proven to be redundant [9] . The searching space becomes L × 2 min(L−1,B) , much smaller than L × 2 B for conventional SCL [6] and SSCL [8] . Another work [17] also proposes to reduce searching space for R1 nodes. But its candidate paths generation is LLR-dependent, thus is suitable for software implementation as suggested in [17] .
In this paper, we focus on hardware implementation and propose a parallel path extension based on pre-stored errorpatterns. As shown in Fig. 2 , only one-time path extension and pruning is required for a constituent block. The optimization exploits the deterministic partial ordering of incremental PM within a block. Accordingly, the search for survived paths can be narrowed down to a limited set, and pre-stored in the form of error patterns in a look-up table (LUT). The LUT is shown to be very small for a practical list size L = 8. As such, the advantages are:
• B bits are decoded in parallel.
• Sub-paths are generated in parallel. • The above two procedures are combined into one. Notation 1 (Soft/Hard Vectors): The soft LLR input of a constituent block is indexed by ascending reliability order, i.e., α α α l s,B such that |α l s,0 | < |α l s,1 | < · · · < |α l s,B−1 | for each list path. The corresponding raw hard estimation is denoted by β β β l s,B β l s,0 , β l s,1 , · · · , β l s,B−1 . Notation 2 (Sub-Paths Extension): For a constituent block with indices B, a sub-path that extends from the i-bit to the (i + B − 1)-th bit can be well defined by the blockwise decoding output. For example, the t-th sub-path of the l-th path is denoted by the vectorβ β β
s,B is generated by flipping β β β l s,B based on an error pattern e e e. A single-bit-error pattern is denoted by e e e p if it has one at the p-th bit position (p = 0, 1, · · · ) and zeros otherwise.
For L = 8, we narrow down the searching space per list path from 2 min(L−1,B) to 13 by the following error patterns.
Patterns-R1: For each path in an SCL with L = 8, its L maximum-likelihood sub-paths (i.e., with minimum incremental PMs) fall into a deterministic set of size 13. These sub-paths can be obtained by bit flipping the original hard estimation of each list path based on the following error patterns:
⊕ e e e 0 ⊕ e e e 1 , t = 8, β β β l s,B ⊕ e e e 0 ⊕ e e e 2 , t = 9, β β β l s,B ⊕ e e e 1 ⊕ e e e 2 , t = 10, β β β l s,B ⊕ e e e 0 ⊕ e e e 3 , t = 11, β β β l s,B ⊕ e e e 0 ⊕ e e e 1 ⊕ e e e 2 , t = 12.
(4)
Proof: To survive from the sub-paths of all L paths, a sub-path must first survive from the sub-paths of its own parent path. That means for each parent path, we only need to consider its L maximum-likelihood sub-paths. Altogether, there are at most L 2 sub-path to be considered.
According to (3), the PM penalty is received only on the flipped positions. For each sub-path and its associated error patterns, the incremental PM is computed by Since the indices of soft LLRs |α α α l s,B | are ordered according to Notation 1, the incremental PMs also satisfy a set of partial order, as shown in the Table 1 , where the right and lower cells are always larger than the left and upper ones, which can be easily verified. Alternatively, the table can be viewed as a tree, where each cell is a ''node'', whose ''child nodes'' are its right and lower cells. A parent node always has smaller PM than its child nodes, and the one with the smallest PM is the root node ''0''.
We prove Patterns-R1 with the above table. Any node with a minimum distance to the root node ''0'' larger than L = 8 cannot survive path pruning.
First, if the 8-th smallest incremental PM is caused by a single bit error, then it cannot be |α l s,7 | or larger, otherwise there will be more than 8 sub-paths with PMs smaller than the 8-th one, which contradicts the assumption. The argument holds since there are already 8 parent nodes of |α l s,7 | in in the tree.
Similarly, the 8-th smallest incremental PM caused by two bit errors cannot be equal to or larger than |α l s,1 | + |α l s,3 |, because there are already more than 8 parent nodes with smaller PM in the tree.
Finally, the sub-paths with incremental PM |α l s,0 |+|α l s,1 |+ |α l s,2 | also has 8 nodes in its upstream (including itself), and any error pattern with larger incremental PM (including the 4-bit patterns) will lead to contradiction if they are included in the surviving set.
Thus, we can reduce the tested error patterns per path to 13 with only one-time path extension without any performance loss.
Remark 1: The bit-flipping-based path extension is mainly constituted of binary/LUT operations. The 13 error patterns are pre-stored. The resulting PMs for all error patterns can be computed in parallel according to (3) or (5) .
The path extension and pruning are as summarized by ''(13 → 8 → 64 → 8) × 1'', explained as follows. For each path, the 13 error patterns lead to 13 sub-paths, among which the 8 with smallest PMs are pre-selected (13 → 8).
Altogether, there will be 8×L = 64 extended paths (8 → 64) for the case of L = 8. The 64 extended paths are pruned back to 8 (64 → 8). The above procedures are executed only one time. In contrast, the fast-SSCL decoder [9] requires L − 1 = 7 times path extension and pruning, i.e., (8 → 16 → 8) × 7. According to Section III-D, the minimum number of ''cycles'' reduces from 49 to 14 in the case of a length-16 R1 block. To avoid any misunderstanding, the ''cycles'' here captures implementation details in our fabricated ASIC [19] , thus should be distinguished from the ''time steps'' concept in [9] .
Remark 2: The error patterns already cover list sizes L ≤ 8, but the idea is worth extending to larger list sizes, where the identification of corresponding error patterns is a good problem for future works. Among them, decoders with list size L = 8 are particularly important since they are widely accepted by the industry during the 5G standardization process [18] . The conclusion is drawn after extensive evaluations on the tradeoff among BLER, latency, throughput and power consumption, in which decoders with L = 8 achieve the best overall efficiency. The tradeoff in real hardware is further verified in our implemented decoder ASIC in [19] .
2) SPC NODES
For an SPC node, the state-of-the-art decoding method [9] requires min(L, B) times path extensions. In this work, we propose only one-time path extension and reduce the searching space from 2 min(L,B) to 13 as follows.
Patterns-SPC: For SCL with L = 8, following Notation 1, if the checksum of β β β l s,B is even, i.e., j∈B β l s,j = 0, then the L surviving paths can be obtained from bit flipping each list path based on the following 13 error patterns: 
Otherwise, if the checksum of β β β l s,B is odd, i.e., j∈B β l s,j = 1, then the L surviving paths can be obtained from bit flipping each list path based on the following 13 error patterns:
⊕ e e e 0 ⊕ e e e 1 ⊕ e e e 2 , t = 8, β β β l s,B ⊕ e e e 0 ⊕ e e e 1 ⊕ e e e 3 , t = 9, β β β l s,B ⊕ e e e 0 ⊕ e e e 2 ⊕ e e e 3 , t = 10, β β β l s,B ⊕ e e e 1 ⊕ e e e 2 ⊕ e e e 3 , t = 11, β β β l s,B ⊕ e e e 0 ⊕ e e e 1 ⊕ e e e 4 , t = 12,
Proof: The proof follows that for Patterns-R1. As shown in Table 2 , the right and lower cells are always larger than the left and upper ones. As seen, any error patterns other than the those given in Patterns-SPC will lead to more than 8 surviving paths with PM smaller than the 8-th path, which contradicts the assumption.
Remark 3: According to Section III-D, the latency (cycles) reduction from [9] is 56 → 15 under L = 8. 
C. ERROR PATTERN IDENTIFICATION VIA SYNDROME DECODING
Existing optimizations operate on special rates, e.g., R0/R1/SPC/Rep nodes. In this work, we suggest a parallelization method for arbitrary nodes with larger sizes (e.g., B = 8, 16, · · · ).
For general nodes, it is not easy to identify all possible error patterns as in R1/SPC nodes. However, it is possible to quickly narrow down to a subset of highly-likely error patterns for parallelized path extension. Syndrome decoding is particularly suitable here for two reasons, e.g., (i) blockwise syndrome calculation is simple and reuses the Kronecker product module, (ii) multiple error patterns (coset) can be prestored and retrieved in parallel.
1) GENERAL NODES
As shown in Fig. 3 , we first obtain a set of input vectors via multi-bit hard decision and bit flipping. The flipped positions are chosen from the flipping set T , i.e., the T indices in α α α s,B with the smallest LLRs. Based on the hard estimation β β β l s,B , we flip within T to generate 2 T input vectors, denoted by β β β l,t s,B , and t ∈ {0 · · · 2 T − 1}. For example, if the t-th flipping pattern is e e e i ⊕e e e j ⊕e e e k (note that {i, j, k} ∈ T ), then β β β l,t s,B = β β β l s,B ⊕ e e e i ⊕ e e e j ⊕ e e e k .
Given the flipping pattern, the syndrome-decoding-based parallel path extension is illustrated in Fig. 3 . The key steps, e.g., syndrome calculation and error pattern retrieval, are hardware-friendly binary operations and LUT.
Denote by G B F ⊗ log 2 B the kernel of the general node indexed by B and its frozen set F B , the parity-check matrix H B is obtained by extracting the columns with indices in F B from G B . Assume the node has K B information bits, the syndrome of vector β β β l,t s,B contains B − K B bits and is calculated by
For each syndrome, its associated error patterns are computed offline [20] and pre-stored by ascending weight order in LUT. Since a low-weight error pattern is more likely than a high-weight one, we only need to store a small number of lowest-weight patterns to reduce memory. In practice, a program enumerates all error patterns β β β from low to high weight, and stores them in the LUT position indexed by their syndrome d d d calculated by (9) , until all LUT elements are filled. The procedure is required to perform only one time offline.
There are 2 B−K B different syndromes for a (B, K B ) constituent code block, where K B is the number of information bits within the block. As a result, the size of a syndrome table is (2 B−K B ) × L sd , where L sd is a constant number of error patterns pre-stored for each syndrome.
For example, the syndrome table for a general node with B = 8, K B = 6 and L sd = 4 has size 4 × 4 is obtained by the above method, and given in Table 3 , where the error patterns are represented in hexadecimal form.
The error patterns retrieved from LUT are used to simultaneously generate a set of candidate sub-paths, denoted by For each list path, we have 2 T × L sd extended sub-paths. The PMs are updated according to (3) except that, the T smallest LLRs are modified to a large value, i.e., α l,t s,j → (−1)β l,t s,j × ∞, ∀j ∈ T , whereβ l,t s,j is the j-th hard bit after flipping. This procedure ensures at most one flip for each bit position and therefore no duplicate paths will survive, which is crucial to the overall performance. Similar to R1/SPC nodes, the path extension and pruning is performed only one time for each block to keep L surviving paths, i.e., (L → L × 2 T × L sd → L). Remark 4: For small K B , an exhaustive-search-based path extension is more convenient since it generates 2 K B paths [11] . For K B > T + log 2 L sd , it is more efficient to extend paths by the proposed flip-syndrome method. Therefore, we recommend to switch between exhaustive-searchbased and syndrome-based path extension depending on the constituent code rate. As such, the maximum path extension is min 2 K B , 2 T × L sd . Remark 5: For a practical list size L = 8, we can set B = 8, T ≤ 2, L sd ≤ 4 for 8-bit parallel decoding, or B = 16, T ≤ 3, L sd ≤ 8 for 16-bit parallel decoding to achieve a good tradeoff between complexity and latency, yet with negligible performance loss.
D. LATENCY ANALYSIS
The minimum number of cycles is analyzed with the assumption that independent operations can be executed in parallel. In reality, the latency will be different depending on the number of processing elements available per implementation. However, the minimum cycle analysis represents the number of logical steps and provides a hardware-independent latency evaluation.
For an R1 node, the 13 error patterns in (4) are retrieved from a pre-stored table, among which 8 are pre-selected according to PM. The 13 → 8 path sorting and pruning logic is shown in Fig. 4 . For simplicity, |α l s,t | is abbreviated by α t . All relevant LLR pairs are compared in cycle 1. Among them, the first 3 pre-selected paths are β β β l s,B , β β β l s,B ⊕e e e 0 and β β β l s,B ⊕e e e 1 . The remaining paths are sequentially selected according to the comparison results and their preceding selection choices. Finally, the 8 candidate paths are pre-selected and sorted by ascending order. The process only requires 5 cycles.
Combining all sub-paths in an L = 8 list decoder, there will be 8 × 8 = 64 paths for another round of pruning. Since the 8 sub-paths for each list are already ordered, the pruning requires an additional 9 cycles to identify the 8 survival paths [14] . The number of cycles are 14 and 15 for an R1 and SPC node, respectively.
For comparison, fast-SSCL [9] requires 7 and 8 rounds of path extension and pruning for a Rate-1 and an SPC node, respectively. Each round takes a minimum of 7 cycles with bitonic sort [14] . Overall, a minimum of 7 × 7 = 49 and 7 × 8 = 56 cycles are required.
For general nodes, the proposed FSL decoder also has lower latency since more bits are processed in parallel. The overall latency is influenced by two factors (i) the number of leaf nodes in an SC decoding tree, (ii) the degree of parallelism within a leaf node.
For a rough estimation, the number of leaf nodes of a N = 1024, K = 512 Polar code is summarized in Table 4 . The code is constructed by Polarization Weight (PW) [4] . For all schemes, the frozen bits before the first information bit are skipped. For R0/Rep/SPC/R1 nodes, the maximum length of a parallel processing block is B max = 32. For general nodes, the parallel processing length is 8-bit or 16-bit, denoted by 8b and 16b FSL, respectively. As seen, 16b FSL only requires to visit a half of nodes to traverse the SC decoding tree.
To determine real latency, we synthesized the proposed decoders in TSMC 16nm CMOS with a frequency of 1GHz. The maximum supported code length is N max = 16384, with LLRs and 6 bits. The number of processing elements is 128. The decoding latency of 4b multi-bit [11] , Fast-SSCL [9] , 8b FSL and 16b FSL decoders is measured at a code decoders is measured at a code rate of 1/3. For N = 1024, the latency is 1258ns, 1079ns, 870ns and 697ns, respectively. For N = 4096, the latency is 5134ns, 4239ns, 3640ns and 3003ns, respectively. The latency reduction from [9] and [11] is 35% ∼ 45% and 29% ∼ 42%, respectively. As seen, even compared with the most advanced SCL decoders [9] , [11] in literature, the proposed 8b and 16b FSL decoders can further reduce latency. A detailed latency comparison is given in Table 5 .
E. BLER PERFORMANCE
The BLER performance of an FSL decoder is simulated and compared with its SCL decoder counterpart. For FSL, we adopt 16-bit parallel processing with B = 16, T ≤ 3, L sd ≤ 8. We simulated a wide range of code rates and lengths, and observe negligible performance loss. In the interest of space, only code rates {1/2, 1/3} and lengths {1024, 4096, 16384} are plotted in Fig. 5 . Throughout the paper, 16 CRC bits are appended to, but not included in, the K payload bits. The code rate is calculated by K /N .
IV. IMPROVED CODE CONSTRUCTION
Based on the proposed FSL decoder, we propose two code construction methods to further (i) reduce complexity and (ii) improve performance. The first method re-adjusts the information bit positions to avoid certain high-complexity constituent code blocks. The second one replaces outer constituent codes with optimized block codes to improve BLER performance. Note that for both improved constructions, the decoding latency stays low as the original construction, as long as the degree of parallelism B remains unchanged.
A. ADJUSTED POLAR CODES
The complexity of an FSL decoder mainly arises from the size of syndrome tables, therefore its optimization also aims at reducing the storage for LUT. According to Section III-C, the size of a syndrome table is (2 B−K B )×L sd for a constituent code block with K B information bits and L sd error patterns per syndrome. According to Remark 4, a rate-dependent path extension is adopted, where the maximum path extension is min 2 K B , 2 T × L sd . In other words, high-complexity operations are incurred by medium-rate blocks, while highrate or low-rate blocks can be processed with low complexity.
Thanks to the polarization effect, most blocks will diverge to high or low rates as code length increases, which is helpful. In the following, we show that, even for finite-length codes with insufficient polarization, we can deliberately eliminate some of the medium-rate blocks by re-adjusting their information bit positions.
For example, a 16-bit parallel FSL decoder with B = 16, T = 3 and L sd = 8 is used to decode a N = 2048, K = 1024, CRC16 Polar code. The original block rate distribution is shown on the left side of Fig. 6 . As seen, many code blocks have already polarized to either high rate or low rate. Among the medium rate blocks, those with K B = 6 are responsible for the majority of the decoding complexity (syndrome table size 1024). However, there are only 3 such blocks. On the right side of Fig. 6 , we eliminate blocks with K B = 6, 7 and 8 by re-allocating their information bits to blocks with lower and higher rates. Although the adjusted Polar codes deviate from the actual polarization, which implies performance loss, they demand much lower decoding complexity. In particular, the largest syndrome table size reduces from 1024 to 128, with the information re-adjustment in Fig. 6 
Once the rate of a block is adjusted, another block has to change its rate accordingly to ensure that overall code rate remains unchanged. Output: I adj 1) Re-adjust to eliminate medium-rate block. for each block with K low
end if end for 2) Balance overall rate when necessary. for each block with K B = 6, 7 and 8 are eliminated. The syndrome table size thus reduces from 1024 to 128. The BLER loss due to information bit re-adjustment is only 0.02dB at BLER 1%. The same experiment is conducted for N = 8192, K = 4096, whereas the performance loss becomes negligible as shown in Fig. 8 . This can be well explained: medium-rate blocks reduce as polarization increases with code length, thus requiring less re-adjustment and incurring less performance loss.
The proposed construction allows us to trade some performance for significant complexity reduction, thus bears practical importance.
B. OPTIMIZED OUTER CODES
Observe that the proposed hard-input decoder for outer block codes is no longer an SC decoder, but similar to an ML decoder. However, the default Polar outer codes have poor minimum distance and may not be suitable for the proposed decoder. To obtain a better performance, a straightforward idea is to adopt outer codes with optimized weight spectrum.
Note that the error-pattern-based decoders do not need to change at all. As long as the generator/parity-check matrices are defined, the outer decoders only need to update the error patterns according to that specific code. The implications are two-fold: (i) any linear block codes fit well into the FSL decoding framework, offering full freedom to optimize the outer codes; (ii) the decoding complexity remains unchanged.
For B = 16, we present a specific outer code design for each K B . For example, K = 2 simplex codes repeated to length-16 have a minimum distance 10, which is larger than 8 of (16, 2) Polar codes. Following this idea, we individually optimize each (B, K B ) outer codes with respect to code distance. Below are some examples.
For K B = 2, 3, 4, repetition over simplex codes always yields higher code distance than the corresponding Polar codes. Their respective generator matrices G K B are For K B = 6, 7, extended BCH (eBCH) codes also yield better weight spectrum than the corresponding Polar codes. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
For K B = 8, 9, the dual of eBCH codes are adopted; for K B = 12, 13, 14, the dual of simplex codes are adopted. For the remaining rates, the original Polar codes are adopted.
Depending on K B , the outer codes are combination of different codes, or hybrid outer codes. The resulting concatenated codes are thus called hybrid-Polar codes. Note that the lengths of the outer codes are not necessarily power of 2, making the concatenated codes length compatible.
The encoding steps are shown in Fig. 9 , and explained as follows:
1) First, an original (N , K ) Polar code is constructed, in order to determine the rate of each (B, K B ) outer code. 2) Second, each block is individually encoded, i.e., multiplying a length-K B information vector by the corresponding generator matrix. 3) Third, the outer codewords are concatenated into a long intermediate vector, upon which inner polarization is performed to obtain a single codeword. The proposed outer codes have better weight spectrum than Polar codes. The code weights {w} are enumerated in Table 6 . The numbers of codewords having a specific weight are displayed, and those of minimum weight are highlighted in boldface. As seen, the weight spectrum of the hybrid codes improves upon Polar codes with the same K B in two ways: • The minimum distance remains the same, but the number of minimum-weight codewords reduces, e.g., K B = 3, 4, 9, 10, 12, 13, 14. Fig. 10 and Fig. 11 show the performance of N = 256 and N = 1024 Polar codes, respectively, along with hybrid-Polar codes of the same length and rate. As seen, a performance gain between 0.1 ∼ 0.2 dB is demonstrated. Note that hybrid-polar codes cannot be decoded by an SCL decoder due to the fact that the length-B outer codes are no longer Polar codes once the decoding stage reaches s = log B 2 but arbitrary block codes, and SC-based de-polarization cannot continue. Therefore, we only compare between hybrid-Polar codes under FSL and Polar codes under SCL in Fig. 10 and Fig. 11 .
Since such BLER improvement comes with no additional cost with the FSL decoder, the Hybrid-Polar codes is considered worthwhile in practical implementations. 
V. CONCLUSIONS
In this work, we propose the hardware architecture of a flip-syndrome-list decoder to reduce decoding latency with improved parallelism. A limited number of error patterns are pre-stored, and simultaneously retrieved for bit-flippingbased path extension. For R1 and SPC nodes, only 13 error patterns are pre-stored with no performance loss under list size L = 8; for general nodes, we may further reduce latency with a syndrome table to quickly identify a set of highly likely error patterns. Based on the decoder, two code construction optimizations are proposed to either further reduce complexity or improve performance. The proposed decoder architecture and code construction are designed particularly for applications with low-latency requirements.
