Abstract As the first error correction codes provably achieving the symmetric capacity of binary-input discrete memory-less channels (B-DMCs), polar codes have been recently chosen by 3GPP for eMBB control channel. Among existing algorithms, CRC-aided successive cancellation list (CA-SCL) decoding is favorable due to its good performance, where CRC is placed at the end of the decoding and helps to eliminate the invalid candidates before final selection. However, the good performance is obtained with a complexity increase that is linear in list size L. In this paper, the tailored CRCaided SCL (TCA-SCL) decoding is proposed to balance performance and complexity. Analysis on how to choose the proper CRC for a given segment is proposed with the help of virtual transform and virtual length. For further performance improvement, hybrid automatic repeat request (HARQ) scheme is incorporated. Numerical results have shown that, with the similar complexity as the state-of-the-art, the proposed TCA-SCL and HARQ-TCA-SCL schemes achieve 0.1 dB and 0.25 dB performance gain at frame error rate FER = 10 −2 , respectively. Finally, an efficient TCA-SCL decoder is implemented with FPGA demonstrating its advantages over CA-SCL decoder.
Introduction
Polar codes, proposed by Arıkan [1, 2] , are considered as a breakthrough of coding theory. It is shown that polar codes can provably achieve the symmetric capacity of binary-input discrete memory-less channels (BDMCs) [2] . Besides the capacity achieving performance, the asset of polar coding compared to the state-of-theart (SOA) is its corresponding low-complexity decoding algorithms. Therefore, polar codes have been adopted by 3GPP for eMBB control channels.
Though linear programming (LP) decoder [3] , successive cancellation (SC) decoder, and belief propagation (BP) decoder [4, 5] have been proposed for polar codes, their performance is not comparable with maximum likelihood (ML) decoder. Thus, the breadth-first SC decoder named SC list (SCL) decoder, was proposed by [6, 7] . Cyclic redundancy check (CRC), widely adopted for error detection, has been proved as a simple and effective enabler for further performance improvement with respect to SCL decoder. Numerical results have shown that, CRC-aided SCL (CA-SCL) decoder [8] achieves at least no worse performance than the SOA turbo and low-density parity-check (LDPC) decoders [9] . Usually, CRC is placed at the end of decoding to eliminate invalid candidates before final decision. The disadvantages are: 1) Though has better performance than SCL decoder, CA-SCL decoder still suffers from time and space complexity regarding the list size L. in time until the decoding end is reached, and the computation afterward is in vain.
To address the complexity and redundancy, [10] proposed a segmented CA-SCL (SCA-SCL) decoder. At the same time, [11] independently proposed a partitioned CA-SCL (PSCL) decoder, which is similar as the SCA-SCL decoder but with a different partition method. Both decoders divide code bits into segments and insert CRC bits in between, to rule out invalid candidates per segment rather than to wait until the decoding ends. Thus, they can reduce redundancy while keeping comparable performance as CA-SCL decoders. However, existing decoders usually apply the same CRC length to the same number of information or code bits. Though convenient, those straightforward schemes fail to take the code construction into consideration. It is not clear whether the existing uniform partition schemes are optimal and whether better performance can be achieved with the same number of CRC bits.
To our best knowledge, no existing literature has discussed the CRC distribution for SCA-SCL decoding, and its hardware implementation. Analysing the CRC requirement by unequal-length segments and introducing concepts of virtual transform and virtual length, this paper devotes itself in figuring out a tailored CA-SCL (TCA-SCL) decoding of improved performance and lower complexity than SOA. An HARQ-TCA-SCL decoding is proposed for further performance improvement. Contributions of this paper are: 1) Efficient CRC distribution is proposed for the first time, showing performance advantage over SOA. 2) This paper does not limit itself to specific decoder design, but proposes a formal TCA methodology, which can be readily applied to any existing SCA-SCL decoders.
3) The efficient implementation methodology is also proposed and verified with FPGA implementations.
The remainder of the paper is organized as follows. Section 2 reviews the preliminaries. Section 3 analyzes the SCA-SCL decoders for possible refinement. The TCA-SCL decoding is given in Section 4. The HARQ-TCA-SCL decoding is given in Section 5. Section 6 gives the performance and complexity analysis of the proposed decoding schemes. Section 7 proposes a hardware architecture for TCA-SCL decoding. FPGA implementations are given in the same section. Finally, Section 8 concludes the entire paper.
Preliminaries

2.A Polar Codes
Denote the input alphabet, output alphabet, and transition probabilities of a B-DMC by X , Y, and W (y|x).
With block length N = 2 n , the information vector, encoded vector, and received vector are u
, and y N 1 = (y 1 , ..., y N ). The polar encoding is given by
where G N and B N are the generation matrix and bitreversal permutation matrix respectively, and F = [ 1 0 1 1 ]. Transmitting channels between x N 1 and y
and channel splitting 
. In (N, K) codes, the K most reliable channels with indices in information set A are chosen to transmit the K information bits in u N 1 ; whereas the others, with indices in frozen set A c , transmit the (N − K) frozen bits.
2.B SC and SCL Polar Decoders
The SC polar decoding tree is a full binary tree. Fig.  1 shows a toy example for N = 8. For each node at the n-th level, two possible choices are 0 and 1. Each set consisting of all the leaf nodes is associated with a unique estimated codewordû
Otherwise, the SC decoder computes its log-likelihood ratio (LLR):
and generates its decision aŝ
The LLR updating is conducted based on the two equations listed in Eq. (6) . max * denotes the Jacobi logarithm:
This recursive process starts from each (sub-)tree's root and always traverses the left branch before the right (Fig. 1) . When the leaf level is reached, hard decision is made and returned to the parent node.
As a greedy search algorithm, SC decoding keeps only one path based on step-wise decision, with complexity of O(N log N ). However, this single-candidate
1,e |u 2i ). method only guarantees the local optimality, and will possibly result in incorrect result. To this end, the SCL decoding, which keeps a list of L survivals, was proposed by [6, 7] independently. Fig. 2 illustrates the difference between SC and SCL algorithms. The complexity of SCL decoder is O(LN log N ). At the i-th step, if i ∈ A, the SCL decoder splits each current path into two paths with bothû i = 0 andû i = 1. Out of the 2L paths, only the L best ones are kept. Finally, the decoder chooses the best path at the end of decoding process. 
2.C CA-SCL Polar Decoder
For further improvement, CA-SCL decoder introduces CRC as a detection tool at the end of decoding [8] . Illustrated in Fig. 3 , CRC detector helps to decide which candidates are possibly correct before metric comparison. Here, m denotes the number of CRC bits. The CRC-passed candidate with the largest metric value is chosen as the final result. If no candidate passes the CRC detection, a decoding failure is claimed.
Segmented CA-SCL Decoding Schemes
In this section, we first introduce two SCA-SCL decoding schemes, then propose a refined version. Without loss of generality, the (1024, 512) code [2] is employed as a running example, whose polarization is in Fig. 4 . Here W is a BEC with erasure probability = 0.5,
with I(W (1) ) = 1 − . The blue stars in Fig. 4 denote the information bits, whereas the red points denote the frozen bits. 
3.A Comparison of Different Segmented Schemes
To the authors' best knowledge, there are two segmented CRC-aided SCL methods. The PSCL scheme proposed in [11] aims to reduce memory consumption, and applies uniform partitions to code bits for implementation convenience. The hardware reduction comes at the cost of some performance loss compared to the conventional CA-SCL algorithm and always forces the number of candidate paths to 1 after each CRC. The SCA-SCL scheme proposed in [10] aims to reduce both the time and space complexity. Uniform segments are applied to information bits and CRC is employed as a tool to eliminate decoding redundancy without harming the performance. Let P denote the number of segments. For PSCL decoder, the index set of Segment-i is T i (1 ≤ i ≤ P ). | · | denotes the cardinality of one set. We have
For SCA-SCL decoder, the index set of Segment-i is S i (1 ≤ i ≤ P ). we have
One simple example of P = 4 is illustrated in Fig. 5 . Theoretically, both schemes are similar but differ in the partition methods. PSCL decoding employs uniform code bit partition, which is implementation friendly. However, since only one candidate can survive after each CRC, small performance degradation is expected, especially in low SNR region. SCA-SCL decoding employs uniform information bit partition, which can keep the performance as CA-SCL decoding while successfully reducing the space and time complexity. This advantage comes from the decoding flexibility. However, the flexibility will make the implementation more complicated.
3.B PSCL with Early Termination
The first observation is that, both schemes apply the same CRC to the uniformly partitioned segments. Without looking into the symmetric capacity of each binary channel, this straightforward scheme may not be optimal. The second observation is that both schemes have their own merits, it would be smarter to merge them together. In other words, it is estimated that we can propose a new approach which is both implementation friendly and adaptive.
One simple mixture of both schemes is to introduce early termination to PSCL decoding. However, this simple combination may not be reasonable in certain cases. Fig. 6 gives an example with P = 4. Shown in Fig. 7 , for the (1024, 512) code with m = 32, the information lengths of four segments are 20, 123, 156, and 245, respectively. If uniform CRC bits are employed, the first segment has |T 1 | = |T 1 | − |C 1 | = 12 information bits and the last segment has |T 4 | = |T 4 | − |C 4 | = 237 information bits. It is unreasonable to use the same 8-bit CRC to both the 12-bit and 237-bit segments. To this end, the TCA-SCL decoding is proposed in the following section.
CA-SCL Decoding with Tailored CRC
In this section, we first discuss how to measure the requirement of CRC bits for different segments. Then the concepts of virtual transform and virtual length are introduced. A visualization method of polarized channel's symmetric capacity is also proposed. The detailed TCA-SCL decoding is finally proposed. It should be noted that though the TCA-SCL decoding is based on uniform partition of code bits, it can be readily applied to other uniform or nonuniform partition schemes.
4.A Requirement of CRC Length for Polar Codes
Assume a total of m CRC bits are available and are divided into P segments C 1 , C 2 , . . . , C P . It may not be suitable to set |C 1 | = |C 2 | = . . . = |C P | for P segments with different lengths. How to measure the requirement of the CRC length for each segment is critical. To the authors' best knowledge, no literature has addressed this specific problem. To maintain the same error detection capability in the situation of independent channels, it is concluded that longer sequence requires more CRC bits [12] . However, this conclusion does not suit polar codes because the reliability of different channels are different. A reasonable measurement on requirement of CRC length should take both sequence length and symmetric capacity into account. In the following, concepts of virtual transform and virtual length are proposed to this end.
4.B Virtual Transform and Virtual Length
Including CRC bits, we always pick the K + m most reliable bits out of N based on the symmetric capacity 
The CRC allocation of PSCL (N = 1024, K = 512).
I(W (i)
N ) with i ∈ A . A is the new information set including CRC bits, and |A | = K + m. CalculateĪ as follows:
Definition 1 Virtual Transform To operate the virtual transform, we first calculate I (i):
The virtual value of the channel is
Definition 2 Virtual Length The summation of J(i) in the k-th segment is its virtual length:
The CRC allocation is given by
where adjust(·) is a function which adjusts the allocation results to near integers and takes the following steps: 1) find an unmarked k which has minimum |ROUND(
|, then mark k and set
); 2) repeat step 1) for (P − 2) times; 3) Assume the left unmarked index is k . Set
4.C Visualization of Channel Symmetric Capacity
Before we give more details of the proposed TCA-SCL decoding, one visualization method of symmetric capacity is proposed for easy understanding and illustration. In this visualization, the gradient colors from iridescence are used to demonstrate the symmetric capacity of each channel. According to the legend, the more symmetric capacity approaching 1 (0), the more bathochromic (hypsochromic) it will be. Fig. 8(a) shows the visualization for polar codes with N = 64.
Example 1 For (1024, 512) polar codes with 32 CRC bits, visualization of code bits is given in Fig. 8(b) . The visualization of 512 information bits is given in Fig. 8(c) . For TCA-SCL decoding, set P = 4. According to Definition 2, the ratio of virtual lengths is: 
which is illustrated by Fig. 8(d) . Then the CRC allocation is obtained according to Eq. (15):
The refined CRC allocation based on virtual length is given in Fig. 8 (e).
Remark 1 Generally speaking, the hypsochromic part in the visualization chart mainly contributes to the virtual length. The more hypsochromic segment requires more CRC bits.
This refined SCA-SCL decoding based on virtual length is named TCA-SCL decoding. Details of TCA-SCL decoding is given as follows. The corresponding performance and implementation are discussed in Section 6 and Section 7.
4.D Tailored CA-SCL Decoding
The detailed tailored CA-SCL decoding is given in this subsection. For TCA-SCL encoding, we set P segments 
(e) The CRC allocation of TCA-SCL (N = 1024, K = 512). and perform the virtual transform to obtain the corresponding virtual lengths. Then we allocate the CRC bits according to the ratio of virtual lengths before polar encoding.
Here, addCRC(·) is function which performs Eq. (15). Function encoder(·) performs conventional polar encoding. For TCA-SCL decoding, SCL decoding with early termination is performed as follows. Here, the function SCL (·) is the SCL decoding for Segment-j. Define U i as the the output paths set of SCL(·) in i-th segment. Define passCRC(·) as the function which checks if at least one path of U i can pass the CRC. If one or more than one path can pass the CRC, the path with the largest metric of them is chosen to refreshû N 1 .
TCA-SCL Decoding with HARQ
Besides early termination, the proposed TCA-SCL decoding can also work in a HARQ way when segmented CRC fails. HARQ has been widely used in delay insensitive communication systems for a capacity-approaching throughput [13] [14] [15] . Recently, HARQ has been considered for polar decoding. [16] introduced a HARQ scheme 
end if 10: end for 11: for k = 1; k <= P ; k + + do 12: vl k = i∈{T k ∩A } J(i); 13: end for 14: addCRC(u based on a class of rate-compatible polar codes constructed by performing punctures and repetitions using punctured polar coding [17] . An incremental redun- dancy HARQ (IR-HARQ) scheme via puncturing and extending of polar codes is proposed in [18] . Both algorithms use punctured patterns to suit different rates. However, puncturing causes a performance loss and needs hybrid decoding schemes to remedy it with high complexity. And IR-HARQ scheme needs to retransmit frozen bits one by one after transmitting K information bits. Therefore, the decoding complexity of IR-HARQ is O(N 2 log N ), which is high for a large N .
To overcome this issue, we give a HARQ-TCA-SCL scheme based on TCA-SCL decoding. When a segment decoding failure occurs, the system resends the specific segment and merges the new information bits with the old ones by maximum ratio combining (MRC). For different segments sharing the same SNR, decoder can apply linear superposition to obtain the average value. As the number of segment retransmission goes up, the noise power converge to zero, which helps to improve the performance effectively. The proposed HARQ-TCA-SCL scheme is illustrated in Fig. 9 . Let i denotes the current number of times a transmission attempted, T denotes the maximum retransmission times, and j (≤ P ) denotes the current position of the segments. The details of HARQ-TCA-SCL scheme are listed as follows:
We initialize i = 1 for the HARQ-TCA-SCL decoding, then perform the SCL decoding for Segment-j (function SCL (·)) and obtain CRC results on each survival path at the end of segment SCL decoding. If at least one path can pass CRC, we save the path with the Retransmit and combine Segment-j; highest probability and move to the next segment. Otherwise, we update i to i + 1, combine Segment-j with the retransmitted part and the old ones, redo the TCA-SCL decoding. Algorithm terminates with a decoding failure if i = T .
Performance and Complexity Analysis
6.A Performance Analysis
In this subsection, performance comparison between different algorithms is given with binary-input additive white Gaussian noise channels (BI-AWGNCs). Different code lengths, rates, and partition schemes are considered: for Fig. 10(a) , we have N = 64, K = 36, m = 8, and P = 2; for Fig. 10(b) , we have N = 1024, K = 512, m = 32, and P = 4. The information set A is selected according to [2, 19] . We use corresponding hex value to represent CRC polynomial. For example, a CRC-4 detector with polynomial g(D) = D 4 + D + 1 is described as CRC-4 (0x9) in this paper (the '+1' is implicit in the hex value). For (64, 36) code, we set 2 copies of CRC-4 (0x9) for (HARQ-)PSCL scheme, and CRC-5 (0x12) and CRC-3 (0x5) for (HARQ-)TCA-SCL scheme. For (1024, 512) code, we set 4 copies of CRC-8 (0xA6) for (HARQ-)PSCL scheme, and CRC-3 (0x5), CRC-10 (0x327), CRC-11 (0x583), and CRC-8 (0xA6) for (HARQ-)TCA-SCL scheme. All the CRC detectors are with the best CRC generation polynomial suggested by [12] .
According to Fig. 10 , compared with the PSCL scheme, the proposed TCA-SCL scheme has a 0.1 dB perfor- mance gain when FER = 10 −2 for both (64, 36) and (1024, 512) codes. The HARQ-TCA-SCL (T = 3) scheme introduces a 0.25 dB and 0.13 dB gain over the HARQ-PSCL scheme when frame error rate FER = 10 −2 for (64, 36) and (1024, 512) codes, respectively.
6.B Complexity Analysis
Define the product of the actual decoding length and list size as the average list size. Since the average computational complexity is proportional to the average list size, here we analyze the average list sizes of TCA-SCL and HARQ-TCA-SCL decoders denoted byL T andL H , respectively. Assume the total frame number is F , and the decoder ends at the P i -th segment of the i-th frame. For the TCA-SCL decoder,L T can be calculated as 
Suppose the i-th frame is retransmitted
For low SNR, thanks to the early terminationL T is small due to high error rate. On the other hand, a larger number of retransmissions leads to a higherL H for HARQ-TCA-SCL decoder. As SNR increases,L T andL H converge to L: 1) TCA-SCL decoder is more likely to finish the decoding process, and 2) the retransmission time of HARQ-TCA-SCL decoder converges to 0. It should be noted that, according to Eq. (18) and (19) Fig. 11 , (HARQ-)TCA-SCL scheme has the same complexity as (HARQ-)PSCL scheme. The HARQ-TCA-SCL scheme has 50.3% and 38.5% higher complexity than the PSCL scheme at SNR = 1.5 dB for (64, 36) and (1024, 512) codes, respectively. As SNR goes up, the complexity of HARQ-TCA-SCL scheme tends to be as same as the PSCL scheme asymptotically with better performance.
Efficient TCA-SCL Decoder Architectures
To facilitate the application of the proposed TCA-SCL decoder, efficient architectures and FPGA implementations are proposed in this section and are also given to demonstrate its merits. Since hardware consumption and decoding latency are two main concerns of SCL family decoder, the proposed architecture aims to achieve a good balance in between. The HARQ-TCA-SCL decoder can also be designed similarly.
7.A Hardware Consumption Analysis
7.A.1 Full Module TCA-SCL Architecture
In this subsection, a full module TCA-SCL architecture is proposed, which is mainly based on the conventional folded SC architecture proposed in [20] . The architecture for full module TCA-SCL decoder is illustrated in Fig. 12 . It divides all mixed node modules (MNs) into n = log 2 N stages, and each MN implements two types of calculations mentioned in Eq. (6) . According to the conclusions in [20] , for an N -bit SC decoder, (N − 1) MNs are required. For an N -bit CA-SCL decoder, L(N − 1) MNs are employed.
Theorem 1 For one P -segmented TCA-SCL decoder with list L, the total number of MNs is
Proof For the given decoder, its MNs can be categorized into two parts. The first part includes Stages 1 to log 2 P . The second part includes Stages (log 2 P + 1) to n. It should be noted that since N is power of 2, log 2 P is always an integer. Since each segment outputs only one candidate, the first part obeys SC decoding rule, and list size L is not necessary. The number of MNs is
The second part obeys CA-SCL decoding rule without considering the fine-gain scheduling. The number of MNs is
Since the memory block corresponds to MNs, the memory complexity is as follows Corollary 1 Assume the quantization length for the LLR message is q, the memory bits required are
The list core (LC) module in Fig. 12 mainly implements the sorting operation. In order to reduce both the sorting latency and complexity, the efficient distributed sorting (DS) proposed in [21] is employed here.
7.A.2 Folded Module TCA-SCL Architecture
Thanks to the early termination scheme, the proposed full module architecture for TCA-SCL decoding is memory efficient compared to conventional CA-SCL decoding. However, the hardware utilization ratio (HUR) of MNs is very low. Borrowing the fine-folding idea proposed in [21, 22] , this paper then proposes the folded module TCA-SCL architecture for higher HUR. We set up a sub-decoder with (2 n/2 − 1)L MNs for Stage 1 to n/2 . Stage ( n/2 + 1) to n can also be implemented by this sub-decoder in a time-multiplexing manner. Fig. 13 gives an example of a even n, Fig. 13 denotes the current folding order. However, the characteristics in Section 7.A.1 which helps to reduce the complexity of the first log 2 P stages could not be employed here, because folding technique is based on uniform hardware. The complexity is Theorem 2 For one folded module TCA-SCL decoder with list L, the total number of MNs is
Proof When implementing Stage 1 to n/2 , all the input and output multiplexers choose mode '0'. 2 n/2 +1 executions are required to output 2 n/2 +1 L LLRs for Stage n/2 . For P -segmented decoder, if log 2 P ≥ n/2 , (2 n/2 − 1)(L − 1) MNs are idle during this decoding stage. Otherwise, according to Eq. When implementing Stage ( n/2 + 1) to n, all the input and output multiplexers choose mode '1'. Since 2 n/2 +1 L LLRs become the input of the sub-decoder, no MN is idle during this stage. Therefore, the total number of MNs is (2 n/2 − 1)L. 
... ( 
Proof The folded design only reduces the complexity of MNs. However, the memory complexity stays the same as the full module TCA-SCL architecture. Table 2 gives FPGA results in accordance with Theorem 3.
7.B Timing Analysis
7.B.1 Single Frame Scheme
As Fig. 12 shown, the decoding process for TCA-SCL has the following steps: 1) In Segment j, MNs complete the main decoding in Eq. (6) . The 2L LLRs correspond toû i for each path. 2) 2L LLRs are input to the LC module. DS method [21] is employed to select the best L paths.
3) The memory is updated and partial sum vectorû sum is calculated forû i+1 . 4) We repeat the above steps to get the L paths forû
is directly chosen as '0' or '1' for each path without decoding. After that, we input information bits inû for 2L paths to CRC j to pick up the only path for Segment j + 1. CRC is implemented with linear feedback shift register (LFSR) [23] , and determines the coefficient of xor. Shown in Fig. 12 , P CRC modules are employed. It should be noted that here CRC j takes care of 2L paths in serial manner. Admittedly, designers can process 2L with parallel CRCs. Considering the simple CRC and its short processing time, serial manner is employed here. The scheduling of this single frame (SF) scheme is shown in Fig. 14(a) . The latency of SF TCA-SCL decoder is Theorem 4 Assume the latency of CA-SCL is T CA clock cycles. The latency for CRC i is T i . For one SF Psegmented TCA-SCL decoder with list L, the decoding latency is
Proof After checking all 2L paths of Segment i, the decoder selects one path and begins to decode Segment (i+1). In SF scheme, segmented CRC scheme increases latency for serial checking of Segment i
In SF scheme, checking Segment P of Frame 1 and decoding Segment 1 of Frame 2 can be done at the same time. Since the checking time is shorter than decoding time, the latency increase is
Now the proof is immediate.
Folded module TCA-SCL decoder can also work in the proposed SF scheme.
Corollary 2 Assume the folding technique introduces
F extra clock cycles per frame, the latency of SF folded module TCA-SCL decoder is
7.B.2 Double Frame Scheme
SF decoding introduces 2L(T 1 +...+T (P −1) ) extra clock cycles per frame. During CRC detection, all MNs are idle and HUR is therefore low. To this end, the double frame (DF) scheme is proposed. The main idea of DF is shown in Fig. 14(b) . Two frames are decoded simultaneously in an interleaved manner: when Frame 1 checks (decodes) its Segment i, Frame 2 decodes (checks) its Segment i (i − 1). Since both frames share the same architecture, every time a new segment is decoded, all LLRs in memory belong to the other frame. If we keep the decoding latency of each frame the same as CA-SCL decoder, Stage 1 to (log 2 P − 1) need an extra memory block of q( Theorem 5 For one DF P -segmented TCA-SCL decoder with list L, the decoding latency is
Proof For the interleaved manner in Fig. 14(b) , the latency of each segment is
where T deci denotes the SCL decoding latency for Segment i, which includes SC decoding and DS. According to [21] , the DS latency for Segment i is approximately 2LT i , therefore
It is believed that there are 2 i segments, which could calculate from Stage (i + 1), now calculates from Stage 1 and introduce latency of i·2 i . Therefore, the decoding increased latency is
Folded module TCA-SCL decoder can also work in the proposed DF scheme with the following latency.
Corollary 3 Assume the folding technique introduces
F extra clock cycles per frame, the latency of DF folded module TCA-SCL decoder is
(34) Table 1 shows comparison between five different schemes: CA-SCL decoder, SF (DF) full module TCA-SCL decoders, and SF (DF) folded module TCA-SCL decoders. According to Section 4.C, CRC allocation is (|C 1 |, |C 2 |, |C 3 |, |C 4 |) = (3, 10, 11, 8) . Data in red show the example of N = 1024, K = 512, P = 4, and L = 2.
7.C FPGA Implementation Results
To better demonstrate the advantages of the proposed TCA-SCL decoders, FPGA implementations based on Altera Stratix V are given as well. To be in accordance with Table 2 , five decoders have been implemented. The same parameters as the aforementioned example are employed here: N = 1024, K = 512, m = 32, P = 4, and L = 2. All the five decoders employ the same LLR quantization scheme of 1 sign bit, 6 integer bits, and 1 decimal bit. In Fig. 15 , the FER performance comparison of floating SC and quantized-SC with q = 8 bits indicates the validity of the quantized scheme.
The implementation results are compared in terms of adaptive logic modules (ALMs), registers, and memory bits. It is shown that, compared to the CA-SCL decoder, TCA-SCL (SF or DF) decoder can achieve 18.8% or 15.0% ALM reduction. For further ALM reduction, with the help of folding technique, FTCA-SCL (SF or DF) decoder consumes 40.11% or 42.9% ALMs compared to TCA-SCL (SF or DF) with slightly increased latency, as analyzed in [22] . It is also observed that the ALMs' reduction is not that much as the reduction of MNs listed in Table 1 . This is because Table 1 does not consider the comparison part, which introduces major part of ALMs consumption and stays the same between different architectures.
For implementation convenience, here memory has been employed by both folded decoders. Therefore, we consider the sum of registers and memory bits as the total memory consumption. It is observed TCA-SCL (SF or DF) decoder requires 77.03% or 82.1% memory compared to the CA-SCL decoder. Also, the introduction of folding technique does not affect the memory cost, which has been indicated by Theorem 3. Comparing FTCA-SCL (DF) decoder and FTCA-SCL (SF) decoder, when DF scheme is employed, the latency can be reduced 15.90% at the cost of 11.99% increased ALMs.
For the latency issue, since the critical paths of all designs are determined by the critical path of the same SC decoding kernel, we believe it is safe to compare in term of clock number. It is shown that the segmented CRC decoders will introduce more latency due to more serial CRC operations. Second, the DF scheme is more time efficient. Third, the folded versions come at the cost of higher latency.
In general, the proposed four architecture of DC-SCL decoding can reduce the hardware consumption compared to CA-SCL decoder. Designers can choose the suitable one according to different application requirements.
Conclusions
In this paper, a segmented SCL polar decoding with tailored CRC is proposed. Method on how to choose the proper CRC for a given segment is proposed with help of concepts of virtual transform and virtual length. Numerical results have shown that the proposed TCA-SCL decoder can achieve better performance and lower complexity than conventional CA-SCL decoder. Thanks to the more reasonable CRC partition scheme, the TCA-SCL decoder can also outperform the PSCL decoder. For further performance improvement, HARQ-TCA-SCL scheme is proposed at the cost of increased complexity. Efficient architectures and FPGA implementations are also proposed for a good balance between hardware consumption and decoding latency.
