For polar codes with short-to-medium code length, list successive cancellation decoding is used to achieve a good error-correcting performance. However, list pruning in the current list decoding is based on the sorting strategy and its timing complexity is high. This results in a long decoding latency for large list size. In this work, aiming at a low-latency list decoding implementation, a double thresholding algorithm is proposed for a fast list pruning. As a result, with a negligible performance degradation, the list pruning delay is greatly reduced. Based on the double thresholding, a low-latency list decoding architecture is proposed and implemented using a UMC 90nm CMOS technology. Synthesis results show that, even for a large list size of 16, the proposed low-latency architecture achieves a decoding throughput of 220 Mbps at a frequency of 641 MHz.
INTRODUCTION
Successive cancellation decoding (SCD) is proposed in [1] for decoding polar codes, and its hardware implementation is extensively studied in [2] - [12] . However, for polar codes with short-to-medium code length, the error-correcting performance of the SCD is unsatisfactory. To improve the performance, SCDs with multiple codeword candidates are proposed. They are the list decoding [13] , [14] and its variants [15] - [17] . For a better performance, cyclic redundancy check (CRC) code is serially concatenated with polar codes and the CRC bits are used to choose the valid codeword from the list candidates [13] , [18] , [19] . As a result, the list decoding of polar codes achieves or even exceeds the performance of Turbo codes [20] and LDPC codes [13] . However, this performance improvement is at the cost of a larger list size (e.g., 16 or 32) and the increased complexity highly desires an efficient list decoding architecture. In this work, the efficient and low-latency implementation of the list decoding is explored, aiming at promoting polar codes as a competitive coding candidate in both error-correcting and implementation aspects.
The first list decoding architecture for polar codes is proposed in [21] . In [22] , the pre-computation look-ahead technique [6] is used in the list decoding for a lower latency, while its memory size is tripled. In [21] and [22] , a small list size of 4 and 2 are used, respectively. When the list decoding decodes an information bit, the number of the codeword candidates are doubled. To maintain a reasonable decoding complexity, once the candidate size exceeds the specific list size L, some of the codeword candidates have to be pruned. The common pruning strategy is to sort the codeword candidates based on their metrics and keep the L best of them. However, the sorting operation incurs a large hardware and timing complexity, especially when L is large. In [23] , a list decoding architecture with list size of 8 is proposed, and a Bitonic sorting network is customized for efficient sorting. Nevertheless, up to three pipeline stages are used by the sorting architecture. As a result, to implement the list decoding with large list size in hardware, list pruning architecture is critical, especially to achieve a low decoding latency.
In this work, the list pruning architecture is optimized in both algorithmic and architectural levels. Recently, instead using loglikelihood (LL) to capture the metric of the list candidates, the LL ratio (LLR) representation is used for the list decoding [24] - [25] . Benefiting from the numerical accuracy and stability of the LLR, a small and regular architecture of the memory and processing element (PE) can be used for the list decoding [24] . Therefore, in this work, LLR is used in the design of the low-latency pruning architecture of the list decoder. Very recently in [26] - [28] , borrowing some advanced techniques used in the SCD implementation [2] , the special constituent codes of polar codes are utilized to reduce the latency of the list decoding. However, conventional sorting strategies are still used for their list pruning and this limits the latency reduction.
RELATION TO PRIOR WORK
The main contributions of this work are outlined as follows:
1. Different from the previous works on list decoding [13] - [28] , a double thresholding strategy (DTS) is proposed to replace the sorting strategy for list pruning. 2. In the architectural level, the architectures for DTS and threshold value update are proposed. As a result, even for a large list size, the logic delay of list pruning is very small. 3. A low-latency list decoding architecture for a large list size, i.e.
16, is implemented in the UMC 90nm CMOS technology. Its decoding latency is even smaller than that of list size of 8 [23] .
LIST DECODING OF POLAR CODES
A length N = 2 n polar code with rate R = K/N is specified by the generator matrix GN and a frozen set A c ⊂ {0, 1, . . . , N − 1} of cardinality |A c | = N − K. A source word of polar codes is denoted as uN , and uN ∈ {0, 1} N . It consists of K information bits ui (i / ∈ A c ) and N − K frozen bits ui (i ∈ A c ). The information bit is used to deliver the data, while the frozen bit is set to a value, e.g., 0, pre-known by the decoder. If the r-bit CRC is used, the last r information bits take the CRC of the previous K − r bits. In the encoder, the codeword xN ∈ {0, 1} N is generated as x T N = u T N GN and sent over the physical channels.
Let y be the noise corrupted signal of xN at the receiver. The LLRs input to the decoder are given as 
(a) Decoding tree and decoding path
(b) Scheduling tree for SCD tree. Fig. 1 shows an example of the trees for N = 4. The decoding tree of a length-N polar code is a depth-N binary tree, with ui mapped to the nodes at depth i + 1. Its root node represents a null state. A path from the root node to a depth-i node represents a subvector [u0, u1, . . . , ui−1] of the source word uN , and is named as the decoding path p i . Specifically, a path from the root node to the leaf node of the decoding tree represents a source word uN of polar codes, and the value of each bit of uN is shown in the corresponding node lying at this decoding path. Notice that, if ui is a frozen bit, it only assumes 0. Hence, the right sub-tree rooted at the depth-(i + 1) node can be pruned, as the source words included in it are not valid. For example, if A c = {0}, the gray sub-tree in Fig. 1 (a) is pruned. Decoding the polar codes can be treated as a search problem in the pruned decoding tree. The conventional SCD performs a depthfirst search. Given a partial decoding path [u0, u1, . . . , ui−1], the SCD generates the LLR of bit ui, denoted as
Otherwise, a ML decision is made for the information bit ui (i / ∈ A c ), and is given bŷ
Based on the decision rule in (2), single decoding path from the root to the leaf is obtained in the decoding tree, e.g., the red path in Fig.  1 (a), and it is the source wordûN decoded by the SCD. For a better error-correcting performance, a breadth-first search is performed by the list decoding. To constrain the searching complexity, a list size L is set. Let L decoding paths at depth i of the decoding tree be denoted as p
For each path candidate p i l , a path metric is associated with it and denoted as pm i l . When decoding the information bit ui, L decoding paths are extended to 2L paths. From [24] , the path metrics of the two extensions of the path p i l are given by
where ui assumes 0 and 1, corresponding to the left and right extensions of p i l . In the hardware [24] , (3) is approximated by
The operation in (4) is denoted as path metric update (PMU). Based on the 2L pms from PMU, L extended paths with the smallest pms are chosen and they are the paths at depth i+1, i.e., p
This operation is named as the list pruning operation (LPO).
From (3) or (4), it can be seen that the PMU needs the knowledge of L n i and it is generated by the SCD. The SCD operation can be described by the scheduling tree shown in Fig. 1(b) . The scheduling tree of a length-N polar code is a depth-n balanced binary tree. It consists of two kinds of nodes: f node and g node. The functions included in one node can be evaluated in one clock cycle. Generally, . Therefore, the delay of the PMU and the LPO result in an increased latency of the list decoding.
From (4), the PMU is implemented with an adder array and its logic delay is small. However, a sorting strategy is used for the LPO in the conventional list decoding architecture [21] - [24] . For a shorter delay, a parallel sorting architecture [29] is used in [21] and [24] . However, its hardware complexity is O L 2 , and hence it becomes inefficient for large L. On the other hand, a Bitonic sorting network is used in [23] , and its delay also scales with L. Next, to achieve a short-delay LPO and hence a low-latency decoder, a double thresholding algorithm and its corresponding architecture are proposed.
LOW-LATENCY LIST DECODING IMPLEMENTATION

Double Thresholding Strategy
In this sub-section, a list pruning strategy with small logic delay is introduced. Based on the 2L path metrics from PMU, it approximately finds the L smallest pms and their corresponding path extensions. To achieve it, the properties of the 2L path metrics in (4) are firstly studied and presented in the following proposition. Proposition 1. Assume L path metrics at depth i of the decoding tree are sorted and
and they are extended to 2L path metrics with (4) . If the subset of pm i+1 l (ui)s smaller than T is defined as
then, the cardinality of Ω (T ) for T = pm
Due to the space limitation, the proof of Proposition 1 is not shown. Based on Proposition 1, the Double Thresholding Strategy (DTS) for list pruning is given as follows. Double Thresholding Strategy. Assume L path metrics at depth i of the decoding tree follow (5 
Radix-4 Sorter
The path extensions at depth i + 1 obey the following pruning rule:
(ui) < AT , the path extension is kept; 2. if pm i+1 l (ui) > RT , the path extension is pruned; 3. for path extensions with AT ≤ pm i+1 l (ui) ≤ RT , they are randomly chosen such that the list size remains to be L. Fig. 2(a) illustrates the DTS for LPO. Assume the 2L extended pms are sorted and the top path extension has the smallest path metric. If the list is exactly pruned, the top L path extensions will be the decoding paths at depth i + 1. However, when the DTS is used, the shaded paths are reserved for depth i + 1. As shown in Fig. 2(a) , from Proposition 1, DTS.1 ensures that at least L/2 best decoding paths are kept. Moreover, the number of the reserved paths does not exceed L. On the other hand, since Ω pm i L−1 ≥ L − 1, DTS.2 efficiently excludes the path extensions that are definitely not in the set of the L best paths. Finally, when the number of the paths kept by DTS.1 is smaller than L, DTS.3 will fill up the L path candidates. Notice that the number of the pruned paths in DTS.2 is no greater than L. Therefore, DTS.3 is always used to fill up the decoding list.
From Fig. 2(a) , the performance degradation of the DTS is due to DTS.3. If RT is loose, some decoding path belongs to the L best paths may not be chosen by DTS.3. To alleviate this, a tighter (smaller) RT can be assumed. For example, the value of RT in (8) can be replaced by pm i k (k < L − 1). As shown in Fig. 2(b) , by doing so, the number of the candidates that DTS.3 can choose decreases, and hence the probability that the chosen decoding path belongs to the L best paths increases. However, from (7), when RT = pm i k (k < L − 1), it is possible that more than L decoding paths will be pruned by DTS.2. As a result, DTS.3 is not always able to fill up the L path candidates, as depicted in Fig. 2(c) . Hence, if RT value is too small, the performance will become poor, as the decoding paths are aggressively pruned. Therefore, an optimal value of RT exits.
Finally, from the hardware implementation perspective, the complexity of the DTS is much smaller than that of the conventional sorting strategy. To implement DTS.1 and DTS.2, 4L comparators are sufficient and all the comparison operations can be executed in parallel as the pms are compared with the same fixed threshold values. To implement DTS.3, the circuits based on the priority encoder are used. Most importantly, due to the parallel nature of the DTS, the logic delay of the DTS is much shorter than that of the full sorting strategy. As a result, the PMU together with the DTS can be finished in one clock cycle. 
Threshold Tracking Architecture
To support the DTS block, the values of AT and RT are needed. These values are calculated by the Threshold Tracking Architecture (TTA) shown in Fig. 3 . From Section 4.1, AT and RT used at depth i + 1 of the decoding tree depend on the path metric at depth i. Therefore, the TTA can be executed in parallel with the list decoding in extending the path from depth i to i + 1. This leads to a relaxed timing budget for TTA, and it can be executed in multiple cycles.
From (8) , the TTA finds the median and the maximum values of the L input numbers. Finding the median is more complicated and its implementation is based on the following property of the medians.
Proposition 2. Assume W numbers {w0, w1, . . . , wW −1} satisfy the following properties:
where wm 0 and wm 1 are the medians of w0, . . . , w W/2−1 and w W/2 , . . . , wW −1 , respectively. If the median of {w0, w1, . . . , wW −1} is denoted as wm, then,
wm ∈ wm 0 , . . . , w W/2−1 , w W/2 , . . . , wm 1 wm ∈ {wm 1 , . . . , wW −1, w0, . . . , wm 0 } wm = wm 0 = wm 1 wm 0 < wm 1 wm 0 > wm 1 wm 0 = wm 1
Proposition 2 can be recursively used to find the value of AT. Fig.  3(a) shows the corresponding architecture for L = 8. It consists of two radix-L/2 sorters [29] , L − 1 MUXes, and log 2 L comparators. As shown in Fig. 3(a) , the L path metrics are evenly divided into two groups and passed through the radix-L/2 sorter. As a result, the metrics in each group are sorted, i.e., w , the size of the median candidate set is halved. In Fig. 3(a) , the comparison result of w . Hence, similar comparison and MUX architectures can be used for the following stages. As a result, after log 2 L stages, the median of the inputs, i.e., AT for the next depth, is obtained.
To find the value of RT, the architecture is simpler. If the maximum path metric is adopted as RT as (8) , the maximum of w Fig. 3(a) is RT. If the second maximum path metric is taken for a tighter RT, it can be found by the architecture in Fig. 3(b) .
List Decoding Architecture
The top-level architecture of the proposed low-latency list decoding is shown in Fig. 4 . It contains L SCDs and each SCD is implemented with a semi-parallel architecture of M < N/2 processing elements (PEs) [10] . Based on L n i s output from the SCD, the PMU generates 2L pms from L stored pms with (4) . Out of these 2L pms, L pms are chosen by the DTS and they are stored in the register of the PMU. Based on registered pms, the TTA computes AT and RT used in decoding ui+1. After DTS, the memory contents related to the SCD need to be copied as [21] . As shown in Fig. 4 , the lazy copy (LCP) block generates the control logic for them. Finally, when the list decoding reaches the leaf node of the decoding tree, the contents of the path memory are passed to the CRC check block. The source word that satisfy the CRC check is the decoding resultûN . Fig. 5 shows the timing diagram of the proposed low-latency list decoding architecture, using the example in Fig. 1 for illustration. For simplicity, assume there are already L decoding paths in the list in the beginning. From Fig. 5 , different from the conventional SCD, two additional clock cycles are inserted after each leaf node SCD operation of the scheduling tree. As depicted in Fig. 5 , they are used for the list pruning by DTS and the memory manipulation by LCP, respectively. As a result, the latency of decoding one codeword in terms of clock cycle number is given bỹ
Finally, Fig. 5 also shows that the TTA is not on the critical path of the list decoding. At least 2 clock cycles are available for the TTA.
Further Latency Reduction
In this sub-section, frozen siblings are used to reduce the decoding latency. They are defined as [u2j, u2j+1] with {2j, 2j + 1} ⊂ A c . With 0 ≤ j < N/2, a frozen sibling corresponds to a leaf sibling in the scheduling tree. For a general sibling, as shown in 
It can be proven that (12) is equivalent to (4) for the frozen sibling, and 5 clock cycles can be saved by (12) . For example, if [u0, u1] in Fig.1 is a frozen sibling, the timing diagram of the list decoding is shown in Fig. 6 . All the decoding operations related to the frozen sibling shrinks to a PMU operation (12) in one clock cycle. Therefore, the latency of the proposed list decoding is reduced to
where F S is the number of frozen siblings in the given polar codes. are simulated for conventional list decoding with sorting strategy [24] . As a reference, the performances of the SCD and (N, R) = (2304, 1/2) WiMAX LDPC code [30] are also presented. Here, 25 iterations are used for LDPC decoding. It can be seen that the performance of polar codes is better than that of the LDPC code, when L = 16 list decoding is used. Finally, three DTSs are used for L = 16. Their RT s assume pm Fig. 7 indicates that pm i 14 is the optimal value of RT and the performance degradation of the resulting low-latency list decoding is smaller than 0.02 dB.
EXPERIMENTAL RESULTS
An
The architecture shown in Fig. 4 is implemented for L = 16 to decode (N, R) = (1024, 1/2) polar codes. The quantization scheme of [24] is used, i.e., 6 bits for channel LLR L 0 i and 8 bits for path metric. The design is synthesized using a UMC 90 nm CMOS technology, and Table I summarizes the synthesis results. Due to a large list size, the area of the LLR memory in our implementation is large and equals to 4.5 mm 2 . For the target polar codes, F S = 231 and decoding throughput can be obtained from (13) . From the table, the proposed architecture achieves a decoding throughput of 220 Mbps, and it is even greater than that of list size of 8 in [23] . The results in Table I demonstrate the effectiveness of the proposed lowlatency list decoding architecture with double thresholding.
CONCLUSION
For a low-latency list decoding, a double thresholding strategy (DTS) is proposed for fast list pruning. With a negligible performance degradation, the DTS greatly reduces the pruning logic delay. Based on the DTS, the low-latency list decoding architecture is proposed. Comparison results demonstrate that the proposed architecture achieves a much lower latency for a large list size.
