Abstract-Due to their provably capacity-achieving performance, polar codes have attracted a lot of research interest recently. For a good error-correcting performance, list successivecancellation decoding (LSCD) with large list size is used to decode polar codes. However, as the complexity and delay of the list management operation rapidly increase with the list size, the overall latency of LSCD becomes large and limits the applicability of polar codes in high-throughput and latencysensitive applications. Therefore, in this work, the low-latency implementation for LSCD with large list size is studied. Specifically, at the system level, a selective expansion method is proposed such that some of the reliable bits are not expanded to reduce the computation and latency. At the algorithmic level, a double thresholding scheme is proposed as a fast approximate-sorting method for the list management operation to reduce the LSCD latency for large list size. A VLSI architecture of the LSCD implementing the selective expansion and double thresholding scheme is then developed, and implemented using a UMC 90 nm CMOS technology. Experimental results show that, even for a large list size of 16, the proposed LSCD achieves a decoding throughput of 460 Mbps at a clock frequency of 658 MHz.
I. INTRODUCTION
A S the first family of error-correcting codes provably achieving the channel capacity with explicit construction, polar codes are a major breakthrough in coding theory [1] . Due to their low encoding and decoding complexities, polar codes have drawn a lot of research interest recently [2] - [16] .
Successive-cancellation decoding (SCD) was proposed in [1] for decoding polar codes. It was shown that SCD asymptotically achieves the channel capacity when the code length N is large [1] . Moreover, the computational complexity of the SCD algorithm is low, in the order of N log 2 N [1] . Therefore, the SCD algorithm and its hardware implementation have been extensively studied recently [17] - [28] . However, for polar codes with short-to-medium code length, the error-correcting performance of SCD is unsatisfactory. For example, as shown in [29] , compared with the low-density parity-check (LDPC) code with similar code length and code rate, the SNR penalty of SCD for N = 2048 polar codes is greater than 1 dB for a bit error rate of 10 −5 . Hence, to improve the performance of polar codes with short-to-medium code length, SCDs generating multiple codeword candidates were proposed. They are list successive-cancellation decoding (LSCD) [29] , [30] and its variants [31] - [33] .
During the decoding of one codeword, LSCD generates L codeword candidates where L is called the list size. The value of L determines the trade-off between the error-correcting performance and the computational complexity. From [29] , the LSCD approaches the maximum likelihood decoding (MLD) performance of polar codes with a moderate list size. However, this performance is still not comparable with that of the advanced error-correcting codes such as Turbo codes and LDPC codes. To this end, to further improve the errorcorrecting performance, cyclic redundancy check (CRC) code is serially concatenated with the polar codes and the CRC bits are used to choose the valid codeword from the candidates of the LSCD [29] , [34] , [35] . With the help of the CRC code, the LSCD of polar codes achieves or even exceeds the error-correcting performance of Turbo codes [36] and LDPC codes [29] . However, this performance improvement is at the cost of a larger list size (e.g., L = 16 or 32) and hence the complexity of the corresponding LSCD becomes high. The high computational complexity also results in an LSCD architecture with high decoding latency and low throughput. 1 This limits the applicability of polar codes in high-throughput and latency-sensitive applications. In this work, a low-latency LSCD architecture is explored, aiming at promoting polar codes as a competitive coding candidate in both the errorcorrecting and hardware implementation aspects.
LSCD mainly consists of two classes of operations: 1) SCD operations for generating each of the L codeword candidates, and 2) list management (LM) operations for maintaining the L (locally) best codeword candidates in the list. SCD operations are serial in nature and hence affect the decoding latency. LM operations involve the finding of the best L out of 2L candidates and maintaining the copy of the candidates. This requires sorting and copying operations of which the complexity increases rapidly with L. To achieve a low latency, existing LSCD architectures apply optimizations at either the algorithmic or architectural level. As the first work on LSCD, lazy copy was proposed in [29] to reduce the data copying complexity and hence the latency for the LM operation. The corresponding gate-level implementation was detailed in [37] . In [38] and [39] , the operand of the SCD operation was changed from the log-likelihood (LL) value to the loglikelihood ratio (LLR), resulting in a simplified data path and improved clock frequency as well as a smaller memory data storage. To reduce the latency introduced by the SCD operation, multiple bits of a codeword were decoded at the same time in [40] - [44] . In [45] , the pre-computation lookahead technique was used to reduce the SCD latency by half, at the cost of a larger memory. However, all these LSCD architectures [37] - [45] were designed for a small list size (L ≤ 4). 2 With the increase of the list size, both the computational complexity and the logic delay of the LM operation become larger. Therefore, to support LSCD for L = 8 with a reasonable clock frequency, up to three pipeline stages were inserted in the LM operation and three cycles were needed for each LM operation in [46] . This resulted in a long decoding latency. In [47] , the serial sorting operation in the LM operation was parallelized at the architectural level [48] , and the latency of the resulting LSCD architecture was reduced for L = 8. However, as shown in [49] , even using a parallel architecture, the logic delay of the LM operation keeps increasing with the list size, and it deteriorates the clock frequency of the overall LSCD architecture for a larger list size (L > 8). Therefore, in this work, we concentrate on reducing the latency introduced by LM operations, especially for a large list size L.
This work achieves low-latency LSCD implementation by performing optimizations at the system, algorithmic, and architectural level, as depicted in Fig. 1 . At the system level, a method called selective expansion (SE) is proposed based on the properties of polar codes. From [1] , each source word bit of the polar code's codeword corresponds to a synthetic channel, and different synthetic channels have different reliabilities. In the SE method, only those bits associated with the less reliable synthetic channels are decoded with the LSCD, while the more reliable bits are decoded by the SCD [50] . As a result, the LM operation (and its associated latency) for the reliable bits are not needed. To implement the SE method on the LSCD architecture, an optimization problem is formulated to determine which bits are decoded by the LSCD, such that the latency saving is maximized for a given error-correcting performance requirement of the system. We note that, similar to SE, a concurrent work [51] was proposed to reduce the complexity of the LSCD by utilizing the synthetic channel characteristics. However, the methodology and the goal of this work and ours are different. At the algorithmic level, an approximated LM operation called the double thresholding scheme (DTS) is proposed. Instead of exactly maintaining the L (locally) best codeword candidates, the DTS keeps the L almost-the-best codeword candidates in the list such that the performance degradation introduced is negligible [52] . Compared with the original LM operation, the DTS is parallel in nature and its logic delay is independent of the list size. Hence, the latency of the LM operation is not increased, even for a large list size. Finally, at the architectural level, an efficient LSCD architecture based on the DTS is proposed. By optimizing the schedule and logic of the blocks related to the LM operation, a low-latency LSCD implementation is achieved, even for L = 16.
The remainder of this paper is organized as follows. The construction of polar codes and the algorithm of LSCD are reviewed in Section II. Section III presents the proposed SE method for reducing the latency of LSCD. The DTS is detailed in Section IV and Section V presents the LSCD architecture with a low decoding latency. In Section VI, the simulation results of the error-correcting performance of the proposed low-latency LSCD architecture are presented. The ASIC implementation results of the proposed architecture are also shown. Finally, Section VII concludes the work.
II. PRELIMINARIES
In this section, the channel polarization phenomenon [1] discovered by ArÄśkan is firstly reviewed, and it is fundamental to the SE method discussed in Section III. After that, the construction of polar codes and the algorithm of LSCD are reviewed.
A. Channel Polarization Phenomenon
Consider a binary-input discrete memoryless channel, denoted as W : X → Y, with an input alphabet X ∈ {0, 1} and an output alphabet Y. Channel W is specified by the channel transition probabilities W (y|x) with x ∈ X and y ∈ Y.
Let W N : X N → Y N denote N independent copies of channel W , where N = 2 n and n ∈ N. Channel W N can be described by the channel transition probabilities and is given by
where x N ∈ X N and y N ∈ Y N are the input and the output of W N , respectively.
Let u N ∈ X N be a binary vector one-to-one mapped to x N by the following relation:
where x T is the transpose of x, and F ⊗n is the n th Kronecker power of the kernel matrix F. 
where x N and u N are related by (2) . x b a denotes the sub-vector of x with a starting and ending index of a and b. From (3), the input of a synthetic channel W i N is a binary bit u i ∈ X , and its output includes the W N output y N and the side information of the i preceding bits u i−1 0 . To evaluate the performance of the synthetic channels, a probability of error P e (i) is associated with each channel W i N . Under maximum likelihood decoding (MLD), P e (i) is given as
where u i−1 0 ∈ X i and y N ∈ Y N , and u i assumes the value of X with equal probability. For any given N , the values of the P e (i)s can be found efficiently by the density evolution techniques, as presented in [9] - [13] .
ArÄśkan's Channel Polarization Theorem studies the behavior of the synthetic channel W i N [1] . One key observation of the theorem is that when N → ∞, the performance of the synthetic channel W i N is polarized; i.e., except for a vanishing fraction of W i N s, the rest of the W i N s are either almost noisefree (P e (i) → 0) or almost useless (P e (i) → 0.5). For a finite value of N , the P e (i)s of the synthetic channels are getting close to either 0 or 0.5, and the P e (i)s are different for different W i N s [9] - [13] .
B. Construction of Polar Codes
Based on the channel polarization phenomenon, the construction of polar codes is simple. In a polar coding scheme, (2) represents the encoding operation of a length N polar code. Vectors u N and x N are the source word and codeword, respectively. A rate R = K/N polar code is specified by the frozen set A c ⊂ {0, 1, . . . , N − 1} of cardinality |A c | = N − K and the information set A defined as A = {0, 1, . . . , N − 1} \ A c . The K source word bits u i (i ∈ A) deliver the information bits, and the remaining N − K bits u i (i ∈ A c ) are the frozen bits. Since the frozen bits are set to a value, e.g. 0, known to both the encoder and the decoder, the block-error probability P b of polar codes is bounded by [1] ,
From (5), choosing the K indices with the smallest P e (i)s in A minimizes the block-error probability P b . From the discussion in Section II-A, if K is not greater than the number of the almost noise-free synthetic channels, a reliable communication is achieved by the polar codes.
If r-bit CRC code is used in polar codes, to maintain a fixed code rate R, the information set A is extended such that |A| = N R + r by switching r most reliable frozen bits to the information bits. These extended bits deliver the CRC code bits of the original N R information bits. In the LSCD, only the codeword candidate passing the CRC check is output as the decoding result.
C. List Successive-Cancellation Decoding
The decoding process of polar codes can be treated as a search problem in the decoding tree. As an example, Fig. 2 shows the decoding tree for an N = 4 polar code. In general, the decoding tree of a length-N polar code is a depth-N binary tree, with u i mapped to the nodes at depth i + 1. As shown in Fig. 2 , its root node represents a null state, and the left and right children at depth i + 1 represent u i = 0 and u i = 1, respectively. Therefore, a path from the root node to a depth-i node represents a sub-vector u i−1 0 ∈ X i , and it is called a decoding path. Specifically, a complete decoding path is a path from the root node to the leaf node that represents a vector u N ∈ X N . The value of each bit of u N is shown in the corresponding node lying at this decoding path. If u i is a frozen bit, it only assumes a preset value, e.g. 0. Consequently, the right-hand sub-tree rooted at the depth-(i + 1) node is pruned, as u N s included in this subtree are not valid source words. For example, if A c = {0}, the gray sub-tree in Fig. 2 is pruned. As a result, each complete decoding path in the pruned decoding tree is one-toone corresponding to a valid source word of the polar code, denoted as U = {u N |u i (i ∈ A c ) = 0}. In the subsequent discussion, let u N ∈ U be the transmitted source word, and the task of the decoder is to find a complete decoding patĥ u N ∈ U to decode u N .
The MLD of polar codes exhaustively searches all the complete decoding paths in the decoding tree and generates the likelihood Pr (y N |û N ) for each complete decoding patĥ u N ∈ U, where
The decoding pathû MLD N with the maximum Pr (y N |û N ) is output as the decoding result. ... To ease the implementation, likelihood Pr (y N |û N ) is represented by the path metric γ N (û N ), which is given by [38] :
For a given channel observation y N , the second term in (7) is the same for all the source wordû N s. Therefore, the MLD of polar codes is described bŷ
Recently, [38] and [39] showed that the path metric γ N (û N ) can be expressed as
whereû i is the i th bit of the decoding pathû N .
denotes the output LLR of the synthetic channel W i N , which is given as
From ( . Using the alternative form of the path metric expressed in (9) enables the use of LLR-based SCD in LSCD which leads to a lower logic delay and memory requirement over its LL-based counterpart [38] - [39] .
Similarly, a path metric
is associated with the decoding pathû i−1 0 , and is given as
where γ 0 = 0. Considering all the decoding pathû of each path is available. When the decoding path is extended to the next depth, the path metric ofû i 0 is updated as
where the decoding pathû the MLD can be regarded as a breadth-first search in the decoding tree.
Since there are 2 K complete decoding paths in the pruned decoding tree, the MLD complexity is as large as O 2 K . To achieve a reasonable decoding complexity, LSCD is proposed to obtain a decoding performance close to that of MLD with a much smaller complexity. For an LSCD with a list size of L, at most L decoding paths are maintained at each depth of the decoding tree. Therefore, after decoding log 2 L information bit u i s, 3 the decoding list has L decoding paths. In the subsequent decoding, if u i is a frozen bit, L decoding pathsû are dropped in the subsequent discussion, and the L path metrics (and output LLRs) are indexed by the subscript l = 0, 1, . . . , L − 1. In this work, as depicted in Fig. 3 , the LPO together with the PMU is denoted as the LM operation.
In the PMU operation specified by (12), the output LLR Λ i of each decoding pathû
is required and it is generated by the SCD. The SCD operation for a length-N polar code can be represented by a depth-n balanced binary tree, called the scheduling tree [25] . Fig. 4 shows an example of the scheduling tree for an N = 4 polar code. Its root node provides the input LLR L n i s from the channel observation y N as follows:
where i = 0, 1, . . . , N − 1. The non-root nodes in the scheduling tree are categorized into two types: the f node at the left-hand child and the g node at the right-hand child.
The f node at stage t executes the following f function,
and the g node executes the following g function,
where j = 0, 1, . . . , 2 t − 1 and L t j s are the output LLRs at stage t. From (14) and (15), each function of the SCD has 
15 returnû N passed the CRC check;
two LLRs as inputs and one LLR as output. A node at stage t of the scheduling tree includes 2 t functions, and they can be executed in parallel. As a result, 2 t L t j s are output by a node at stage t, and they are the inputs of its two children in the next stage.
The variable s j in (15) is known as the partial-sum in [20] and [25] . The partial-sum s = [s 0 , s 1 , . . . , s 2 t −1 ] is calculated from the previous decoding pathũ =û
Due to the data dependency introduced by the partial-sum, the decoding schedule of the SCD follows the depth-first traversal of the scheduling tree. As shown in Fig. 4 , the i th leaf node of the scheduling tree outputs the LLR of the synthetic channel
and hence Λ i s are serially generated. Based on Λ i , if i ∈ A, the MLD of u i is given by
where Θ Λ i is the hard-decision function based on the value of Λ i . The probability of error
Algorithm 1 summarizes the procedure of an LSCD with list size L. Line 4 indicates that the LSCD consists of L SCDs. They are executed in parallel till a leaf node of the scheduling tree is reached. With L output LLR Λ i l s, the decoding patĥ u i−1 0 s are extended to the next depth of the decoding tree and the path metrics are updated by the PMU. If the number of extended paths is greater than L, the LPO is executed. Note that the SCD operation has to be stalled till the LPO is finished because the subsequent SCD operation needs the knowledge of the previous pathû i 0 , as discussed in (16) . As a result, the decoding schedule of the LSCD can also be represented by the depth-first traversal of the scheduling tree, except that the LM operation (Lines 5-14 in Algorithm 1) has to be executed at each leaf node of the scheduling tree. Hence, the decoding latency of the LSCD depends on the latency of both the SCD and LM operations.
Finally, it is noted that the PMU in (12) and the f function in (14) are non-linear functions. To simplify the hardware implementation, the PMU is approximated as follows [38] - [39] , [47] :
where l = 0, 1, . . . , L − 1, and γ i+1 2l and γ i+1 2l+1 denote the path metrics of the two path extensions from the l th decoding patĥ u i−1 0 , respectively. Here, x is the complement of the binary variable x. Similarly, the f function is usually approximated as [20] - [28] (19) where sgn (·) and |·| represent the sign bit and the magnitude of a variable, respectively. As hardware implementation is discussed in this work, (18) and (19) will be used for the corresponding calculation except otherwise stated.
III. SELECTIVE EXPANSION

A. Selective Expansion Scheme
From the discussion in Section II-C, additional latency is introduced by an LM operation, when L decoding paths are expanded into 2L paths for an information bit u i (i ∈ A) in the LSCD. In this section, we present a selective expansion (SE) scheme where the path expansion for some of the information bits is not executed; i.e., L decoding paths are only extended into L paths for those bits. As a result, the list pruning operation (LPO) is not needed and the associated latency will not be added to the overall latency.
When an information bit u i (i ∈ A) is decoded, there are L surviving decoding pathsû i−1 0 s available due to the decoding of the previous i bits. Assuming that ultimately the LSCD will correctly decode the source word, there exists one path u 
, and the probability of error forû i = Θ Λ i is P e (i). Therefore, if the decoding path u
is only extended into a single path takingû i as Θ Λ i , the probability of this path extension leading to an incorrect decoding of the transmitted source word u N is then P e (i). From the discussion in Section II-A, even inside the information set, different bits have different P e (i)s. If u i corresponds to a very reliable channel with a very low P e (i), the probability of u i 0 not being in the candidate list by only extending the path into a single path assumingû i = Θ Λ i is small and the performance degradation introduced is negligible.
Based on the above discussion, the SE method is proposed. It divides the information set A into two subsets: the reliable set and the unreliable set, denoted by A r and A u , respectively. Only for those bits inside A u are the L decoding paths expanded into 2L paths. If u i is in A r , each of the L decoding paths is extended into a single path by takingû i = Θ Λ i . Consequently, the LPO and the associated latency are saved for those bits inside A r . Moreover, from (18) , sinceû i is taken to be Θ Λ i , no PMU operation is required. Next, the method of determining the set A r is discussed.
B. Reliable Set for Selective Expansion
To determine the reliable set A r (or equivalently A u = A \ A r ), the performance of LSCD using the SE method is firstly analyzed. Let M LSCD and M SE denote the candidate lists output from the conventional LSCD and the LSCD using the SE method, respectively. We are mainly interested in the block-error event that the transmitted source word u N is not in M LSCD or M SE . The block-error event of the SE method E SE {u N / ∈ M SE } is given by (5), the probability of event B satisfies
From the above discussion, the block-error probability of the LSCD using the SE method, i.e., P
SE b
Pr (E SE ), is upper bounded by
where P
LSCD b
Pr (E LSCD ) denotes the block-error probability of the conventional LSCD. 4 Furthermore, to simplify the calculation of (22) , P e (i)s are approximated by the error probability P d e (i)s of their degraded channels [13] , where 4 It is assumed in this work that the value of P LSCD b is already available and it can be obtained from the simulation. We leave the theoretical analysis of P LSCD b to our future works.
Based on (23), we define the upper bound of the block-error probability degradation η introduced by the SE method as
and the block-error probability of the LSCD using SE is no greater than (1 + η) P LSCD b . From the above performance analysis result, we formulate an optimization problem given a constraint on the tolerable error-correcting performance degradation ǫ as follows:
The solution of (25) is the optimal set of A r , as the objective function |A r |, reflecting the latency saving achieved by the SE method, is maximized. The optimal solution to problem (25) can be obtained by sorting the information set A by P d e (i) (i ∈ A) in ascending order and taking the first k elements in the sorted A such that the corresponding η of this k-element set is just smaller than ǫ. For an information set A of polar codes with a given ǫ, the reliable set A r of (25) can be found offline accordingly.
IV. DOUBLE THRESHOLDING SCHEME For the SE method, the LM operation still has to be executed for those unreliable information bits. The LPO needs to find the smallest L path metrics from the 2L candidate inputs and sorting method is required. However, the sorting operation introduces a large latency, particularly when the list size is large. To reduce the latency, parallel sorting can be used, but the computation complexity will be very high for a large list size. Therefore, a low complexity sorting operation is needed. In this section, a Double Thresholding Scheme (DTS) is proposed at the algorithmic level as a good approximation of the conventional sorting method. Low complexity parallel comparisons are executed in the DTS to find the surviving paths, and the latency of the LPO is greatly reduced for a large list size L.
A. Properties of the Path Metric
From Section II-C, the inputs to the LPO of bit u i are 2L path metrics γ i+1 k (k = 0, 1, . . . , 2L − 1) generated from the PMU as stated in (18) . To approximate the LPO, the properties of the input path metrics are first studied. Specifically, we are interested in the number of the path metrics that are smaller than a certain value T , i.e., the cardinality of the set Ω (T ) which is defined as
The properties related to the cardinality |Ω (T )| are stated as follows. 
The cardinality of Ω (T ),
Proof: From (18) and (27),
, and hence the left-hand part of (28) is proved. On the other hand, γ (28) is proved.
B. Double Thresholding Scheme
Based on the path metric properties presented in Proposition 1, the DTS is proposed for a fast LPO. It finds the L approximately smallest path metrics from the 2L inputs to form the surviving path metric set Ψ. Double Thresholding Scheme: Assuming the L path metrics γ i l (l = 0, 1, . . . , L − 1) input to the PMU satisfy (27) , two threshold values, one the acceptance threshold (AT ) and the other the rejection threshold (RT ), can be determined, and they are given as
The LPO for γ i+1 k (k = 0, 1, . . . , 2L − 1) is then summarized as follows:
Finally, the path extensions with the path metrics γ i+1 k s that are inside Ψ are kept and the rest of the path extensions are pruned.
The operation of the DTS is illustrated in Fig. 5 . Assuming the 2L path metrics γ i+1 k (k = 0, 1, . . . , 2L − 1) are sorted in ascending order, the top L path metrics are the smallest. Hence, they are the elements of Ψ if an exact sorting method is used for the LPO. On the other hand, when the DTS is used, the shaded path metrics are the elements of Ψ.
From Proposition 1, DTS.1 ensures that at least L/2 path metrics are picked and they are the smallest among all 2L path metrics. So these path metrics are in the original exactlysorted Ψ. Therefore, based on DTS.1, the performance of the resulting LSCD with list size L would not be worse than that of the LSCD with a list size L/2 based on the exact sorting method.
From Proposition 1, |Ω (RT )| ≥ L − 1, and (18) implies γ i+1 2L−2 = RT . Hence, at least L γ i+1 k s are less than or equal to RT . It also means that at most L path metrics are greater than RT . Therefore, DTS.2 efficiently excludes at most the L largest path metrics and these are surely not in the original exactly-sorted Ψ. Finally, as shown in Fig. 5(a) , when the number of path metrics picked by DTS.1 is smaller than L, DTS.3 randomly chooses the metrics from the remaining γ i+1 k s to fill up the decoding list such that |Ψ| = L.
Compared with the exact-sorting method, the performance of the DTS is potentially degraded due to DTS.3. As shown in Fig. 5(a) , some of larger of the L smallest path metrics may not be chosen by DTS.3, and this happens when the number of path metrics accepted by DTS.1 and that excluded by DTS.2 are both fewer than L. Therefore, to improve the performance of the DTS, a larger AT or a smaller RT can be used. If the AT is increased, it is possible that more than L path metrics are accepted by DTS.1. Also, as will be discussed in the next section, in order to reduce the number of comparisons, our proposed architecture does not explicitly generate the AT value for comparison. Hence, in this work, a smaller RT , e.g.,
, is used to improve the performance. As indicated in Fig. 5(b) , a smaller RT excludes more path metrics, and hence the path metric chosen by DTS.3 is more likely to be one of the L smallest metrics. On the other hand, with a smaller RT , it is possible that more than L path metrics will be excluded by DTS.2. As shown in Fig. 5(c) , this results in a list size smaller than L. Hence, if the RT is reduced by too much, the performance of the LSCD will also be degraded.
In the next section, we propose an architecture that can use a smaller RT value while guaranteeing to generate a list with size L. The overall procedure of the proposed low-latency LSCD based on SE and DTS is summarized in Algorithm 2. Lines 8-14 execute the SE method discussed in Section III and Lines 15-18 describe the DTS. From the hardware implementation perspective, since now we only need to compare the 2L input path metric values with fixed threshold values, the DTS can be executed in parallel, without a large increase in computation complexity. Therefore, the logic delay is much smaller than that of the exact sorting method and the overall latency of the LPO is reduced. In the next section, a VLSI architecture implementing Algorithm 2 will be discussed in detail.
V. LOW-LATENCY LSCD ARCHITECTURE
The top-level architecture of the proposed LSCD is shown in Fig. 6 . It mainly consists of five modules: the SCD module, the state memory module, the LM module, the CRC check unit, and the control unit. The SCD module is composed of L independent semi-parallel SCDs, each using M (M < N/2) processing elements (PEs) for the f and g function evaluation [20] , [25] . The CRC check unit contains L bit-serial units computing the CRC check of each decoding path. As shown in [47] , the latency of the CRC check unit is masked by that of the LSCD and hence can be neglected. A 2N bit ROM is used to store the flags to indicate whether u i is a frozen bit, a reliable information bit, or an unreliable information bit, and this is used by the control unit to generate the corresponding control signals to each block. In the rest of this section, the state memory module and the LM module are discussed in detail.
A. State Memory Module
Similar to the architecture in [37] , the state memory module is composed of three memories: the LLR memory, storing the intermediate L t j s (0 ≤ j < 2 t , 0 ≤ t ≤ n) of each SCD; the partial-sum memory, storing the partial-sums of each SCD [25] ; and the path memory, storing the L decoding paths.
As discussed in [20] and [25] , a semi-parallel SCD with M = 2 m processing elements uses a dual-port SRAM to store the intermediate LLR operands at every decoding stage. It consists of 2 N M + m words with M Q bits each (i.e., an overall size of 2 (N + mM ) Q bits), where Q is the number of quantization bits for the LLR values. In every cycle, two words are needed for the corresponding f and g node execution and one word of the M LLR values is generated and stored back. N Q bits of memory are used to store the channel input LLR L n i (0 ≤ i < N ) and the remaining (N + 2mM ) Q bits are used for the intermediate output LLR L t j (0 ≤ j < 2 t , 0 ≤ t < n). To support the operation of L parallel SCDs, L SRAMs are needed for the LLR memory. Since the channel input L n i s are the same for all L SCDs, they can be stored in the first SRAM, while the size of the other SRAMs is reduced to (N + 2mM ) Q bits each. As a result, the overall size of the LLR memory is [(L + 1) N + 2LmM ] Q bits. 5 As shown in [25] , N/2 bits of partial-sums are stored for the g function evaluation for one SCD. Hence, the size of the partial-sum memory in the LSCD is LN/2 bits. The size of the path memory is LK bits, as each of the L decoding paths has K information bits (the values of the N − K frozen bits are pre-known and need not be stored). Since the sizes of the partial-sum memory and the path memory are much smaller than that of the LLR memory, they are implemented using registers and organized into L register blocks with equal size, as shown in Fig. 6 .
For LSCD, each SCD expands a decoding path into two when an information bit is decoded. The two paths can both be kept in or excluded from the surviving candidate list. That means an SCD used for the decoding of a path stored in a certain SRAM in this decoding cycle may be assigned to decode another path stored in another SRAM in the next decoding cycle. Therefore, we need to re-align the connection between the state memory and the SCD in each decoding cycle. As shown in Fig. 6 , for the partial-sum memory and the path memory, L × L crossbars are used for moving the data for the alignment. For the LLR memory, since the size is very large and moving the contents has a large timing and power overhead, the lazy copy method, which uses a pointer to manipulate the alignment instead of physically moving the data content, is introduced in [29] and [37] . As shown in Fig.  6 , an L×L crossbar with port width 2M Q bits is used to direct the memory contents to the corresponding SCD hardware. The control signals of this crossbar are generated by the pointer memory updated by the LM module, and the details of the updating logic have been presented in [37] and [47] . The size of the pointer memory is L × (n − 1) × log 2 L bits, and the memory is implemented with registers. Figure 7 . The data path of the LM module using the DTS.
B. List Management Module
The LM module implements the LM operation shown in Fig. 3. Fig. 7 shows the data path when the DTS is used for the LPO. It mainly consists of four components: the thresholdtracking architecture (TTA), the PMU block, the DTS block, and the lazy copy (LC) block. Specifically, the PMU block executes the PMU operation in Fig. 3 , and the DTS block together with the LC block implements the LPO shown in Fig. 3 . The TTA calculates the thresholds to support the operation of the DTS block. As shown in Fig. 7 , after decoding u i−1 , the path metrics of the L surviving decoding paths are γ
l s, and the PMU block generates the path metrics of the 2L extended paths. After this, the DTS block finds the L almost-the-best path metrics γ i+1 l (l = 0, 1, . . . , L − 1) and their corresponding decoding paths. Based on the information on path removal and survival, the LC block manipulates the memory contents in the state memory module, and its logic has been discussed in [37] and [47] . Running in parallel with the LC block, the TTA block calculates the values of AT and RT from the surviving γ i+1 l s, and they will be used by the DTS block for the decoding of the next bit. In the following, the architectures for the PMU, TTA, and DTS blocks are presented in detail.
1) PMU Block in the List Management Module:
The PMU block expands and updates the path metrics based on (18) . Its 2L outputs γ i+1 l (l = 0, 1, . . . , 2L − 1) are divided into two groups: path metrics with an even index (PME), i.e. γ i+1 j (j = 0, 2, . . . , 2L − 2), and path metrics with an odd index (PMO), i.e. γ i+1 k (k = 1, 3, . . . , 2L − 1). From (18) , no extra hardware is required to generate the path metrics in the PME as γ 2) TTA in the List Management Module: The TTA is responsible for calculating the acceptance threshold AT and the rejection threshold RT for the DTS to work. The AT and RT values for decoding bit u i are generated from γ i l (l = 0, 1, . . . , L − 1), which are the L surviving path metrics at bit u i−1 , as shown in Fig. 7 . The architecture of the TTA is shown in Fig. 8 . In addition to the generation of AT and RT , as shown in Fig. 8 , the TTA also outputs the partiallysorted γ i l s. The smallest L/2 path metrics are on the top and the largest L/2 path metrics, which are exactly-sorted, are at the bottom. The details of the TTA operations are as follows.
The L input path metrics are evenly divided into two groups. Each group is then sorted by a radix-L/2 sorter [48] . Figure 8 . Threshold-tracking architecture.
Therefore, their outputs d [46] , L/2 comparing-and-swapping (C&S) elements take pairs of the output values of the sorters d
. . , L/2 − 1, as their inputs and direct the smaller value to the upper output and the larger value to the lower output. As a result, the outputs of the C&S array are partially sorted, where the top L/2 outputs are guaranteed to be smaller than or equal to the lower L/2 outputs. For an easier implementation of the DTS architecture, the lower L/2 outputs are further exactly sorted by another radix-L/2 sorter. The reason for this will be discussed in the next sub-section. From the discussion in Section IV-B, the first element in the lower L/2 sorted output path metric γ i L/2 in Fig. 8 is AT . In fact, we do not need to know the value of AT . The group of path metrics that satisfies the AT check can be directly obtained from the top L/2 outputs of the TTA. This will be discussed in more detail in the next sub-section. Moreover, RT can be chosen from the lower L/2 sorted output path metrics of the TTA. For example, the RT used in (29) is the last output path metric of the TTA. The TTA requires an exact sorting of L/2 elements. For other LSCD architectures that use exact sorting for list pruning, the input size of the sorter is 2L instead of L/2. So the complexity of the proposed TTA is much smaller. In addition, the TTA is executed in parallel with the execution of f or g nodes and the PMU for the decoding of the next bit, and hence the latency is hidden and no extra cycle is added to the overall latency.
3) DTS Block in the List Management Module:
As shown in Fig. 7 , when we decode bit u i , the DTS takes the two groups of path metrics (PME and PMO) output from the PMU block as input. The AT and RT values obtained from the TTA are used as the threshold values for the DTS operation.
As shown in Fig. 6 , the path metrics in the PME and PMO are firstly passed to two permutation networks (PNs), respectively. Since each partially-sorted path metric output of the TTA corresponds to the generation of one path metric element in the PME and another in the PMO, according to (18) , the elements in the PME and PMO are permutated based on the sorted-order of their parent path metrics in the TTA output. For example, if the orders of the outputs of the TTA are γ Figure 9 . Architecture of the pruning and copying (PC) block.
PMO PN
outputs of the TTA are smaller than AT and γ i+1 2l
= γ i l , the first L/2 elements in the permutated PME are all smaller than AT . Similarly, as AT is the smallest value among the last L/2 outputs of the TTA, the last L/2 elements in the permutated PME are all greater than or equal to AT .
After permutation, the elements of the PME and PMO are passed to the pruning and copying (PC) block to determine the L surviving paths. The architecture of the PC is shown in Fig. 9 . From the above discussion, the first L/2 elements in the PME are definitely smaller than AT and hence will be included in the surviving set Ψ. To fill up the remaining L/2 elements in Ψ, as discussed in Section IV, we need to compare the last L/2 elements in the PME and the elements in the PMO with AT and RT . Random inclusion or exclusion has to be done if the number of elements passing the two threshold checks is not exactly equal to L/2. To reduce the number of comparisons and also avoid the random inclusion/exclusion, which will complicate the hardware implementation, we propose a different method to select the remaining L/2 elements in Ψ. We temporarily accept the last L/2 elements in the PME first. We then compare the elements in the PMO with a fixed RT value using L comparators. A flag equal to 1 is generated if the corresponding path metric is not greater than RT . Note that this RT value is smaller than that stated in (29) in order to prune out more paths with larger metric values. All the flags are then added up by an accumulator to decide how many path metrics are not greater than RT . Carry-save adders and adder tree are used to reduce the delay of the accumulator. Let k be the output of the accumulator. Then the largest k elements of the last L/2 elements in the PME are replaced by the k path metrics in the PMO that are not greater than RT . Note that since the last L/2 elements in the PME are exact-sorted in order, we simply pick the last k elements in the set for replacement. If k is larger than L/2, we just take the first L/2 elements in the PMO that pass the RT test to replace the last L/2 elements in the PME in Ψ.
The DTS architecture presented in Fig. 9 has two advantages over the DTS operation discussed in Section IV-B. Firstly, a much smaller RT can be used to exclude more paths with large metric values. Even when a smaller RT is used, we can still guarantee at any time that the candidate list of the LSCD has L decoding paths. In the worst case, when all the path metrics in the PMO are greater than RT , we will keep the last L/2 elements in the PME in the surviving path list. Secondly, since the last L/2 elements in the PME are already sorted by the TTA, we always replace the worst elements in the PME. This is better than randomly selecting a path to replace as the probability of the last few elements of the PME in the actual surviving path set is low. As a result, the error-correcting performance of the DTS is improved by using the architecture shown in Fig. 9 , and we denote this as DTS-Advance. Fig. 10(a) shows the timing diagram of decoding u 0 and u 1 in the scheduling tree of Fig. 4 using a single SCD. When LSCD is used, additional cycles are required for the path metric updating and list pruning. Fig. 10(b) shows the timing diagram of decoding u 0 and u 1 with the proposed LSCD architecture, 6 where the detailed timing of the list management (LM) component is also shown. Specifically, γ i l in the PMU and DTS denotes the generation of 2L path metrics output from L input path metrics in the PMU block and finding the L surviving path metrics from the 2L path metric candidates in the DTS block, respectively. Compared with the architecture presented in [52] , the processing element data path is optimized and the PMU block is executed in the same clock cycle with the leaf f /g node execution of the SCD operation. Moreover, the LPO implemented by the DTS and the lazy copying (LC) blocks are done in the same clock cycle. Due to the data dependency, the TTA operation for finding the threshold values for the next bit is executed when the DTS for the current bit is finished and it is hidden in the cycle where the leaf f /g nodes are executed. As a result, by using the DTS for the LPO, only one additional cycle is introduced for each LM operation.
C. Decoding Latency of the Proposed LSCD Architecture
From [20] , the decoding latency (i.e., the time to traverse the scheduling tree) of a semi-parallel SCD using M PEs is equal to 2N + N M log 2 N 4M clock cycles. Hence, the overall latency of the LSCD architecture is
As discussed in Section III, when the SE method is used, if u i is a reliable bit, i.e. i ∈ A r , the operation of the PMU and the LPO after the decoding of bit u i are not required. Moreover, the LPO for the frozen bit is not executed either. Hence, the latency in (30) can be reduced. The latency is further reduced by considering two source bits at a time. A source-bit couple is defined as (u 2i , u 2i+1 ), with i ∈ {0, 1, . . . , N/2 − 1}. Based on the types of bits of u 2i and u 2i+1 , the source-bit couples can be categorized into six cases, which are summarized in Table I , where a f and a r denote the number of frozen bits and reliable bits in a sourcebit couple, respectively. Without loss of generality, we use the couple (u 0 , u 1 ) and its decoding timing diagram in Fig. 10 for illustration in the following discussion.
1) Case I: Both u 0 and u 1 are reliable information bits. Hence, the LM operation after decoding each bit is saved. Moreover, since the PMU operation is not needed, the output LLRs Λ 0 and Λ 1 are not needed, and hence the leaf nodes of the scheduling tree, f Fig. 4 . Based on the above discussion, the operations in cycles 0 to 3 of Fig. 10(b) are saved for Case I. Moreover, as part of the LM operation, the TTA in cycle 4 is also not needed. As a result, four clock cycles are saved for the Case I source-bit couple.
2) Case II: Bit u 0 is a frozen bit and u 1 is a reliable information bit. The LPOs for both bits and the PMU operation for bit u 1 are not executed. However, the PMU for the frozen bit u 0 still has to be executed, and it can be combined with the SCD operation as follows:
where l = 0, 1, . . . , L − 1. Similar to Case I, L 3) Case III: u 0 is an unreliable information bit and u 1 is a reliable bit. 7 In this case, the operations of the PMU, the LPO, and the TTA after decoding u 1 are not needed. Hence, one clock cycle (i.e., cycle 3 in Fig. 10(b) ) is saved.
4) Case IV: Both u 0 and u 1 are frozen bits. The LPOs for both bits are saved, and the PMU operations of the two bits are combined and simplified as [52] 
where l = 0, 1, . . . , L − 1, and L (32) . This PMU operation is retimed and it is executed in the same cycle with f 1 0 . Hence, similar to Case II, four clock cycles are saved.
5) Case V: u 0 is a frozen bit and u 1 is an unreliable information bit. This case is different from Case II, because the LM operation is needed for u 1 . Hence, only the LPO for u 0 can be eliminated and one cycle is saved.
6) Case VI: Both u 0 and u 1 are unreliable information bits. Fig. 10(b) depicts the timing of this case, and no latency reduction is achieved. Table II summarizes the latency reduction achieved by different source-bit couple cases. As a result, the decoding latency of the proposed LSCD architecture is given as (33) where N α denotes the number of source-bit couples for Case α found in the polar codes. These values depend on the frozen set A c and the reliable set A r . To achieve the timing specified in (33) , the PMU block shown in Figs. 6 and 7 has to support the operation of (31) and (32) , and it is easily achieved with additional comparators and adders.
VI. EXPERIMENTAL RESULTS
In this section, to demonstrate the error-correcting performances of the proposed SE method and DTS algorithm, an (N, R, r) = (1024, 1/2, 16) polar code is simulated over a binary-input AWGN channel. 8 Then, we present the implementation results of the proposed LSCD architecture, and then compare them with those of other existing works. in (23) and (24) , is shown. The BLERs of the proposed SE method with different sizes of the reliable set A r are also shown. The size of A r depends on the tolerable performance degradation parameter ǫ. In the simulation, we use different ǫ values, ranging from 0.3 to 9 at E b /N 0 = 2.25 dB.
From Fig. 11 , it can be seen that, for each given ǫ, the degradation in BLER of the LSCD using the SE method is close to the upper bound predicted by (23) and (24) . This indicates that the performance analysis in (24) well estimates the performance degradation introduced by the SE method for a given reliable set A r . To investigate the relationship between the latency reduction and the performance degradation of the SE method, Table III summarizes the cardinality of A r for different ǫs. Moreover, based on A c and the corresponding A r , Table IV presents the number of different source-bit couples for each ǫ value. Assuming that the LSCD architecture proposed in Section V is used and each SCD uses M = 64 PEs, the last row of Table III compares the decoding latency  (D LSCD ) for different ǫs, based on (33) . From Table III, we can see that for ǫ = 0.3, more than 72% of the information bits are included in set A r and hence more than 72% of the LPOs are saved by the corresponding LSCD with SE. From Fig. 11 , it is also shown that the performance degradation introduced by the SE method with ǫ = 0.3 is negligible compared with that of the conventional LSCD. If a larger ǫ is used, Table III shows that |A r | is only slightly increased, while the performance of the corresponding LSCD is degraded significantly, as shown in Fig. 11 . For example, when ǫ = 9, the decoding latency is only reduced by 9% compared with that of ǫ = 0.3. Therefore, ǫ = 0.3 is used in the SE method for our low-latency LSCD implementation.
To verify the effectiveness of the method proposed in Section III in finding set A r , we randomly choose 72.35% information bits in A to compose set A r . Fig. 11 shows its BLER using the SE method. It is shown that the performance is greatly degraded from that using A r generated from our proposed method.
B. Error-correcting Performance of the DTS
Next the error-correcting performance of LSCD using the DTS to replace exact sorting in the LPO is investigated. Simulations for the polar code used in the previous sub-section are carried out. Fig. 12 shows the BLERs of different LSCDs, including those using the DTS discussed in Section IV and the DTS-Advance discussed in Section V. Comparisons of the BLERs of the DTS using different RT values are also shown. Compared with the LSCD using the exact sorting method, when γ i L−1 is used as RT , as stated in (29) , the LSCD using the DTS introduces an SNR penalty of around 0.2 dB when the BLER is 10 −4 . For the DTS-Advance discussed in Section V-B, the SNR loss is only around 0.1 dB. Moreover, when a smaller RT value is used, such as γ i 11 shown in Fig. 8 , the performance degradation of the DTS-Advance is negligible. However, when the same RT is used for the DTS, a performance loss of around 0.1 dB is recorded. This is because fewer decoding paths are chosen by DTS.3 and the candidate list is not full for most of the time. As a result, the DTSAdvance with RT = γ i 11 is used for a low-latency LPO in our LSCD implementation.
C. Implementation Results of the Low-latency LSCD
The LSCD architecture proposed in Fig. 6 is designed and implemented for an (N, R, r) = (1024, 1/2, 16) polar code with list size L = 16. M = 64 PEs are used for each SCD. From the simulation results, the SE method with ǫ = 0.3 and the DTS-Advance with RT = γ i 11 introduce negligible degradation in the error-correcting performance, and hence they are used for the hardware implementation. Fig. 13 compares our implementation's error-correcting performance with those of the conventional LSCD with different list sizes. It can be seen that our LSCD architecture has a very similar BLER performance to the conventional LSCD. As a reference, the performances of SCD and an (N, R) = (1152, 1/2) LDPC code used in the WiMAX standard [53] are also shown in Fig.  13 . Here, 40 iterations are used for the LDPC decoding. It can be seen that polar codes have better performance when LSCD with a larger list size L is used. When LSCD with L = 16 is used, the BLER performance of polar codes is comparable to that of the LDPC code.
The design is synthesized with a UMC 90 nm CMOS process, using Synopsys Design Compiler. For a fair comparison, the quantization scheme in [47] is used, i.e., the LLR and the path metric are represented in 6 bits and 8 bits, respectively. Table V summarizes the synthesis results and compares them with those of the existing architectures. Compared with the state-of-the-art architectures, our proposed LSCD architecture supports a much larger list size, which results in a comparable error-correcting performance with other advanced error-correcting codes. Moreover, from Table  III , the proposed LSCD architecture requires 1462 clock cycles to decode one codeword, and hence it achieves a decoding throughput of 460 Mbps at a clock frequency of 658 MHz. Compared with [46] and [47] , both the decoding throughput and the list size are doubled. The chip area presented in Table  V is mainly due to the state memory module. The SCD module only occupies 0.53 mm 2 and the area of the LM module is smaller than 0.1 mm 2 .
VII. CONCLUSION In this work, a low-latency LSCD architecture is presented, which is optimized at the system, algorithmic, and architectural levels. At the system level, a selective expansion method is proposed such that the amount of LM operations and the associated latency of the reliable information bits are reduced. At the algorithmic level, a double thresholding scheme is proposed as an approximate sorting method for the list pruning operation and its logic delay is greatly reduced for a large list size. Finally, an optimized VLSI architecture for the LM operation is presented. Experimental results show that both the decoding throughput and the list size are doubled when compared with the state-of-the-art architectures.
