Abstract-As the first kind of forward error correction (FEC) codes that achieve channel capacity, polar codes have attracted much research interest recently. Compared with other popular FEC codes, polar codes decoded by list successive cancellation decoding (LSCD) with a large list size have better error correction performance. However, due to the serial decoding nature of LSCD and the high complexity of list management (LM), the decoding latency is high, which limits the usage of polar codes in practical applications that require low latency and high throughput. In this work, we study the high-throughput implementation of LSCD with a large list size. Specifically, at the algorithmic level, to achieve a low decoding latency with moderate hardware complexity, two decoding schemes, a multi-bit double thresholding scheme and a partial G-node look-ahead scheme, are proposed. Then, a high-throughput VLSI architecture implementing the proposed algorithms is developed with optimizations on different computation modules. From the implementation results on UMC 90 nm CMOS technology, the proposed architecture achieves decoding throughputs of 1.103 Gbps, 977 Mbps and 827 Mbps when the list sizes are 8, 16 and 32, respectively.
I. INTRODUCTION
P OLAR codes are the first kind of forward error correction (FEC) codes that provably achieve channel capacity [1] , [2] . Their regular and low-complexity decoding algorithm is hardware-friendly and hence polar codes have attracted much research interest recently.
Successive cancellation decoding (SCD) [3] - [10] and belief propagation decoding (BPD) [11] - [13] are the two main kinds of decoding schemes for polar codes. The complexity of SCD is low but the decoding is sequential in nature, and therefore it is a challenge to achieve a low latency and a high throughput. Recently, research work has been done to improve the latency of SCD [7] - [10] . BPD uses message passing among the nodes of the factor graph of the polar codes to carry out decoding. It is parallel in nature and consequently has a high decoding throughput. However, its hardware cost is large and more importantly, its error correction performance is not as good as that of SCD. In this work, we mainly focus on SCD-based decoding schemes of polar codes.
Comparing the error correction performance of SCD on polar codes with that of the start-of-the-art FEC codes, such as low-density parity-check (LDPC) codes [14] or turbo codes [15] , polar codes with a short code length are inferior. One method to improve the error correction performance is to use a long code length because the channel polarization phenomenon increases the ratio of almost-lossless channels for long-length codes [1] , [4] , [10] . However, the decoding latency, which is related to the code length, is high and so are the computation and memory overhead [4] , [6] .
Another method is using list successive cancellation decoding (LSCD) [16] , [17] in which L SCDs are executed in parallel for decoding one codeword. The L best decoding paths are kept during the decoding and finally the path that satisfies the cyclic redundancy check (CRC) [17] - [19] is selected as the output. From the results shown in [17] - [20] , with a large list size (L ≥ 16), CRC-concatenated polar codes using LSCD out-perform other state-of-the-art FEC codes. However, LSCD with a large list size has significant latency and complexity overhead on hardware. Thus, special algorithmic and architectural optimization techniques are required to reduce the latency and complexity, particularly for latency-sensitive applications such as the next generation communication systems [21] .
The first hardware architecture of LSCD was proposed in [22] in which log-likelihood (LL) values are used for computing the decoding messages. In [23] - [26] , log-likelihood ratios (LLRs) are used instead of LLs to reduce the computational circuit complexity and memory usage. Most of the LSCD architectures have two main processing modules. The first is the list management (LM) module, which maintains the list when the expanded list size exceeds L; and the second module consists of L parallel SCD cores that calculate the messages of each path simultaneously. To achieve a low-latency and high-throughput design, both modules need to be optimized.
During the decoding of every bit, L survival paths are expanded to 2L paths and the LM module is responsible for selecting the best L paths to keep. To reduce the decoding latency, two classes of decoding schemes originally proposed for single SCD [9] , [10] , which decode multiple bits at the same time, are adopted for LSCD.
• The first is multi-bit decoding (MBD) [27] - [30] , extended from [9] . Here, the number of bits that are decoded together in each MBD, M , is a fixed value. To simplify the complexity of the sorter by which at most 2 M · L numbers need to be sorted, a two-stage sorting strategy was proposed in [30] , where a local sorting is done in each path at the first stage and only a few best local paths are sent to the second stage for global sorting and selection.
• The second class of decoding schemes [31] - [37] is based on the concept of fast simplified successive cancellation (fast-SSC) decoding [10] . For simplicity, we call these fast-SSC-based algorithms. Here, a block of code is divided into several kinds of special sub-codes that can be decoded by simplified decoding algorithms. Unlike MBD, the number of bits that can be decoded together is not fixed. In each step, at most 2L paths are expanded, and consequently the complexity of the sorter is restricted. The complexity of the sorter also increases dramatically with L and hence will incur a large delay overhead when L is large. In [22] and [25] , a parallel radix-2L sorter is used to sort and select the best L paths to reduce the logic delay. However, from the result in [34] , [38] , when L ≥ 8, the logic delay of the sorter becomes critical and dictates the clock rate. Consequently, most of the implementation of the existing ASIC architectures only present results for LSCD with L ≤ 8 [25] - [36] , and these architectures are not suitable for LSCD with a large list size. In our previous work [39] , [40] , a double thresholding scheme (DTS) was proposed, in which an approximate sorting method is used with the help of two run-time generated threshold values. By doing this, the sorting and hence the selection of the best L paths do not scale with L and the scheme is suitable for LSCD with a large list size. Implementation results show that the architecture of LSCD with L = 16 using the DTS doubled the decoding throughput and the list size when compared with the stateof-the-art architectures at that time. For LSCD with L = 32, only CPU-based [37] and FPGA-based [41] architectures have been proposed and no ASIC architecture has been reported. Further optimization methods that selectively expand the paths at each bit and eliminate unnecessary execution times of LM operations [36] , [40] - [42] were proposed to reduce the overall latency.
For the SCD cores, optimization methods used for traditional SCD are still applicable to the LSCD architecture. A hardware multiplexing scheme was first proposed in [8] for SCD and later was adapted to LSCD in [43] . With deeppipelining, the LSCD can decode different paths of a frame in sequential instead of in parallel. Theoretically, only one SCD core is needed and the hardware complexity is only about 1/L of that of the traditional LSCD. However, the latency is very long and is not suitable for latency-sensitive applications. Another family of ideas is pre-computation. In [7] , a G-node lookahead (GLAH) schedule was used, which pre-computed the intermediate LLRs in SCD to save the latency in the decoding process, and the corresponding register-based architecture was proposed. In [30] , a pre-computation memory-saving (PCMS) scheme was proposed to save the memory for storing the channel LLRs.
In this work, we focus on the high-throughput design of LSCD with a large list size due to its extraordinary error correction performance. Two algorithmic level techniques are proposed to reduce the decoding latency of LSCD and the corresponding hardware architecture is designed, of which the critical path delay is optimized. Specifically, our main contributions are as follows.
• To reduce the latency of the LM module, a method called multi-bit double thresholding scheme (MB-DTS) is proposed, which combines the idea of MBD with DTS and selective expansion [40] . The original DTS is executed bit by bit and the whole scheduling tree needs to be traversed, which results in a large decoding latency. The proposed method executes DTS for all the bits in a sub-tree simultaneously to avoid the full tree traversal. Thus, a much lower latency is achieved while the lowcomplexity characteristic of DTS is maintained.
• For the SCD module, the idea of partial G-node lookahead (P-GLAH) is proposed and adopted in a semiparallel architecture [25] , [40] . The original GLAH for a single SCD can save half of the clock cycles for decoding a block of code. However, it uses three times as much memory as the conventional semi-parallel architecture.
In this work, we make a careful tradeoff between decoding latency and the hardware usage of GLAH, and the proposed P-GLAH technique has a similar latency as GLAH while the hardware overhead is minimized when comparing with the counterpart in the traditional LSCD architecture.
• A high-throughput architecture that is suitable for LSCD with a large list size is proposed based on the proposed algorithms. The structures of the programmable processing element and the path metric update block used in the architecture are carefully designed for a short logic delay. A critical path delay optimization scheme is proposed to further reduce the decoding latency. Experimental results show that high-throughput LSCD is achievable even with a list size of 32. The rest of this paper is organized as follows. In Section II, the background and the existing decoding methods of polar codes will be reviewed. In Section III, the algorithm and the detailed decoding schedule of the proposed MB-DTS will be discussed. In Section IV, the P-GLAH technique for the SCD cores will be presented. In Section V, the high-throughput architecture for LSCD with a large list size will be introduced. Finally, the experimental results of the implementation of LSCDs with large list sizes will be presented in Section VI, and conclusions will be given in Section VII.
II. PRELIMINARIES A. Construction of Polar Codes
Polar codes are a family of block codes [1] . Let N = 2 n be the code length of polar codes, u N and x N (each is an N -bit vector) be the input source word and the output codeword, respectively, and the encoding of polar codes is given by
where F ⊗n is the n th Kronecker power of F = 1 0 1 1 .
Due to the polarization effect, some of the N bits are more reliable and hence are used to transmit information and are called the information bits, while the other not-so-reliable bits are set to 0 and are called the frozen bits. A and A c denote the sets of all the indices of the information and frozen bits, respectively. R = K/N is defined as the code rate of polar codes, where K is the cardinality of A. If an r-bit CRC code is used, the last r information bits are used to transmit the checksum generated from the other K − r information bits 1 .
B. Successive Cancellation Decoding
SCD is the most popular decoding method of polar codes due to its low complexity which is O(N log N ). The decoding process can be represented by a scheduling tree. The scheduling tree for N = 4 polar codes is shown in Fig. 1 as an example. It has n + 1 stages with descending indices from the root to the leaves. The number of nodes at stage s of the scheduling tree is 2 n−s , while the number of functions executed in each node is 2 s , where s ∈ [0, n]. In each stage, there are N output LLR operands. We hereby denote the i
At the root node of the scheduling tree, N channel LLRs are inputted to the tree and are expressed as
where y is the channel outputs on x. At the lowest stage, Λ i = L 0 i is the LLR corresponding to bit u i . For i ∈ A c , the decoded value of u i , denoted byû i , is always equal to 0; while for i ∈ A,û i is evaluated according tô
To calculate Λ i s, LLR calculations are executed in each node of the scheduling tree. Specifically, there are two kinds of nodes: F-nodes and G-nodes, which are represented by the white and black circles in Fig. 1 , respectively. During the decoding process, the following calculations are executed in the F-and G-nodes, respectively:
1 It is noted that the r parity check bits do not truly transmit information and the code rate is correspondingly modified to (K − r)/N . However, as they are treated in the same way as information bits from the perspective of polar code decoder, we still regard them as information bits in this paper.
where L a and L b are the two input LLRs to the node andps is the partial-sum. The partial-sums at stage s are obtained by
where j ∈ [0, 2 n−s−1 ]. It can be seen that the inputps value for a particular G-node depends on the values of the bits that are already decoded. Therefore, the G-node cannot be computed until all the corresponding leaf nodes have been visited. Thus, the decoding process of SCD can be represented by a depth-first traversal of the scheduling tree.
To simplify the hardware implementation, usually (4) is approximated using a min-sum calculation which is given by
C. List Successive Cancellation Decoding
To improve the error correction performance of SCD, list successive cancellation decoding was proposed [16] , [17] . LSCD keeps L paths of decoded bits during decoding, where each path is denoted asû
and i is the index of the bit just decoded. To decode L paths in parallel, L copies of SCD are used.
At the beginning of decoding, only one path is valid. When the decoding process reaches a leaf node corresponding to an information bit u i (i ∈ A), a path is expanded to two by keeping both possible values of the bit, and the number of valid paths in the list doubles. This number increases exponentially with respect to the number of decoded information bits, and the list becomes full after log L information bits are decoded. In the subsequent decoding, an LM operation is required to keep the list size to L based on the path metrics (PMs) of all the paths. Letû to the current PM value and k is set as 2l + 1. For i ∈ A, both of the expressions in (10) will be computed. After that, list pruning will be executed, where the PMs of the 2L expanded paths will be sorted and the L paths with the smaller PMs will be kept in the list. For i ∈ A c ,û i ≡ 0, hence only one of (10) will be computed and list pruning is not needed as the list size is still equal to L.
Usually, LSCD with a larger list size has a better error correction performance. Figure 2 shows the simulation results of LSCDs with different list sizes for the CRC-concatenated polar code of (N, K, r) = (1024, 512, 24) under additive white Gaussian noise (AWGN) channel 3 . As reference, the error correction performances of rate-1 2 LDPC codes in IEEE 802.16e standard [44] and turbo codes in LTE standard [20] with different code lengths are also presented. It can be seen that the block error rates (BLERs) of LSCDs with L = 16 and 32 out-perform those of LDPC codes with similar code lengths. However, as discussed in Section I, the computational load of LM and hence the decoding latency of LSCD are greatly increased when a large list size is used. In the next sub-section, we review some existing low-latency decoding algorithms for LSCD.
D. Review of Some of the Existing Low-latency Decoding
Algorithms for LSCD 1) Double Thresholding Scheme : First, we briefly review the basic idea of DTS proposed in [40] . As mentioned in Section II-C, the number of expanded paths is 2L after an information bit is decoded. For list pruning, a 2L-to-L sorter is needed to sort all the PMs and selects the L paths with the smallest PMs. The hardware complexity of a parallel sorter is O(L 2 ) and becomes very large when a large list size L is used. The critical path delay is also increased due to the high complexity and thus influences the throughput of the decoder.
To remove the 2L-to-L sorter, DTS was proposed for list pruning. Suppose that the L PM values of the original paths 3 CRC-24-Radix-64 with generator polynomial equals to 0x1864cfb
], are already sorted, then two threshold values, an acceptance threshold AT and a rejection threshold RT , can be obtained as follows:
It is proved in [40] that any path that has a PM value smaller than AT is a survival path after pruning and no survival path has a PM value larger than RT . Using these two thresholds, L γ l i+1 s are selected from 2L γ ′ k i+1 s based on the following criteria without the need of sorting the expanded PMs:
• If γ ′ k i+1 < AT , the corresponding path is kept;
• If γ ′ k i+1 > RT , the corresponding path is pruned;
• For the paths with AT < γ ′ k i+1 < RT , they are randomly chosen such that the list is filled with L paths.
From the simulation results in [40] , the DTS has negligible performance degradation, while only O(L) comparisons with two threshold values are needed to select the path without using a sorting operation. This results in low logic delay and allows the DTS to be finished in one cycle. It is noted that to extract the two thresholds for the next DTS, the survival PMs according to the above criteria need to be sorted. However, this sorting is performed on fewer PMs than that in traditional LM, so it can be executed in parallel with the SCD core computation [40] and finished before the next DTS operation. Thus the decoding latency will not be affected.
2) Selective Expansion: Suppose all the bits in a codeword are sorted according to their reliability with respect to the bit error rate (BER). The construction of the polar codes uses the K bits with high reliability as the information bits, and the rest as the frozen bits. However, the reliability of these K information bits is not the same and the reliability of some information bits is higher than the others. Based on this property, a low-latency LM scheme called selective expansion is proposed in [40] . With selective expansion, the information set A is further divided into two sub-sets -a reliable set, denoted as A r , and an unreliable set, denoted as A u . The bits in A u are those with relatively low reliability among all of the information bits. The path expansion and PMU for these bits are the same as those described in (8) and (10) . However, for the bits in A r which have higher reliability, we only keep the path with the extended bit that matches with the hard decision result of the decoding, and hence the path expansion is simplified aŝ
and the PMs are not updated, i.e.,
Effectively, path expansion and PMU can be omitted. Thus, the complexity of decoding the reliable bits in LSCD is similar to that of the frozen bits, as summarized in Table I . It is demonstrated in [40] that a carefully chosen A r can bring a great reduction in the execution times of LM operations, while the error correction performance degradation is negligible. 
3) Multi-bit Decoding and Fast-SSC-based Methods:
To reduce the decoding latency, multiple bits in a block of polar code can be decoded at the same time. MBD and fast-SSCbased methods are the two most popular classes of decoding algorithms in the literature that are based on this idea.
In MBD [27] - [30] , suppose that M bits are decoded simultaneously and hence the list is only updated once during this process, the corresponding leaf nodes of the M bits in the scheduling tree have the same root node at stage m, where m = log 2 M . Considering the worst case where all the M bits are information bits, 2 M paths are expanded from any of the L existing paths and the PM of one of these 2 M paths is updated as
where 
, which is actually encoded from the M evaluated bits rooted at this sub-tree and is given by
and
is a sub-vector of the decoded bits of the expanded path, which is given bŷ
Since we have L original paths, L · 2 M expanded paths are generated, and (14) is executed L · 2 M times correspondingly. To keep the list size as L, list pruning is executed in the same way as that of the traditional LSCD.
If an M -bit pattern contains M frz frozen bits, the number of expanded paths will be reduced to 2
Minf , where
For such a pattern, the computational load of (14), (15) and (16) can be reduced. However, the hardware has to cater for the the worst case and hence an L · 2 M -to-L sorter is required to execute list pruning, which leads to a much higher hardware complexity than traditional LM. This is the main drawback of MBD for LSCD.
Different from MBD, fast-SSC-based methods [31] - [37] divide a block of polar code into four kinds of special sub-codes: rate-0 code, repetition code (the last bit is an information bit and all the others are frozen bits), single parity check (SPC) code (the first bit is a frozen bit and all the others are information bits), and rate-1 code, with variable length. A simplified LM algorithm for each kind of sub-code was proposed. For rate-0 code and repetition code, only one or two paths are expanded from each survival path, so the best L paths can still be picked by a 2L-input sorter. The corresponding PMU algorithms are given in [31] , [32] . For SPC code and rate-1 code, mathematical proof has been given in [33] , [34] that the number of path expansions for which no performance degradation is incurred is min(M, L − 1), which means fewer path expansions are needed when the list size is smaller than the actual size of an SPC or a rate-1 code.
III. MULTI-BIT DOUBLE THRESHOLDING SCHEME
To reduce the decoding latency of LSCD with a large list size, we propose to combine double thresholding scheme, selective expansion and multi-bit decoding together. However, the original DTS requires bit-wise expansions and each leaf node in the scheduling tree has to be traversed. To apply MBD, special consideration has to be made. We first show a method which uses DTS for a multi-bit tuple instead of a single bit. Then we will present a low-latency LSCD algorithm called multi-bit double thresholding scheme. The error correction performance and complexity will then be discussed and compared with the traditional DTS.
A. Decoding Multi-bit Pattern with Single Unreliable Bit
The original DTS selects the L best paths out of the 2L expanded paths when one information bit is expanded. With selective expansion, we only need to consider the unreliable bits. If we want to use DTS for decoding two unreliable bits, the number of expanded paths will be 4L. In this case, γ L/4 i will be used as AT 4 , meaning that only L/4 PMs are guaranteed to be better than AT in this condition. The other 3L/4 paths will be selected randomly. Intuitively, the lower bound of the performance of this method is that of an LSCD with an equivalent list size of L/4. If we consider more unreliable bits in an MBD, the error correction performance will be further degraded. In this sub-section, we will propose a method to extend the original DTS to solve this issue.
We first consider the decoding of a T -bit tuple whose bits are the leaf nodes of a sub-tree of the scheduling tree and the tuple contains exactly one unreliable bit. We call this a singleunreliable-bit tuple (SUBT). Later we will extend the idea to tuples that have different combinations of bits. For a SUBT, other than the unreliable bit, each bit is either a frozen bit or a reliable bit. As only the unreliable bit needs to be expanded, there are only two expanded paths for each SUBT and hence the PMU equations in (14) are modified as
where the L t j s are the LLR inputs at stage t (t = log 2 T ), and α ... ... sub-vectors V at stage t when the only unreliable bit in the SUBT is assumed to be 0 and 1, respectively. It can be seen that non-zero penalties, denoted as ∆ A and ∆ B (the second terms at the right hand side of (17)), are added for the updating of both PMs, and it is different from the PMU of the traditional LSCD in (10) 
where V0 and V1 are two sets which include all the possible combinations of V when the unreliable bit in the SUBT is assumed to be 0 and 1, respectively. Maximum-likelihood detections (MLDs) on V0 and V1 is required to ensure the selected A l and B l have the smallest ∆ A and ∆ B on the original PMs. Correspondingly, the T -bit decoded vectors of the two expanded paths can be obtained by A l · F ⊗t and B l · F ⊗t , respectively. As the path expansion according to (17) and (18) generate 2L expanded paths, DTS can be used for a SUBT. Comparing this method with single bit decoding using the original DTS and selective expansion, the error correction performance will not be degraded. This is because the MLD in (18) guarantees any survival path kept by the DTS for a SUBT will not have a larger PM than the one obtained by the original DTS if they are from the same path before the expansion and have an identical value for the unreliable bit.
B. Latency Analysis of DTS for a SUBT
For hardware implementation, we need multiple cycles to execute DTS for a SUBT. The block diagram of the datapath is shown in Fig. 3 . Each block represents a step in the DTS for a SUBT and is assumed to take one cycle. First, the PMU is calculated based on (17) and (18) . The inputs are PMs of the existing survival paths, γ i+1 , so θ l i+T and hence the threshold values cannot be computed until the PMU generates the result in the first cycle and sorting is needed in the second cycle to obtain the threshold values. Finally, a DTS operation is executed to select the paths to be kept based on the extracted AT and RT in the last cycle. In summary, the latency of one DTS for a SUBT is three cycles. Of the three operations, the time delay due to the MLDs in the PMU block is the highest and dictates the clock frequency. We will discuss how to solve this problem in Section V.
C. Special Multi-bit Patterns
In some special multi-bit patterns, the three-cycle latency can be reduced because some of the steps are not required as described below. For simplicity, we denote these special patterns as SP1 and SP2, respectively.
A SUBT with no frozen bit (SP1): If a SUBT does not include any frozen bit, it means all the bits apart from the unreliable bit are reliable and all 2 T combinations of V are possible candidates of the encoded tuple in (18) , causing a large computational load for the MLDs. According to the properties of polar codes and selective expansion [40] , [45] , the unreliable bit in this kind of tuple has the lowest reliability and it is always the first bit of such a tuple. Based on this property, we propose the following scheme to decode such a tuple. Specifically, we use the original DTS [40] to decode the first bit. The magnitude of the LLR output at the first bit is computed through a series of F-functions and hence its value |(L 0 0 ) l | is equal to the minimum of all the LLR magnitudes at the root of the tuple at stage t which is denoted as
Hence the penalty values for the two paths expanded from this unreliable bits are 0 and |(L t k ) l |, respectively. The rest of the reliable bits do not contribute any penalties to the PMs as no path will be expanded from them.
Next we need to find the decoded sub-vector of the reliable bits to fill in the rest of the bits of the two expanded paths for this SUBT. To decode these reliable bits, we can treat each of the two expanded paths as a single-parity-check code [10] in the traditional SCD with the unreliable bit as its parity bit (one path with the parity bit equal to 0 and the other with the parity bit equal to 1). As discussed in [10] , the parity bits of A l and B l are calculated as
respectively, and then A l and B l can be obtained as l |, so the error correction performance is guaranteed to be not worse than the results get by (17) and (18) . The computational complexity is significantly reduced as we do not need to execute MLD to obtain A l and B l . Moreover, as one of the two expanded paths does not have a penalty and keeps its original PM value, the threshold values can be pre-extracted similar to the original DTS and the extra sorting cycle can be hidden into the SCD computation cycle. Thus one cycle is saved as shown in Fig. 4(b) .
A tuple with only frozen bits or only reliable bits (SP2): In these two multi-bit patterns, there will be no path expansions and hence no DTS is needed. According to Table I , for a tuple with only frozen bits, a PMU needs to be executed and the decoded bits are all zero; for a tuple with only reliable bits, the penalty values of all the paths are zero and the decoded
Consequently, two cycles are saved for both patterns, as shown in Fig. 4(c) .
These special patterns have at most only one unreliable bit so they are treated as SUBTs in the following. It is also noted that these special patterns are similar to the four kinds of special codes in the fast-SSC-based algorithms. Specifically, an SP2 with only frozen bits is a rate-0 code. An SP1 and an SP2 with only reliable bits are rate-1 codes. However, the decoding algorithm of rate-1 code still requires a few path expansions, while our method only requires at most one path expansion and thus fewer clock cycles are needed for LM.
D. Multi-bit Double Thresholding Scheme
We have discussed how to apply the DTS for decoding multi-bit patterns. For a block of polar code that contains many unreliable bits, we first present a tuple dividing scheme to separate a block of code into multiple SUBTs. Based on this, an MB-DTS is proposed and its decoding latency and complexity are discussed.
The tuple dividing scheme is summarized in Algorithm 1. The main function "TUPLE DIV" takes a tree as input and returns a set of all the divided SUBTs and its cardinality, N T . It is recursively applied on the scheduling tree to divide it into SUBTs.
The proposed MB-DTS first divides the whole scheduling tree according to the tuple dividing scheme and then applies DTS for the leaf nodes of the trimmed tree, which correspond to the SUBTs. An example is shown in Fig. 5(a) . Each leaf node in the trimmed scheduling tree represents a SUBT which can be decoded by the DTS in three cycles. If the tuple is an SP1 or SP2, one or two cycles can be saved, respectively. The LLR calculations in the trimmed scheduling tree are executed according to the depth-first traversal schedule, and the latency of each node is one cycle. Consequently, the total latency of traversing such a trimmed scheduling tree is given by
where N node is the total number of the nodes (except the root node), N leaf is the number of leaf nodes in this tree, and N SP1 and N SP2 are the numbers of leaf nodes corresponding to SP1 and SP2, respectively. In the example shown in Fig. 5(a) , N node = 6, N leaf = 4, N SP1 = 2 (a single unreliable bit is regarded as an SP1 here) and N SP2 = 1, so the total latency is D MB-DTS = 14 cycles. It can be seen that the latency is greatly influenced by N leaf , which depends on the number of unreliable bits. Thus the selective expansion algorithm helps to greatly reduce the latency by reducing the number of bits which need path expansions.
One of the issues of the MB-DTS for LSCD with a large list size is that the size of the multi-bit tuples is not fixed. It is not favorable for a regular hardware implementation. Also, if the tuple size is too large, the PMU operations will be too complex and the corresponding critical path delay will be high. To solve these problems, we divide the scheduling tree into two parts at stage m and only apply MB-DTS to sub-trees rooted at stage m, i.e., DTS is only applied for leaf nodes at the stages lower than m in a trimmed tree. This restricts the size of the tuple to a maximum length of T max = 2 m = M so that the computational complexity and critical path delay can be bound to a reasonable value. For stages higher than or equal to m, LLR calculations are executed according to the traditional LSCD schedule. Fig. 5(b) shows an example of an MB-DTS with M = 2. MB-DTS is only applied to the 2-bit sub-trees rooted at stage 1. Tuple dividing scheme is only applied to node 1 which includes two unreliable bits.
IV. PARTIAL G-NODE LOOK-AHEAD
In this section, we will present a partial G-node look-ahead scheme to reduce the latency of the SCD computation for the LSCD with a large list size. First, the G-node look-ahead scheme proposed in [7] for the conventional SCD will be reviewed. Then we will show how to modify and apply this scheme to a semi-parallel LSCD architecture to reduce the latency while keeping the hardware overhead to a minimum.
A. Review of G-node Look-ahead
In traditional SCD computation, G-node calculations are executed according to (5) after the partial-sums are generated, which takes one clock cycle. If the dependency on the partialsum is removed, the G-nodes can be calculated at the same time with the F-nodes at the same stage and the extra cycle is saved. In the GLAH scheme [7] , the dependency is removed by unconditionally computing the G-node twice assuming the partial-sums to be 0 and 1. Both results are stored temporarily and the correct one will be selected directly when the actual partial-sum is generated later. As half of the nodes in the scheduling tree are G-nodes, the overall latency of SCD can be reduced by half. The saving in latency comes with the cost of extra computations and memory storage. For every two input LLRs, three output LLRs need to be calculated and stored at the same time, one for the F-node and two for the pre-computed G-node. The memory usage is hence about three times that of the traditional SCD [7] and the cost is particularly high for LSCD with a large list size as there are L SCDs.
In the rest of this section, we will discuss how we can use GLAH on LSCD for the most latency-saving while keeping the hardware and memory overhead to a minimum.
B. Partial G-node Look-ahead
For an efficient hardware implementation, a semi-parallel architecture [4] is usually used in a traditional LSCD architecture [25] , [40] . The block diagram of the SCD computation in the LSCD architecture is shown in Fig. 6 , which contains L blocks of LLR memory, L processing element (PE) arrays and one L × L crossbar. The memories are used to store the intermediate LLRs at each stage. Each PE array uses P = 2 p (≪ N ) PEs for the SCD computation. After an LM, the LLR values of some paths need to be copied to the other LLR memory of other SCDs to continue the subsequent decoding operations if both of the expanded paths are kept. To eliminate the costly data movement between the memories, usually a pointer-based updating mechanism is used instead [17] . To support this pointer-based operation, the crossbar is used to align the data stored in the memories and the PE arrays for correct operations [22] , [25] . The bit-width of the read port of each memory and the crossbar is 2P Q LLR bits, where Q LLR is the number of quantization bits for the LLR operands. If GLAH is used in this architecture, similar as the architecture presented in [7] , the size of the memories is tripled and the data moved from the memories to PE arrays are doubled to 4P Q LLR bits because each LLR input of the G-function has two candidates. This increases the complexity of the memories and the crossbar. Table II summarizes the relationships of the computational parallelism (in terms of the number of F-or G-functions that are executed in parallel), the latency, the memory storage requirement, the data bandwidth requirement and the number of memory words with the stage index for a semi-parallel LSCD architecture that has P PEs for each path. At stage s, the decoding parallelism is 2 s . According to these relationships, we can separate all the stages into two groups: 1) Stages with indices smaller than or equal to p are called fully-parallel stages, as their parallelism is not larger than P and the computations can be finished in one cycle. 2) Stages with indices larger than p are called semi-parallel stages, as their parallelism is larger than P and the computations take multiple cycles. Based on the above analysis, we propose a partial G-node look-ahead scheme. Specifically, for the fully-parallel stages, we use the GLAH scheme; and for the semi-parallel stages, we just use the traditional decoding scheme without the GLAH.
Though the data bandwidth at the fully-parallel stages (except stage p, which gets 2P LLR operands from the semiparallel stages) is doubled because of the GLAH calculation, it does not exceed the maximum bandwidth requirement (2P Q LLR bits) and hence is unchanged. The overall decoding latency is nearly halved because most of the nodes in the scheduling tree belong to the fully-parallel stages. Specifically, the latency of P-GLAH is given by
where D trd is the latency of the traditional semi-parallel architecture without GLAH computation [4] and ∆D is the latency saving which equals to the number of G-nodes in the fully-parallel stages. For example, if N = 1024 and P = 64, the latency of P-GLAH is about 51.2% of that of the traditional Next, we analyze the memory usage of the semi-parallel LSCD architecture with P-GLAH. For the fully-parallel stages, the memory usage is 4P Q LLR bits for each path, which equals to the sum of the memory bits of all the stages shown in Table II . It is noted that, compared with the 2P Q LLR -bit memory usage in these stages of the traditional LSCD, the required memory bits are only doubled instead of tripled. This is because the calculation results of the F-nodes will be used in the subsequent cycle and can be stored in an extra P Q LLR -bit bypass register bank [4] . This also guarantees the P pairs of GLAH calculations executed at stage p have a write data bandwidth of 2P Q LLR bits instead of 3P Q LLR bits. For the semi-parallel stages, as we do not use the GLAH scheme, the memory usage is the same as that of the traditional architecture, which is ( N 2P − 1) · 2P Q LLR bits for each path. To support L parallelly-executed SCDs, L copies of memories are needed for the fully-parallel stages, the semi-parallel stages and the bypass memories. Besides, N Q LLR memory bits are used to store the channel LLRs which can be shared by all the paths. So, the overall size of the LLR memory for an LSCD with list size equal to L is [(L + 1)N + 3LP ] · Q LLR bits. The size of the LLR memory is similar to that of the traditional LSCD architecture such as [40] . Specifically, for the example mentioned in the last paragraph, the memory overhead is only about 18% for a list size of 32 and much smaller than the overhead of the original GLAH [7] .
V. HIGH-THROUGHPUT ARCHITECTURE FOR LSCD WITH A LARGE LIST SIZE
In this section, we first present the overall architecture of the proposed LSCD and analyze its hardware complexity. Then, we discuss the details of the main sub-modules.
A. Overall architecture
The structure of the proposed architecture is shown in Fig. 7 . It consists of several modules: the SCD cores, the LM module, the partial-sum network, the path memory and the control unit.
To support an LSCD with a list size equal to L, most of the blocks in Fig. 7 are duplicated L times, and they are drawn with double-lined frames. The SCD cores are used to execute the SCD computations. The input channel LLRs and the intermediate LLRs at each stage of SCD are stored in the channel buffer and the LLR memories, respectively. Both of the memories are implemented in words of 2P Q LLR bits. To save the memory usage, precomputation memory-saving (PCMS) proposed in [30] is applied, which means the GLAH calculation is executed at the top stage n − 1. These calculation results are shared by all the L paths and hence channel LLRs do not need to be stored once they are used in the first computation. Thus only 3N 2 , instead of LN 2 , LLR storage is needed at stage n − 1. The crossbar selects the LLRs from either the LLR memory or the channel buffer as input and sends the data to the PE arrays for F-and G-node computation according to the pointer-based lazy copy scheme [22] . The PEs are designed to support both GLAH and non-GLAH computation, and the detailed structure will be presented in Section V-B. As MB-DTS is used, after the LLR calculations of stage m is finished, M LLRs are sent to the LM module. Here, the schedule of the SCD computations at stage m and the LM operations are re-designed to further reduce the latency, which will be detailed in Section V-D.
The LM module directly implements the block diagram shown in Fig. 3 . The PMU block is used to calculate the PMs of the expanded paths and the details will be presented in Section V-C. The DTS block is used to realize the DTS algorithm, and a structure similar to that presented in [40] is used. The sorter is used to sort the PMs of the L survival paths in order to obtain the AT and RT values for the DTS for the decoding of the next SUBT. These three blocks are mapped and connected according to the schedule of decoding a SUBT as discussed in Section III-B and III-C. The partial-sum network is used to update and store the partial-sums required for the G-node computations. A folded partial-sum network for semi-parallel SCD architecture similar to that proposed in [6] is used for each path. The partial-sum memory and partial-sum update block are duplicated L times to support L paths. Each copy of the partial-sum memory contains N memory bits. Among these bits, P bits are stored in registers while the others are stored in an SRAM whose port width equals to P bits. After MB-DTS for an M -bit tuple is finished, L · P bits of partial-sums from L paths are sent to the crossbar for permutation according to the LM results. The permuted partial-sums are updated with the L·M decoded bits from the LM module and then stored back to the partial-sum memory, while at the same time sent to the PE arrays for Gnode computations. More details of the partial-sum network architecture can be found in [6] .
The path memory is used to store and update the partial decoded vectors of each path. Its structure is similar to that of the partial-sum network. L copies of path memories are used to store the decoded bits of L paths and a crossbar is used to update each path according to the LM results. The newly decoded bits are appended to the partial decoded vectors and the updating block is simply implemented using shifters. When the decoding of a code block is finished, all the paths are checked with the CRC unit and the one that passes the checking is selected as the output word.
The control unit generates the control signal for the decoder. The frozen set A c and the reliable set A r are stored in a small ROM in this unit. Based on these sets, the size of the tuples and the other related control signals are generated online according to Algorithm 1. For the decoder flexibility, we can simply change the contents in the ROM when the code sets are changed and the hardware architecture does not need to be modified.
B. Programmable Processing Element
As the P-GLAH scheme is used in the implementation of the LSCD, the processing elements have to execute the GLAH and normal non-GLAH computations at the fully-parallel and semi-parallel stages, respectively, and hence a programmable processing element is required.
The structure of the programmable PE is shown in Fig. 8 . Sign-magnitude representation is used to represent the LLRs. The datapath contains an input stage, a calculating stage and an output stage which are described as below.
• The input stage is configured according to the number of input candidates. If there are only two input LLRs, i.e. when pre-computation is not used, only two ports, I.0 and I.2, are activated and selected. Otherwise, L a and L b are selected from the four pre-computed values by the partial-sums.
• The calculating stage is used to generate the candidate values for the output stage according to (5) and (7). The datapath for the magnitudes mainly consists of three adders which are used to calculate |L a |+|L b |, |L a |−|L b | and |L b | − |L a |. The overflow bit of |L a | − |L b |, marked as χ, will be used to select the minimum of the two magnitudes min ab as well as the absolute value of the difference, dif ab , from ±|L a | ∓ |L b |. The sum ab , dif ab and min ab are used as the candidates at the output stage. Similarly, in the datapath for the sign bit, three sign bits are generated.
• The output stage is configured according to whether GLAH is used or not. There are three and one outputs when GLAH is and is not used, respectively. O.1 will be sent to the memories under both conditions. The delay and complexity of the programmable PE is mainly brought by the adders and the critical path includes only one stage of adder and some multiplexers, which is highlighted with the dashed line in Fig. 8 . We compare the proposed programmable PE structure with the existing ones from the literatures and the results are summarized in Table III . All the PE structures listed here use sign-magnitude representation for the LLR operands. The PE structure in [4] uses one fewer adder than ours; however, this is at the cost of larger logic delay. Moreover, it cannot support the GLAH calculation. [7] uses an adder-subtractor to calculate multiple expressions for GLAH. However, for such a structure, the representation of all the input and output operands must be converted between 2's complement and sign-magnitude. As a result, there are in total one adder and five converters (for representation converting) Fig. 9 . Structure of the PMU block for an M -bit tuple in the MB-DTS (M =4). An double arrow means the signal includes two updated PMs. and the critical path includes one stage of adder and two stages of converters. It can be seen that the programmable PE structure can realize the partial G-node look-ahead scheme with low logic delay and moderate hardware complexity.
C. PMU Block in the LM Module
To implement the MB-DTS algorithm on hardware, the PMU block shown in Fig. 9 is used in the LM module, which directly implements the algorithm introduced in Section III-D.
Let m be the highest stage in the scheduling tree that MB-DTS can be used. With the tuple dividing algorithm, the size of the tuple decoded using MB-DTS T ranges from 1 to 2 m , indicating the PMU block needs to update the PM with at most M LLRs from one path. In the example shown in Fig.  9 , M = 4, and it can be used to execute the PMU for tuples with T = 1, 2 or 4. The inputs to the PMU are M LLRs from stage m. If T < M , the PMU cannot be executed immediately with the M inputted LLRs. According to Section III-D, m − t stages of LLR calculations are executed in m − t cycles to obtain the valid LLRs for the PMU of a T -bit tuple. Therefore, m stages of programmable PEs are needed to execute these LLR calculations and registers are used to save the outputs at each stage. If T = 1 or the tuple is an SP1, the bit can be decoded with the original DTS. Consequently, the magnitude value of the last stage of LLR calculation, |(L 0 i ) l |, which is the output of the last PE in Fig. 9 , is used as the penalty value Λ l i in (10) and the PM values can be updated accordingly. Otherwise, if T > 1, after m − t stages of LLR calculations, a T -input sub-PMU block shown in Fig. 10 is used to update the PM values for the T -bit tuple.
The T -input sub-PMU block is used to compute (17) and (18) . To implement the MLD in (18), we need to update up to 2 T PMs with penalty and find the two minimum values out of two groups of 2 T −1 values, which requires 2 T T -input adders and two groups of T − 1 stages of comparators, respectively. To reduce the hardware complexity and time delay of the datapath, in the real sub-PMU block, we do not execute MLD at all. Instead, during the formation of the tuples, we further restrict the tuple to be one of the following special patterns of SUBT: an SP1, an SP2 or a rate-1 T tuple. As discussed in Section III, none of these tuple patterns require an MLD, so the complex calculation of MLD is not required at all.
The structure of the T -input sub-PMU block is shown in Fig. 10 and it consists of two identical sub-blocks, which are used to calculate (17) Fig. 10 . Structure of a T -input sub-PMU block.
parallel. Both of the two sub-blocks are activated when a rate-1 T tuple is decoded, while only one of them is activated when an SP2 is decoded. There are three stages in the datapath, which are divided by the dotted lines in Fig. 10 . The first stage consists of two T -input adders to calculate the two penalty values, ∆ A and ∆ B in (17) , which are then sent to the second stage to add up with the PM value of the current path. The last stage is used to select the smaller of the the two updated PMs, θ l i+T , for the sorting process that may be executed in the next cycle.
The delay of the T -input sub-PMU block is mainly due to the adders and comparators. A t-stage adder-tree is required to implement the T -input adder. Thus, the critical path delay of the whole PMU block lies at the M -input sub-PMU block which equals to that of m + 1 stages of adders. As mentioned in Section III-D, M should not be too large so as to bound this critical path delay to a moderate value. The datapaths to calculate other T -bit tuples includes m − t stages of programmable PEs (each has the delay of about one adder) and t + 1 stages of adders for the T -input sub-PMU block and the total delay is also m + 1 stages of adders. Thus, the LLR calculations and PMU of any T -bit tuple can be merged and executed in one cycle without increasing the critical path delay. If we consider the latency to traverse a sub-tree at stage m and below, the first term in (21), N node , can be removed. We denote this modified MB-DTS with the optimized schedule as simplified MB-DTS (SMB-DTS), and its latency in terms of cycle can be expressed as
D. Latency Fine-tuning by Datapath Optimization
In this sub-section, a critical path optimization scheme is proposed to further reduce the latency. Let us take a snapshot of the decoding process at the nodes near stage m first. A sub-tree rooted at stage m + 1 5 is shown in Fig. 11 (a) and its decoding schedule is shown in Fig. 11(b) , in which the dotted line represents a pipeline stage. In cycle 1, two LLR calculations at stage m, m.F and m.G, are executed in parallel using GLAH. The PMU of LM.F (the LM after m.F is finished) is executed in the second cycle 6 . After LM.F is finished, the pre-computed m.G results are selected as the inputs of LM.G (the LM after m.G is finished) and the PMU of LM.G is calculated immediately in cycle 3. An architecture that directly mapped the above operations into hardware is shown in Fig. 12(a) . To reduce the decoding latency, we can remove the pipeline stage between cycle 1 and 2 (the grey dotted line in Fig. 11(b) ). However, direct de-pipelining makes the delay of this datapath (sum of the delay of the crossbar, the PE array and the PMU) much longer than that of the other blocks of the decoder, such as the DTS block, and affects the overall clock frequency. By carefully analyzing the data dependency, we can optimize the critical path of the depipelined datapath to fine-tune the latency under the following two situations.
• If the root node at stage m + 1 is a G-node, the inputs of m.F are already pre-computed using GLAH several cycles earlier. This means that we can also pre-compute the F-functions in m.F some cycles before and we do not need to compute the LLR values in this cycle. As each input LLR of m.F has two candidate values, there are four possible combinations of the inputs and hence four possible outputs of the m.F. We can calculate all these four F-functions when LM operations of the previous leaf nodes are being executed. To do so, we can re-use the programmable PEs in the SCD cores without adding extra hardware as they are idle during the LM operation. A group of 4-1 mux arrays is needed to select the correct LLRs, as shown in Fig. 12(b) , and the total delay is the sum of the delay of the crossbar, the mux and the PMU (highlighted with the dashed line).
• If the root node at stage m + 1 is an F-node, the inputs of m.F are calculated at the last cycle. When an F-node is executed, the data in the corresponding LLR memory of a PE array is always used as its own input LLRs for m.F.
Thus the crossbar is not needed for this situation and can be bypassed. Hence, the architecture in Fig. 12 (a) can be de-pipelined as shown in Fig. 12(b) , and the total delay is the sum of the delay of the mux, the PE array and the PMU (highlighted with the dotted line). With the proposed latency fine-tuning scheme and the optimized datapath shown in Fig. 12(b) , the F-function at stage m and the PMU of its following LM can be executed in the same cycle with a small critical path delay overhead. As there are N 2M F-nodes on stage m, N 2M cycles can be saved in total.
E. Decoding Latency of the Proposed LSCD Architecture
Based on the above discussions, the overall latency of the LSCD architecture with all the proposed schemes for decoding one frame is given by
Specifically, the first term represents the cycles required for decoding the nodes at the stages lower than m, which equals the sum of the cycles required for all the M -bit tuples decoded by SMB-DTS. D SMB-DTS and D P-GLAH can be calculated according to (25) and (24), respectively. D fine = N 2M is the saving due to the technique discussed in Section V-D. The last latency-saving term D zero is due to the structure of the polar code and is described as follows.
For a block of polar code, the leaf node corresponding to the first information bit in the scheduling tree is already known at the beginning of the decoding. Specifically, all the bits before the first information bit are frozen bits and the only partiallydecoded vector is all-zero. Thus, we can find the path from the root to the leaf node corresponding to this information bit and calculate the nodes on this path. All the partial-sums of the Gnodes on this path are zero, so there is no data dependency. For example, in Fig. 5(b) , node 1 and hence the first information bit u 2 can be decoded at the beginning of the decoding as u 0 and u 1 are frozen and the partial-sums of node 1 are zeros. The decoding process for the all-zero vector is not needed and the latency saved is denoted as D zero .
For a conventional semi-parallel LSCD [25] , the baseline decoding latency is 2N cycles for the L parallel SCDs and K cycles for LM. As the latency of our SMB-DTS highly depends on the setting of the code, we will show the latency saving by a numerical example in Section VI-B.
VI. EXPERIMENTAL RESULTS

A. Error Correction Performance of the Proposed Schemes
To demonstrate the error correction performance of the proposed schemes, simulations are done on the polar code of (N, K, r) = (1024, 512, 24) over an AWGN channel. that of the LSCD with L = 32 at E b /N 0 = 2.5 dB. For the DTS, we use the modified version, DTS-advance, presented in [40] , which is more suitable for hardware implementation due to its lower computational complexity. The rejection thresholds of the DTS-advance are γ Table IV . It can be seen even when M = 8, out of the 128 eight-bit tuples, 90 tuples are those for which all the eight bits can be decoded at the same time, indicating a great latency-saving can be achieved, which will be shown in the next sub-section.
B. Latency-Saving Achieved by the Proposed Schemes
The overall decoding latency of LSCD with different M values for the 1024-bit polar code are obtained based on (26) and are summarized in Table V . We assume P = 64 PEs are used in each SCD, which is the same as that used in [25] , [32] , [34] , [36] , [40] , [46] . The parameters of SMB-DTS are the same as those presented in Section VI-A. When more bits are merged, the latency of LM using SMB-DTS and the overall latency are both decreased. The latency of SCD using P-GLAH is also decreased as more stages in the LSCD are calculated by the LM module instead of the SCD cores. The first information bit is u 127 and D zero is obtained accordingly.
We also compare the latency with those of the state-ofthe-art LSCDs [25] , [40] . For a fair comparison, the latency is re-calculated using the same code setting mentioned in Section VI-A. When M = 8, the latency without the two further saving refinement schemes is 598 cycles, which is 77% and 60% less than the latency of [25] and [40] , respectively. It seems that most of the latency saving is achieved by P-GLAH used in the SCD part while the latency of LM is almost not reduced. However, the proposed SMB-DTS executes LM and SCD calculations below stage m as a whole. Thus for a fair comparison, the latency of SMB-DTS with M = 8 should be compared with the sum of the latency of LM and SCD calculations below stage 3, which are 2304 and 1190 clock cycles in [25] and [40] , respectively. This means the latency reduction achieved by SMB-DTS is 82% and 64% compared with [25] and [40] , respectively. For the remaining SCD calculations at stages equal to or higher than stage 3, the latency saving achieved by P-GLAH is 42%, which is consistent with the theoretical analysis in Section IV-B. When the two further saving refinement schemes are used, an even higher latency reduction is achieved.
It can be seen that the more the merged bits are, the smaller the latency of SMB-DTS is. However, a larger M means that a more complex PMU block has to be used, which incurs a larger area and longer critical path. Moreover, the programmable PE used in P-GLAH and the de-pipelining for the fine-tuning optimization also increase the logic delay. To this end, a careful tradeoff should be made between a higher clock frequency and fewer decoding cycles to achieve an optimal decoding throughput according to the implementation results to be shown in the next sub-section.
C. Implementation Results of the Proposed Architecture
The proposed LSCD architecture for the polar code of (N, K, r) = (1024, 512, 24) with the settings (the number of PEs used and the parameters of SMB-DTS) presented in Section VI-A and VI-B is implemented. For a fair comparison, the same quantization schemes in [25] , [32] , [34] , [36] , [40] , [46] , i.e., Q LLR = 6 and Q PM = 8, are used in our implementation. The proposed LSCD is synthesized with a UMC 90 nm CMOS process using Synopsys Design Compiler. The reported throughputs are coded throughput. The reported area includes both cell and net area. Table VI shows the synthesis results with different list sizes and different numbers of merged bits. The maximum throughput of LSCD with L = 32 is about 25% lower than that of L = 8. The throughputs of the proposed LSCD architectures with M = 4 and 8 are similar and are much higher than that of M = 2 although the clock frequency is slightly reduced when a larger M is used. The critical paths of the implementations with M = 4 and 8 both lie at the datapath shown in Fig. 12(b) , which is mainly due to the logic delay of the PMU block. The critical paths of the implementations with M = 2 lies at the DTS block as the complexity of the PMU block is small.
The area of the proposed architecture is greatly increased when a large list size is used. This is mainly due to the crossbar in the SCD cores. According to the area breakdown shown in Table VII , the area of the crossbar for L = 32 is four times that of the crossbar for L = 16. The complex interconnection of the crossbar also leads to a large area for routing. The complexity of the PMU block is greatly increased with M according to the discussion in Section V-C. However, its area is only slightly increased for a large M because the PMU block contributes less than 5% of the total area. In contrast, the SCD L=8, conventional L=16, conventional L=32, conventional L=8, proposed L=16, proposed L=32, proposed cores contribute about 65% of the total area where the LLR memory is the main contributor. The area-saving achieved by PCMS is 12.8%, 18.1% and 15.6% for LSCD with L = 8, 16 and 32, respectively, which are larger than the 8% saving for LSCD with L = 4 reported in [30] . This shows that PCMS saves the area more effectively for a large list size. Table VIII compares the performance of our architecture with some state-of-the-art LSCD architectures [25] , [32] , [34] , [36] , [40] , [46] . It can be seen that the proposed architecture can support LSCDs with larger list sizes and both the decoding throughput and the area efficiency are higher than those of the state-of-the-art LSCDs with the same list sizes. These higher throughputs are achieved under a similar clock frequency, which means our architecture reduces the decoding latency significantly. Comparing with [40] , the proposed LSCDs with L = 16 have a similar area. The area of the PE array and PMU is much larger than their counterparts in [40] , which are 0.53 mm 2 for the PE arrays and less than 0.1 mm 2 for the whole LM module for an LSCD with L = 16, indicating the high throughput is achieved at the cost of increased hardware complexity of the programmable PE and PMU block. Fortunately, the total area is not increased much as the areasaving brought by the PCMS offsets the area overhead of the combinational logic. Comparing with [25] , although the area of our design is larger, a 3.5 times area efficiency is achieved due to the shorter decoding latency, as stated in Section VI-B. Comparing with the fast-SSC-based architectures [32] , [34] , fewer path expansions are required for SP1 and SP2 in SMB-DTS comparing with the rate-1 codes, leading to a higher throughput and area efficiency.
VII. CONCLUSION
In this work, a high-throughput architecture for the LSCD with a large list size is proposed. First, two kinds of lowlatency decoding algorithms are proposed. For the list management module, a multi-bit double thresholding scheme is proposed so that the double thresholding scheme can work with multi-bit decoding to reduce the latency. For the SCD cores, a partial G-node look-ahead scheme is proposed by making a tradeoff between the complexity and the latency. A high-performance VLSI architecture is then developed based on the proposed algorithms. Experimental results show that LSCDs with L = 8, 16 and 32 implemented by the proposed architecture provide much higher throughputs than the stateof-the-art architectures with a good BLER performance. [34] , [32] and [46] are based on TSMC 65 nm technology and are scaled to a 90 nm technology.
⋄ The synthesis results in [25] and [36] are based on TSMC 90 nm technology. △ The cardinalty |A| is re-calculated according to the definition in this paper.
