Abstract-Polar codes are the first class of capacity-achieving forward error correction (FEC) codes. They have been selected as one of the coding schemes for the 5G communication systems due to their excellent error correction performance when successive cancellation list (SCL) decoding with cyclic redundancy check (CRC) is used. A large list size is necessary for SCL decoding to achieve a low error rate. However, it impedes SCL decoding from achieving a high throughput as the computational complexity is very high when a large list size is used. In this paper, we propose a two-staged adaptive SCL (TA-SCL) decoding scheme and the corresponding hardware architecture to accelerate SCL decoding with a large list size. Constant system latency and data rate are supported by TA-SCL decoding. To analyse the decoding performance of TA-SCL, an accurate mathematical model based on Markov Chain is derived, which can be used to determine the parameters for practical designs. A VLSI architecture implementing TA-SCL decoding is then proposed. The proposed architecture is implemented using UMC 90nm technology. Experimental results show that TA-SCL can achieve throughputs of 3.00 and 2.35 Gbps when the list sizes are 8 and 32, respectively, which are nearly 3 times as that of the state-ofthe-art SCL decoding architectures, with negligible performance degradation on a wide signal-to-noise ratio (SNR) range and small hardware overhead.
Abstract-Polar codes are the first class of capacity-achieving forward error correction (FEC) codes. They have been selected as one of the coding schemes for the 5G communication systems due to their excellent error correction performance when successive cancellation list (SCL) decoding with cyclic redundancy check (CRC) is used. A large list size is necessary for SCL decoding to achieve a low error rate. However, it impedes SCL decoding from achieving a high throughput as the computational complexity is very high when a large list size is used. In this paper, we propose a two-staged adaptive SCL (TA-SCL) decoding scheme and the corresponding hardware architecture to accelerate SCL decoding with a large list size. Constant system latency and data rate are supported by TA-SCL decoding. To analyse the decoding performance of TA-SCL, an accurate mathematical model based on Markov Chain is derived, which can be used to determine the parameters for practical designs. A VLSI architecture implementing TA-SCL decoding is then proposed. The proposed architecture is implemented using UMC 90nm technology. Experimental results show that TA-SCL can achieve throughputs of 3.00 and 2.35 Gbps when the list sizes are 8 and 32, respectively, which are nearly 3 times as that of the state-ofthe-art SCL decoding architectures, with negligible performance degradation on a wide signal-to-noise ratio (SNR) range and small hardware overhead. [1] have attracted much research interest due to their excellent error correction performance. Polar codes decoded by successive cancellation (SC) decoding provably achieve channel capacity for symmetric binary-input, discrete, memoryless channels (B-DMC) when their code lengths are approaching infinity [2] . However, as the source word is recovered bit by bit in the SC decoding process, the decoding latency for long polar code is large [3] , [4] . Numerous fast SC architectures have been proposed to improve the decoding latency [5] , [6] . On the other hand, the error correction performance of SC decoding is not satisfactory when it is used for practical polar codes with short to medium code lengths [7] , such as the channel codes for 5G communication systems [8] . Thus, successive cancellation list (SCL) decoding [7] , [9] has been proposed to improve the error correction performance of polar codes.
However, it has a large computational complexity and latency overhead.
In SCL decoding, L concurrently-executed SC decodings are used to keep L candidates of decoded vectors, where L is called the list size. Compared with SC decoding, SCL decoding has better error correction performance as the source word is possible to be kept in the list even when a decoding error happens. Moreover, by concatenating polar codes with cyclic redundancy check (CRC) codes [10] , [11] , the valid output vector is selected according to the CRC checksums after decoding. Consequently, SCL decoding significantly outperforms SC decoding for polar codes in error correction performance. Polar codes using SCL decoding with L ≥ 16 even out-perform low-density parity-check (LDPC) [12] and turbo [13] codes using iterative decoding [14] , [15] , and hence short polar codes have been elected as one of the coding schemes in the coming 5G enhanced mobile broadband (eMBB) standard [8] .
Aiming at increasing the decoding throughput of polar codes, VLSI architectures of SCL decoding becomes a popular research topic [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] . Compared with single SC decoding, SCL decoding has latency overhead because of the need of executing list management (LM) [16] . During the decoding process of a bit, the L survival paths will be expanded to 2L paths as all of them are possible candidates of the partial decoded vectors. LM is executed to select the L best paths to keep. Basically, LM needs to solve a radix-2L sorting problem which has a computational complexity of O(L 2 ) [28] . To minimize the latency overhead brought by LM, the most popular optimization schemes used in state-of-the-art hardware architectures are decoding multi-bit sub-codes at the same time so that fewer LM operations are needed. A sub-code can be either fixed-length [18] [19] [20] [21] [22] or matching a special code pattern with variable length [23] [24] [25] [26] [27] . Besides, the sorting algorithm itself can be simplified [28] [29] [30] . An approximate sorting algorithm called double thresholding scheme (DTS) was proposed in our previous work [15] , [21] , [22] . It simplifies the sorting complexity to O(L) with the help of two run-time generated thresholds. The corresponding VLSI architecture supports a list size up to 32 [22] so as to achieve an excellent decoding performance. However, as shown in Fig.  1 , state-of-the-art SCL decoding architectures suffer from a severe throughput degradation when the list size is increased. It is because a larger list size causes larger computational complexity for LM and hence the critical path delay of the SCL decoding architectures increases.
In the iterative decoding of LDPC codes, the number of iterations for each frame to converge is not fixed so that the decoding speed can be increased by adaptively assigning different decoding iterations for different frames [31] [32] [33] . Similarly, to increase the decoding speed of polar codes so that they can be comparable with those of LDPC and turbo codes, adaptive SCL (A-SCL) decoding was proposed in [10] , in which the list size is adaptive. Specifically, a codeword is first decoded by a single SC decoding. If the decoded vector cannot pass CRC, the codeword will be decoded by SCL with a doubled list size. This process is iterated until a valid output is obtained or a predefined maximal list size L max is reached. Experimental results in [10] show that A-SCL with L max has an equivalent error correction performance as that of SCL decoding with L=L max , and the average list sizē L ≪ L max as most of the codewords can be decoded by SCL with L ≪ L max . According to the relationship between list size and throughput as shown in Fig. 1 , the reduction onL increases the average throughput of hardware polar decoder. Nevertheless, the decoding latency of each codeword in A-SCL is different. This is not an issue for a software decoder such as the one in [26] . However, a directly-mapped A-SCL hardware architecture cannot support applications that need a constant system latency and transmission data rate, such as the digital baseband in a communication system.
In this work, we will introduce how to accelerate polar decoding on hardware with the help of A-SCL decoding. The paper is an extension of our previous work [34] . Here, the main contributions of this work are summarized as follows:
• A hardware-friendly two-staged adaptive SCL (TA-SCL) algorithm is proposed, which can achieve a high throughput with constant transmission data rate and system latency. To analyse the error correction performance of TA-SCL, a mathematical model on B-DMC is developed based on Markov chain. The model is an extension from our previous work [34] such that the speed gain achieved by TA-SCL is not restricted to an integer multiple but can be any rational multiple.
• The relationships between the error correction performance and design parameters are studied, and a method of how to select the design parameters is introduced.
• A hardware architecture of TA-SCL is developed based on the proposed model. The memory usage is analysed and the corresponding timing schedule is presented. A low-latency SCL (LL-SCL) architecture combining several existing low-latency decoding schemes is introduced to satisfy the high requirement of a component SCL decoder in the TA-SCL decoder architecture.
• Experimental results show that the throughput is about three times that of state-of-the-art SCL decoding architecture [22] with negligible performance degradation and small hardware overhead. The rest of this paper is organized as follows. In Section II, the background knowledge of polar codes and the decoding algorithms will be reviewed. In Section III, the algorithm of TA-SCL and its analytical model will be introduced. The relationships between the error correction performance and design parameters will also be analysed. In Section IV, the hardware architecture of the TA-SCL decoder will be introduced. Finally, simulation and implementation results of the hardware-based TA-SCL will be presented in Section V, and conclusions will be given in Section VI.
II. PRELIMINARIES A. Polar Codes
Polar codes [1] are a kind of linear block codes of length N . Without loss of generality, we assume N =2 n in this work, where n is an integer. Let u N and x N be the input source word and the output codeword of an N -bit binary frame, respectively, and the encoding process can be simply expressed by x N = u N ·F ⊗n , in which F ⊗n is called the generator matrix that equals to the n th Kronecker power of the polarization matrix F= 1 0 1 1 . Due to the polarization effect, each bit in u N has a different reliability. An information set A is determined by finding the K most reliable bits. These K bits, called information bits, are used to transmit information. The complement of A, A c , is defined as the frozen set, in which the bits are called the frozen bits and set to 0. If an r-bit CRC code is used, the last r information bits are used to transmit the checksum generated from the other K-r information bits, and the code rate of polar codes is R = K−r N .
B. Successive Cancellation Decoding
Successive cancellation decoding is a basic decoding algorithm of polar codes and has a low computational complexity of O(N log N ). Its decoding process is usually represented by a scheduling tree. An example for an N =4 polar code is shown in Fig. 2 . It is a full binary tree with n+1 stages. The operands of SC decoding are the log-likelihood ratios (LLRs). The i th LLR at stage s is denoted as L 
(a) Traditional SCL decoding. where Λ i =L 0 i . To obtain these Λ i s, the nodes in the scheduling tree are calculated as follows. A pair of sibling nodes at stage s share 2 s+1 LLR inputs from stage s+1 and both of them execute 2 s calculations in parallel. The left and right sibling nodes (denoted as F-and G-nodes) calculate the following Fand G-functions, respectively:
where L a and L b are the two input LLRs andps for the Gfunction is a binary bit called partial-sum. (2) is a hardwarefriendly version of F-function proposed in [3] . For a G-node at stage s, its partial-sums are obtained by
whereû j is the last decoded bit. According to (4), the partialsums of a G-node has data dependancy on the 2 s decoded bits rooted at its sibling F-node. Thus, it can be seen that the decoding process of SC decoding follows a depth-first traversal of the scheduling tree.
C. Successive Cancellation List Decoding
SCL decoding was proposed in [7] , [9] . It has a significant performance gain over single SC decoding. Its decoding process can be regarded as a search problem on a binary decoding tree of depth-N. Fig. 3a shows a decoding tree for a polar code of N =4. The i th source bit u i , which corresponds to the i th leaf node in the scheduling tree, is mapped to the nodes at depth i+1 in the decoding tree. A path from the root node to a leaf node represents a candidate of decoded vector. For a parent node at depth i, its left and right children correspond to two different expansions of the partial decoded path with u i =0 and 1, respectively. For example, the paths marked with single and double crosslines in Fig. 3a represent decoding vectors 0010 and 0100, respectively.
The decoding process begins from the root node. When the decoding process reaches a frozen bit u i , such as u 0 in Fig.  3a , the sub-tree rooted at the right child is pruned (marked with light colour) as it does not contain any valid path, and the number of valid paths is unchanged. Otherwise, if u i is an information bit, the valid decoded paths are expanded to both sibling nodes and the number of valid paths in the list doubles. The number of the path candidates increases exponentially with respect to the number of decoded information bits. When a practical code length is used, the computational complexity will be too high to be implemented after a few bits are decoded. To limit the computational complexity, an LM operation is executed at each new depth to keep the number of survival paths to a predefined value L which is called the list size. In Fig. 3a , the lines with dark colours represent the paths that have been expanded during LM and those with crosslines represent the paths kept after LM.
The criterion of selecting survival paths during LM is their reliability measured by path metrics (PM). We denote the path metric of a path l (l ∈ [0, L − 1]) after the decoding of bit u i as γ k i+1 where k ∈ {2l, 2l+1}. The PM is initialized as γ 0 0 =0 and is updated based on bit-wise accumulation as [16] 
Similar as (2) , (5) is a hardware-friendly version of PM update. For a frozen bit, only one of (5) that satisfiesû i =0 will be computed and the number of paths remains to be L. As mentioned above, only the L left children nodes will be kept. Otherwise, both equations in (5) will be computed and the number of paths is doubled. After that, a list pruning operation will be executed, where all the 2L PMs are sorted and the L paths with the smaller PMs will be kept in the list.
Recently, a variety of algorithms have been proposed to reduce the decoding latency of SCL by decoding multiple bits and executing their LM operation for only once. The algorithms in the literature can be divided into two different classes, multi-bit decoding (MBD) [18] [19] [20] , [22] and special node decoding (SND) [23] [24] [25] [26] [27] . MBD decodes M = 2 m bits simultaneously, where M is a fixed and predefined value. The decoding tree of MBD is modified to a full 2 M -ary tree as shown in Fig. 3b , in which M =2. LM is still executed at each depth of this tree for each M bits. SND, on the other hand, runs simplified LM algorithms for variable-length sub-codes that matches special code patterns. Fig. 3c shows the decoding tree modified for SND, in which rate-1 sub-code, a sub-code with only information bits, is used to simplify the decoding. Fewer paths are expanded from each survival path in SND and hence the computational complexity is reduced.
D. Adaptive SCL Decoding for Polar Codes
Adaptive SCL with CRC was proposed in [10] and the algorithm is summarised in Algorithm 1. Each time, a new codeword which contains N LLRs is inputted for decoding. A-SCL starts with an SCL of L=1, i.e., a single SC. If there are some decoded vectors that pass CRC at the end of decoding, the one with the highest reliability is chosen as output. Otherwise, the list size is doubled and the codeword is decoded again by an SCL with the new list size. Usually, a predefined L max is used to limit the computational complexity, that is, after the decoding using an SCL with L max , the decoding terminates even when there is no valid candidate. From [10] , the error correction performance of A-SCL is the same as that of an SCL with L max . At the same time, as most of the valid decoded vectors can be obtained using SCL with smaller list sizes, the average list sizeL of A-SCL is much smaller than L max . An example for a polar code of (N, K, r)=(1024, 512, 24) polar code is shown in Table I , in whichL ≪ L max =32 . As SCL with a smaller list size has a higher decoding speed, the average decoding speed of A-SCL is much higher than that of SCL.
However, if we directly implement the A-SCL algorithm in hardware, the following issues need to be addressed:
• In an A-SCL decoding, different codewords may be decoded by SCL with different list sizes and SCL with a larger list size has a much larger latency. The decoding latency varies from frame to frame, so the system latency is not fixed. Because of that, a directly-mapped architecture may not be able to support applications that need to have a constant transmission data rate and latency, such as the channel coding blocks in communication systems.
• A directly-mapped architecture is required to support multiple SCL decodings with list sizes L ranging from 1 to L max . This increases the design effort and also the hardware complexity. A simplified A-SCL decoding was proposed for accelerating software polar decoding [26] , in which only one single SC and one SCL with L max is used. However, as shown in Table I , itsL is larger than that of the original A-SCL, which means the achievable throughput gain is much less than that of the original A-SCL decoding. Moreover, simplified A-SCL does not support constant transmission data rate as well.
• The throughput of A-SCL is not a constant under different channel conditions. As shown in Table I , the average list sizes of A-SCL decoding increase when the channel condition deteriorates as more frames need to be decoded multiple times, which indicates the throughput could be 
The buffer size is one (finite). Buffer overflow may happen. affected accordingly. It is also shown in [26] that a software-based simplified A-SCL decoder suffers from a 20x throughput reduction when the signal-to-noise ratio (SNR) is reduced by 1 dB. To adjust the data rate according to the channel condition, the transmitter side in a real system needs to know the channel SNR in real time, which increases the difficulty of system implementation. To design a hardware-friendly A-SCL algorithm, we take reference from variable-iteration decoders for LDPC codes [31] [32] [33] . An additional buffer is employed at the input of the iterative decoder, in which the newly received codeword can be stored temporarily before the current decoding is finished. By doing so, the iterative decoder can use different decoding iterations to achieve decoding convergence and support constant transmission data rate at the same time. In the next section, we will propose a two-staged A-SCL decoding which solves the problems mentioned above with the help of some buffers and is hence suitable for hardware implementation.
III. TWO-STAGED ADAPTIVE SUCCESSIVE CANCELLATION DECODING

A. Algorithm of TA-SCL Decoding
As mentioned in Section II-D, the average list size and the error correction performance of A-SCL algorithm follows that of single SC and SCL with L max , respectively. Based on this observation, we propose a hardware-friendly TA-SCL whose algorithm is described as below.
The block diagram of TA-SCL and its timing schedule are shown in Fig. 4 and Fig. 5a , respectively. Basically, it includes two SCL decodings, which are an SCL decoding with small list size (not necessarily to be 1), denoted as D s , and an SCL decoding with large list size L max , denoted as D l . Each input codeword from the channel is first decoded by D s . If none of the candidates in the list passes CRC after this decoding, e.g. fr.1 in Fig. 5a , the current codeword will be decoded again by D l . This decoding usually takes longer time than decoding using D s . Nevertheless, D l runs concurrently with D s so that D s starts decoding the next codeword immediately. Most of the time, D s can decode the input codewords correctly and D l becomes idle when the current decoding process finishes. However, if the channel is subject to burst errors, it is possible that a new codeword cannot be correctly decoded by D s and the decoding in D l has not finished yet. To deal with this, an LLR buffer is needed to store the LLRs of the codeword from D s temporarily, such as fr.2, fr.3, fr.4 and fr.6 shown in Fig.  5a . An output buffer is also needed to re-order the decoded vectors as the codeword may be decoded out of order. For example, fr.7˜fr.13 are stored in the output buffer until the decoding of fr.6 finishes.
The major difference between TA-SCL and A-SCL, either the original one or the simplified one, is that D s can decode the next codeword from the channel input immediately instead of waiting for D l to finish decoding the current codeword with the help of LLR buffer. The continuous running of D s and LLR storage in the buffer permit the data to be transmitted at a constant data rate which is equal to the decoding throughput of D s regardless of the SNR of the channel, while the decoding performance is guaranteed by D l . Also, TA-SCL benefits the hardware complexity as only two SCL decoders need to be implemented on hardware. The issues mentioned in Section II-D are hence solved.
B. Performance Bound of TA-SCL Decoding
If we have unlimited buffer resources, the decoding performance of TA-SCL will be the same as that of D l . However, in actual hardware implementation, the buffer size is limited and buffer overflow will happen, as shown in Fig. 5b . It happens when a new codeword needs to be stored in the LLR buffer but the buffer is full and decoding in D l has not finished yet. To deal with buffer overflow, either the codeword in D s or D l would be thrown away and the incorrect decoding results from D s will be used as the final output decoded vector of the corresponding codeword. Thus, the block error rate (BLER) of D TA , denoted as ǫ DTA , is bounded by
where the upper limit is the sum of the BLER of D l and the probability of buffer overflow. Obviously, it is important to prevent buffer overflow in order to reduce performance loss which is defined as
In summary, the benefits of TA-SCL on hardware comes at the cost of error correction performance loss. To obtain the best tradeoff among performance, hardware usage and throughput, an analytical model of D TA will be introduced next to obtain the relationship between Pr(Overflow) and the parameters of D TA in the next sub-section. Before that, we define the design parameters of TA-SCL as follows.
•
• t s /t l : decoding time of each codeword using D s /D l . • β: speed gain, i.e., t l ts . In this work, β is not limited to integer value and can be any rational number, which is given by β = βn β d
, where β n , β d ∈ Z + and β n ⊥β d , i.e., β n and β d are co-prime.
• ζ: size of the LLR buffer, which equals to the number of codewords that can be stored in the buffer. In this work, we assume ζ ≥ 1 and ζ ∈ Z + .
We also denote a TA-SCL decoding whose speed gain is β and buffer size is ζ as D TA (β, ζ). The TA-SCL decoding in the example shown in Fig. 5a and 5b hence can be described as D TA (3, ∞) and D TA (3, 1), respectively, and the corresponding D l needs 3t s to decode a codeword.
C. Analytical Model of TA-SCL Based on Markov Chain
In this sub-section, we model the behavior of D TA (β, ζ) on B-DMC. Without loss of generality, we assume the channel is an additive white Gaussian noise (AWGN) channel. We first introduce the states that the decoder can operate at. We define the number of codewords currently stored in the LLR buffer as i ζ and the remaining time required to finish the decoding of the current codeword in D l as i β which is an integer multiple of ts β d
. Each codeword in the LLR buffer needs βt s to decode. Then, the state of TA-SCL is defined as
which is actually the total time required to clear the buffer in terms of t s . For a D TA (β, ζ), there are S=β n ζ+β n +1 states in total. The S states can be categorized into the following groups according to whether buffer flow will happen.
• (3, 1) in Fig. 6a , where the black and white arrows represent the probabilities of ǫ s and ǫ ′ s =1-ǫ s , respectively. The first three columns show i β , i ζ and X τ , respectively. Typical transitions from hazard, safe and idle states are marked with "H", "S" and "I" in the figure, respectively. The number of these states and their transitions are summarized in Fig. 6b . After defining these states, we introduce the modeling of TA-SCL by Proposition 1.
Proposition 1. Decoding with D TA is a Markov process and can be modeled with a Markov chain.
Proof: An input codeword is independent of the codewords inputted at other time as all the LLRs are random variables with identical and independent distribution (IID). Thus, the decoding correctness of D s only depends on the inputs at that time. It actually follows a Bernoulli distribution which takes the value 1 (error happens) with probability ǫ s . As all the state transitions depend only on ǫ s , the next state of D TA (β, ζ) only depends on the current state instead of any earlier state. Hence, Proposition 1 is proved.
From Proposition 1, the state diagram of a D TA (β, ζ) can be easily obtained by finding out all the possible state transitions in Fig. 6a . The state diagrams of D TA (3, 1) and D TA ( Fig. 7a , some states (the light gray ones in Fig. 7a ) can never be accessed and are redundant in the model. These states can be removed and the model is the same with the one with β n =3 and β d =1. So we only consider the situations that β n ⊥β d .
For further mathematical analysis, we map the state diagram to a transition matrix P whose size is S × S. An element P x,y ∈ P (x, y ∈ [0, S-1]) corresponds to the transition probability from state x to state y, i.e.,
where X τ is the current state and X τ +1 is the next state. The transition matrix of D TA (3, 1) mapped from the state diagram is
The transition matrix P is time-independent according to Proposition 1. To do steady-state analysis for TA-SCL, the following proposition of the Markov chain model is introduced. Proof: The proof is given in Appendix A. From Proposition 2, the chain converges to the stationary distribution regardless of its initial distribution. Suppose that the decoding begins with D TA at state 0, i.e., the initial dis-
, the probability distribution becomes
Actually, all the rows of P ∞ are the same, and so the stationary distribution is irrespective of the initial state λ 0 of D TA .
Buffer overflow happens when D TA is in any hazard state and D s cannot decode the next codeword correctly. Thus, the probability of buffer overflow is expressed as
This probability of overflow bounds ǫ DTA in (6) . It is a function of error correction performance of D s ǫ s , speed gain β and buffer size ζ, i.e., Pr(Overflow)=f (ǫ s , β, ζ). We will use this to analyse the error correction performance of TA-SCL next.
D. Error Correction Performance of TA-SCL
In this sub-section, we study the relationships between the error correction performance of TA-SCL decoding and its design parameters based on the proposed model. Simulation results are presented to verify the analysis numerically. A polar code of (N, K, r)=(1024, 512, 24) is used for simulation over an AWGN channel, which is the same as that used in [22] . The low-latency hardware-friendly decoding algorithm proposed in [22] , multi-bit DTS (MB-DTS), is used for D l with L l =32 as TA-SCL is targeting for hardware-based applications.
Instead of directly study the relationships of the design parameters with ǫ DTA , we first study their relationships with Pr(Overflow) with the help of the derived model. Proof: As shown in (10) , all the elements in a transition matrix are linear combinations of 1 and ǫ s . Thus, each element π i in the stationary distribution π and hence Pr(Overflow) is a polynomial of ǫ s and the Proposition is proved.
Proposition 3 indicates that TA-SCL decoding should have a better error correction performance at a higher SNR or if a larger L s is used. To verify this, we simulate the error correction performance of the proposed TA-SCL decoding with different L s but the same β and ζ and the results are shown in Fig. 8a . We assume that β=3, which is a reasonable estimation as will be shown later in Section V. The solid lines show the simulation results of ǫ DTA and the dashed lines show the upper bound of ǫ DTA calculated by the proposed model. It can be observed that the TA-SCL with L s =2 has almost negligible error correction performance degradation compared to D l on a wide SNR range. Also, the performance degradation gradually disappears when SNR increases and ǫ s decreases. In contrast, TA-SCL with L s =1 has much poorer performance at low SNR range. This is bacause more codewords cannot be correctly decoded by D s , and more operations of D l are needed, hence Pr(Overflow) increases. Nevertheless, at high SNR range, the performance degradation is still negligible. Considering that a smaller L s usually indicates a lower decoding latency, a larger speed gain can be achieved. Moreover, it can be observed in Fig. 8a that the simulation results are almost the same as the upper bounds obtained from the analysis, i.e., ǫ DTA ≈ ǫ l +Pr(Overflow). Thus, the upper bound of the derived model can be used to estimate the error correction performance of D TA . Next, we study the relationships between Pr(Overflow) and the other two design parameters, β and ζ.
• When ζ increases, more codewords can be stored in the LLR buffer for D l decoding. As discussed in Section III-A, if we have infinity buffer resources, buffer overflow never happens.
• When β decreases, D l can decode the codewords in the LLR buffer sooner after they were stored. When β ≤ 1, D l decodes a codeword faster than D s and buffer overflow never happens. Fig. 8b and 8c shows the performance of TA-SCL with different β and ζ, respectively. It can be seen at low SNR range, both increasing buffer size and decreasing speed gain β lead to a lower Pr(Overflow) and hence ǫ DTA , which is in accordance with the discussion above. At high SNR range, the curves of all the D TA (β, ζ) are overlapped, which means the error correction performance is good even when a small buffer is used or a high speed gain is required.
From the simulation results shown in Fig. 8 , the performance loss of TA-SCL decoding δ is larger at low SNR range. In practical applications, the decision to select which D s to use depends on how much performance loss we can tolerate for a certain BLER. For example, in Fig. 8a , the performance loss is 20% at a BLER of 2·10 −2 for D TA (3, 3) with L s =2 while that for L s =1 is much larger. Also it can be seen that δ gradually approaches 0 when SNR increases. Based on the relationships between the error correction performance and the design parameters, we summarise the following steps of designing TA-SCL decoding for a specific polar code.
1) Running simulations of the target code using D s and D l .
2) Calculating t s , t l and corresponding β.
3) Gradually increasing the buffer size from ζ = 1. Calculating and checking whether the performance loss δ is satisfied at the target BLER by using the Markov model. 4) If ζ reaches a predefined buffer resource constraint and δ is still not satisfied, adding idle time to t s to decrease β and redo step 3. 5) Running simulations of the target code using the designed D TA (β, ζ) for verification.
IV. HARDWARE ARCHITECTURE FOR TA-SCL
In this section, we first present the overall architecture of TA-SCL decoder and analyze its memory usage and timing schedule. Then, we introduce a low-latency architecture for D s . We denote the number of clock cycles required for D s and D l to decode one frame as C s and C l , respectively, then
In the rest of the paper, we will use C s and C l to represent the latency instead of t s and t l .
A. Overall Architecture
The proposed architecture of TA-SCL decoder is shown in Fig. 9 . It consists of four major sub-blocks: two constituent SCL decoders for D s and D l , an LLR buffer and an output buffer. Data width of each connection is also marked in Fig. 9 , where P is called parallelism factor, i.e., the maximum number of F-or G-nodes that can be executed at the same time in D l , and Q is the number of quantization bits for the LLR values. Details of each sub-block is introduced as below.
To support a high speed gain β, the architecture of D s should have a very low decoding latency. Empirically, a D s with L s ≤ 2 provides error correction performance that is good enough to achieve a very low Pr(Overflow) and ǫ D TA , and a larger L s brings little performance gain but large overhead on timing and hardware complexity. Hence, a low-latency SCL decoding scheme targeting for L s =2 is used in the proposed TA-SCL architecture. It will be introduced in Section IV-C in detail. The memory usage of D s is shown in Fig. 9 . We only show the major memories which dominate the overall memory usage. The LLR memory stores both the channel LLRs and the calculated LLRs during decoding. The channel LLR memory have N Q bits. It is also re-used for the storage of the calculated LLRs from the G-node at stage n-1. The calculated LLRs from the G-nodes at the other stages require about L s · 2 · N Q 2 bits for storage in total [22] . The calculated LLRs of F-nodes at different stages can share a N Q 2 -bit memory as they will be used only once in the next clock cycle directly and do not need to be stored afterwards [3] , [22] . The partial-sum memory and path memory requires L s · N 2 bits and L s · N bits, respectively. This architecture can also be configured for L s =1, and the corresponding memory usage is slightly larger than half of that of D s with L s =2.
For D l , we use the architecture proposed in our previous work [22] and the details will not be discussed here. The sizes of LLR memory, partial-sum memory and path memory are
bits and L l · N bits, respectively. The number of processing elements (PE) for each path in D l equals to P .
The LLR buffer is implemented by a simple one-read onewrite dual-port SRAM. The incoming channel LLR values will be written to the channel LLR memory in D s and the LLR buffer at the same time. If D s could not decode the frame correctly, this frame data will be kept in the LLR buffer for D l decoding. Otherwise, it will be overwritten by the next frame. Thus the total size of the LLR buffer is ζ+1 frame where ζ is the number of buffer required for storing the frame data for D l as discussed in the analytical model and the additional buffer is holding the current frame data being decoded by D s . The size of LLR buffer is hence N Q · (ζ+1) bits. The width of each port is 2P Q bits. This parallelism matches with the I/O ports of D l [22] .
The output buffer is implemented by a true dual-port SRAM of which both ports can be used for reading or writing. It is used to align the output order to be the same as the input sequence because the decoding of the frames can be out of (3, 3) . ASSUMING THAT N =1024, L l =32, P =64 AND Q=6
Sub-blocks
Memory usage (bits)
288,768 288,768 Others (ζ+Ls+ order. All the decoding results from D s will be temporarily stored in this buffer. The results will be overwriten if the corresponding codeword is decoded by D l . Considering the worst case that a TA-SCL reaches the maximum state, the frame just stored into the LLR buffer and could not be decoded by D s correctly will be decoded by D l after (βζ+β)·C s clock cycles. All the decoding results of the codewords inputted after this frame need to be temporarily stored in the output buffer. So the output buffer needs to accommodate at most ⌈βζ + β⌉ frames theoretically. In the real design, the output buffer needs to store ⌊βζ + β + 1⌋ frames, i.e, one more frame of the decoded vectors needs to be stored if βζ+β is an integer, and the reason will be explained in Section IV-B. The size of the output buffer is hence N · ⌊βζ + β + 1⌋ bits. The memory usage of TA-SCL is summarized in Table II according to the analysis above. The hardware complexity overhead of TA-SCL over the traditional architecture for D l comes from D s and the two buffers, which is dominated by the memory overhead. As an example, the memory overhead of D TA (3, 3) (the one in Fig. 8a ) is also shown in Table II and the overhead is around 20%. More accurate experimental results on hardware usage will be presented in Section V.
B. Timing schedule
The timing schedule of TA-SCL architecture is illustrated in Fig. 10 with an example for D TA ( 
TABLE III SUMMARIZATION OF SPECIAL NODES USED AT LOW STAGES
Name # frz. # inf. Name # frz. # inf. Name # frz. # inf.
are used to load the input channel LLRs. The periods filled with "//", "\\" and "X" stripes represent the LLR loading time for D s , for D l and for both, respectively. The second and fourth rows represent the operations of sending the decoding results from D s and D l to the output buffer, respectively. These two operations are executed concurrectly with the loading operations of the next codeword to be decoded. The fifth row shows when the final output results are generated from the output buffer. From the timing schedule, it can be seen that the final decoding results of any codeword will be available after C s +C s ·(βζ+β)+C rw clock cycles when the corresponding LLRs are inputted to the decoder. The first term is caused by D s . The second term is dictated by the worst case discussed in Section IV-A. The third term is added to avoid potential memory collision as circled in Fig. 10 : the decoded data is read out from the output buffer only after the two decoders send the decoded results to the output buffer. This is also the reason why we need space for one more frame as mentioned in Section IV-A. As an example, the system latency of D TA ( 5 2 , 1) in Fig. 9 is 6C s +C rw clock cycles.
C. Low-Latency SCL Decoding Scheme
In this section, we introduce a low-latency SCL (LL-SCL) decoding scheme customized for D s with L s ≤2 such that D s can support a large speed gain. It combines several stateof-the-art low-latency decoding schemes for SCL decoding, including G-node look-ahead scheme [6] , multi-bit decoding (MBD) [19] and special node decoding (SND) [24] , [25] . Specifically, LL-SCL can be divided into the following two parts.
• SC calculations at stages not lower than stage m=log M s (M s is the number of merged bits for MBD in D s ) are calculated by normal SC algorithm with the full parallelism, i.e., N 2 PEs are used and any node in the scheduling tree takes only one clock cycle. Moreover, Gnode look-ahead scheme [6] is used so that each pair of sibling nodes are calculated simultaneously and half of the latency is saved. Thus, the decoding latency of this part is N Ms -1 clock cycles.
• SC calculations at the low stages are replaced with MBD which decodes M s bits in the same sub-tree rooted at stage m simultaneously. According to [19] , given a certain number of frozen bit, there is only one code pattern for the M s -bit sub-codes. So totally there are only M s +1 different code patterns. To reduce the decoding latency, the decoding scheme for each code pattern is designed as follows. These M s +1 code patterns are divided into multiple special nodes as shown in Table III where T is the number of bits in each special node. The corresponding numbers of information bits and frozen bits are also listed. Rate-0, rate-1, repetition (Rep.) and single parity check (SPC) nodes are decoded according to the schemes presented in [24] [25] [26] . Rep2/SPC2 nodes can be divided into two Rep./SPC nodes with half of the lengths which can be decoded concurrently. Thus, each special node can be decoded by SND within one clock cycle. The number of paths is doubled after a special node is decoded. Different from the traditional SCL decoding, all the expanded paths are kept temporarily. List pruning is not executed until the end of each M s -bit sub-code, which takes another clock cycle. Thus, an M s -bit subcode that can be divided into M SN special nodes requires M SN +1 clock cycles to decode, except for M s -bit rate-0 and rate-1 nodes which do not need a sorting operation and only take 1 clock cycle. The total decoding latency of the LL-SCL decoding scheme is
where C rw is the LLR loading latency and F i is the number of frozen bits in the i th M s -bit sub-code, M SN (F i ) and C sort (F i ) are the decoding clock cycles for SND and sorting of the code pattern corresponding to F i , respectively. For L s =2,
For L s =1, the sorting operation for list pruning is unnecessary. Hence, C sort (F i )=0 for L s =1, which indicates the throughput can be increased by using conventional SC decoding as D s . The top level architecture of LL-SCL is shown in Fig. 11 . For high stages, the SCL architecture [16] uses a PE array with N 4 PEs for each path. As the calculations at the highest stage n-1, which requires N 2 PEs to compute, are the same for both paths, they can use the two PE arrays from the two paths to accomplish. For low stages, max(M SN ) stages of SND blocks and a radix-2 max(MSN)+1 sorter are implemented in a feedforward manner, which are directly mapped from the decoding schemes introduced above. The architecture can be configured for L s =1 by simply disabling some of the blocks as shown in Fig. 11 .
V. EXPERIMENTAL RESULTS
A. Decoding Latency and Error Correction Performance of TA-SCL Decoding for Practical Polar Codes
In this sub-section, we demonstrate the speed gain and the error correction performance of the proposed architectures. Table V . We apply TA-SCL decoding on several polar codes with different code lengths and code rates as shown in Table IV for illustration. These codes are similar to those chosen for the 5G eMBB control channel [8] . P 1 is the same as the one used in [22] and similar to those presented in many existing works [23] , [24] . P 2 and P 3 have different code rate and code length, respectively, and are used as examples to show the flexibility of TA-SCL decoding. Table IV summarizes code characteristics and the number of cycles required for the corresponding D s and D l . |A r | is the number of reliable bits for the MB-DTS in the corresponding D l [22] . For D s , MBD is applied to each 16-bit sub-codes, i.e., M s =16, which means there are 17 code patterns in total. The number of special nodes in each code pattern and hence the number of cycles required for decoding is 
As max(M SN )=3, the decoding latency of a sub-code varies from one to four clock cycles according to (19) and (20) . As shown in part II of Table IV , these code patterns are divided into four different groups according to their decoding latency as shown in the brackets. The numbers of sub-codes in each group are then shown in the table, with which the decoding latency of D s can be calculated according to (18) and presented in part III. For D l , MB-DTS is applied to subcodes with a maximum length of M l =4 and P =64 PEs are used for each path. Its total latency is calculated according to [22] and the results are shown in part IV. With the selected M s and M l , the critical path delays of D s and D l are similar (both require 4˜5 stages of adders delay) and the decoding latency can be minimized.
Based on the parameters shown in Table IV , we can design the TA-SCL decoders for P 1˜P3 to meet different requirements of error correction performance and speed gain. Some designs are listed in Table V . If we want to target for a TA-SCL decoder that has good decoding performance across a wide SNR range, as shown in Fig. 8 , L s =2 should be used. D 1 , D 2 and D 3 shown in Table V are the designs for P 1 , P 2 and P 3 , respectively, with L s =2 and L l =32. To determine the values of β and ζ for each design, we set the maximum performance loss δ at a BLER of 10 −2 to be 30%. β is then calculated according to (16) and ζ is obtained by using the design flow presented in Section III-D. It can be observed from Table V that the speed gain is 3x˜3.5x for these decoders. The error correction performance of these decoders on AWGN channel are simulated using both floating-point and fixed-point numbers and the results are shown in Fig. 12a-12c . The quantization schemes for D s are Q s,LLR =7 and Q s,PM =8 so that the performance loss due to quantization error is reduced to the minimum and this is essential to avoid performance loss for the TA-SCL decoding as analysed in Section III. The quantization schemes for D l are Q l,LLR =6 and Q l,PM =8 to achieve a balance between performance loss and hardware complexity. It can be seen that the fixed-point simulation results of TA-SCL decoding has negligible performance degradation (<0.05dB) compared Fig. 12d . As ǫ s and hence Pr(Overflow) is large at a lower SNR, we set the maximum performance loss δ at a lower BLER, i.e., a higher SNR. In this case, δ is set to be 30% at a BLER of 10 −3 and the LLR buffer size ζ is equal to 6 to achieve this performance .
For some applications that can trade the error correction performance with lower decoding complexity, L l can be smaller than 32 [22] . D 5 shown in Table V is an example for P 1 . This design has the same design parameters, β and ζ, as those of D 1 . The hardware complexity of D 5 is much smaller than that of D 1 and the results will be shown later. The error correction performance of D 5 is shown in Fig. 12e .
The simulation results show that by just using a smaller number of buffer, e.g. ζ=2, the maximum speed gain β can be achieved for all the designs. To show the actual throughput gain achieved by TA-SCL on hardware, we realize the design of the TA-SCL decoders D 1 , D 4 and D 5 for P 1 and obtain their throughputs. The results will be presented in the next sub-section and compared with the results of the state-of-theart polar decoders [22] [23] [24] .
B. Implementation Results of the Proposed Architecture for TA-SCL Decoding
The proposed architecture is synthesized with a UMC 90nm CMOS process using Synopsys Design Compiler. The quantization schemes and the number of PEs for D l are the same as those presented in [22] [23] [24] for a fair comparison. The reported throughputs are in terms of coded bits and the reported area includes both cell and net area. [22] [23] [24] are also shown for comparison. When L l =32, the critical path delay of D l is larger than that of D s so the clock frequency is determined by D l . Hence, the clock frequency of D 1 and D 4 are the same and lower than that of D 5 . The decoding throughput of D 4 is higher than that of D 1 because L s =1 is used and fewer clock cycles are required for each frame in D 4 . When L l =8, the critical datapath of D l is shorter so the clock frequency of D 5 follows that of D s . The corresponding throughput is the highest due to the high clock frequency. It is noted that M l =4, rather than M l =8 which is used to maximize the throughputs as reported in [22] , is used for all the D l . This is because the critical path delay for M l =8 is larger than those for M l =4 and also the one of D s . The clock frequency and hence the throughput of D TA will be lower if M l =8 is used.
From the area breakdown shown in Table VII [24] . Compared with the results in [22] in which M l =8 is used, the throughput gains achieved by D 1 and D 5 are 2.83x and 2.72x for L l =32 and 8, respectively, which are slightly lower than the theoretical speed gains due to the clock rate issue as discussed above. For applications just targeting at high SNR range, D 4 can achieve a throughput gain of up to 3.39x. Comparing D 5 with the decoders in [23] , [24] with L l =8, the area is similar while the throughput is nearly 4 times higher. The implementation results show that the proposed TA-SCL architecture can significantly improve the decoding throughput with a small hardware overhead and negligible error correction performance degradation at a wide SNR range.
VI. CONCLUSION
In this work, a two-staged SCL decoding scheme is proposed, which significantly increases the throughput of the polar decoding on hardware. To analyse the error correction performance of TA-SCL decoding, a mathematical model based on Markov chain is proposed. With a proper selection of the design parameters, the performance loss is negligible for a wide SNR range. A high-performance VLSI architecture is then developed for the proposed TA-SCL decoding. Experimental results show that the throughput of TA-SCL decoding implemented by the proposed architecture are about three times as high as that of the state-of-the-art architectures.
APPENDIX A PROOF OF PROPOSITION 2 We first review Theorem 1 which is necessary to prove Proposition 2. 
More generally, the integers of the form ax+by are exactly the multiples of d. Next, the proof of Proposition 2 is given as below.
Proof:
We prove the irreduciblity of the Markov chain model first. First, we consider the safe states only and try to prove that any safe state j is accessible from any other safe state i. Let a=β n and b=β d . As β n ⊥ β d , d=1, and (21) can be rewritten as is also accessible as (23) is still valid when its right-hand side is -1 according to Theorem 1. Thus, any safe state j can be accessed from state i by repeating this procedure. The accessibility of a idle/hazard state from/to a safe state is obvious. Thus, it is possible to get to any state from any state in this model and the irreduciblity is proved.
An irreducible Markov chain is aperiodic if any state is aperiodic. As state 0 is aperiodic, the aperiodicity is proved.
