Polar codes achieve outstanding error correction performance when using successive cancellation list (SCL) decoding with cyclic redundancy check. A larger list size brings better decoding performance and is essential for practical applications such as 5G communication networks. However, the decoding speed of SCL decreases with increased list size. Adaptive SCL (A-SCL) decoding can greatly enhance the decoding speed, but the decoding latency for each codeword is different so A-SCL is not a good choice for hardware-based applications. In this paper, a hardware-friendly two-staged adaptive SCL (TA-SCL) decoding algorithm is proposed such that a constant input data rate is supported even if the list size for each codeword is different. A mathematical model based on Markov chain is derived to explore the bounds of its decoding performance. Simulation results show that the throughput of TA-SCL is tripled for good channel conditions with negligible performance degradation and hardware overhead.
Abstract-Polar codes achieve outstanding error correction performance when using successive cancellation list (SCL) decoding with cyclic redundancy check. A larger list size brings better decoding performance and is essential for practical applications such as 5G communication networks. However, the decoding speed of SCL decreases with increased list size. Adaptive SCL (A-SCL) decoding can greatly enhance the decoding speed, but the decoding latency for each codeword is different so A-SCL is not a good choice for hardware-based applications. In this paper, a hardware-friendly two-staged adaptive SCL (TA-SCL) decoding algorithm is proposed such that a constant input data rate is supported even if the list size for each codeword is different. A mathematical model based on Markov chain is derived to explore the bounds of its decoding performance. Simulation results show that the throughput of TA-SCL is tripled for good channel conditions with negligible performance degradation and hardware overhead.
Index Terms-Polar codes, Successive cancellation list decoding, Adaptive decoding, Markov chain, Hardware-friendly algorithm I. INTRODUCTION To improve error correction performance of polar codes [1] , successive cancellation list (SCL) decoding [2] , [3] is the most popular decoding choice. L (called the list size) successive cancellation (SC) decodings [4] , [5] are executed concurrently to decode a polar codeword and L candidates of decoded vectors are kept during decoding [2] , [3] . Compared with SC decoding, SCL decoding improves the error correction performance as the probability of one of the L candidates to be the correct decoded vector is higher, and a larger list size brings a better error correction performance. In [6] , cyclic redundancy check (CRC) codes are concatenated as outer codes with polar codes, and CRC is applied to all the candidates to see whether any candidate is the valid decoding output. From the experimental results presented in [7] , [8] , the CRC-aided polar codes decoded by SCL with a sufficiently large list size (≥16) outperform LDPC codes and turbo codes.
Due to the extraordinary error correction performance of CRC-aided SCL decoding, its hardware implementation has attracted much research interest recently. Several different VLSI architectures [8] [9] [10] [11] [12] [13] [14] [15] [16] have been proposed for SCL. The decoding throughputs achieved by the state-of-the-art architectures are shown in Fig. 1 . It can be observed that the decoding throughputs of all the architectures are degraded with the list size. This is mainly because the critical path delay of some of the critical functional modules [17] [18] [19] in these architectures increases rapidly when the list size is increased. Although efforts have been made to optimize these modules as well as the overall architecture, the throughput is still reduced due to the high decoding complexity.
To increase the decoding speed so that it can match with that of LDPC or turbo code architectures, adaptive SCL (A-SCL) decoding was proposed in [20] and a corresponding software decoder was implemented on CPU in [21] . This algorithm first uses a single SC to decode a codeword. If the decoded vector List size Throughput (Gbps) TSP18' [8] TSP17' [13] TVLSI16' [12] cannot pass CRC, the list size is doubled and the decoding repeats. This process is iterated until a valid vector is obtained or a pre-defined L max is reached. Experimental results [20] show that A-SCL significantly reduce the average list sizeL required to achieve an equivalent error correction performance of SCL decoding with L = L max . The average throughput of executing A-SCL on hardware can benefit from the reduction on theL. However, if the algorithm is directly mapped to hardware, the decoding latency of each codeword is different, which may not support applications that need a constant input transmission data rate. Also, the hardware complexity is high as multiple SCL modules are needed.
The main contributions of this work are outlined as follows: 1) We simplify the algorithm of A-SCL [20] and propose a two-staged adaptive SCL (TA-SCL) decoding. Different from [20] , [21] , TA-SCL is more hardware-friendly as it is able to achieve a high throughput for applications that require a fixed input transmission data rate. 2) An analytical model of TA-SCL is developed based on Markov chain to analyse its error correction performance. Its accuracy is verified by simulation, and it can be used for the optimization of the VLSI architecture for TA-SCL. 3) Simulation results show that the throughput of TA-SCL with L max = 32 is two times higher than that of the SCL decoder with L = 32 [8] for good channel conditions with negligible performance degradation. The throughput is also higher than those of SCL decoders with smaller list sizes [12] , [13] .
II. MISCELLANEOUS A. Introduction of Polar Codes and SCL
Polar codes are a family of block codes [1] characterised by an N × N binary generator matrix F N , where N is the code length. The source word u N and codeword x N of an N -bit frame are both binary vectors, and the encoding can be expressed as x N = u N · F N . Among all the N bits in a frame, only K bits are used to send information and the rest are frozen bits which are set to 0. The last r information bits are used to transmit the checksum of the CRC code.
SCL decoding of polar codes decodes a codeword bit-by-bit in a serial order, and the decoding process is similar to a search problem on a binary decoding tree whose depth is N + decoding tree with N = 4 is shown in Fig. 2 . The i th source bit u i is mapped to the nodes of the decoding tree at level i + 1. A path from the root node to a leaf node represents a candidate of decoded vector. For a parent at level i, its left and right children at level i+1 correspond to the expansions of the decoding path with u i = 0 and 1, respectively. In the example, the path marked with single crosslines represents a decoding vector 0010. If a bit, such as u 0 in Fig. 2 , is a frozen bit, the sub-tree rooted at the right child does not contain any valid candidate and hence is pruned. Therefore, the total number of the possible candidates in a decoding tree is 2 K , and it is is too large to exhaustively search the decoding tree to obtain the correct decoded vector when a practical code length is used.
To limit the computational complexity, each SCL decoding has a pre-defined list size L. If the number of paths at a certain level exceeds L, a list management operation is used to select and keep the best L survival paths and discard the rest ones. The example in Fig. 2 maintains a list with L = 2, so another path marked by double crosslines representing the decoded vector 0100 is also kept in the list. At the end of the decoding, the path in the list that passes CRC is selected as the output vector. B. Adaptive SCL with CRC Adaptive SCL with CRC was proposed in [20] and its operation is summarised in Algorithm 1. Each time, a new codeword which contains N log-likelihood ratios (LLRs) of the input values is sent for decoding. A-SCL starts from an SCL with L = 1, i.e. a single SC. If there is at least one decoded vectors that pass CRC at the end of decoding, the one with the highest reliability is chosen as output. Otherwise, the list size is doubled and the codeword is decoded again by an SCL with the new list size. Usually, a pre-defined L max is used to limit the computational complexity, that is, after the decoding using an SCL with L max , the decoding terminates even when there is no valid candidate. According to [20] , the error correction performance of A-SCL is the same as that of an SCL with L max . At the same time, as most of the valid decoded vectors can be obtained using SCL with smaller list sizes, the average list sizeL of A-SCL is much smaller than L max and its average decoding speed is much higher than that of SCL with L max .
C. Problems of Implementing A-SCL on Hardware
If the A-SCL algorithm is implemented on hardware, the throughput will be much higher than that of a traditional SCL. However, direct mapping of the A-SCL algorithm onto a VLSI architecture requires the architecture to support multiple SCL decodings with all L ∈ {1, 2, 4, ..., Lmax 2 , L max }. This increases the design effort and also the hardware complexity. Moreover, different codewords may need SCL with different list sizes and SCL with a larger list size has a much higher latency. When a codeword needs longer latency to decode, the input has to be interrupted until the decoding of the current frame is finished. Because of that, a directly-mapped architecture may not be able to support applications that need to have a constant input data rate, such as the channel coding blocks in communication networks.
To improve the decoding speed on hardware, a CPU-based software A-SCL decoder was proposed in [21] in which A-SCL was simplified by only using a single SC and an SCL with L max . However, the variable decoding latency issue has not yet been addressed. Moreover, the overall latency is very large as the latency for the data movement between the memory and the computing resources is dominant. Hence, neither the original A-SCL nor the simplified A-SCL in [21] is a good choice for high-throughput VLSI implementations. To solve these issues and map A-SCL to a high-speed and efficient VLSI architecture, we propose a two-staged adaptive SCL which will be presented in the next section.
III. TWO-STAGED ADAPTIVE SCL
A. Algorithm of TA-SCL As mentioned above, the average list sizeL ≪ L max in A-SCL. Actually,L ≈ 1 in a high signal-to-noise ratio (SNR) operation region [20] , indicating that most codewords are decoded by the single SC correctly. At the same time, the error correction performance follows that of SCL with L max . Based on these observations, we propose a hardware-friendly two-staged adaptive SCL.
The block diagram of TA-SCL and its timing schedule are shown in Fig. 3 and Fig. 4 , respectively. Basically, it includes two SCL decodings, which are an SCL decoding with small list size (not necessarily to be 1), denoted as D s , and an SCL decoding with large list size, denoted as D l . Each codeword from the channel is first decoded by D s . Most of the time, the decoded vector can be decoded correctly. If none of the candidates in the list passes CRC after this decoding, e.g. fr.1 in Fig. 4 , the current codeword will be decoded again by D l . This decoding usually takes longer time than decoding using D s . However, different from A-SCL, D s will bring in and decode the next codeword from the channel input immediately instead of waiting for D l to finish decoding the current codeword. The continuous running of D s permits the data to be transmitted at a constant data rate which is equal to the decoding speed of D s , while the decoding performance is guaranteed by D l . Also, the hardware complexity of TA-SCL is effectively reduced as only two SCL decoders are needed.
If the channel is subject to burst errors, it is possible that a new codeword cannot be correctly decoded by D s and the decoding in D l has not finished yet. To deal with this, an LLR buffer is needed to store the LLRs of the codeword from D s temporarily, such as fr.3 and fr.4 shown in Fig. 4 . An output buffer is also needed to re-order the decoded vectors as the codeword may be decoded out of order. For example, fr.7∼fr.9 are stored in the output buffer until the decoding of fr.6 finishes.
B. Error Correction Performance of TA-SCL
To analyse the error correction performance of the TA-SCL decoding, we define its parameters as follows.
• t s /t l : decoding time of each codeword using D s /D l .
• β: speed gain, which is defined as t l ts . With out loss of generality, we assume β ∈ Z + . • ζ: size of the LLR buffer, which equals to the number of codewords that can be stored in the buffer. We also denote a TA-SCL decoding whose speed gain is β and buffer size is ζ as D TA (β, ζ). The TA-SCL decoding in the example shown in Fig. 4 hence can be described as D TA (3, 1) and the corresponding D l needs 3t s to decode a codeword. When a new codeword needs to be stored in the LLR buffer but the buffer is full and decoding in D l has not finished yet, buffer overflow happens, which will lead to performance degradation for D TA 1 . An example of buffer overflow is marked in Fig. 4 . Thus, the BLER of D TA , denoted as ǫ DTA , is bounded by ǫ l ≤ ǫ DTA < ǫ l + Pr(Overflow).
(1)
Obviously, it is important to prevent the buffer overflow in order to reduce ǫ DTA . A large buffer size ζ certainly helps as more codewords can be stored, and a smaller speed gain β indicates D l have relatively more time to decode the codewords accumulated in the buffer. To obtain the best tradeoff among performance, hardware usage and throughput, an analytical model of D TA will be introduced to derive the relationship between Pr(Overflow) and the parameters of D TA in the next sub-section.
C. Analytical Model of TA-SCL based on Markov Chain
To model the behavior of D TA (β, ζ), we first introduce the states that the decoder can operate at. In particular, these states reflect whether buffer overflow will happen. We define the number of codewords stored in the LLR buffer as i ζ and the remaining time required to finish the decoding of D l (in term of t s ) as i β . Each codeword in the LLR buffer needs βt s to decode. Then, the state of TA-SCL indicates the time to clear the buffer and is defined as which equals to the total time required to clear the buffer. For a D TA (β, ζ), there are totally S = βζ + β + 1 states. All the S states can be divided into two groups.
• Hazard states: The states that the LLR buffer is full and the current codeword decoded by D l cannot be finished within t s , which means i ζ = ζ and i β > 1.
Buffer overflow will occur if D s cannot decode the next codeword correctly. • Safe states: In contrast with the hazard states, these states do not have overflow hazard as the LLR buffer has enough space for a codeword that cannot be correctly decoded by D s .
We show an example for D TA (3, 1) in Fig. 5(a) , where the black and white arrows represent the probabilities of ǫ s and ǫ ′ s = 1 − ǫ s , respectively. The first three columns show i β , i ζ and X τ , respectively. Typical transitions from hazard and safe states are marked with "H" and "S" in the figure, respectively. Note that the transition from state 0 is a little different as D l is idle. Suppose that ǫ s (BLER of D s ) follows an identical and independent distribution (IID). Then, the state transitions only depend on current state of D TA (β, ζ) and ǫ s . Hence, decoding with D TA is a Markov process and can be modeled with a Markov chain. The state diagram of a D TA (β, ζ) can be easily obtained by finding out all the possible state transitions in Fig. 5(a) . Fig. 5(b) shows the state diagram of D TA (3, 1). For further mathematical analysis, we map the state diagram to a transition matrix P whose size is S ×S. An element P x.y ∈ P (x.y ∈ [0, S−1]) corresponds to the transition probability from state x to state y, i.e.,
where X τ is the current state and X τ +1 is the next state. The transition matrix of D TA (3, 1) mapped from the state diagram is 
With the transition matrix P , we can do steady-state analysis for D TA . Suppose that the decoding begins with D TA at state 0, i.e., the state probability λ 0 = [1, 0, ..., 0]. After k · t s (k ∈ Z + ), the state probability becomes λ k = λ 0 · P (k) . Define P ∞ = lim k→∞ P (k) , then the steady-state distribution λ ∞ of D TA is
Actually, all the lines of P ∞ are the same, which means the steady-state distribution is irrespective of the initial state λ 0 of D TA . Buffer overflow happens when D TA is in any hazard state and D s cannot decode the next codeword correctly, and the probability of buffer overflow is then expressed as Pr(Overflow) = ǫ s · Pr(i ζ = ζ and i β > 1) (6) = ǫ s · Pr(X τ > βζ + 1),
This probability of overflow bounds ǫ DTA in (1) . It is a function of error correction performance ǫ s , speed gain β and buffer size ζ, i.e., Pr(Overflow)=f (ǫ s , β, ζ). If β and ζ are fixed, the Σ term and hence Pr(Overflow) is monotonically increasing with respect to ǫ s . The proof is omitted due to page limitation and will be given in our future work. The monotonicity indicates we can either increase L s or the SNR to get a better error correction performance. We will show the accuracy of the proposed model by simulation results in the next section. We will also show that TA-SCL can improve the decoding throughput with a small hardware overhead.
IV. EXPERIMENTAL RESULTS

A. Accuracy of the Proposed Model
To verify the accuracy of the proposed model, we run simulations for a polar code with {N, K, r} = {1024, 512, 24} under AWGN channel conditions. The list sizes of the two component SCL decoders are L s = 1 and L l = 32, respectively. The simulated BLER results of D TA under different speed gain and buffer sizes are obtained at an SNR of 2dB and are compared with the upper bounds calculated using (1) and (8) . Fig. 6 summarizes the performance loss with respect to the speed gain β when different ζ are used. Here, the performance loss is calculated by
The solid lines and the dashed lines show the calculated and simulated results, respectively. It can be seen that these two lines are almost overlapped, indicating ǫ DTA is approximately equal to its upper bound derived in (8) . The proposed model can thus be used to estimate the error correction performance of an D TA accurately. The results also show that a larger buffer size enables the decoding to run at a higher speed gain with the same constraint of performance loss.
B. Analysis of Hardware Gain
In this sub-section, we show the improvement of hardware performance achieved by the proposed TA-SCL decoder. We use a polar code with {N, K, r}={1024, 512, 24} and L=32 as an example. The hardware performance of some VLSI architectures of SCL decoder in the literature [8] , [22] is shown in Table I . They are used as the component SCL decoders in the TA-SCL decoder. Fig. 7 shows the error correction performance of D TA with buffer size ζ=6. When the target β is 3, there is almost no performance degradation at a high SNR range (≥1.6dB) comparing with the baseline of L=32. The degradation is obvious at a low SNR range. If the target β is reduced to 2, the decoder can work in a wider range of SNR down to 1.4dB. All these observations is consistent with the intuitions mentioned in Section IV. It is noted that the throughput of D TA is lower than D s in both cases, so the speed gain β of up to 3x is achievable. The overall throughput of TA-SCL is also higher than that of the SCL decoders with smaller list sizes as shown in Fig. 1 [12] , [13] . The area of the proposed architecture is shown in Table  I , which is estimated based on the results reported in the literature [8] , [22] . It equals to the sum of area of the two SCL modules, the LLR buffer and the output buffer. As the area of the D l module is dominant, the proposed D TA only has a 18% area overhead. Moreover, due to the throughput improvement, the area efficiency of D TA is also much higher than that of D l . V. CONCLUSION In this work, a two-staged adaptive SCL is proposed. This algorithm can support data input at a fixed data rate and has a low hardware complexity. To analyse its error correction performance, an analytical model is also proposed and its accuracy is then verified by simulations. With a good selection of the parameters of TA-SCL using the proposed analytical model, an optimal tradeoff between speed gain, error correction performance loss and hardware overhead can be obtained for designing the VLSI architecture.
