Polar codes are of great interest, since they are the first provably capacity-achieving forward error correction codes. To improve throughput and to reduce decoding latency of polar decoders, maximum likelihood (ML) decoding units are used by successive cancellation list (SCL) decoders as well as SC decoders. This paper proposes an approximate ML (AML) decoding unit for SCL decoders first. In particular, we investigate the distribution of frozen bits of polar codes designed for both the binary erasure and additive white Gaussian noise channels, and take advantage of the distribution to reduce the complexity of the AML decoding unit, improving the throughput-area efficiency of the SCL decoders. Furthermore, a multimode (MM) SCL decoder with variable list sizes and parallelism is proposed. If high throughput or small latency is required, the decoder decodes multiple received words in parallel with a small list size. However, if error performance is of higher priority, the MM-SCL decoder switches to a serial mode with a bigger list size. Therefore, the MM-SCL decoder provides a flexible tradeoff between latency, throughput, and error performance at the expense of small overhead. Hardware implementation and synthesis results show that our polar decoders not only have a better throughput-area efficiency but also easily adapt to different communication channels and applications.
algorithm [3] , [5] were proposed. A key drawback of the SC, SCL, and CA-SCL algorithms is their long decoding latency and low decoding throughput, as these algorithms deal with only one bit at a time.
To reduce decoding latency and improve the throughput of an SC polar decoder, several algorithms [13] , [15] , [21] , [22] were proposed to deal with several bits at a time instead of only one bit by using ML decoding units, which calculate symbolwise channel transition probabilities and make hard decisions for several bits at a time. Based on the SC algorithm, the parallel SC [21] , hybrid ML-SC [22] , ML simplified SC (ML-SSC) [15] , and fast ML-SSC [13] algorithms were proposed. The basic difference of ML decoding units between these four algorithms is that the hybrid ML-SC [22] and the fast ML-SSC [13] take advantage of the distribution of frozen bits to reduce complexity, but neither parallel SC [21] nor ML-SSC [15] algorithms do so.
The ML decoding units in [19] , [20] , and [23] [24] [25] [26] are also used to improve the throughput of SCL-based decoders and to reduce decoding latency. Instead of making hard decisions in SC-based algorithms, an ML decoding unit for SCL-based algorithms calculates symbolwise channel transition probabilities and performs path expansion and pruning. None of these SCL-based algorithms takes advantage of the distribution of frozen bits to reduce the complexity of ML decoding units. Therefore, the ML decoding units in these SCL-based algorithms have high complexities. For example, when the list size is four and the symbol size is eight, the ML decoding unit accounts for 27% of the overall decoder area in [24] . In [26] , when the code length is 1024, the area of an ML decoding unit takes up as much as 62% of the overall decoder area.
In this paper, we first propose a low-complexity approximate ML (AML) decoding unit by utilizing the distribution of frozen bits of polar codes and then propose a multimode SCL (MM-SCL) polar decoder to support variable throughput and latency. Our main contributions are as follows.
1) The divide-and-conquer method in [22] is applied in the probability domain to simplify the ML unit for the SC-based algorithms. By extending this idea, a divide-and-conquer AML decoding unit for the SCL-based algorithms is proposed by considering the distribution of frozen bits. It has greatly smaller computational complexity than those of the existing ML decoding units for the SCL-based algorithms, and has negligible performance loss when properly configured.
2) The distribution of frozen bits of polar codes is analyzed.
We show that there are only a small number of frozen-location patterns for polar codes constructed by a method proposed in [27] and a method in [28] .
3) Since only a small number of frozen-location patterns exist in polar codes, the divide-and-conquer AML decoding unit for the SCL-based algorithms is simplified further. A low-complexity hardware implementation for the simplified divide-and-conquer AML decoding unit, the low-complexity (LC) AML decoding unit, is proposed. Synthesis results show that by taking advantage of a small number of frozen-location patterns, our CA-SCL decoder with the LC-AML unit has a better throughput-area efficiency than the existing SCL decoders, while working for all channel conditions. 4) An MM-SCL polar decoder is also proposed. This decoder supports the SCL algorithms with different list sizes and parallelism. When a high throughput or small latency is needed, the MM-SCL decoder decodes multiple received words in parallel with a small list size. If error performance is of higher priority, the MM-SCL decoder switches to a mode with a greater list size. Therefore, the MM-SCL polar decoder provides a flexible tradeoff between latency, throughput, and performance at the expense of small overhead. Our proposed divide-and-conquer AML decoding unit for the SCL-based algorithm is a nontrivial extension of the method for SC-based algorithm in [22] . However, by investigating the distribution of frozen bits of polar codes, we reduce the complexity of the ML decoding unit further. The existing ML decoding units for the SCL decoders [19] , [23] [24] [25] [26] perform list pruning after all the symbolwise channel transition probabilities are calculated, whereas the proposed LC-AML decoding unit sorts intermediate results generated by the recursive channel combination (RCC) method [19] , leading to a reduced number of symbolwise channel transition probabilities dealt with by list pruning. Hence, the proposed LC-AML decoding unit has a much smaller complexity. The performance degradation due to the proposed LC-AML is the same as that in [19] and [24] . Although the ML decoding unit in [23] has no performance degradation, its complexity grows quickly, as the list size and the symbol size increase. The performance degradation of the ML decoding unit in [25] and [26] depends on its design parameters.
Many applications, such as modern wireless or wireline communication system, require variable data rate transmission and have stringent latency requirements. As a potential candidate of forward error correction technique for future communication systems, a polar decoder supporting variable data rate and variable decoding latency is desired. Unfortunately, the existing polar decoders provide only fixed latency and throughput (data rate). To the best of our knowledge, the proposed MM-SCL decoder is the first polar decoder with variable throughput and decoding latency given a polar code.
The rest of this paper is organized as follows. In Section II, polar codes as well as construction methods for polar codes and existing ML decoding units are reviewed. In Section III, the divide-and-conquer method is first applied to the ML unit of the SC-based algorithms in the probability domain. Then, the divide-and-conquer AML decoding unit for the SCL-based algorithms is proposed, and its computational complexity is also analyzed. In Section IV, the frozen-location patterns for polar codes are investigated. In Section V, a hardware design of the LC-AML unit is proposed, and an area-efficient CA-SCL decoder with the LC-AML unit is implemented as well. The hardware implementation and synthesis results for the area-efficient SCL decoder are also presented in Section V. In Section VI, the MM-SCL decoder, its hardware implementation, and synthesis results are presented. Finally, some conclusions are provided in Section VII.
II. PRELIMINARIES

A. Notations
Suppose u represents a binary vector (u 1 , u 2 , . . . , u N ). u b a denotes a binary subvector (u a , u a+1 , . . . ,
and u b a,e represent the subvectors of u b a with odd and even indices, respectively. For an index set A ⊆ {1, 2, . . . , N}, its complement is denoted by A c . The subvector of
B. Polar Codes
For an (N, K ) polar code, the code length N is a power of two, i.e., N = 2 n for n > 0 and 0 < K < N. The data bit sequence, represented by u = u N 1 , is divided into two parts: a K -element part u A , which carries information bits, and u A c whose elements (called frozen bits) are set to zero. The corresponding encoded bit sequence x = x N 1 is generated by x = uB N F ⊗n , where B N is the N ×N bit-reversal permutation matrix, F = 1 0 1 1 , and F ⊗n is the nth Kronecker power of F [1] .
C. Construction Methods of Polar Codes
An essential problem for constructing a polar code is to determine the locations of frozen bits (the elements of A c ). For the binary erasure channel (BEC) with an erasure probability (0 < < 1), assuming z 0,1 = , the following recursions [27] are used to construct an (N, K ) polar code:
For the additive white Gaussian noise (AWGN) channel and a given noise variance σ 2 , let z 0,1 = (2/σ 2 ), the following recursive method [28] based on Gaussian approximation is used for 1 ≤ i ≤ n:
where
In this case, A c is chosen, such that j ∈A c z n, j is minimal and |A c | = N − K .
D. Existing ML Decoding Units for Polar Decoders
When x = uB N F ⊗n is transmitted, suppose the received word is y = y N 1 and the symbol size is M = 2 m . A symbol-decision [19] ML decoding unit first calculates symbolwise channel transition probabilities, Pr(y,û j M 1 |u j M+M j M+1 )(0 ≤ j < (N/M)), then makes a symbolwise ML decision for the SC-based decoders or chooses the L most reliable paths for the SCL-based decoders. Here,û j M 1 is the previously estimated bits. There are three methods to calculate the symbolwise channel transition probabilities, and all of them do not take advantage of the distribution of frozen bits. The first [15] , [21] , [23] , [25] is based on an M-element product of bitwise channel transition probabilities, called directing mapping method (DMM)
is the previously estimated bit vector of w
The second [19] , called the RCC method, is based on a product of symbolwise channel transition probabilities recursively
where 1 ≤ φ ≤ m, 0 ≤ λ < n, = 2 λ , and = 2 φ . The third [26] is a hybrid method by applying the DMM first and then the RCC method, referred to as the DMM-RCC Hybrid (DRH) method.
Based on the distribution of frozen bits, some data symbols in [13] are considered as some special constituent codes, such as repetition codes and single-parity-check nodes. Different methods were proposed to deal with different constituent codes.
Furthermore, an ML decoding unit in [22] with the divide-and-conquer method was proposed for the SC algorithms based on an empirical assumption [22] .
Assumption 1: For a well-designed polar code, there is no such case that u 2i−1 is an information bit and u 2i is a frozen bit, for any 1 ≤ i ≤ (N/2).
Based on this assumption and the divide-and-conquer method, a simplified ML unit was proposed in [22] . Moreover, a recursive way of the divide-and-conquer method was proposed in [22] , but it is not suitable for hardware implementation, since it is for a large symbol size, which has a very high complexity for hardware implementation.
How to take advantage of frozen-location patterns to reduce the complexity of ML decoding units has been discussed in [13] and [22] for the SC-based algorithms, but it has not been investigated yet for the SCL-based algorithms.
III. DIVIDE-AND-CONQUER AML DECODING UNIT
The simplified ML unit in [22] is based on the Euclidean distance, since an AWGN channel is assumed. Here, we first apply the divide-and-conquer method in the probability domain and reformulate the ML unit for the SC-based algorithms. This simplified ML unit in the probability domain is slightly more general than that in [22] , because it is applicable to both the AWGN channels and other channels. We then extend the simplified ML unit in the probability domain to SCL-based algorithms.
A. Reformulation of the Divide-and-Conquer ML Unit for SC-Based Algorithms [22] in the Probability Domain For the ease of discussion, a string vector j M+1 ) needs to be found. Based on the RCC method [19] , 11 , since v ( j M/2)+i and u j M+2i are independent, max(T ) = max(T 1 ) max(T 2 ). Then, the maximum of 2 | ( j ) 01 | values generated in the previous step is found. Therefore, if
Under Assumption 1, considering (8), if ( j ) 01 = ∅, the maximal value of T is just a product of the maximal value 
B. Divide-and-Conquer AML Decoding Unit for SCL-Based Algorithms
Extending the idea in (8) and (9), we propose a divideand-conquer AML decoding method for the SCL-based algorithms under Assumption 1. For the SC-based algorithms, only the maximal value of Pr(y,û
In contrast, for the SCL-based algorithms with the list size L, the L maximal values of Pr(y,û j M 1 |u j M+M j M+1 ) are needed. A simple understanding for our method is that the max(Pr(ρ)) function is replaced by a function finding the L maximal values of Pr(ρ), denoted by [Pr(ρ 1 ), . . . ,
The path expansion-and-pruning procedure of the SCL-based algorithms is divided into two stages. In the first stage, the q most reliable paths are selected for each list by calculating and comparing path metrics. In the second stage, the L most reliable paths among the q L survival paths generated in the first stage. This two stage approach was proposed in our prior work [19] , and the novelty herein is that we use the divide-and-conquer method to reduce the complexity of the first stage. The second stage has been described in [19] , and we omit its discussions.
Assuming | ( j )
The first stage includes the following steps.
Step 0: The RCC method [19] is applied to calculate both T 1 and T 2 .
Step 1: Given any β j -bit binary vector
We find the min(q, 2 γ j ) maximal values of 2 γ j values of T 1 , and the min(q, 2 γ j ) maximal values of 2 γ j values of T 2 .
Step 2: For B ( j ) , there are (min(q, 2 γ j )) 2 values of T , each of which is a product of values of T 1 and T 2 generated in Step 1.
Step 3: The q maximal values are selected from (min(q, 2 γ j )) 2 2 β j values of T generated by
Step 2, because there are 2 β j possible values for B ( j ) .
If
( j ) 01 = ∅ and β j = 0, we still use the aforementioned four steps to find the q most reliable paths for each list except that B ( j ) is considered as a void binary vector which is the only value for B ( j ) when β j = 0. Fig. 1(d) and (e) shows the two examples for frozen-location patterns DDDD and FDDD, respectively, when M = 4, N = 4, and q = 2. After these four steps are carried out for each list, there are q L values of T left, which are sorted to choose the L maximal values in the second stage.
The proposed divide-and-conquer AML decoding unit has a lower computational complexity. It reduces the number of symbolwise channel transition probabilities dealt by the list pruning function by sorting the intermediate calculation results generated by the RCC method [19] , whereas the DMM, RCC, and DRH methods perform list pruning function after all the symbolwise channel transition probabilities are calculated. For example, in Fig. 1(d) , the DMM, RCC, and DRH methods perform max 2 (Pr(y 4 1 |u 4 1 )) after all 16 values of Pr(y 4 1 |u 4 1 ) are calculated. The proposed divide-and-conquer method performs max 2 (Pr(y 2 1 |v 2 1 )) and max 2 (Pr(y 4 3 |u 2 , u 4 )) first. Then, it finds the two maximal values out of four elements generated by max 2 (Pr(y 2 1 |v 2 1 )) max 2 (Pr(y 4 3 |u 2 , u 4 )). The output of the proposed AML decoding unit is the same as those of other ML decoding units if they have the same input.
Given an M-bit symbol u
), the first stage using the divide-and-conquer decoding unit needs 2 β j +1 2 γ j -to-(min(q, 2 γ j )) sorts, one (min(q, 2 γ j )) 2 2 β j -to-q sort, and (min(q, 2 γ j )) 2 2 β j multiplications per list, whereas the ML decoding unit in [19] needs 2 β j +2γ j multiplications and a 2 β j +2γ j -to-q sort per list. By examining all possible values of β j and γ j , we can find the worst case computational complexity.
We demonstrate the advantage of the proposed divide-and-conquer AML unit in computational complexity as opposed to other ML decoding units with an example of M = 8 and q = 4. Henceforth, we only discuss the computational complexity per list to accomplish the job of the first stage of the proposed method. Table I lists worse case computational complexities of different methods, and shows that the proposed method has the smallest computational complexity when 81 8-bit frozen-location patterns under Assumption 1 need to be dealt with.
Our proposed method has the same performance degradation as in [24] . If q ≥ L, our method does not introduce any performance degradation for the SCL-based algorithms. If q < L, the performance degradation depends on q and L, and is usually negligible when q and L are small. Fig. 2 shows the frame error rate (FER) and bit error rate of a (2048, 1433) polar code with a 32-bit CRC of CA-SCL decoders with the LC-AML decoding unit when M = 8 and L = 8. When q = 8, the proposed algorithm has no performance loss compared with the CA-SCL decoder in [3] and the hybrid SC-ML-LIST algorithm in [22] . When q = 4, the performance loss is negligible compared with that of q = 8. However, q = 2 leads to a performance loss of ∼0.1 dB at an FER of 10 −3 .
IV. FROZEN-LOCATION PATTERNS FOR POLAR CODES
Considering the hardware implementation for the divide-and-conquer AML unit, a uniform hardware design for all frozen-location patterns is preferred rather than different dedicated designs for various frozen-location patterns. For M = 8 and M = 16, there are 81 and 6561 possible frozen-location patterns satisfying Assumption 1, respectively. Actually, some of them may never exist in a polar code. Therefore, we want to know the exact number of frozen-location patterns in a polar code, since the number of frozen-location patterns impacts the complexity of the divide-and-conquer AML decoding unit for the SCL-based algorithms: the more frozen-location patterns, the higher complexity the divide-and-conquer AML decoding unit has.
A. Polar Codes for the BEC
For polar codes constructed for the BEC with an erasure probability (0 < < 1), (1) and (2) are used in [27] . In order to examine frozen-location patterns in these polar codes, we have the following results regarding the ordering of z i, j for i ≥ 1 and 1 ≤ j ≤ 2 i . This ordering determines the possible frozen-location patterns in a polar code.
Proposition 1: Assuming z 0,1 = ∈ (0, 1), given any i ≥ 1 and 1 ≤ j ≤ 2 i , z i, j is calculated by (1) or (2) . We have the following.
Now, let us explain how the ordering of z n, j determines 2 m -bit (1 ≤ m ≤ 3) frozen-location patterns in an (N, K ) polar code over the BEC. First, to choose the elements of A c for an (N, K ) polar code over the BEC, A c is chosen, such that j ∈A c z n, j is maximal and |A c | = N − K , where N = 2 n . Then, if there are k j frozen bits in a symbol u
j consisting of indexes of these k j frozen bits must be chosen, such that t ∈A c j z n,t is maximal, while |A c j | = k j . For example, assuming there are four frozen bits in u 8 1 in a (16, 12) polar code, by Proposition 1.4), z 4,1 > z 4,2 > z 4,3 > z 4,5 > z 4,4 > z 4,6 > z 4,7 > z 4,8 . Hence, u 1 , u 2 , u 3 , and u 5 will be frozen bits and the frozen-location pattern for u 8 1 will be FFFDFDDD. Therefore, for polar codes constructed by the method in [27] , by Proposition 1.2), there are three 2-bit frozen-location patterns: DD, FD, and FF. We note that the implication of Proposition 1.2) is the counterpart over the BEC of [22, Assumption 1]. By Proposition 1.3), there are five 4-bit frozen-location patterns: DDDD, FDDD, FFDD, FFFD, and FFFF. By Proposition 1.4), there are nine 8-bit frozen-location patterns: DDDDDDDD, FDDDDDDD, FFDDDDDD, FFFDDDDD, FFFDFDDD, FFFFFDDD, FFFFFFDD, FFFFFFFD, and FFFFFFFF.
For a larger symbol size, it is hard to get the ordering of z i, j by an analytical method. A numerical method can be used. For example, the symbol size is 16 
Moreover, for 0 < z 0,1 = < 1, z 4,4 − z 4,9 = 2 2 ( − 1) 4 ( 10 − 4 9 + 34 8 Because of the recursive calculation of z i, j , for i ≥ 4 and 1 ≤ j ≤ 2 i−4 , we have
Thus, there are only 17 frozen-location patterns for 16-bit symbols.
It is not meaningful to consider the symbol size greater than 16, because this will incur very high complexity for hardware implementations.
B. Polar Codes for the AWGN Channel
For the construction method introduced in [28] for the AWGN channel, it is difficult to analyze the relationship between z 3,i 's for 1 ≤ i ≤ 8 based on (3) and (5) . Instead, we examine eight polar codes constructed with the method in [28] , which have the code lengths from 2 10 to 2 13 and the code rates of 0.5 and 0.8 to identify 8-bit frozen-location patterns. By examining all 8-bit symbols of these polar codes, we found that in these codes, there are only nine 8-bit frozen-location patterns, which are the same as those for polar codes constructed for the BEC, listed in Section IV-A. Our observation is consistent with [22, Assumption 1].
C. Computational Complexity of the Divide-and-Conquer AML Decoding Unit
When it needs to deal with only the frozen-location patterns mentioned in Sections IV-A and IV-B, the divide-and-conquer AML decoding unit has a smaller complexity. If M = 8 and q = 4, it needs 80 multiplications, two 16-to-4 sorts, and a 32-to-4 sort, as listed in Table I . It saves 32 multiplications, a 32-to-4 sort, and an 8-to-4 sort compared with the divide-and-conquer AML decoding unit, which deals with all 81 frozen-location patterns following Assumption 1, since a 64-to-4 sort consists of two 32-to-4 sorts and an 8-to-4 sort.
If M = 16 and q = 4, to deal with all 3 8 = 6561 16-bit frozen-location patterns satisfying Assumption 1, the first stage of the proposed ML decoding unit needs 1632 multiplications, two 256-to-4 sorts, and a 1024-to-4 sort. However, to deal with 17 16-bit frozen-location patterns discussed in Section IV-A, the simplified divide-and-conquer AML decoding unit needs 736 multiplications, two 256-to-4 sorts, and a 128-to-4 sort.
V. LOW-COMPLEXITY AML DECODING UNIT
For convenience, we implement the proposed divide-and-conquer AML decoding unit, assuming M = 8 henceforth. Our implementation can be readily extended to other values of M. To further reduce complexity and latency, we do not use the divide-and-conquer method to deal with patterns DDDDDDDD, FFFFFFFD, and FFFFFFFF, which will be described in Section V-B. Then, the divide-and-conquer AML decoding unit can be simplified further by dealing with only the remaining six 8-bit frozen-location patterns. This simplified divide-and-conquer AML decoding unit, referred to as the LC-AML decoding unit, needs 80 multiplications, four 8-to-4 sorts, and a 32-to-4 sort. It saves two 8-to-4 sorts compared with the divide-and-conquer AML decoding unit dealing with nine patterns, since a 16-to-4 sort consists of three 8-to-4 sorts. This also leads to a shorter critical path in our design than the divide-and-conquer AML decoding unit.
A. Hardware Design for the LC-AML Decoding Unit
The SCL-based polar decoders in the literature can be divided into two categories: the log-likelihood (LL)-based decoders [11] , [24] , [29] and the LL-ratio (LLR)-based decoders [10] , [26] . Although our proposed algorithm in Section III is described in the probability domain, it can be easily adapted for both the LL-based decoder and the LLR-based decoder. We focus on the LLR-based polar decoder, because in general, the LLR-based decoder has a better throughput-area efficiency than the LL-based decoder.
First, we adapt the proposed LC-AML decoding unit to the LLR-based SCL decoder. Given path metrics PM (t ) k of L list survivors and assuming u t is the last bit processed by the decoder, where 1 ≤ k ≤ L, 1 ≤ t ≤ N, and t is a multiple of M. Suppose α j,l (0 ≤ j < M) represents the LLR of Pr(y [10] . Otherwise, m j = 1. Then, our goal is to calculate PM (t +M) k, p and to select the L minimum values of PM (t +M) k, p . Fig. 3 shows the top architecture of our low-complexity implementation for the LC-AML decoding unit. MLD_S1 calculates the path metrics and selects the q minimum values for each list. FrzInfVec is an M-bit frozen-bit indication vector ( f 1 , f 2 , . . . , f M ) for u t +M t +1 . For 1 ≤ j ≤ M, if u t + j is a frozen bit, f j = 1; otherwise, f j = 0. LLRInV_l is the vector (α 0,l , α 1,l , . . . , α M−1,l ) for 1 ≤ l ≤ L. Fig. 4(a) shows the design for MLD_S1_q4 when M = 8 and q = 4. Here, we focus on the data path for calculating path metrics. The circuitry to generate symbol values associated with path metrics is simple and consists of XORs, and therefore is omitted. The data paths corresponding to different steps aforementioned in Section III are labeled as well.
In Step 0, two RCC blocks, shown in Fig. 4(b 
is calculated by the left RCC block. Here, (i ) 2 represents the binary string of integer i . 16-ADDER contains 16 adders to calculate path metrics, as shown in Fig. 4(c) .
In Step 1, for different frozen-location patterns, path metrics go through different data paths selected by 16 2-to-1 multiplexers. Their control words are 1 if frozen-location patterns are FDDDDDDD and FFDDDDDD; otherwise, they are 0.
In Step 2, results from Step 1 are combined to calculate 7 j =0 m j |α j,l |. In Step 3, there are 32 path metrics going through a 32-to-4 sorter. However, for some frozen-location patterns, the number of valid symbol values is <32, because the number of frozen bits can be larger than three. Therefore, the path metrics associated with those invalid symbol values need to be set to the maximal positive value as well, so that the four minimum path metrics belong to valid symbol values. The message-screening (MSNG) block accomplishes this job with FrzInfVec, which contains the frozen-location pattern information.
Different sorters used in our design are shown in Figs. 4(d) and 5(e) and (f). S8TO4 finds the minimum four values of eight values. S4 sorts the four inputs and outputs them in decreasing order and has a shorter critical path of two comparators and one 4-to-1 multiplexer than a four-input bitonic sorter [30] , which has a critical path of three comparators. S32To4 consists of seven S8TO4 units in a binary tree structure.
Although MLD_S1_q4 is designed for six 8-bit frozenlocation patterns, other frozen-location patterns also can be dealt with by MLD_S1_q4, such as all frozen-location patterns satisfying the following two conditions. First, the frozen-location pattern has at least three F s. Second, two frozen bits are located at the first two bits of the data symbol.
B. Area-Efficient SCL Decoder
To examine the advantage of our proposed design, we incorporate MLD_S1_q4 into CA-SCL polar decoders with the list size L = 4. Architecturewise, our decoder, referred to as the area-efficient (AE) SCL decoder, is almost the same as the architecture of the tree-based reduced-latency SCL polar decoder in [26] , which performs the CA-SCL decoding algorithm on a binary tree representation of a polar code, except that our AE-SCL decoder uses the LC-AML decoding unit instead of the DRH ML decoding unit used in [26] .
Leaf nodes of the decoding tree for our decoder are divided into four categories.
1) Rate-0 Node: Its frozen-location pattern contains only F , i.e., the node contains only frozen bits. 2) Rate-1 Node: Its frozen-location pattern contains only D, i.e., the node contains only information bits. 3) Repetition Node [13] : Its frozen-location pattern is either FFFFFFFF_FFFFFFFD or FFFFFFFD. 4) Rate-R-2 Node: Its frozen-location pattern is one of the six 8-bit frozen-location patterns.
Rate-0 and Rate-1 nodes are decoded with the same methods as in [26] . The main difference between our proposed decoder, here, and the tree-based reduced-latency SCL polar decoder is how to deal with the repetition nodes and the Rate-R-2 nodes. For repetition nodes, a binary tree of adders is used to calculate the LLRs in order to reduce the decoding latency [13] . Rate-R-2 nodes are dealt with by MLD_S1_q4, which reduces the area of AE-SCL decoders.
C. Synthesis Results
The AE-SCL decoders with L = 4 are implemented for three polar codes: a (1024, 512) code, an (8192, 4096) code, and a (32 768, 29 504) code. The first two codes are constructed for the BEC with = 0.5 and the third code is for the AWGN channel with a noise variance of 0.1757. These three codes are with a 32-bit CRC, whose generator polynomial is 0x1EDC6F41. The number of processing units of decoders for N = 1024 is 256. For the other two codes, the decoder has 512 processing units. The 5-bit channel LLRs are used. The synthesis tool is Cadence register transfer level compiler. The process technology is TSMC 90-nm CMOS technology. Here, the four stages of pipeline registers are used in the LC-AML decoding unit. Areas of different ML decoding units for the (1024, 512) polar codes are listed in Table II . The area of our proposed LC-AML decoding unit is only one fourth of that of the ML decoding unit in [26] . By considering fewer patterns, the area of the LC-AML decoding unit is 67% of that of the divide-and-conquer AML design, which deals with all 81 8-bit frozen-location patterns following Assumption 1.
The synthesis results of three entire decoders (AE-SCLs) are also listed in Tables IV-VI, respectively. Here, NIT means the net information throughput. Compared with [26] , the SCL decoder architecture with the best throughput-area efficiency to our knowledge, the AE-SCL decoders have smaller areas, because the proposed LC-AML decoding unit is applied. The LC-AML decoding unit has a slightly larger decoding latency than that in [26] , because the proposed LC-AML decoding unit deals with only 8-bit frozen-location patterns, whereas the ML decoding unit in [26] can deal with some 16-bit frozen-location patterns. Since the extra decoding cycles needed by AE-SCL decoders are a very small fraction of the entire decoding cycles, the proposed AE-SCL decoders still achieve better throughput-area efficiency than the decoders in [26] . For example, for the (1024, 512) polar code, the throughput-area efficiency of the AE-SCL decoder is 1.93 times of that of the decoder in [26] . As the code length increases, the advantage of throughput-area efficiency is less, because the ML decoding unit occupies a smaller fraction of the entire decoder if the code is longer. Compared with the symbol-decision SCL decoders in [10] , [23] , and [24] , the advantage of our decoders on the throughput-area efficiency is more significant. The throughput-area efficiency of the AE-SCL decoder is 3.32, 8.25, and 3.17 times of that of decoders in [10] , [23] , and [24] , respectively, for the (1024, 512) polar code.
VI. MULTIMODE SCL DECODER
All existing SCL polar decoders in the literature provide fixed throughput and decoding latency given a polar code. These SCL decoders cannot adapt to variable communication channels and applications. In order to adapt to different throughput and latency requirements, we propose an MM-SCL decoder with n d decoding paths, which can decode P received words with the list size L in parallel, where 1 ≤ P, L ≤ n d , and n d ≥ P × L. For simplicity, we use the number of the received words decoded simultaneously as the mode index and call this mode-P. This MM feature requires the decoder to perform SCL decoding algorithms with different list sizes (the SC decoding algorithm is a special case of the SCL decoding algorithm with the list size L = 1).
A. Architecture Description
Assuming n d = 4, the top architecture of the MM-SCL decoder is shown in Fig. 5 . It has four blocks of channel memory, CMEM i (1 ≤ i ≤ 4), to store four received words from Chn_LLR, since the decoder under mode-4 deals with four received words simultaneously. Block DCD i (1 ≤ i ≤ 4) contains processing units to calculate the LLRs, and partial-sum units to update partial-sum for each list. The intermediate LLRs calculated by DCD i are stored in the list message memory (LMEM). Designs for processing units, partial-sum units, and the interface between the processing units and the LMEM adopt blocks of the reduced-latency tree-based SCL decoder in [26] .
The control block (CNTL), designed based on the instruction RAM-based methodology in [13] , includes two parts shown in Fig. 6 . The first part is the control ROM (CROM) which has N words. Here, N is the number of the leaf nodes of the decoding tree [26] , which is determined by the frozen-bit distribution of the polar code and parameters of the decoder. For a different polar code, the CROM needs to be reprogrammed. Each word of the CROM corresponds to a leaf node and contains the following information: 1) the layer index, the node type, and the size of the current leaf node; 2) the frozen-location pattern of the current leaf node; 3) the indexes of LLR vectors, which will be updated for the current leaf node; 4) the indexes of partial sums, which will be updated for the current leaf node; 5) the number of clock cycles needed for the current leaf node. The other part is four finite-state machines (FSMs). Each of them is associated with a decoding path and reads a word from the CROM and generates control signals for the proposed decoder, such as read/write addresses and enable signals for memory blocks, and control signals for decoding paths.
We focus on the additional logic to support the MM features. Mode_Sel is a 2-bit control word to select the decoding mode of the MM-SCL decoder: 00, 01, and 10 for mode-1, mode-2, and mode-4, respectively. MC_Flag indicates that a mode change happens within a decoding process: at the beginning of a decoding process, MC_Flag is reset to 0; when a mode change happens within a decoding process, MC_Flag is set to 1 until the end of the decoding process.
Under mode-1, all four DCD i (1 < i ≤ 4) access channel information of CMEM 1 and perform a list decoding algorithm with L = 4. Under mode-2, our MM-SCL decoder simultaneously decodes two received words with L = 2. DCD 1 and DCD 2 are used to decode the received word located in CMEM 1 ; DCD 3 and DCD 4 use the channel information from CMEM 3 . Under mode-4, four received words are simultaneously decoded with L = 1: the received word in CMEM i is decoded by DCD i for 1 ≤ i ≤ 4.
Block MM-LC-AML performs the LC-AML decoding function for different types of leaf nodes and is supposed to output the most reliable list candidate for mode-4, the two most reliable list candidates for mode-2, and the four most reliable list candidates for mode-1. The architecture in Fig. 3 is for a fixed list size only. Here, we propose an MM-LC-AML unit (we take n d = 4 and M = 8 as an example) shown in Fig. 7 to support the MM features. Under mode-1, all of SCLO 1 , SCLO 2 , SCLO 3 , and SCLO 4 are used to decode a received word. Under mode-2, SCLO 1 and SCLO 2 are used for one of two received words; SCLO 3 and SCLO 4 are used for the other received word. Under mode-4, each of SCO i (1 ≤ i ≤ 4) is used by an individual received word.
MM_MLD_S1 performs the same function as MLD_S1_q4 in Fig. 4 , except that MM_MLD_S1 supports MMs. This can be accomplished by simply adding an S4 block between S32TO4 and the adder at the bottom right of Fig. 4(a) . This implementation leads to a slightly longer critical path due to the extra block in the data path and, hence, a larger decoding latency. To address this issue, we redesign MLD_S1 for mode-4 and mode-2, respectively, called MLD_S1_q1 and MLD_S1_q2, shown in Fig. 8(a) and (b) . Symbol values for Z and F are 4-bit vectors 0000 and 1111, respectively. Hence, the symbol value calculated from Z and F is 11111111, which is guaranteed to be an invalid symbol value for our designs. If mode-2 is used, the control words for patterns FFDDDDDD, FDDDDDDD, and FFFDDDDD are 0, 1, and 2, respectively; for the remaining patterns, the control words are 3. If mode-4 is used, the control words for patterns FFDDDDDD, FDDDDDDD, and FFFDDDDD are 0, 0, and 1, respectively; for the remaining patterns, the control words are 2.
Actually, MLD_S1_q4, MLD_S1_q2, and MLD_S1_q1 are integrated together instead of three individual blocks in MM_MLD_S1, since they have the same circuitry for Step 0. Furthermore, the sorting units of the top row of Step 1 in these three designs can also be reused, because S8TO4 contains several S4 blocks and S2 blocks. The hardware sharing reduces the additional area for supporting MMs and improves throughput-area efficiency without increasing the critical path delay.
Compared with the AE-SCL decoder in Section V-B, to support MMs, the MM-SCL decoder needs additional hardware, including the additional three blocks of channel memories and the additional circuitry for mode-2 and mode-4 in MM_MLD_S1.
The FERs of the SC and CA-SCL algorithms and different modes for the MM-SCL decoder to decode all three codes are shown in Fig. 9 . Fig. 9 shows that the smaller the mode index, the greater the list size and the better the FER. The CA-SCL-i algorithm is the CA-SCL decoding algorithm in [3] with the list size i . The performance differences between our decoder and prior decoding algorithms with the same list size (mode-1 versus CA-SCL-4, mode-2 versus CA-SCL-2, mode-4 versus SC) are very small.
B. Simplified Modes and Mode Changes
The three modes described earlier provide three possible throughputs and latency. To provide a wider range of throughput and latency, we consider simplified modes and mode changes by changing Mode_Sel and MC_Flag on-the-fly.
Simplified modes are motivated by the computational complexity and latency caused by the expansion-and-pruning process when it involves L paths for a list size L. The idea for simplified modes is to make a switch at some point, so that while all L paths are kept, the expansion-and-pruning process is carried out among every L s (L s |L) paths. The computational complexity and the latency of the expansion-and-pruning process are reduced, since L s < L, while the performance degradation is negligible when an appropriate switching point θ is used. Let us consider an L s -simplified mode-P (referred to as mode-PSL s ) with a switching point θ . The first θ bits, u θ 1 , are decoded with the list size (n d /P). Then, each list is divided into (n d /P L s ) groups, each of which has L s survivors. For the remaining bits, the expansion-and-pruning process happens only among the L s survivors of each group. For example, for a (32 768, 29 504) code, under mode-1S1 with θ = 21 000, u 21 000 1 are decoded with L = 4. Then, four survivors are divided into four groups and each group has one survivor, i.e., L s = 1. For the remaining 11 768 bits, u 32 768 21 001 , the expansion-and-pruning process happens only for one survivor. In this case, there is no observed performance loss compared with mode-1, while the decoding latency of 6530 cycles is slightly shorter than that of mode-1, 6718 cycles. To reduce the decoding latency further, a smaller switching point can be used at the expense of small performance loss. With θ = 10 000, mode-1S1 has a decoding latency of 6206 cycles and has a performance loss of ∼0.03 dB compared with mode-1, as shown in Fig. 9 , but still has a better performance and a slightly shorter decoding latency than mode-2. Hence, simplified modes provide a different tradeoff between latency (throughput) and performance.
Under simplified modes, the number of simultaneously decoded received words is not changed within the decoding process. To support a wider range of throughput, it is also possible to perform a mode change, i.e., the number of received words simultaneously decoded can be changed within a decoding process. Here, we use mode-PCP with θ to represent that in the decoding process of u θ 1 , P received words (y 1 , y 2 , . . . , y P ) are decoded simultaneously with the list size (n d /P); in the decoding process of u N θ+1 , P (P > P) received words (y 1 , y 2 , . . . , y P ) are decoded simultaneously with the list size (n d /P ). More specifically, for each of y 1 , y 2 , . . . , y P , u θ 1 are decoded with the list size (n d /P), then the (n d /P ) most reliable survivors are kept, and u N θ+1 are decoded with the list size (n d /P ). After the switching point, only (n d P/P ) decoding paths are used for y 1 , y 2 , . . . , y P , and the remaining (n d (P − P)/P ) decoding paths are used for y P+1 , y P+2 , . . . , y P . For example, let us consider mode-1C4 with four decoding paths and θ = 10 000 for a (32 768, 29 504) polar code. For the first 10 000 bits of y 1 , the SCL algorithm with L = 4 is used, and four survivors are generated at this point. Then, the most reliable survivor is selected by comparing the path metrics of these four survivors. Based on the knowledge of this most reliable survivor associated with u 10 000 1 of y 1 , the remaining 22 768 bits of y 1 are decoded with the SC algorithm by one decoding path DCD 1 . Meanwhile, after the switching point, three received words (y 2 , y 3 , and y 4 ) are fed into the other three decoding paths (DCD 2 , DCD 3 , and DCD 4 ) to be decoded with the SC algorithm. Therefore, mode changes provide a wider range of throughput than the simplified modes. Fig. 10 shows the decoding schedules for different modes and their control words. In terms of the throughput, mode-4 > mode-1C4 > mode-1S1 > mode-1. The range of the throughput of mode-1S1 is from the throughput of mode-1 up to a quarter of that of mode-4. The range of the throughput of mode-1C4 is wider than that of mode-1S1, from the throughput of mode-1 to that of mode-4.
Therefore, simplified modes and mode changes provide a way for the MM-SCL decoder to reduce decoding latency further somewhat without noticeable performance loss and improve throughput-area efficiency further. It can also be used when decoding needs to finish as soon as possible due to external reasons, such as buffer overflow.
In terms of the control of the decoding process, the FSMs of different decoding paths of simplified modes are synchronous on the decoding tree, because all the decoding paths are working on the same part of the decoding tree. However, these FSMs are not synchronous on the decoding tree when a mode change happens within a decoding process. Therefore, if the feature of mode changes within a decoding process is not needed, only one FSM for all decoding paths in CNTL is enough, but the hardware saving is very small, because the area of the control circuitry is a very small fraction of that of the entire decoder. For example, the area of FSMs of the MM-SCL decoder for the (1024, 512) code is 0.011 mm 2 , 0.47% of the area of the whole decoder.
C. Synthesis Results
The MM-SCL decoder is implemented for the aforementioned three codes. For the (1024, 512) code, the areas of the channel memory and the ML decoding unit are listed in Table III . It shows that the increased area of the MM-SCL decoder over the AE-SCL decoder is dominated by the area of the three additional blocks of channel memory. Due to hardware sharing, the increased area of the ML decoding unit is small. Synthesis results of MM-SCL decoders for different polar codes are listed in Tables IV-VI. The decoding latency of mode-2 is smaller than that of mode-1, and the decoder has the smallest decoding latency under mode-4. This is because MLD_S1_q2 and MLD_S1_q1 have shorter data paths. Therefore, in MM-LC-AML, three stages and two stages of pipeline registers are used by the circuitry for mode-2 and mode-4, respectively. Mode-1S1 can have a smaller latency than mode-1 and mode-2.
Compared with the AE-SCL decoder, the MM-SCL decoder under mode-1 has a smaller throughput-area efficiency due to the additional circuitry for supporting MMs. However, the MM-SCL decoder provides multiple choices of output throughput and decoding latency, which is more suitable for variable communication channels and applications.
Compared with the decoder in [26] , for N = 1024 and N = 8192, the MM-SCL decoder has a smaller area and a better throughput-area efficiency. For N = 32 768, the area of the MM-SCL decoder is bigger than that of the decoder in [26] , because the additional circuitry to support MMs is larger than the saving due to the low-complexity ML decoding unit in the MM-SCL decoder. For the (1024, 512) code, under mode-1, mode-2, and mode-4, the MM-SCL decoder provides area efficiencies of 1.58, 3.5, and 8.36 times of throughput-area efficiency of the decoder in [26] , respectively.
Compared with the decoders in [10] , [23] , and [24] , the advantage in throughput-area efficiency of the MM-SCL decoder is more significant. This advantage comes from two aspects. The first is that the tree-based low-latency SCL architecture in [26] is adopted for the MM-SCL decoder. This helps to reduce the decoding latency. The second is due to the low-complexity AML decoding unit. For N = 1024, the MM-SCL decoder under mode-1 provides a throughput-area efficiency of 2.73, 6.71, and 2.59 times of area efficiencies of SCL decoders in [10] , [23] , and [24] , respectively. When mode-4 is used, the ratios of the throughput-area efficiency of the MM-SCL decoder over those of the SCL decoders in [10] , [23] , and [24] and that of the semiparallel SC decoder in [12] are 14.38, 35.43, 13 .65, and 8.32, respectively.
For N = 32768, decoding latencies and throughputs respect to different switching points of mode-1S1 are also provided. A smaller switching point leads to a smaller latency. When the switching point is 10 000, the latency of mode-1S1 is even smaller than that of mode-2. Compared with mode-1, the improvements on throughput and latency are ∼8%.
VII. CONCLUSION
In this paper, the divide-and-conquer method is applied to the SC-based algorithms in the probability domain. By extending this idea, a divide-and-conquer AML decoding unit for the SCL-based polar decoder is proposed. By examining frozen-location patterns of polar codes, an efficient hardware design for a simplified divide-and-conquer AML decoding unit is developed. To adapt to different throughput and latency requirements, the MM-SCL polar decoder is proposed in this paper. Synthesis results show that our implementations for our MM-SCL decoder and SCL decoder with the LC-AML unit achieve better area efficiencies than the existing SCL polar decoders. APPENDIX PROOF OF PROPOSITION 1 1) First, 0 < z 1,1 = 2 − 2 = 1 − (1 − ) 2 < 1. Second, 0 < z 1,2 = 2 < 1. Then, by induction, for i ≥ 1 and 1 ≤ j ≤ 2 i , 0 < z i, j < 1 is satisfied.
2) For any i ≥ 1 and 1 ≤ j ≤ 2 i , z i,2 j −1 − z i,2 j = 2z i−1, j − z 2 i−1, j − z 2 i−1, j = 2z i−1, j (1 − z i−1, j ). By Proposition 1.1), z i,2 j −1 − z i,2 j > 0 ⇒ z i,2 j −1 > z i,2 j . 3) By Proposition 1.2), z i,4 j −3 > z i,4 j −2 and z i,4 j −1 > z i,4 j . z i,4 j −2 − z i,4 j −1 = 2z 2 i−2, j (1 − z i−2, j ) 2 . By Proposition 1.1), z i,4 j −2 − z i,4 j −1 > 0 ⇒ z i,4 j −2 > z i,4 j −1 . Therefore, z i,4 j −3 > z i,4 j −2 > z i,4 j −1 > z i,4 j . 
