Polar codes, as the first provable capacity-achieving error-correcting codes, have received much attention in recent years. However, the decoding performance of polar codes with traditional successive-cancellation (SC) algorithm cannot match that of the low-density parity-check or Turbo codes. Because SC list (SCL) decoding algorithm can significantly improve the error-correcting performance of polar codes, design of SCL decoders is important for polar codes to be deployed in practical applications. However, because the prior latency reduction approaches for SC decoders are not applicable for SCL decoders, these list decoders suffer from the long-latency bottleneck. In this paper, we propose a multibit-decision approach that can significantly reduce latency of SCL decoders. First, we present a reformulated SCL algorithm that can perform intermediate decoding of 2 b together. The proposed approach, referred as 2-bit reformulated SCL (2b-rSCL) algorithm, can reduce the latency of SCL decoder from (3n − 2) to (2n − 2) clock cycles without any performance loss. Then, we extend the idea of 2-b-decision to general case, and propose a general decoding scheme that can perform intermediate decoding of any 2 K bits simultaneously. This general approach, referred as 2 K -bit reformulated SCL (2 K b-rSCL) algorithm, can reduce the overall decoding latency to as short as n/2 K −2 − 2 cycles. Furthermore, on the basis of the proposed algorithms, very large-scale integration architectures for 2b-rSCL and 4b-rSCL decoders are synthesized. Compared with a prior SCL decoder, the proposed (1024, 512) 2b-rSCL and 4b-rSCL decoders can achieve 21% and 60% reduction in latency, 1.66 and 2.77 times increase in coded throughput with list size 2, and 2.11 and 3.23 times increase in coded throughput with list size 4, respectively.
I. INTRODUCTION
P OLAR codes are the first provable capacity-achieving codes [1] . Because of their explicit structure and regular encoding/decoding architectures, polar codes have received much attention in recent years. To date many works have addressed theoretical analysis [1] - [6] , [20] and hardware implementation [7] - [17] , [21] of polar codes.
Although it has been proved that polar codes can achieve channel capacity asymptotically, the decoding performance of polar codes with the successive-cancellation (SC) Manuscript algorithm [1] is inferior to that of low-density paritycheck (LDPC) or Turbo codes. To improve the decoding performance, a SC list (SCL) decoding algorithm was presented in [4] . Simulation results show that polar codes with the use of SCL algorithm combined with simple CRC check and systematic encoding methods can outperform the same length and rate LDPC codes [4] . As a result, the SCL algorithm is believed to be the key for decoding of polar codes to be applicable in practical systems. However, because of the inherent serial nature of SC computation, the SCL decoders suffer from long-latency and low-throughput problems similar to early SC decoders. Nowadays, many techniques [3] , [7] , [12] - [15] , [21] have been proposed to reduce the latency of SC decoders; however, these approaches cannot be directly used to reduce the latency of the SCL decoders. As a result, to date the known very large-scale integration (VLSI) designs of SCL decoder [16] , [17] still incur decoding latency of 3n − 2 clock cycles. 1 This paper presents multibit-decision approaches that can reduce the latency of SCL decoders. First, 2-bit reformulated SCL (2b-rSCL) algorithm, which can perform intermediate decoding of 2 b simultaneously, is presented to reduce the overall latency from 3n − 2 cycles to 2n − 2 cycles. Then, by generalizing the 2-b-decision idea, we propose a general 2 K -bit reformulated SCL (2 K b-rSCL) algorithm. By performing intermediate decoding of 2 K bits together, the proposed 2 K b-rSCL decoder has latency as short as n/2 K −2 − 2 cycles. To demonstrate the advantage of the proposed approaches, VLSI architectures of 2b-rSCL and 4b-rSCL decoders are synthesized. Compared with the prior SCL decoder, the proposed (1024, 512) 2b-rSCL and 4b-rSCL decoders can achieve 21% and 60% reduction in latency, 1.66 and 2.77 times higher in coded throughput with list size 2, and 2.11 and 3.23 times higher in coded throughput with list size 4, respectively.
The rest of this paper is organized as follows. Section II gives a brief review of polar codes and SCL algorithm. The proposed 2b-rSCL and 2 K b-rSCL algorithms are presented in Section III. Section IV presents the hardware architectures of the 2b-rSCL and 4b-rSCL decoders. Hardware analysis and comparisons are discussed in Section V. Section VI draws the conclusions. 
II. REVIEW OF POLAR CODES AND SCL ALGORITHM

A. Encoding Process of Polar Codes
Different from other block codes, an (n, k) polar code is generated in two steps. First, the k-bit source message is extended to an n-bit message x = (u 1 , u 2 , . . . , u n ) by padding (n − k) 0 bits. Notice that because the postdecoding reliability of n bit positions of u can be precomputed in [1] , the k most reliable positions of u are assigned k information bits and other (n − k) least reliable positions are forced to be 0. Then, the n-bit message u is multiplied with an n×n generator matrix G to generate the transmitted codeword x = (x 1 , x 2 , . . . , x n ). Fig. 1 shows the implementation of a polar code encoder with n = 4.
B. SC Decoding Algorithm
At the receiver end, because of the corruption from transmission noise, the transmitted codeword x changes to the received codeword y = (y 1 , y 2 , . . . , y n ). Because the required information bits are contained in u, a polar code decoder is needed to recover the u from the y. Arıkan [1] proposed a SC decoder to perform this recovery. Fig. 2 shows the example decoding procedure of this SC decoder for n = 4 polar code based on likelihood form. As seen in this figure, the SC decoder consists of m = log 2 n = 2 stages, where each stage consists of two types of four-input-two-output units, referred as f unit and g unit, respectively. In addition, a twoinput-one-output hard-decision unit denoted as h is used at the last stage of SC decoder (stage-2) to determine the estimate of u i , referred as u i . Besides, each f or g unit is labeled a number to indicate the clock cycle index when it is activated. This labeling system reveals the inherent serial nature of the SC decoding algorithm. For example, in Fig. 2 the decoded bits are output at cycles 2, 3, 5, and 6.
In addition, the functions of f and g units can be derived via the analogy between polar code encoder and decoder. Figs. 1(b) and 2(b) show the general basic unit in polar encoder and decoder, respectively. For the basic unit of encoder [ Fig. 1(b) ], it performs a left-to-right transformation from in 1 and in 2 to out 1 and out 2 . Hence, the transformation equations shown in Fig. 1 (b) are out 1 = in 1 ⊕ in 2 , and out 2 = in 2 (1) where ⊕ represents the exclusive-or operation.
On the other hand, for the basic unit of decoder as shown in Fig. 2 (b), as indicated in [1] , its function is just a rightto-left estimation from the likelihoods of out 1 and out 2 to the likelihoods of in 1 and in 2 . Therefore, according to the leftto-right transformation (1), the expected relationship from the estimates of out 1 and out 2 to the estimates of in 1 and in 2 can be derived as in 1 = out 1 ⊕ out 2 , and in 2 = out 2 .
(
With the help of the preceding guideline (2), we can now develop the functions of f and g units as shown in Fig. 2(b) . First, we assume the previously decoded bits u 1 , u 2 , . . . , u 2i−2 have been determined as binary values z 1 , z 2 , . . . , z i−1 , respectively. For simplicity, this event is denoted as u i−1 1 = z i−1 1 . Then, the two outputs of f unit, referred as c(0) and c (1) , can be derived
where a(0), a(1), b(0), and b (1) are the inputs of f or g unit. Owing to the successive property of SC algorithm, d(0) and d (1) , as the outputs of g unit, are determined by the estimate of in 1 . When it is 0, according to (2), we have Similarly, when in 1 is 1, according to (2), we have
As a result, by summarizing (5)-(8), we can obtain the unified function for g unit
Besides, for h unit, because it is the hard-decision unit, we can obtain its function as follows: (1) or u i is frozen bit 1, if a(0) < a(1) and u i is free bit.
In general, (3), (4), and (9)-(11) describe the likelihoodbased SC algorithm.
On the other hand, from the view of code tree, the SC algorithm can be described as a path searching process. Fig. 3 shows an example for n = 4 and k = 4 SC decoding procedure over the code tree. This n = 4 code tree consists of four levels, where each level represents a decoded bit. The value associated with each node is the likelihood-based metric for the decoding path from root node to the current node. For example, 0.33 on the leftmost side indicates that for the path u 1 = 0 and u 2 = 0, denoted as the length-2 path (00), its metric is given by Pr( u 1 = 0, u 2 = 0) = 0.33. For the 0.12 on the rightmost side, it indicates the metric for path u 1 = 1, u 2 = 1, and u 3 = 1, denoted as the length-3 path (111), is given by Pr( u 1 = 1, u 2 = 1, u 3 = 1) = 0.12. In particular, the path metrics associated with the nodes at the lowest level (level 4) represent the different likelihoods for the different combinations of ( u 1 u 2 u 3 u 4 ). The valid output of this n = 4 SC polar decoder should be the length-4 path which has the largest metric at the lowest level. In this example it is (0010) with path metric Pr( u 1 = 0, u 2 = 0, u 3 = 1 u 4 = 0) = 0.19. Notice the aforementioned path metrics are calculated by the f or g units in the last stage of the SC decoder [e.g, stage-2 in Fig. 2(a) ]. For the length-i path, its path metric is computed by the f or g unit associated with u 1 . For example, for n = 4 polar code, the path metric for path ( u 1 u 2 ) is computed by index-3 g unit as shown in Fig. 2(a) . Similarly, the path metric for path ( u 1 u 2 u 3 ) is computed by index-5 f unit as shown in Fig. 2(a) .
To find the decoding path with the largest metric, SC algorithm adopts a locally optimal searching strategy. As shown in Fig. 3 , the arrows represent the survival decoding path of the SC decoder. In the i th level, the SC decoder first visits the two children nodes (striped nodes as shown in Fig. 3 ) that are connected to the current survival length-(i − 1) path. Because the metrics of length-i paths are associated with these children nodes, the SC decoder then can obtain the metrics of length-i paths. After comparing the metrics, the SC decoder only selects the length-i path which has the larger metric as the updated survival path, but the path which has the smaller metric will never be explored in the future. With this searching strategy, as shown in Fig. 3 the length-4 path (0010) with metric 0.19 is selected as the output of SC decoder. In this example, the SC decoder works well because it finds the valid length-4 path with the largest metric.
C. SCL Decoding Algorithm
An essential drawback of the SC algorithm is that its searching strategy over the code tree is only locally optimal, but not globally optimal. As a result, in many cases the (n, k) SC decoder cannot find the length-n path with the largest metric. For example, if we apply SC decoding approach as shown in Fig. 4 , its output is (0010) with metric 0.19; however, the valid length-4 path with largest metric should be (1000) with metric 0.23.
The reason for the inefficiency of SC algorithm in this example is that sometimes the unexplored path, instead of the chosen survival path, has the larger path metric. On the basis of this observation, SCL algorithm [4] was proposed to perform searching process along multiple survival paths at the same time. Here the maximum number of the survival paths is referred as the list size (L). Fig. 4 shows an example for the n = 4 and k = 4 SCL decoder with L = 2. As shown in Fig. 4 , at the i th level, the SCL decoder visits all the 2L children nodes (striped nodes as shown in Fig. 4 ) that are connected to the length-(i − 1) survival paths. After calculating all the 2L new path metrics associated with these children nodes, the SCL decoder selects the L length-i paths which have the larger metrics as the updated survival paths. From Fig. 4 , it can be seen that the valid decoding path (1000), which could not be traced by SC decoder before, now can be found by the SCL decoder.
III. PROPOSED REFORMULATED SCL ALGORITHMS
A. Long-Latency Problem of Original SCL Decoder
In general, the SCL algorithm can improve decoding performance significantly over the SC algorithm [4] . However, one of the major challenges for the practical use of SCL decoder is the long-latency problem. Because an L-size (n, k) SCL decoder can be viewed as the combination of L copies of (n, k) SC component decoders ( Fig. 5 ), an (n, k) SCL decoder needs the same (2n − 2) cycles to process its f and g units as its SC component decoders do. In addition, since SCL decoders need to sort 2L path metrics and select L largest metrics for each decoded bit ( Fig. 5 ), extra n cycles are needed to carry out the sorting and selecting function to avoid long critical path [16] . Therefore, the latency of an (n, k) SCL decoder is 3n − 2 cycles. As discussed in Section I, although some methods have been proposed to reduce the latency of SC decoders, these approaches cannot be directly applied to the SCL decoder. As a result, the latency of current known SCL decoder [16] , [17] is still very long. Table I shows an example decoding scheme of conventional SCL decoder for n = 4 polar code. In Table I , the symbols f and g represent the f and g units in each SC component decoder as shown in Fig. 2 , respectively. Besides, the symbol s represents the path metrics sorting and selecting operation for each intermediately decoded bit.
B. 2-b Reformulated SC List (2b-rSCL) Algorithm
As seen in Table I , more than 60% latency of SCL decoder is due to the computation of f, g and s in the last stage (stage-2, in Table I ). This phenomenon implies that the reduction of latency in the last stage can lead to significant reduction of the overall latency of SCL decoder. Therefore, in this section, we propose to reformulate the original computation of the last stage. This reformulated computation in the last stage can save many clock cycles without any performance loss. Table I shows that the computation of the last stage can be viewed as multiple f s g s functions to perform intermediate decoding of two consecutive bits u 2i−1 and u 2i . Because the f/g units and s in the last stage contribute to path metrics calculation and selection, respectively, the goal of our reformulation on the last stage is to find a simplified method that can compute path metrics and sort/select them to perform intermediate decoding of u 2i−1 and u 2i more quickly. Fig. 6 (a) and (b) shows the block diagram of the original and reformulated SC component decoder for SCL decoding, respectively. From these two figures it can be seen that the reformulated SC decoder replaces the original stage-m with two new units, referred as metric computation unit (MCU) and zero-forcing unit (ZFU), respectively. Besides, as shown in Fig. 5 , a sorting block (s symbol in Table I) is also needed to sort the path metrics output from all the L SC component decoders. Because the sorting block is an individual block that does not belong to any SC component decoder, in this subsection we do not discuss sorting block but focus on the functions of MCU and ZFU. The architecture of sorting block will be presented in Section IV.
1) Metric Computation Unit: As shown in Fig. 7 , MCU calculates the likelihoods for different combinations of u 2i−1 and u 2i with the use of the messages a(0), a(1), b(0), and b(1) output from stage-(m − 1). The principle of this calculation can be derived from (5) to (8) . Because u 2i−1 and u 2i−1 are, respectively, the estimates of in 1 and in 2 in the last stage of 
denotes that the previously decoded bits u 1 , u 2 , . . . , u 2i−2 are assumed to have been determined as z 1 , z 2 , . . . , z 2i−2 , respectively.
Equation (12) . Now we show that these joint likelihoods are just the actual metrics of length-2i paths. Consider one of the current length-(2i − 2) survival path in the code tree as ( u 1 , . . . , u 2i−2 ) = (z 1 , . . . , z 2i−2 ). As shown in Fig. 8 , with the different combination of u 2i−1 and u 2i , this length-(2i − 2) path can be extended to four length-(2i ) paths as ( u 1 , . . . , u 2i−2 u 2i−1 u 2i ) = (z 1 , . . . , z 2i−2 pq), where p and q are binary 0 or 1. According to the definition of path metric, with the four combinations of p and q, Pr( u 2i−2 = p, u 2i−2 = q, u 2i−2 1 = z 2i−2 1 ) in (12) are just the actual metrics of the previous four extended length-(2i ) paths. As a result, according to (12) , with the knowledge of a(0), a(1), b(0), and b(1) output from the stage-(m −1), we can directly obtain the actual path metrics of four length-(2i )
2) Zero-Forcing Unit: Although (12) provides a fast approach to compute the actual metrics of length-2i paths, a postprocessing operation is still needed before inputting those calculated metrics into the sorting block, because the values of u 2i−1 and u 2i do depend not only on the corresponding path metrics, but also on whether they are frozen bits or not. Notice that when the current decoded bit u i is a frozen bit, the paths with u i = 1 are not qualified and should never be selected even if they have larger metrics than their counterparts. As a result, to avoid selecting those unqualified paths, we need a ZFU to force the metrics of those unqualified paths to 0. The reason of this zero-forcing operation is that since the SCL decoder only selects the L survival paths with larger metrics for each u i , therefore the unqualified paths with metric values 0 will never be classified into the group of L paths with larger metrics. As a result, the validity of the function is guaranteed. 
Equations (12) and (13) describe the reformulated function of the last stage of SC component decoder. With the help of this reformulation, u 2i−1 and u 2i , as the two successive decoded bits, can now be intermediately decoded simultaneously. Fig. 9 (a) and (b) shows the decoding procedure of original SCL decoding and the proposed reformulated approach with list size L, respectively. In the conventional SCL algorithm, with the comparison of their metrics, the L length-(2i − 1) survival paths are selected from 2L candidates for each time. And each selection can only perform intermediate decoding of one decoded bit [ Fig. 9(a) ]. Instead, in the reformulated approach, the L length-(2i ) survival paths are selected from 4L candidates for each time. As a result, the two successive bits can now be intermediately decoded simultaneously in each selection [ Fig. 9(b) ].
Considering the proposed reformulation can allow two bits to be intermediately decoded at the same time, this new SCL algorithm is referred as 2-bit reformulated SCL (2b-rSCL) algorithm and described in Algorithm 1.
The proposed 2b-rSCL algorithm can greatly reduce the latency of the original SCL decoder. Recall that in the original searching procedure (Fig. 4) , the SCL decoder needs to compute the path metrics associated with the striped nodes in each level of the code tree. On the other hand, because the 2b-rSCL only needs to compute the metrics for length-(2i ) paths, the metrics computation for length-(2i − 1) paths are totally avoided (Fig. 8) . As a result, for the same code tree the 2b-rSCL decoder only needs to visit the striped nodes at even levels instead of at all the levels. For example, by comparing Figs. 4 and 10, it can be found that the reformulated SCL decoder does not need to visit the nodes at levels 1 and 3 anymore. As a result, this new decoding scheme leads to immediate saving in clock cycles. Table II shows the example decoding scheme of the proposed 2b-rSCL decoder with n = 4. Here mc&zf in Table II denotes the metric computation and zero-forcing operations, which are described from lines 10 to 18 of Algorithm 1 in detail. Compared with the scheme of conventional SCL decoder (Table I) , it can be seen that the reformulation at the last stage (stage-2 in this example) leads to significant reduction in clock cycles. For the intermediate decoding of each two successive bits u 2i−1 and u 2i , 
the original SCL decoder (Table I) needs four cycles (f, s, g, s), whereas the 2b-rSCL decoder in Table II only needs two cycles (mc&zf, s). In general, for an (n, k) polar code, the overall latency of 2b-rSCL decoder can reduce from 3n −2 to 2n − 2 clock cycles.
C. 2 K -b Reformulated SC List (2 K b-rSCL) Algorithm
In Section III-B, we presented 2b-rSCL algorithm that can simultaneously perform intermediate decoding of 2 b. In this section, we extend the prior approach to a more general case, and propose a new algorithm, referred as 2 K -bit reformulated SC List (2 K b-rSCL), which can perform intermediate decoding of 2 K bits simultaneously.
As shown in Fig. 11 , the 2 K b-rSCL decoder reformulates the last K stages of original SCL decoder. Similar to the case in 2b-rSCL decoder, the reformulated part of 2 K b-rSCL decoder consists of MCU and ZFU as well.
1) Metric Computation Unit: The function of MCU in 2 K b-rSCL decoder is to compute the joint probabilities of 2 K successive bits as u 2 K (i−1)+1 , u 2 K (i−1)+2 , . . . , and u 2 K i . Similar to the discussion in 2-b-decision case, we first investigate the transformation of u 2 K (i−1)+1 , . . . , u 2 K i .
As shown in Fig. 12 , the transformation of 2 K successive bits can be viewed as the multiplication with matrix U, where U is 2 K × 2 K generator matrix G.
Denote − → u 2 K ,i (u 2 K (i−1)+1 , u 2 K (i−1)+2 , . . . , u 2 K i ) and − → out 2 K (out 1 , out 2 , . . . , out 2 K ), then we have In particular, if we denote the j th column vector of U as U(j), then according to (23) we have
Equation (24) describes the left-to-right transformation of the u 2 K (i−1)+1 , u 2 K (i−1)+2 , . . . , u 2 K i in encoding phase.
Then, based on (15), the right-to-left guideline in decoding procedure should be
According to (15) and (16), we have out j = − → u 2 K ,i U(j) and u 2 K (i−1)+1 = − → out 2 K U(j). (17) Note that in (16) we use the special property that U −1 = U. As shown in Fig. 13 , the inputs of MCU are a 1 (0), a 1 (1), . . . , a 2 K −1 (0), a 2 K −1 (1), b 1 (0), b 1 (1), . . . , b 2 K −1 (0) and b 2 K −1 (1), respectively. With the help of (17), we can obtain the joint probabilities of u 2 K (i−1)+1 , u 2 K (i−1)+2 , . . . , u 2 K i as follows:
where − → α 2 K (α 1 , α 2 , . . . , α 2 K ) is a vector consisting of 2 K binary 0 or 1. According to (18) , because P(α 1 α 2 ...α 2 K ) is the joint probability of u 2 K (i−1)+1 = α 1 , u 2 K (i−1)+2 = α 2 , . . . , u 2 K i = α 2 K and u
, it is just the metric of length-2 K i path (z 4i−4 1 α 1 α 2 , . . . , α 2 K ). Therefore, with a 1 (0), a 1 (1), . . . , a 2 K −1 (0), a 2 K −1 (1), b 1 (0), b 1 (1), . . . , b 2 K −1 (0) and b 2 K −1 (1) output from stage-(m-K ) and (18), MCU can directly output the actual metrics of 2 2 K length-2 K i paths. Algorithm 2 2 K -bit SCL Decoding (2 K b-rSCL) With List Size L for (n, k) Polar Codes 2) Zero-Forcing Unit: Similar to the 2-b-decision case, the function of ZFU in 2 K -b-decision scenario is also to force the metric of unqualified length-2 K paths to 0. Therefore, we can derive the function of ZFU for 2 K b-rSCL decoder as follows.
Assign M(α 1 α 2 , . . . , α 2 K ) = P(α 1 α 2 , . . . , α 2 K ) for path (z 4i−4 1 α 1 α 2 , . . . , α 2 K ) with α 1 , α 2 , . . . , α 2 K ∈ {0, 1} If u 2 K (i−1)+1 is frozen, then reassign all M(1α 2 α 3 , . . . , α 2 K ) = 0 If u 2 K (i−1)+2 is frozen, then reassign all
With MCU in (18) and ZFU in (19) , we can develop a general 2 K b-rSCL decoding algorithm as shown in Algorithm 2. Fig. 14 shows the decoding procedure of 2 K b-rSCL algorithm with list size L. It can be seen that during the decoding procedure 2 2 K L metrics of candidate paths are compared each time, and the L paths with larger M(α 1 α 2 , . . . , α 2 K ) metrics are selected as the survival paths. As a result, 2 K successive bits can be determined simultaneously. Table III lists the latency of 2 K b-rSCL decoder with different values of K for (n, k) polar codes. From this table, it can be seen that 2b-rSCL decoder in Section III-B can be viewed as the specific case of 2 K b-rSCL with K = 1. For a general 2 K b-rSCL decoder, its latency is n/2 K −2 − 2 clock cycles. Therefore, as K increases, the overall latency is reduced. In an extreme case, when K reaches m = log 2 n, the 2 K b-rSCL decoder becomes a maximum likelihood (ML) decoder with latency as small as only two cycles.
Although the increase of K can lead to the reduction of latency, K cannot be set too large for hardware implementation. That is because when K increases, the number of candidate paths, as 2 2 K , increases rapidly. As a result, a large K causes a large amount of path candidates and hence significantly increases the overall complexity of metric computation and path metrics comparison. For example, when K = m = log 2 n (ML decoder), the number of path candidates is 2 n . For (1024, 512) polar codes, that means 2 1024 path metrics need to be computed and compared. The implementation of these extensive operations will cause ultralarge silicon area and ultralong critical path. As a result, for practical implementation K is suggested to be set as no more than 3, which can achieve a good tradeoff between latency reduction and computation overhead.
D. Simulation Results
Because the proposed reformulated SCL decoding algorithms only avoid the unnecessary metric computations but do not change the accuracy of metric computation, there is no performance loss for the reformulated SCL algorithms over the original SCL algorithm. This is consistent with the simulation results shown in Fig. 15 .
IV. PROPOSED REFORMULATED SCL ARCHITECTURE
In this section, the hardware architectures of the reformulated SCL (2 K b-rSCL) decoders are presented. Different values of K correspond to different 2 K b-rSCL decoders. For simplicity, in this section we focus on K = 1 and K = 2 cases, which correspond to the 2b-rSCL decoder and 4b-rSCL decoder. Architectures with values of other K can be developed in a similar way.
As shown in Fig. 11 , the difference between SC component decoder of 2b-rSCL or 4b-rSCL decoders and that of original SCL decoder is on the last one or two stages. Therefore, the other stages (f/g units) of original SC decoder are still used in the reformulated SCL decoders. As a result, in this section we focus on the architecture design of f/g units in the SC component decoder, MCU/ZFU in the reformulated stage, and metric sorting block, respectively.
A. Processing Element for f/g Units
As indicated in Section II, the likelihood-based function of f and g units are described in (3), (4), (9) , and (10). However, these equations contain multiplication which is not feasible for hardware implementations. As a result, to simplify computation, the log-likelihood-based f and g units are used in our design. In this case, the likelihood-based (3), (4), (9) , and (10) are reformulated to the following equations:
where max*(x, y) = max(x, y) + ln(e −|x−y| ) represents the Jacobian logarithm. Notice that (20) and (21) contain logarithmic operation (ln(·)), which needs to be implemented using complex lookup table with a long critical path. Fortunately, in [16] it was shown that the logarithmic item can be ignored with negligible performance loss. As a result, (20) and (21) can be further simplified as In general, (22)-(25) describe the log-likelihood version of f and g units. With these equations, the basic processing element (PE) of the SC component decoder, which contains an f unit and a g unit, is developed and is shown in Fig. 16 .
Here, C&S unit represents the combined comparator and 2-to-1 selector. In addition, ctrl signal is the control signal that indicates whether the PE functions as an f unit or a g unit.
B. MCU and ZFU
As shown in Fig. 11 , MCU and ZFU are the two essential parts in 2 K b-rSCL decoders to help them decide multiple bits. Similar to the case in Section IV-A, the likelihoodbased functions of MCU and ZFU need to be transformed to log-likelihood version as well.
For K = 1 case that corresponds to 2b-rSCL decoding algorithm, its likelihood-based functions of MCU and ZFU have been described in Algorithm 1 (lines 10-18). For the transformation for MCU, according to the transformation principle in Section IV-A, P(pq) = a( p)b(q) in the lines 12 and 13 of Algorithm 1 is transformed to a( p) + b(q). In addition, since ln0 is negative infinite, M(pq) = 0 (lines 17 and 18 in Algorithm 1), as the likelihood-based function of ZFU, is reformulated to M(pq) = −Inf and where −Inf represents negative infinite. As a result, the hardware architecture of MCU and ZFU for 2b-rSCL decoder is developed as shown in Fig. 17(a) . Here the ctrl1 and ctrl2 shown in Fig. 17(a) are the two control signals that indicate whether u 2i−1 and u 2i are information bits or not.
For K = 2 case that corresponds to 4b-rSCL decoding algorithm, its likelihood-based function of MCU and ZFU can be derived from Algorithm 2 (lines [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] . For the function of MCU (lines 12 and 13), in K = 2 case it is P(α 1 α 2 α 3 α 4 ) = a 1 (α 1 ⊕α 2 ⊕α 3 ⊕α 4 )a 2 (α 2 ⊕α 4 )b 1 (α 3 ⊕α 4 )b 2 (α 4 ). Then, with the likelihood-to-log-likelihood transformation, it is reformulated as P(α 1 α 2 α 3 α 4 ) = a 1 (α 1 ⊕α 2 ⊕α 3 ⊕α 4 )+a 2 
. For the function of ZFU (lines [16] [17] [18] [19] , in K = 2 case, it is M(α 1 α 2 α 3 α 4 ) = 0. Therefore, its log-likelihood version is M(α 1 α 2 α 3 α 4 ) = −Inf. As a result, the architecture of MCU and ZFU for 4b-rSCL decoders are developed as shown in Fig. 17(b) . Here the ctrl1, ctrl2, ctrl3, and ctrl4 as shown in Fig. 17(b) are the four control signals that indicate whether u 4i−3 , u 4i−2 , u 4i−1 , and u 4i are information bits or not. 
C. Metric Sorting Block
After MCU and ZFU generate the metrics for different paths, a sorting block is needed to compare those 2L metrics and select the L paths with larger metrics. In the proposed designs, we use the bitonic sorting algorithm [19] to find out the L larger metrics. Fig. 18 shows an example architecture of the proposed eight-input four-output metric sorting block.
It contains a 4 × 4 increasing order bitonic sorter and a 4 × 4 decreasing order bitonic sorter. Each bitonic sorter is constructed by 2 × 2 increasing order sorters (IOS) and 2 × 2 decreasing order sorters (DOS). With the help of the two 4 ×4 bitonic sorters, in 1 -in 4 are sorted as an array with increasing order (i 1 ≤ i 2 ≤ i 3 ≤ i 4 ) while in 5 -in 8 are sorted as an array with decreasing order (d 1 ≥ d 2 ≥ d 3 ≥ d 4 ). Then, these two presorted arrays are sent to a stage of C&S units. At the output end of these C&S units, the four larger elements among in 1 -in 8 are found as out j = max(i j , d j ), where j = 1 − 4. For the details of bitonic sorter, the reader is referred to [19] .
As indicated in [19] , the critical path delay of a 2 s × 2 s bitonic sorter is 1 +2 +· · ·+s = s(s +1)/2T C&S , where T C&S is the critical path delay of C&S unit. Therefore, a general 2 s -input 2 s−1 -output metric sorting block consisting of two 2 s−1 × 2 s−1 bitonic sorters and a stage of C&S units has an overall critical path delay of 1 + 2 + · · · s − 1
Notice that the metric sorting block is a 2 s -input 2 s−1 -output (2 s × 2 s−1 ) sorter that can find the 2 s−1 largest elements among the 2 s inputs. For the proposed L-size 2 K b-rSCL decoder, it only needs to find the L largest metrics among 2 2 K L candidates, hence the 2 s × 2 s−1 sorter is enough for this sorting task and we do not need the full-size sorting (2 s × 2 s ) function.
D. Data Path Balancing
As discussed in Section IV-C, the critical path delay of 2 s -input 2 s−1 -output metric sorting block is 1 + (s − 1) s/2T C&S . This is much larger than the critical path delay of PE or MCU/ZFU. For example, for a 4b-rSCL decoder with L = 2, s = log 2 (16 × 2) = 5. Then the critical path delay of metric sorting block is 11T C&S , whereas the critical path delays of PE and MCU/ZFU are less than 3T C&S . Because the clock speed is upper-bounded by the critical path delay, the throughput of reformulated SCL decoder is limited by the long critical path of metric sorting block.
Considering the unbalanced data path between metric sorting block and other parts of reformulated SCL decoder, we propose to repipeline those data paths to reduce critical path delay. Fig. 19(a) shows the original pipelining of 4b-rSCL decoder. Here the register arrays for pipelining are inserted between different blocks of SCL decoder. As a result, because of the unbalanced data path between different blocks, the clock cycles for processing PEs and MCU/ZFU are not fully utilized [ Fig. 19(b) ]. Fig. 19(c) shows the proposed repipelining strategy to the same 4b-rSCL decoder. It can be seen that the original registers between stage-(m − 2), MCU/ZFU and metric sorting block are moved into metric sorting block. Fig. 19(d) shows the corresponding timing chart after repipelining. It can be seen that the data path in each clock cycle is balanced. More importantly, since metric sorting block is deeply pipelined, the overall critical path delay is reduced significantly. Notice that in Fig. 19 (c) the metric sorting block is two-stage pipelined. If deeper pipelining is needed, we need to move the registers between other stages of PE into metric sorting block. For example, to perform three-stage pipeline to metric sorting block, we need to move the registers between stage-(m − 3) and stage-(m − 2) as shown in Fig. 19 (c) into metric sorting block as well.
The proposed data path balancing strategy is very useful for high-speed polar list decoder design. For practical use of polar codes, to achieve comparable error-correcting performance with LDPC or Turbo codes with the similar code length, a large list size L is required. For example, Tal and Vardy [4] reported that the 2048-length polar codes can achieve beyond LDPC performance under the condition of L = 32. In that case, for the conventional SCL decoder, the s for sorting block is log 2 (2 * 32) = 6. As a result, even the proposed metric sorting block is used, the critical path delay is still very large (1 + (s − 1)s/2T C&S = 16 T C&S ), which impedes the application of polar codes in high-speed systems. Notice that this phenomenon becomes even more severe for 2 K b-rSCL decoder. For example, for 4b-rSCL decoder with L = 32, the number of path metric candidates is 32 × 16 = 512, which corresponds to s = log 2 512 = 9. As a result, the critical path delay of metric sorting block increases to 1 + (s − 1)s/2T C&S = 37T C&S . However, if we apply the proposed data path balancing technique to this case, the critical path delay can be significantly reduced. For example, in the case of 2048-length polar codes with L = 32, with the balance of the data path of metric sorting block, MCU/ZFU block and all the stages of PE (stages 1-9), the critical path delay of 4b-rSCL decoder after data path balancing is less than (37 + 3 + 3 × 9)/11 ≈ 6.1T C&S . This new critical path delay is four times less than the case without use of data path balancing, and it is even 1.5 times less than that of the original SCL decoder. As a result, the use of the proposed data path balancing strategy guarantees the high-speed design of polar list decoder.
E. Quantization Scheme
Similar to the case of SCL decoders, the architecture of 2 K b-rSCL decoders contain multiple stages of PE. As a result, to avoid saturation problem that is pointed out in [16] , the quantization schemes for different stages of PE are different. If we assume the log-likelihood (LL) information from channel is quantized as Q ch bits, then for the stage-i of 2 K b-rSCL decoder, the corresponding bit-width is Q ch +i . In addition, for the MCU/ZFU and metric sorting blocks, they are quantized with Q ch + m bits. Notice that because the LL information in different stages has different bit-widths, the corresponding memories that store the LL information have different bit-widths as well.
F. Memory Requirement
Besides the aforementioned blocks, a large portion of the 2 K b-rSCL decoders is the memory banks. Similar to SCL decoders [16] , multibit-width memory banks in the proposed design store the LL information from the channel as well as the LL information processed by each stage. As discussed in Section IV-E, the quantization scheme for LL information is nonuniform and varies depending on the corresponding stages, therefore the memory banks for different stages have different bit-widths. In addition, 1-b-width memory banks are needed to store the updated survival paths and partial sum bits u sum .
Notice that compared to [16] , the memory requirement of the proposed 2 K b-rSCL decoder is larger. This is because the number of path metric candidates increases in the proposed decoders. As a result, more memories are required for storing the calculated metrics from MCU/ZFU block. For example, with L = 32 and K = 2, 32 × 16 = 512 LL messages for metrics needs to be stored, while SCL decoder only needs to store 64 LL message for metrics. Consider these metrics are always quantized to more than 10 b, the extra memory requirement of 2 K b-rSCL decoder causes inevitable area overhead, especially in the case of large L or K .
G. Overall Architecture
With the aforementioned PE, ZFU&MCU, and metric sorting block, the overall architecture of an L-size reformulated SCL decoder can be developed as shown in Fig. 20 . Besides the previous presented blocks, the decoder needs LL memory bank to store and update the log-likelihood information that are processed by L SC component decoders. In addition, survival path bank is also needed to store and update the L survival paths during the list decoding procedure. Besides, the reformulated SCL decoder needs a polar-encoder-like partial sum generator (PSG) to compute u sum for corresponding SC component decoder. The architecture of PSG is similar to the polar encoder shown in Fig. 1 . 
V. HARDWARE ANALYSIS AND COMPARISON
In this section, the hardware performance characteristics of the proposed reformulated SCL decoding architectures are analyzed. Table IV shows the hardware performance of different SCL decoders with list size L = 2 and 4 for polar (1024, 512) code. Here the designs of 2b-rSCL decoder and 4b-rSCL decoder are synthesized by Synopsys Design Compiler with ST CMOS 65 nm library. Notice that in the proposed designs 3-b quantization scheme is used for the LL information output from channel, which is the same as in [16] . With the quantization scheme described in Section IV-E, the bit width of stage-i is 3 + i . For the MCU/ZFU block and metric sorting block, they are quantized to 3 + m = 13 bits.
From Table IV it can be seen that, compared with prior LL-based SC list decoder design [16] , the proposed 2b-rSCL decoder and 4b-rSCL decoder can achieve 21.0% and 60.5% reduction in latency, respectively. Notice these reductions are less than the analysis in Table III . This is because the latency listed in Table IV is calculated based on (12) in [16] , where code rate R = k/n is considered, while the analysis in Table III discuss the general case without the specific discussion on different code rate or distribution of frozen bit positions. In general, as the code rate increases, the proposed reformulated SCL decoders can save more clock cycles than the original one in [16] . For example, for an R = 1 polar code, 2b-rSCL decoder and 4b-rSCL decoder can achieve 33% and 66% less latency than the original SCL decoder, respectively.
With the use of data path balancing technique in Section IV-D, the proposed 2b-rSCL and 4b-rSCL designs can achieve high clock frequency. Therefore, as seen in Table IV , the coded throughputs of 2b-rSCL decoder and 4b-rSCL decoder with L = 2 are 1.66 and 3.45 times of that of original SCL decoder, respectively. In addition, when L = 4, the coded throughputs of 2b-rSCL decoder and 4b-rSCL decoder are 2.11 and 3.23 times of that of original SCL decoder, respectively. Besides, the hardware efficiency of our designs, which is defined as the ratio of throughput to area, increases as well. When L = 2, the hardware efficiencies of 2b-rSCL and 4b-rSCL decoders are 1.36 and 2.08 times of that of original SCL decoder; when L = 4, the hardware efficiencies of 2b-rSCL and 4b-rSCL decoders are 1.87 and 2.66 times of that of original SCL decoder. Recently, log-likelihood-ratio (LLR)-based SCL decoder was proposed in [17] , which requires much less bit-width than LL-based decoder. As a result, the overall area and critical path delay can be significantly reduced. Owing to the generality of LLR-based scheme in [17] , it can be also applied to our proposed 2 K b-rSCL decoders. In that case, the hardware complexity and crucial path of our designs can be further reduced while retaining the same short latency.
VI. CONCLUSION
In this paper, we have presented reformulated SC list decoding algorithms. These reformulated algorithms can reduce the latency significantly without any performance loss. Then, with the proposed algorithm, we develop corresponding latencyreducing hardware architectures for SCL decoders. Hardware analysis shows that the proposed 2b-rSCL decoder and 4b-rSCL decoder can achieve significant improvement in throughput and hardware efficiency.
