Abstract-Due to their capacity-achieving property, polar codes have become one of the most attractive channel codes. To date, the successive cancellation list (SCL) decoding algorithm is the primary approach that can guarantee outstanding error-correcting performance of polar codes. However, the hardware designs of the original SCL decoder have large silicon area and long decoding latency. Although some recent efforts can reduce either the area or latency of SCL decoders, these two metrics still cannot be optimized at the same time. This paper, for the first time, proposes a general log-likelihood-ratio (LLR)-based SCL decoding algorithm with multi-bit decision. This new algorithm, referred as LLR-2 K b-SCL, can determine 2 K bits simultaneously for arbitrary K with the use of LLR messages. In addition, a reduced-data-width scheme is presented to reduce the critical path of the sorting block. Then, based on the proposed algorithm, a VLSI architecture of the new SCL decoder is developed. Synthesis results show that for an example (1024, 512) polar code with list size 4, the proposed LLR-2 K b-SCL decoders achieve significant reduction in both area and latency as compared to prior works. As a result, the hardware efficiency of the proposed designs with K=2 and 3 are 2.33 times and 3.32 times of that of the state-of-the-art works, respectively.
I. INTRODUCTION
OLAR codes [1] have emerged as one of the most attractive forward error correction (FEC) codes in recent years. Due to their unique capacity-achieving property, polar codes provide outstanding error-correcting capability that would be very useful for digital transmission.
However, to date polar codes suffer from inferior finite length error-correcting performance. In particular, in the region of short or medium code length, polar codes are not comparable to the LDPC codes in terms of coding gain. To solve this problem, successive-cancellation list (SCL) decoding algorithm was proposed in [2] to improve the coding gain of the polar codes. In [2] , it was shown that with the use of SCL algorithm, polar codes can outperform the WiMAX LDPC codes even for a shorter code-length.
Although SCL algorithm can help polar codes achieve
Manuscript received June 19 th beyond-LDPC performance, this approach suffers from high complexity and long latency. Recently, some efforts were proposed to address these problems. An LLR-based SCL algorithm was proposed in [3] [4] to reduce the amount of combinational logic and memory. In [5] [6] [11], low-latency SCL algorithms were presented to reduce the required number of decoding cycles. However, these prior works only focused on either reducing area or latency, but not on optimizing these two metrics at the same time. This paper, for the first time, proposes a general reduced-latency LLR-based SCL decoding algorithm. This new algorithm, namely LLR-2 K b-SCL, can determine 2 K bits in one cycle for arbitrary K with the use of LLR messages. As a result, it can achieve both low complexity and short latency. In addition, a reduced-data-width scheme is presented to reduce the critical path of the sorting block. Based on the proposed algorithm, a VLSI architecture of the new SCL decoder is developed. Synthesis results show that for an example (1024, 512) polar code with list size 4, the proposed LLR-2 K b-SCL decoder achieves great reduction in both area and latency as compared to the prior works, respectively. As a result, the hardware efficiency of the proposed designs with K=2 and 3 are 2.33 times and 3.32 times of that of the state-of-the-art works, respectively. Notice that this paper is a generalized version of our prior work [10] .
The rest of the paper is organized as below. Section II gives a brief review of polar codes. The proposed LLR-2 K b-SCL algorithm and the reduced-bit-width scheme for sorting block are presented in Section III. Section IV develops the hardware architecture of the new algorithm. Hardware performance is analyzed and discussed in Section V. Section VI draws the conclusions.
II. REVIEW OF POLAR CODES

A. Encoding of Polar Codes
The encoding procedure of (n, p) polar codes consists of two steps. First, the p-bit source message is extended to an n-bit message u=(u 1 , u 2 ,…u n ) with padding n-p "0" bits. Here those padded "0" bits are referred as frozen bits and their positions over the u, namely frozen positions, are known to both the transmitter and receiver. Then, u is multiplied with an n-by-n generator matrix G to obtain the transmitted codeword x=(x 1 , x 2 ,…,x n )=uG. Fig. 1 shows the example of an n=4 polar codes encoder. For details on encoding of polar codes, the reader is referred to [1] . 
B. Successive-Cancellation (SC) Decoding Algorithm
At the receiver end, the original transmitted codeword x is corrupted by noise to the received codeword y=(y 1 , y 2 ,…,y n ). An LLR-based successive-cancellation (SC) decoder can be used to recover u from y. Fig. 2 shows an example SC decoding procedure for n=4 polar codes. It can be seen that the SC decoder consists of two types of computation units, namely f and g, respectively. Here each f or g unit is labelled with a number, which indicates the time index when the corresponding unit is activated for computation. It can be seen that at clock cycle 2, 3, 5 and 6, the f or g unit in the last stage (stage 2 in this example) sends its LLR output to the hard-decision h unit to determine the decoded bit. Notice that the example SC decoder is based on LLR form. In this case, the function of f and g units can be approximated as (1) and (2): Fig. 2 . Example LLR-based SC decoding scheme for n=4, cited from [7] .
C. Successive-Cancellation List (SCL) Algorithm
The decoding process of polar codes can also be interpreted from the view of code tree. Fig. 3 shows an example code tree for n=4 polar codes. Here each level represents a decoded bit. The value that is associated with each node represents the probability (metric) of the corresponding decoding path. For example, 0.23 at level 4 represents the probability for the length-4 path (1000) is Pr(
.23. Based on this representation, the objective of a successful decoding procedure is to find the length-n path that is the corrected codeword. To achieve this goal, the SC decoder first visits the children nodes that are associated with the current survival path at each level. Then it selects the new path that has the larger metric as the updated survival path. Because this searching strategy is only locally optimal, the performance of the SC decoder is limited.
Different from the SC decoder that only selects a single path, the L-size SCL algorithm utilizes L different searching paths. Therefore, it is more likely for the SCL algorithm to find the desired path than the SC algorithm. For example, SCL decoder with L=2 is able to trace the valid path (1000) in Fig. 3 while the SC decoder fails to find it. 
III. THE PROPOSED LLR-BASED MULTI-BIT DECODING
From Fig. 3 it can be seen that the SCL algorithm needs to visit the nodes at each level of the code tree. This traversing searching style leads to a very long decoding latency. To address this problem, a reduced-latency algorithm that can determine 2 K bits simultaneously was proposed in [5] . However, the approach in [5] is based on the likelihood messages, which requires much larger computation and memory complexity than the commonly used LLR-based decoder in [3] [4] . On the other hand, those LLR-based SCL decoders in [3] [4] are only able to determine one bit in one cycle; hence they have much longer latency than the decoder in [5] . As a result, to date those prior SCL decoders in [3] [4] [5] are not able to reduce the latency and area at the same time.
A. LLR-based Multi-bit SCL Decoding
This subsection presents an LLR-based SCL decoding algorithm with 2 K bits decision, namely LLR-2 K b-SCL. Notice that [5] also proposed a reduced-latency decoder that can determine multiple bits in one cycle. However, the SCL decoder in [5] is based on the likelihood form while the design in this paper is based on the LLR form. This difference in the fundamental representation of the proposed algorithm leads to much less computation complexity than the approach in [5] .
Next, we show how to determine each successive 2 K bits as
at the same time in the LLR form. From the view of code tree (see Fig. 3 ), this means the SCL decoder is able to directly calculate the metrics of length-(2 K i) paths from the metrics of length-(2 K (i-1)) paths. In general, such direct computation is performed by an LLR-based metric computation unit (MCU), which replaces the original last K stages of each LLR-based SC decoder (see Fig. 2 ). In the following paragraph, we show how to derive the function of the LLR-based MCU.
Assume that the previously decoded 2
respectively. This event is denoted as
Therefore, in the logarithmic domain the length-(2 K i) path metric can be represented as: 
where
that is the set of the current 2 K decoded bits. In addition, (3) contains the probabilistic information of the current 2 K decoded bits, which is unknown during the decoding procedure. To address this problem, we need to further represent the logarithmic path metrics with the LLR messages that are input to the MCU. Such reformulation is based on the fact that the polar decoding procedure is inherently "guided" by its encoding procedure [8] . Simultaneous right-to-left decoding procedure of the successive 2 K bits (see Fig. 4(a) ) involves the estimation of the left-to-right encoding procedure (see Fig.   4(b) ). Hence if  ( ) Notice that the determination of  1 out ,…,  2 K out are independent. In addition, if we denote the j-th column vector of U as U(j), then we have
As a result, (4) can be further derived as below: 
is the logarithmic length-(2 K (i-1)) path metric. Recall that each SC component decoder is based on LLR form. In that case, the j-th input to the MCU block is:
As a result, we can obtain the elements of the first item in (5):
Substituting (6) into (5), we have:
(7) describes the LLR-based update principle for path metrics. Once the MCU block receives the 2 K input LLR messages s j and the previous metric of length-(2 K (i-1)) path, it can immediately calculate the new metric of length-(2 K i) path with the use of (7), which corresponds to the simultaneous decision for 2 K bits. Notice that (7) contains exponential and logarithmic functions, which require long critical paths in hardware design. Therefore, (7) needs to be simplified for feasible VLSI implementation.
Consider ln(1+e x )≈x for large x; otherwise 0 for small x. (7) can be further approximated as below: Fig. 2 ), where m=log 2 n. At this time, the (m-K) stage outputs 2 K LLR messages s j (j=1, 2,…2 K ) to the MCU block. Then, the MCU block in each SC component decoder calculates the new path metrics with the use of (8) . After that, all of the updated path metrics from the L SC component decoders are compared and L largest are selected as the survival paths metrics. The above entire procedure is repeated for every 2 K bits until all the n bits are determined. Note that similar to [5] , a simple zero-forcing unit (ZFU) is needed after the computation of (8), which helps to drop the unqualified paths that violate the frozen conditions. 
( ... ,
11: Metric Computation:
12:
13: Forcing Zero: 14: 
B. Reduced-Data-Width Scheme for Sorting Block
Typically, Q=6 bit quantization scheme is sufficient for the fixed-point implementations of LLR-based SCL decoder. However, as indicated in [3] , the representation of path metrics needs more bits since the path metrics have larger data range than the propagated LLR messages. In [3] , it showed that M=8 bit for path metrics can avoid significant performance degradation in terms of frame error rate (FER). Since the overall critical path of the SCL decoder is in the sorting block that sorts those path metrics [3] [5], the escalating data-width of path metrics inevitably causes significant increase in criticial path delay. To address this challenge, we propose a reduced-bit-width scheme for sorting block. The key idea is to only utilize S=M-1 bits to represent the path metrics for sorting, while the representations of path metrics for updating and storing are still based on M bits. This approach is derived from the following observation: In SCL decoder the sorting block does not require as high precision as MCUs, since the function of the soring block is just to rank the path metrics without changing their values, while the MCUs have to adopt larger data-width to guarantee accurate calculation. Therefore, a reduced-data-width for sorting block does not cause significant performance degradation but enables reduction in critical path delay. Fig. 5 shows the fixed-point simulation results for the proposed LLR-2 K b-SCL algorithm with reduced-data-with for sorting block. Here the simulation environment is AWGN channel with BPSK modulation and the code parameters are n=1024, r=0.5. From the figure it can be seen that, compared to the original scheme using S=8 bits for sorting the path metric, the proposed scheme with S=7 bits only has negligible performance loss for different values of K and L. Fig. 6 shows the overall hardware architecture of the proposed L-size LLR-2 K b-SCL algorithm decoder. The data path of the entire decoder contains L LLR-based SC decoders plus a metric sorting block. For each SC component decoder, it is reformulated from the LLR-based decoder in [8] , which retains the (m-K) stages but the last K stages are replaced by the LLR-based MCU and ZFU. In addition, the required memory resource of the entire decoder consists of register arrays, bulk memory and buffer for survival paths, path metrics, propagating LLR messages and channel outputs, respectively.
IV. HARDWARE ARCHITECTURE
A. Overall Architecture
In this section, the hardware design of the sorting block is very straightforward and similar to the approaches in [3] [5] . The only difference is that the data-width of comparators is reduced from M bits to M-1 bits and the least significant bits (LSBs) of all the input path metrics are dropped to be consistent with the data-width of comparators. Therefore, in this section we focus on other parts of the data path and memory resource.
Notice that since ZFU can be easily implemented with multiplexers; the analysis of this block is omitted as well. 
B. LLR-based Metric Computation Unit (MCU)
In Section III (8) describes the function of MCU. Since this function depends on K, the hardware design of MCU varies with different choices of K. Fig. 7 illustrates the inner architecture of MCU for K= 3. Here δ(•) block can be simply implemented with a multiplexer. In addition, StoC and CtoS blocks represent the components that perform the conversion between sign-magnitude and 2's complement forms. 
C. LLR-based Processing Element (PE)
As shown in line 5 -line 7 in Scheme-A, the input LLR messages of MCU are calculated from the first (m-K) stages of LLR-based SC decoder in [8] . In general, these stages consist of f and g nodes in Fig. 2 , which can be implemented in hardware as the following processing element (PE) in Fig. 8 . Notice here the addition and subtraction in (2) is designed as a unified adder and subtractor to save area. 
D. Memory Resource
For the proposed decoder, different types of memory resource are needed for different types of data. Because the L survival paths and their metrics need to be updated simultaneously each time, they are stored in registers. In addition, for the propagating LLR messages that are processed in the PEs, L bulk memory banks are used to store the corresponding messages in the L different SC component decoders. This memory partition approach can also avoid the potential memory access conflict. Besides, a specific buffer is used to store the initial LLR inputs to the decoder. Table I shows the hardware performance of different SCL decoders with list size L=4 for the same (1024, 512) polar codes. For the designs with different technology libraries, the results on area and frequency are scaled to the same 90nm nodes. From this table it can be seen that; the proposed design with K=3 can achieve much higher throughput than the same design with K=2 with a slight increase in area and critical path delay (lower clock frequency). As a result, the hardware efficiency of the design with K=3 can achieve 572Mbps/mm 2 , which is the highest among the listed works. 
V. PERFORMANCE AND DISCUSSION
A. Hardware Performance Comparison
