Polar codes are a class of capacity achieving error correcting codes that has been recently selected for the next generation of wireless communication standards (5G). Polar code decoding algorithms have evolved in various directions, striking different balances between error-correction performance, speed and complexity. Successive-cancellation list (SCL) and its incarnations constitute a powerful, well-studied set of algorithms, in constant improvement. At the same time, different implementation approaches provide a wide range of area occupations and latency results. 5G puts a focus on improved error-correction performance, high throughput and low power consumption: a comprehensive study considering all these metrics is currently lacking in literature. In this work, we evaluate SCL-based decoding algorithms in terms of error-correction performance and compare them to low-density parity-check (LDPC) codes. Moreover, we consider various decoder implementations, for both polar and LDPC codes, and compare their area occupation and power and energy consumption when targeting short code lengths and rates. Our work shows that among SCL-based decoders, the partitioned SCL (PSCL) provides the lowest area occupation and power consumption, whereas fast simplified SCL (Fast-SSCL) yields the lowest energy consumption. Compared to LDPC decoder architectures, different SCL implementations occupy up to 17.1× less area, dissipate up to 7.35× less power, and up to 26× less energy.
I. INTRODUCTION
Polar codes, introduced by Arıkan in [1] , are a class of errorcorrecting codes that can provably achieve channel capacity on a memoryless channel when the code length N tends to infinity. They have been selected for the next generation of wireless communication standards [2] .
The 5G standardization process is putting a particular focus on improved error-correction performance, lower power consumption and higher throughput. For example, machineto-machine communications in 5G target massive connectivity among a high number of devices, on a scale higher than the most bandwidth-demanding applications in 3G and 4G [3] , with a limited power budget. Therefore, reliable and efficient encoding and decoding methods need to be designed.
In [1] , the successive-cancellation (SC) decoding algorithm is proposed for polar codes: it can be represented as a binary tree search. While optimal with infinite code length, this approach suffers from long decoding latency and mediocre error-correction performance at moderate code lengths. To improve the error-correction performance of SC, the SC List (SCL) decoding algorithm was proposed in [4] , that relies on a list of L codeword candidates. A cyclic-redundancy check (CRC) is also concatenated to the polar code, to help in the selection of the correct candidate at the end of the SCL decoding process. The improved error-correction performance of CRC-aided SCL comes at the cost of additional computational complexity and latency. A hardware implementation for SCL using logarithmic likelihood ratio (LLR) values was presented in [5] . In order to reduce latency and increase throughput, simplified SCL (SSCL) [6] and Fast-SSCL [7] decoding algorithms were proposed, that rely on the identification of bit patterns to prune the SC decoding tree and reduce the number of required bit estimations, with minor or no error-correction performance degradation. Compared to the conventional SCL, SSCL and Fast-SSCL can reduce the number of time steps required to decode one codeword up to 88% [7] . To address the high implementation complexity of SCL decoders, a partitioned SCL (PSCL) decoder was proposed in [8] : it shows substantial area occupation reduction and negligible error-correction performance loss with respect to conventional SCL decoders.
SCL-based decoders are currently one of the best candidates to meet 5G error-correction performance requirements and throughput. While most recent decoder architectures for polar codes focus on improving throughput and area occupation, little work has been done in terms of power consumption [9] , [10] . A large part of machine-to-machine connected devices are mobile-end platforms that use batteries and smallscale energy harvesting electronics: ultra-low power/energy consumption for these devices is crucial [11] .
This work provides an extensive study on polar code SCLbased decoders in terms of frame error rate (FER) performance, area occupation, and power/energy consumption. We focus on short to medium code lengths, similar to those chosen for the eMBB control channel [12] . For rates 1 2 and 2 3 , SCL-based decoders are compared against low-density parity-check (LDPC) codes from the IEEE 802.16e (WiMAX) standard with variable maximum iteration number. Then, we address power consumption of polar code decoders based on SCL, SSCL, Fast-SSCL and PSCL, and compare them against LDPC codes. The rest of this work is organized as follows. In Section II, polar codes are briefly introduced along with various SCLbased decoding algorithms. Hardware implementations of polar code decoders are discussed in Section III. In Section IV, the error-correction performance of polar codes is analyzed and compared to that of LDPC codes from communications standards. Section V presents synthesis results for a wide u 0
x 4
x 5
x 6
x 7 Fig. 1 . Polar code encoding for P C (8, 4) . Gray indices indicate frozen bits while black indices represent information bits. variety of decoder architectures, and compares them to LDPC decoders in literature. Conclusions are drawn in Section VI.
II. BACKGROUND

A. Polar Codes
Polar codes are able to achieve channel capacity through channel polarization, that splits N channel utilizations into K reliable ones, through which information bits are sent, and N − K unreliable ones, used for frozen bits. A polar code, represented as P C(N, K), is a linear block code of length N = 2 n and rate R = K/N . Encoding of a polar code can be represented by a matrix multiplication:
where u N −1 0 = {u 0 , u 1 , . . . , u N −1 } is the input vector, x N −1 0 = {x 0 , x 1 , . . . , x N −1 } is the encoded vector, and the generator matrix G ⊗n is the n-th Kronecker product of the polar code matrix G = [ 1 0 1 1 ]. A polar code of length N is composed of two concatenated polar codes of length N/2; Fig. 1 depicts the encoding process for P C (8, 4) .
In [1] , it was shown that as N → ∞, encoded bits become either completely unreliable or completely reliable. For a polar code of rate R = K/N , N − K most unreliable bits are fixed to a constant that is known by the decoder, usually to zero; remaining K reliable locations are used to transmit the information bits. For the P C(8, 4) code in Fig. 1 , bits u 0 , u 1 , u 2 , and u 4 are located on the least reliable indices, thus are frozen and indicated with set Φ (gray indices in the figure), while bits u 3 , u 5 , u 6 , and u 7 are located on the most reliable indices, which carry the information bits (black indices in the figure) .
By its serial nature, SC decoding estimates a bitû i according to the channel output y SC decoding traverses the polar code tree in Fig. 2 starting from the root node, and advances recursively from left to right. Each parent node at stage S contains soft information (LLR values) α = {α 0 , α 1 , . . . , α 2 S −1 }, and passes this soft information to its left and right children. Hard decision estimates β = {β 0 , β 1 , . . . , β 2 S −1 } are passed from child nodes to their parent nodes. From a parent node at stage S, the soft information passed to left child α l = {α l 0 , α l 1 , . . . , α l 2 S−1 −1 } and right child α r = {α r 0 , α r 1 , . . . , α r 2 S−1 −1 } can be approximated as
The hard decision estimates β are calculated at each stage S via the left and right messages received from child nodes,
where ⊕ denotes bitwise XOR operation, and 0 ≤ i < 2 S . At the leaf nodes, β values are hard decisions computed by (2) . The computational complexity of SC decoding is O(N log 2 N ).
B. Successive-Cancellation List (SCL) Decoding
In SCL decoding [4] , when a bit is to be estimated, the decoding process splits into two paths; one path estimates the bit as a '0', and the other as a '1'. Therefore, at each bit estimation, the number of codeword paths double, until a list size L is reached. In this context, SC can be considered as an SCL with list size L = 1. Each path contains an information on the likelihood of the path being the correct codeword, which is defined as a path metric (PM). When the list size L is doubled by estimating another bit in the sequence, the L least likely paths are dropped based on their PM information, and the list is updated. Compared to SC, SCL decoding yields a better error-correction performance. Fig. 3 depicts the parallel decoding process with list size L = 2 for P C(4, 3):û 0 is a frozen bit and as a result no path splitting occurs. Estimatinĝ u 1 creates two paths with associated path reliability values. Whenû 2 is estimated, out of four possible paths, two of them with least reliable PMs are discarded. The PM is initialized as 0, and at each bit estimation, PM is updated as
where l is the path index (0 ≤ l < L), and i is the estimated bit index.
In [4] , it was observed that SCL decoding could pick a wrong codeword out of the final candidates if they are evaluated only by their PM, even when the correct codeword is present in the final list. Thus, a CRC is added as an outer decoding process to aid SCL decoding, which improves the error-correction performance significantly. On the other hand, SCL decoding suffers from long decoding latency and higher computational complexity of O(LN log 2 N ).
C. Simplified SCL and Fast-SSCL Decoding
The throughput of SC can be improved by an order of magnitude when applying the fast decoding techniques proposed in [13] and [14] . These techniques identify particular information and frozen bit patterns, reducing the decoding latency of SC with no error-correction performance degradation. Such special patterns are associated to nodes in the decoding tree: Rate-0 nodes (with no information bits), Rate-1 nodes (with no frozen bits), repetition (Rep) nodes (with a single information bit) and single parity-check (SPC) nodes (with a single frozen bit). In Fig. 2 the left and right child of the root node are examples of a Rep node and an SPC node, respectively. In [15] , it was shown that adaptation of these special nodes is applicable to SCL, yielding significant reduction in latency at the cost of error-correction performance loss.
The SSCL algorithm from [6] proposes an efficient decoding technique by proving that Rate-0, Rate-1 and Rep nodes need not to be traversed to update the PM while guaranteeing error-correction performance preservation. This approach reduces the number of decoding steps for a node of length N v from 3N v − 2, to N v for Rate-1 nodes, to 1 for Rate-0 nodes, and to 2 for Rep nodes [16] .
Fast-SSCL decoding [17] , proposes an enhanced method to reduce the decoding latency further for Rate-1 nodes, down to min(L − 1, N v ) time steps with zero error-correction performance degradation. It was shown that, when splitting the paths over the Rate-1 node, the path split that does not match the sign of the LLR will always be discarded after the L − 1-th step.
D. Partitioned SCL Decoding
PSCL decoding divides the polar code into P constituent sub-trees of length N/P , while every partition is decoded by the CRC-aided SCL algorithm [8] . Each partition has its own CRC, thus only one candidate is passed at the end of each partition to the next, using standard SC rules [1] . This approach helps reducing the memory requirements, since instead of storing L copies of the complete tree, L copies of a single partition are required. In addition, the same physical memory can be reused for different partitions. As a result, the memory requirements decrease exponentially with P . Fig. 4 depicts a generic PSCL decoder tree for a partition size of P = 4.
The reduced memory in PSCL comes at the cost of errorcorrection performance degradation compared to the conventional SCL decoding. As the number of partitions increases, the error-correction performance decays towards that of SC decoding. It was shown in [18] that a careful code construction and CRC selection can improve the error-correction performance of PSCL.
III. HARDWARE ARCHITECTURES FOR SCL-BASED DECODERS
A. SCL Decoder
The architecture of the SCL decoder follows the one described in [5] . It consists of five components: memory units, metric computation unit (MCU), metric sorting unit (MSU), address translation unit, and a controller. The MCU employs L parallel SC decoders performing (3), (4), and (5), one for each candidate codeword in the list. It also calculates the PM values whenever L decision LLR values are calculated according to (6) . It then takes one clock cycle to update and sort the PMs using MSU. PMs are stored in a registerbased memory architecture for each candidate, and are passed to a compute/swap unit at the end of each bit estimation. LLR and β memory units have L banks each, one for each parallel decoder unit. Considering there are P e processing elements available, each bank is itself divided into two parts, one handling the top stages of the decoding tree, where stage S > log 2 (P e ), and one for the lower stages.
B. SSCL & Fast-SSCL Decoder
The architectures of SSCL and Fast-SSCL decoders are based on the SCL architecture described in Section III-A: they however expand MCU to perform Rate-0, Rate-1 and Rep node calculations. Size and position of special nodes in the decoder tree are computed offline and used by the decoder as inputs. For Rate-0, no path splitting occurs, and a single step is used to update the PM list. Rate-1 nodes are computed in two stages: first the portion of the information bits (all of them in SSCL) that are subject to path splitting is calculated. Then, in case of Fast-SSCL, the remaining bits are estimated in a single step, and their LLR values are used to update the PM according to (6) . Computations for Rep nodes are similar to those of Rate-0 nodes: the frozen bits are treated as in Rate-0 nodes, while an additional step estimates the single information bit.
Both SSCL and Fast-SSCL architectures employ an Lparallel CRC computation unit that updates the CRC as soon as a bit is estimated by the SC decoders. They include different degrees of parallelism to accommodate the singlestep estimation of multiple bits in Rate-0, Rate-1 and Rep nodes.
C. PSCL Decoder
The PSCL decoder modifies the SCL decoder by reducing the size of the LLR and β memories to fit the partition size. A single memory takes care of the top of the tree, where the SC rules are applied, and ad-hoc routing to processing elements is performed depending on the tree stage.
IV. ERROR CORRECTION PERFORMANCE COMPARISON
In this section, the error correction performances of SCLbased decoding algorithms described in Section II are evaluated and compared against each other, and against LDPC codes taken from the IEEE 802.16e standard where applicable. For polar codes, we consider code lengths of 256 and 512: these lengths are included in the 5G eMBB control channel [2] . The LDPC code with N = 576 has been instead selected from the WiMAX standard, being the only one of length comparable to that of polar codes. Our simulation environment considers additive white Gaussian noise (AWGN) channel and BPSK modulation.
The error-correction performance of SCL, SSCL and Fast-SSCL are identical. Therefore, they are referred with the notation SCLL-CRCC, where L and C denote the list size and the CRC length, respectively. PSCL decoders are referred as PSCL(P ,L)-CRC(c 0 ,c 1 ,. . .,c P −1 ), where P denotes the number of partitions and c p represents the CRC length of partition p. For LDPC codes, T denotes the maximum number of iterations, and the normalized min-sum algorithm is used for decoding [19] , together with layered scheduling [20] .
The target code rates are R ∈ { 1 6 , 1 3 , 1 2 , 2 3 } for polar codes, having been investigated in 5G discussions [12] . Among these rates, WiMAX LDPC codes allow for R ∈ { 1 2 , 2 3 }. A CRC of length 8 is selected for polar codes. For PSCL, the CRC selection criteria from [18] was adopted. For a target E b /N 0 value, a simulation sweeps the error-correction performance of PSCL with different CRC lengths. Only CRC polynomials of degrees which are multiples of four are considered, to reduce the algorithm complexity. Then, for each code length and rate, CRC lengths that provide the best errorcorrection performance are selected. Fig. 5, 6, 7 , and 8 present the FER for SCL and PSCL algorithms with list sizes of L ∈ {4, 8}, and code lengths of N ∈ {256, 512} for code rates 1/6, 1/3, 1/2, and 2/3, respectively. A consistent improvement in FER can be seen when the list size is increased for all rates and lengths when SCL decoder is used. For a target FER of 10 −4 , this improvement reaches 1 dB when a polar code of length 512 with rate 1/3 is used. Similar observations can be made in terms of PSCL, with a peak improvement of 0.25 dB for P C(512, 256). In all cases, SCL decoders provide better errorcorrection performance than their PSCL counterparts. Fig. 9 and 10 present the FER of polar codes with N = 512 against LDPC WiMAX codes with N = 576, for R ∈ {1/2, 2/3}. The maximum number of iterations considered for LDPC decoding is T ∈ {5, 10, 20}. For R = 1/2 codes, SCL algorithm with L = 2 outperforms LDPC with T = 20, while at FER= 10 −4 , PSCL(2,2)-CRC (8, 8) has the same FER. For R = 2/3 codes in Fig. 10 , LDPC with T = 10 matched the error-correction performance of SCL8-CRC8. Based on these results, in the following section we compare decoder architectures that target codes with matching FER. Thus, LDPC decoders are compared to PSCL for R = 2/3, while SCL, SSCL and Fast-SSCL are used in case of R = 1/2.
V. ASIC IMPLEMENTATION RESULTS
In this section, synthesis results for SCL, SSCL, Fast-SSCL and PSCL for N ∈ {256, 512}, R ∈ { 1 2 , 2 3 }, and L ∈ {2, 4, 8} are presented. For each architecture, the number of parallel processing elements is P e = 32. Based on simulations, PM quantization is selected as 8 bits. For channel LLR and internal LLR values, quantization is 4 and 6 bits respectively, two of which are assigned to the fractional part. All memories have been synthesized as registers.
The architectures are synthesized with TSMC 65 nm CMOS technology, targeting a frequency of f = 800 MHz. Table I compares the total area, power and energy consumption per codeword for all four SCL-based decoder implementations under the aforementioned design parameters.
The SCL decoder yields lower area occupation and power consumption compared to SSCL and Fast-SSCL, With all the considered code lengths, rates and list sizes. This is due to the fact that the special node computations in SSCL and Fast- SSCL add substantial logic complexity. Additional complexity is also caused by the parallel CRC units necessary to update Rate-0 and Rep nodes in SSCL, and also Rate-1 nodes in Fast-SSCL. In terms of energy consumption, Fast-SSCL provides the best results compared to its predecessors: although the power consumption is higher, the number of time steps needed to decode a codeword is reduced dramatically, yielding the lowest energy per frame.
In SCL-based implementations, memory is a major contribution in both area occupation and power consumption, that decreases exponentially with the partitioning factor of PSCL. Thus PSCL, with its reduced memory requirements, has the smallest area occupation and power consumption. In this context, with a minor degradation in performance, PSCL provides the best results for area-and power-efficient implementations.
For all considered rates in Table I when N = 256, energy consumption per codeword of PSCL follows a close trend to SSCL when L = 2. SCL, with its long decoding process, has the worst energy consumption, while Fast-SSCL has the lowest one. As L increases, PSCL energy dissipation becomes comparable to that of Fast-SSCL. For N = 512, energy consumption for PSCL sits between that of SCL and SSCL for L ∈ {2, 4}. For L = 8, the energy consumption of PSCL is lower than that of SSCL and higher than that of Fast-SSCL. This is due to the nonlinear increment in power consumption that both SSCL and Fast-SSCL experience with increasing L. Table II compares power, energy, and area of the considered polar code decoders against architectures for LDPC 802.16e codes taken from [21] , [22] , and [23] . Polar code decoders are selected based on the observations from Fig. 9-10 . Note that the LDPC decoder architectures from [21] and [23] support both considered rates R = { 1 2 , 2 3 }. Energy consumption for the LDPC architectures in Table II are calculated with the number of iterations T required to match the FER of polar codes from Section IV.
In Table II , the area occupation for the LDPC decoders is scaled to 65 nm technology for a fair comparison. For rate R = 1 2 , the total area of polar code decoders ranges between 7.7× (Fast-SSCL vs. [21] ) to 17.1× (PSCL vs. [23] ) less than that of LDPC WiMAX implementations. For rate R = 2 3 , the advantage of polar decoders over LDPC decoders is lower, with a minimum of 2.46× less area occupation. Comparing power and energy consumption of architectures implemented with different technology nodes is not desirable, power scaling leads to wildly inaccurate figures. However, with the current scheme, SCL decoders consume up to 8.75× less power, and up to 26.8× less energy per frame in case of R = 1 2 . For R = 2 3 , polar codes yield 2.6× less power consumption and 3.9× less energy dissipation per frame.
According to these results, SCL-based polar code decoder implementations offer good solutions for 5G applications that require low area, power or energy consumption. For communication devices that require low power and energy, SCL, Fast-SSCL, and PSCL offer better figures than LDPC codes at comparable FER, code lengths and rates. Considering area occupation along with power consumption, PSCL provides a very favorable solution with negligible loss in error-correction performance.
VI. CONCLUSION
In this work, we evaluate SCL-based polar code decoder implementations in terms of error-correction performance, area occupation, power consumption, and energy consumption, for a code set case study. SCL, SSCL and Fast-SSCL have the same error-correction performance, while PSCL suffers minor FER loss. We show that the considered polar code decoders have comparable error-correction performance against WiMAX LDPC codes. We also show and compare the area, power and energy consumption for all four decoder implementations, and discuss their trade-offs. Comparing selected SCL-based decoder implementations against WiMAX LDPC architectures show that polar code decoders have reduced area, power and energy consumption, which makes them more suitable for potential 5G communications.
