Abstract-We show that successive cancellation list decoding can be formulated exclusively using log-likelihood ratios. In addition to numerical stability, the log-likelihood ratio based formulation has useful properties that simplify the sorting step involved in successive cancellation list decoding. We propose a hardware architecture of the successive cancellation list decoder in the log-likelihood ratio domain which, compared with a log-likelihood domain implementation, requires less irregular and smaller memories. This simplification, together with the gains in the metric sorter, lead to to higher throughput per unit area than other recently proposed architectures. We then evaluate the empirical performance of the CRC-aided successive cancellation list decoder at different list sizes using different CRCs and conclude that it is important to adapt the CRC length to the list size in order to achieve the best error-rate performance of concatenated polar codes. Finally, we synthesize conventional successive cancellation decoders at large block-lengths with the same block-error probability as our proposed CRC-aided successive cancellation list decoders to demonstrate that, while our decoders have slightly lower throughput and larger area, they have a significantly smaller decoding latency.
I. INTRODUCTION

I
N his seminal work [1] , Arıkan constructed the first class of error correcting codes that provably achieve the capacity of any symmetric binary-input discrete memoryless channel (B-DMC) with efficient encoding and decoding algorithms based on channel polarization. In particular, Arıkan proposed a low-complexity successive cancellation (SC) decoder and proved that the block-error probability of polar codes under SC decoding vanishes as their block-length increases. The SC decoder is attractive from an implementation perspective due to its highly structured nature. Several hardware architectures for SC decoding of polar codes have recently been presented in the literature [2] - [8] , the first SC decoder ASIC was presented in [9] , and simplifications of Arikan's original SC decoding algorithm are studied in [10] - [13] .
Even though the block-error probability of polar codes under SC decoding decays roughly like as a function of the block-length [14] , they do not perform well at low-to-moderate block-lengths. This is to a certain extent due to the sub-optimality of the SC decoding algorithm. To partially compensate for this sub-optimality, Tal and Vardy proposed the successive cancellation list (SCL) decoder whose computational complexity is shown to scale identically to the SC decoder with respect to the block-length [15] .
SCL decoding not only improves the block-error probability of polar codes, but also enables one to use modified polar codes [16] , [17] constructed by concatenating a polar code with a cyclic redundancy check (CRC) code as an outer code. Adding the CRC increases neither the computational complexity of the encoder nor that of the decoder by a notable amount, while reducing the block-error probability significantly, making the error-rate performance of the modified polar codes under SCL decoding comparable to the state-of-the-art LDPC codes [16] . In [18] an adaptive variant of the CRC-aided SCL decoder is proposed in order to further improve the block-error probability of modified polar codes while maintaining the average decoding complexity at a moderate level.
The SCL decoding algorithm in [15] is described in terms of likelihoods. Unfortunately, computations with likelihoods are numerically unstable as they are prone to underflows. In recent hardware implementations of the SCL decoder [19] - [23] the stability problem was solved by using log-likelihoods (LLs). However, the use of LLs creates other important problems, such as an irregular memory with varying number of bits per word, as well as large processing elements, making these decoders still inefficient in terms of area and throughput.
Contributions and Paper Outline: After a background review of polar codes and SCL decoding in Section II, in Section III we prove that the SCL decoding algorithm can be formulated exclusively in the log-likelihood ratio (LLR) domain, thus enabling area-efficient and numerically stable implementation of SCL decoding. We discuss our SCL decoder hardware architecture in Section IV and leverage some useful properties of the LLR-based formulation in order to prune the radixsorter (implementing the sorting step of SCL decoding) used in [19] , [24] by avoiding unnecessary comparisons in Section V. Next, 1053-587X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
in Section VI we see that the LLR-based implementation leads to a significant reduction of the size of our previous hardware architecture [19] , as well as to an increase of its maximum operating frequency. We also compare our decoder with the recent SCL decoder architectures of [22] , [23] and show that our decoder can have more than 100% higher throughput per unit area than those architectures. Besides the implementation gains, it is noteworthy that most processing blocks in practical receivers process the data in the form of LLRs. Therefore, the LLR-based SCL decoder can readily be incorporated into existing systems while the LL-based decoders would require extra processing stages to convert the channel LLRs into LLs. In fairness, we note that one particular advantage of LL-based SCL decoders is that the algorithmic simplifications of [10] - [13] can immediately be applied to the SCL decoder [25] , while in order to apply those simplifications to an LLR-based SCL decoder one has to rely on approximations [26] .
Finally, we show that a CRC-aided SCL decoder can be implemented by incorporating a CRC unit into our decoder, with almost no additional hardware cost, in order to achieve significantly lower block-error probabilities. As we will see, for a fixed information rate, the choice of CRC length is critical in the design of the modified polar code to be decoded by a CRC-aided SCL decoder. In Section VI-E we provide simulation results showing that for small list sizes a short CRC will improve the performance of SCL decoder while larger CRCs will even degrade its performance compared to a standard polar code. As the list size gets larger, one can increase the length of the CRC in order to achieve considerably lower block-error probabilities.
An interesting question, which is, to the best of our knowledge, still unaddressed in the literature, is whether it is better to use SC decoding with long polar codes or SCL decoding with short polar codes. In Section VIII we study two examples of long polar codes that have the same block-error probability under SC decoding as our ( ) modified polar codes under CRC-aided SCL decoding. By comparing the synthesis results of the corresponding decoders, we observe that, while the SCL decoders have a lower throughput due to the sorting step, they also have a significantly lower decoding latency than the SC decoders. 
II. BACKGROUND
Notation
Arıkan shows that as , these synthetic channels polarize to 'easy-to-use' B-DMCs [1, Theorem 1] . That is, all except a vanishing fraction of them will be either almost-noiseless channels (whose output is almost a deterministic function of the input) or useless channels (whose output is almost statistically independent of the input). Furthermore, the fraction of almost-noiseless channels is equal to the symmetric capacity of the underlying channel-the highest rate at which reliable communication is possible through when the input letters are used with equal frequency [27] .
A. Polar Codes and Successive Cancellation Decoding
Having transformed identical copies of a 'moderate' B-DMC into 'extremal' B-DMCs , Arıkan constructs capacity-achieving polar codes by exploiting the almost-noiseless channels to communicate information bits.
1) Polar Coding:
In order to construct a polar code of rate and block length for a channel , the indices of the least noisy synthetic channels are selected as the information indices denoted by . The sub-vector will be set to the data bits to be sent to the receiver and , where , is fixed to some frozen vector which is known to the receiver. The vector is then encoded to the codeword through (1) using binary additions (cf. [1, Section VII]) and transmitted via independent uses of the channel .
The receiver observes the channel output vector and estimates the elements of the successively as follows: Suppose the information indices are ordered as 1 Let and be two length vectors and index their elements using binary sequences of length , . Then for . 2 Following the convention in probability theory, we denote the realizations of the random vectors , , and as , , and respectively.
(where ). Having the channel output, the receiver has all the required information to decode the input of the synthetic channel as , as, in particular, is a part of the known sub-vector . Since this synthetic channel is assumed to be almost-noiseless by construction, with high probability. Subsequently, the decoder can proceed to index as the information required for decoding the input of is now available. Once again, this estimation is with high probability error-free. As detailed in Algorithm 1, this process is continued until all the information bits have been estimated.
2) SC Decoding as a Greedy Tree Search Algorithm: Let (5) denote the set of possible length-vectors that the transmitter can send. The elements of are in one-to-one correspondence with leaves of a binary tree of height : the leaves are constrained to be reached from the root by following the direction at all levels . Therefore, any decoding procedure is essentially equivalent to picking a path from the root to one of these leaves on the binary tree.
In particular, an optimal ML decoder, associates each path with its likelihood (or any other path metric which is a monotone function of the likelihood) and picks the path that maximizes this metric by exploring all possible paths: (6) Clearly such an optimization problem is computationally infeasible as the number of paths, , grows exponentially with the block-length .
The SC decoder, in contrast, finds a sub-optimal solution by maximizing the likelihood via a greedy one-time-pass through the tree: starting from the root, at each level , the decoder extends the existing path by picking the child that maximizes the partial likelihood . 3) Decoding Complexity: The computational task of the SC decoder is to calculate the pairs of likelihoods needed for the decisions in line 5 of Algorithm 1. Since the decisions are binary, it is sufficient to compute the decision log-likelihood ratios (LLRs),
It can be shown (see [1, Section VII] and [2] ) that the decision LLRs (7) can be computed via the recursions, Fig. 1 
where is a 'hardware-friendly' function as it involves only the easy-to-implement operation (compared to which involves exponentiations and logarithms). For a hardware implementation of the SC decoder the update rule is replaced by . Given , such an approximation is called the "min-sum approximation" of the decoder.
B. Successive Cancellation List Decoding
The successive cancellation list (SCL) decoding algorithm, introduced in [15] , converts the greedy one-time-pass search of SC decoding into a breadth-first search under a complexity constraint in the following way: At each level , instead of extending the path in only one direction, the decoder is duplicated in two parallel decoding threads continuing in either possible direction. However, in order to avoid the exponential growth of the number of decoding threads, as soon as the number of parallel decoding threads reaches , at each step , only threads corresponding the most likely paths (out of tentatives) are retained. 3 The decoder eventually finishes with a list of candidates , corresponding to (out of ) paths on the binary tree and declares the most likely of them as the final estimate. This procedure is formalized in Algorithm 2. Simulation results in [15] show that for a ( ) polar code, a relatively small list size of is sufficient to have a close-to-ML block-error probability.
While a naive implementation of SCL decoder would have a decoding complexity of at least (due to duplications of data structures of size in lines 8 and 18 of Algorithm 2), a clever choice of data structures together with the recursive nature of computations enables the authors of [15] to use a copy-on-write mechanism and implement the decoder in complexity.
C. CRC-Aided Successive Cancellation List Decoder
In an extended version of their work [16] , Tal and Vardy observe that when the SCL decoder fails, in most of the cases, the correct path (corresponding to ) is among the paths the decoder has ended up with. The decoding error happens since there exists another more likely path which is selected in line 19 of Algorithm 2 (note that in such situations the ML decoder would also fail). They, hence, conclude that the performance of polar codes would be significantly improved if the decoder were assisted for its final choice.
Such an assistance can be realized by adding more nonfrozen bits (i.e., creating a polar code of rate instead of rate ) to the underlying polar code and then setting the last non-frozen bits to an -bit CRC of the first information bits (note that the effective information rate of the code is unchanged). The SCL decoder, at line 19, first discards the paths that do not pass the CRC and then chooses the most likely path among the remaining ones. Since the CRC can be computed efficiently [28, Chapter 7] , this does not notably increase the computational complexity of the decoder. The empirical results of [16] show that a ( ) concatenated polar code (with a -bit CRC) decoded using a list decoder with list size of , outperforms the existing state-of-the-art WiMAX ( ) LDPC code [29] . Remark: According to [30] , the empirical results of [16] on the CRC-aided successive cancellation list decoder (CA-SCLD) are obtained using a ( ) (inner) polar code with the last unfrozen bits being the CRC of the first information bits and the results on the non-CRC aided (standard) SCL decoder are obtained using a ( ) polar code-both having an effective information rate of . In [17] , [20] , [23] the CA-SCLD 3 Although it is not necessary, is normally a power of 2.
is realized by keeping the number of non-frozen bits fixed and setting the last of them to the CRC of the preceding information bits. This reduces the effective information rate of the code and makes the comparison between the SCLD and the CA-SCLD unfair. 4 III. LLR-BASED PATH METRIC COMPUTATION Algorithms 1 and 2 are both valid high-level descriptions of SC and SCL decoding, respectively. However, for implementing these algorithms, the stability of the computations is crucial. Both algorithms summarized in Section II are described in terms of likelihoods which are not safe quantities to work with; a decoder implemented using the likelihoods is prone to underflow errors as they are typically tiny numbers. 5 Considering the binary tree picture that we provided in Section II-B2, the decision LLRs (7) summarize all the necessary information for choosing the most likely child among two children of the same parent at level . In Section II-B3 we saw that having this type of decisions in the conventional SC decoder allows us to implement the computations in the LLR domain using numerically stable operations. However the SCL decoder, in lines 10-18 of Algorithm 2, has to choose the most likely children out of children of different parents (see [31, Fig. 3 ] for an illustration). For these comparisons the decision log-likelihood ratios alone are not sufficient. Consequently, the software implementation of the decoder in [15] implements the decoder in the likelihood domain by 4 In [18] this discrepancy is not clarified. However, this work focuses only on CA-SCLD without comparing the performances of SCLD and CA-SCLD. 5 As noticed in [16] , it is not difficult to see that .
rewriting the recursions of Section II-B3 for computing pairs of likelihoods , from pairs of channel likelihoods . To avoid underflows, at each intermediate step of the updates the likelihoods are scaled by a common factor such that in line 10 of Algorithm 2 is proportional to [16] . Alternatively, such a normalization step can be avoided by performing the computations in the log-likelihood (LL) domain, i.e., by computing the pairs , for as a function of channel log-likelihood pairs , , [19] . Log-likelihoods provide some numerical stability, but still involve some issues compared to the log-likelihood ratios as we discuss in Section IV.
Luckily, we shall see that the decoding paths can still be ordered according to their likelihoods using all of the past decision LLRs , and the trajectory of each path as summarized in the following theorem.
Theorem 1: For each path and each level let the path-metric be defined as: (10) where is the log-likelihood ratio of bit given the channel output and the past trajectory of the path . If all the information bits are uniformly distributed in , for any pair of paths , if and only if
In view of Theorem 1, one can implement the SCL decoder using parallel low-complexity and stable LLR-based SC decoders as the underlying building blocks and, in addition, keep track of path-metrics. The metrics can be updated successively as the decoder proceeds by setting
where the function is defined as
As shown in Algorithm 3, the paths can be compared based on their likelihood using the values of the associated path metrics. 
We also note that is the direction that the LLR (given the past trajectory ) suggests. This is the same decision that a SC decoder would have taken if it was to estimate the value of at step given the past set of decisions (cf. line 5 in Algorithm 1). Equation (12) shows that if at step the th path does not follow the direction suggested by it will be penalized by an amount . Having such an interpretation, one might immediately conclude that the path that SC decoder would follow will always have the lowest penalty hence is always declared as the output of the SCL decoder. So why should the SCL decoder exhibit a better performance compared to the SC decoder? The answer is that such a reasoning is correct only if all the elements of are information bits. As soon as the decoder encounters a frozen bit, the path metric is updated based on the likelihood of that frozen bit, given the past trajectory of the path and the a-priori known value of that bit (cf. line 6 in Algorithm 3). This can penalize the SC path by a considerable amount, if the value of that frozen bit does not agree with the LLR given the past trajectory (which is an indication of a preceding erroneous decision), while keeping some other paths unpenalized. We devote the rest of this section to the proof of Theorem 1. 
Having shown (13) , Theorem 1 will follow as an immediate corollary to Lemma 1 (since the channel output is fixed for all decoding paths). Since the path index is fixed on both sides of (10) we will drop it in the sequel. Let (the last equality follows since ), and observe that showing (13) is equivalent to proving (14) Since (15) Repeated application of (15) (for ) yields
Dividing both sides by proves (14) .
IV. SCL DECODER HARDWARE ARCHITECTURE
In this section, we show how the LLR-based path metric which we derived in the previous section can be exploited in order to derive a very efficient LLR-based SCL decoder hardware architecture. More specifically, we give a detailed description of each unit of our LLR-based SCL decoder architecture, which essentially consists of parallel SC decoders along with a path management unit which coordinates the tree search. Moreover, we highlight the advantages over our previous LL-based architecture described in [19] . Our SCL decoder consists of five units: the memories unit, the metric computation unit (MCU), the metric sorting unit, the address translation unit, and the control unit. An overview of the SCL decoder is shown in Fig. 2(a) .
A. LLR and Path Metric Quantization
All LLRs are quantized using a -bit signed uniform quantizer with step size . The path metrics are unsigned numbers which are quantized using bits. Since the path metrics are initialized to 0 and, in the worst case, they are incremented by for each bit index , the maximum possible value of a path metric is . Hence, at most bits are sufficient to ensure that there will be no overflows in the path metric. In practice, any path that gets continuously harshly penalized will most likely be discarded. Therefore, as we will see in Section VI, much fewer bits are sufficient in practice for the quantization of the path metrics.
B. Metric Computation Unit
The computation of the LLRs (line 4 of Algorithm 3) can be fully parallelized. Consequently, the MCU consists of par-allel SC decoder cores which implement the SC decoding update rules and compute the decision LLRs using the semi-parallel SC decoder architecture of [5] with processing elements (PEs). These decision LLRs are required to update the path metrics . Whenever the decision LLRs have been computed, the MCUs wait for one clock cycle. During this single clock cycle, the path metrics are updated and sorted. Moreover, based on the result of metric sorting, the partial sum, path, and pointer memories are also updated in the same clock cycle, as described in the sequel.
Each decoder core reads its input LLRs from one of the physical LLR memory banks based on an address translation performed by the pointer memory (described in more detail in Section IV-D).
C. Memory Unit 1) LLR Memory:
The channel LLRs are fixed during the decoding process of a given codeword, meaning that an SCL decoder requires only one copy of the channel LLRs. These are stored in a memory which is words deep and bits wide. On the other hand, the internal LLRs of the intermediate stages of the SC decoding (metric computation) process are different for each path . Hence we require physical LLR memory banks with memory positions per bank. All LLR memories have two reads ports, so that all PEs can read their two -bit input LLRs simultaneously. Here, register based storage cells are used to implement all the memories.
2) Path Memory: The path memory consists of -bit registers, denoted by . When a path needs to be duplicated, the contents of are copied to , where corresponds to an inactive path (cf. line 25 of Algorithm 3). The decoder is stalled for one clock cycle in order to perform the required copy operations by means of crossbars which connect each with all other . The copy mechanism is presented in detail in Fig. 2(b) , where we show how each memory bit-cell is controlled based on the results of the metric sorter. After path has been duplicated, one copy is extended with the bit value , while the other is updated with (cf. lines 26 and 27 of Algorithm 3).
3) Partial Sum Memory:
The partial sum memory consists of PSNs, where each PSN is implemented as in [5] . When a path needs to be duplicated, the contents of the PSN are copied to another PSN , where corresponds to an inactive path (cf. line 25 of Algorithm 3). Copying is performed in parallel with the copy of the path memory in a single clock cycle by using crossbars which connect each PSN with all other PSNs . If PSN was duplicated, one copy is updated with the bit value , while the other copy is updated with
. If a single copy of PSN was kept, then this copy is updated with the value of that corresponds to the surviving path.
D. Address Translation Unit
The copy-on-write mechanism used in [15] (which is fully applicable to LLRs) is sufficient to ensure that the decoding complexity is , but it is not ideal for a hardware implementation as, due to the recursive implementation of the computations, it still requires copying the internal LLRs which is costly in terms of power, decoding latency, and silicon area.
On the other hand, a sequential implementation of the computations enables a more hardware-friendly solution [19] , where each path has its own virtual internal LLR memory, the contents of which are physically spread across all of the LLR memory banks. The translation from virtual memory to physical memory is done using a small pointer memory. When a path needs to be duplicated, as with the partial sum memory, the contents of row of the pointer memory are copied to some row corresponding to a discarded path through the use of crossbars.
E. Metric Sorting Unit
The metric sorting unit contains a path metric memory and a path metric sorter. The path metric memory stores the path metrics using bits of quantization for each metric. In order to find the median at each bit index (line 13 of Algorithm 3), the path metric sorter sorts the candidate path metrics , , (line 8 of Algorithm 3). The path metric sorter takes the path metrics as an input and produces the sorted path metrics, as well as the path indices and bit values which correspond to the sorted path metrics as an output. Since decoding can not continue before the surviving paths have been selected, the metric sorter is a crucial component of the SCL decoder. Hence, we will discuss the sorter architecture in more detail in Section V.
F. Control Unit
The control unit generates all memory read and write addresses as in [5] . Moreover, the control unit contains the codeword selection unit and the optional CRC unit.
The CRC unit contains -bit CRC memories, where is the number of CRC bits. A bit-serial implementation of a CRC computation unit is very efficient in terms of area and path delay, but it requires a large number of clock cycles to produce the checksum. However, this computation delay is masked by the bit-serial nature of the SCL decoder itself and, thus, has no impact on the number of clock cycles required to decode each codeword. Before decoding each codeword, all CRC memories are initialized to -bit all-zero vectors. For each , the CRC unit is activated to update the CRC values. When decoding finishes, the CRC unit declares which paths pass the CRC. When a path is duplicated the corresponding CRC memory is copied by means of crossbars (like the partial sums and the path memory).
If the CRC unit is present, the codeword selection unit selects the most likely path (i.e., the path with the lowest metric) out of the paths that pass the CRC. Otherwise, the codeword selection unit simply chooses the most likely path.
G. Clock Cycles Per Codeword
Let the total number of cycles required for metric sorting at all information indices be denoted by . As we will see in Section V-C, the sorting latency depends on the number of information bits and may depend on the pattern of frozen and information bits as well (both of these parameters can be deduced given ). Then, our SCL decoder requires (16) cycles to decode each codeword.
H. Advantages Over LL-Based SCL Decoder Implementation
The LLs in the SCL decoders of [19] - [23] are all positive numbers and the corresponding LL-domain update rules involve only additions and comparisons. This means that, as decoding progresses through the decoding stages, the dynamic range of the LLs is increased. Thus, in order to avoid catastrophic overflows, all LLs in stage are quantized using bits. In the LLR-based implementation of this paper, the LLRs of all stages can be quantized using the same number of bits since the update rules involve both addition and subtraction and the dynamic range of the LLRs in different stages is smaller than that of the LLs. This leads to a regular memory where all elements have the same bit-width. Hence, as we will see in Section VI, using LLRs significantly reduces the total size of the decoder. In addition, the PEs in the LL-based SCL decoder architectures of [19] , [20] must support computations with a much larger bit-width than the ones in our LLR-based SCL decoder architecture. Moreover, it turns out that the path metric in the LLR-based decoder can be quantized using much fewer bits than in the LL-based decoder, hence decreasing the delay and the size of the comparators in the metric sorting unit. Finally, the LLR-based formulation enables us to significantly simplify the metric sorter, as explained in the following section.
V. SIMPLIFIED SORTER For large list sizes
, the maximum (critical) delay path passes through the metric sorter, thus reducing the maximum operating frequency of the decoder in [19] , [24] . It turns out that the LLR-based path metric we introduced in Theorem 1 has some properties (which the LL-based path metric lacks) that can be used to simplify the sorting task.
To this end, we note that the real numbers that have to be sorted in line 13 of Algorithm 3 are not arbitrary; half of them are the previously existing path-metrics (which can be assumed to be already sorted as a result of decoding the preceding information bit) and the rest are obtained by adding positive real values (the absolute value of the corresponding LLRs) to the existing path metrics. Moreover, we do not need to sort all these potential path metrics; a sorted list of the smallest path metrics is sufficient. Hence, the sorting task of the SCL decoder can be formalized as follows: Given a sorted list of numbers a list of size , is created by setting and where , for . The problem is to find a sorted list of smallest elements of when the elements of have the following two properties: for ,
A. Full Radix-2L Sorter
The most straightforward way to solve our problem is to sort the list up to the -th element. This can be done using a simple extension of the radixsorter described in [32] , which blindly compares every pair of elements and then combines the results to find the first smallest elements. This is the solution we used in [19] , which requires comparators together with -to-multiplexers (see Fig. 3(a) ). The sorting logic combines the results of all comparators in order to generate the control signal for the multiplexers (cf. [32] for details). The maximum path delay of the radixsorter is mainly determined by the complexity of the sorting logic, which in turn depends on the number of comparator results that need to be processed.
B. Pruned Radix-2L Sorter
The pruned radix-2L sorter presented in this section reduces the complexity of the sorting logic of the radixsorter and, thus, the maximum path delay, by eliminating some pairwise comparisons whose results are either already known or irrelevant.
Proposition 1: It is sufficient to use a pruned radix-sorter that involves only comparators to find the smallest elements of . This sorter is obtained by (a) removing the comparisons between every even-indexed element of and all following elements, and (b) removing the comparisons between and all other elements of . Proof: Properties (17a) and (17b) imply for . Hence, the outputs of these comparators are known. Furthermore, as we only need the first elements of the list sorted and is never among the smallest elements of , we can always replace by (pretending the result of the comparisons involving is known) without affecting the output of the sorter.
In step (a) we have removed comparators and in step (b) comparators (note that in the full sorter is compared to all preceding elements but of them correspond to even-indexed elements whose corresponding comparators have already been removed in step (a)). Hence we have comparators.
Besides the comparators, the pruned radix-sorter requires -to-multiplexers (see Fig. 3(b) ). The pruned radix-sorter is derived based on the assumption that the existing path metrics are already sorted. This assumption is violated when the decoder reaches the first frozen bit after the first cluster of information bits; at each frozen index, some of the path-metrics are unchanged and some are increased by an amount equal to the absolute value of the LLR. In order for the assumption to hold when the decoder reaches the next cluster of information bits, the existing path metrics have to be sorted before the decoding of this cluster starts. The existing pruned radix-sorter can be used for sorting arbitrary positive numbers as follows.
Proposition 2: Let be non-negative numbers. Create a list of size as
Feeding this list to the pruned radix-sorter will result in an output list of the form zeros where is the ordered permutation of . Proof: It is clear that the assumptions (17a) and (17b) hold for . The proof of Proposition 1 shows if the last element of the list is additionally known to be the largest element, the pruned radix-sorter sorts the entire list.
Note that while the same comparator network of a pruned radix-sorter is used for sorting numbers, separate -to-1 multiplexers are required to output the sorted list.
C. Latency of Metric Sorting
We assume that the sorting procedure is carried out in a single clock cycle. A decoder based on the full radixsorter, only needs to sort the path metrics for the information indices, hence, the total sorting latency of such an implementation is cycles (18) Using the pruned radix-sorter, additional sorting steps are required at the end of each contiguous set of frozen indices. Let denote the number of clusters of frozen bits for a given information set . 6 The metric sorting latency using the pruned radix-sorter is then cycles (19) VI. IMPLEMENTATION RESULTS
In this section, we present synthesis results for our SCL decoder architecture. For fair comparison with [23] , we use a TSMC nm technology with a typical timing library ( supply voltage, C operating temperature) and our decoder of [19] is re-synthesized using this technology. All synthesis runs are performed with timing constraints that are not achievable, in order to assess the maximum achievable operating frequency of each design, as reported by the synthesis tool. For our synthesis results, we have used PEs per SC decoder core, as in [5] , [19] . The hardware efficiency is defined as the 6 More precisely we assume such that (i) if , i.e., is a partition of ; (ii) for every , is a contiguous subset of ; and (iii) for every pair , is not a contiguous subset of . It can be easily checked that such a partition always exists and is unique. throughput per unit area and it is measured in Mbps/mm . The decoding throughput of all decoders is: (20) where is the operating frequency of the decoder. We first compare the LLR-based decoder of this work with our previous LL-based decoder [19] , in order to demonstrate the improvements obtained by moving to an LLR-based formulation of SCL decoding. Then, we examine the effect of using the pruned radix-sorter on our LLR-based SCL decoder. Finally, we compare our LLR-based decoder with the LL-based decoder of [23] and [22] (since [23] is an improved version of [20] , we do not compare directly with [20] ). A direct comparison with the SCL decoders of [21] , [26] is unfortunately not possible, as the authors do not report their synthesis results in terms of . Finally, we provide some discussion on the effectiveness of a CA-SCLD.
A. Quantization Parameters
In Fig. 4 , we present the FER of floating-point and fixed-point implementations of an LL-based and an LLR-based SCL decoder for a ( ) polar code as a function of SNR. 7 For the floating-point simulations we have used the exact implementation of the decoder, i.e., for computing the LLRs the update rule of (8a) is used and the path metric is iteratively updated according to (11) . In contrast, for the fixed-point simulations we have used the min-sum approximation of the decoder (i.e., replaced with as in (9)) and the approximated path metric update rule of (12) .
We observe that the LL-based and the LLR-based SCL decoders have practically indistinguishable FER performance when quantizing the channel LLs and the channel LLRs with bits and bits respectively. Moreover, in our simulations we observe that the performance of the LL and the LLR-based SCL decoder is degraded significantly when and , respectively. As discussed in Section IV-A, metric quantization requires at most bits. However, in practice, much fewer bits turn out to be sufficient. For example, in our simulations for and , setting leads to the same performance as the worst-case , while setting results in a significant performance degradation due to metric saturation. Thus, all synthesis results of this section are obtained for for the LL-based decoder of [19] , and and for the LLR-based decoder for a fair (i.e., iso-FER) comparison.
The authors of [22] do not provide the FER curves for their fixed-point implementation of SCLD and the authors of [23] only provide the FERs for a CA-SCLD [23, Fig. 2] . Nevertheless, we assume their quantization schemes will not result in a better FER performance for a standard SCLD than that of [19] since they both implement exactly the same algorithm as in [19] (using a different architecture than [19] ).
B. Gains due to LLR-based Formulation of SCL Decoding
Our previous LL-based architecture of [19] and the LLRbased architecture with a radix-sorter presented in this paper are identical except that the former uses LLs while the latter uses LLRs. Therefore, by comparing these two architectures we can specifically identify the improvements in terms of area and decoding throughput that arise directly from the reformulation of SCL decoding in the LLR domain.
The cycle count for our SCL decoder using the radixsorter when decoding a ( ) polar code is cycles (see (16) and (18)). From Table I , we see that our LLR-based SCL decoder occupies , , and smaller area than our LL-based SCL decoder of [19] for , , and , respectively. We present the area breakdown of the LL-based and the LLR-based decoders in Table II in order to identify where the area reduction mainly comes from and why the relative reduction in area decreases with increasing list size . The memory area corresponds to the combined area of the LLR (or LL) memory, the partial sum memory, and the path memory. We observe that, in absolute terms, the most significant savings in terms of area come from the memory for and from the MCU for . On the other hand, in relative terms, the biggest savings in terms of area always come from the MCU with an average area reduction of . The relative reduction in the memory area decreases with increasing list size . This happens because each bit-cell of the partial sum memory and the path memory contains -to-crossbars, whose size grows quadratically with , while the LL (and LLR) memory grows only linearly in size with . Thus, the size of the partial sum memory and the path memory, which are not affected by the LLR-based reformulation, becomes more significant as the list size is increased, and the relative reduction due to the LLR-based formulation is decreased. Similarly, the relative reduction in the metric sorter area decreases with increasing , because the LLR-based formulation only decreases the bit-width of the comparators of the radix-sorter but it does not affect the size of the sorting logic, which dominates the sorter area as the list size is increased .   TABLE I  COMPARISON WITH LL-BASED IMPLEMENTATION   TABLE II  CELL AREA BREAKDOWN FOR THE LL-BASED AND THE RADIX-2L LLR-BASED  SCL DECODERS From Table I , we observe that the operating frequency (and, hence, the throughput) of our LLR-based decoder is , , and higher than that of our LL-based SCL decoder of [19] for , , and , respectively. Due to the aforementioned improvements in area and decoding throughput, the LLR-based reformulation of SCL decoding leads to hardware decoders with , , and better hardware efficiency than the corresponding LL-based decoders of [19] , for , , and , respectively.
C. Radix-Sorter Versus Pruned Radix-Sorter
One may expect the pruned radixsorter to always outperform the radixsorter. However, the decoder equipped with the pruned radixsorter needs to stall slightly more often to perform the additional sorting steps after groups of frozen bits. In particular, a ( ) polar code contains groups of frozen bits. Therefore, the total sorting latency for the pruned radixsorter is (see (19) ). Thus, we have , which is an increase of approximately compared to the decoder equipped with a full radixsorter. Therefore, if using the pruned radixdoes not lead to a more than higher clock frequency, the decoding throughput will actually be reduced.
As can be observed in Table III , this is exactly the case for , where the LLR-based SCL decoder with the pruned radix-sorter has a lower throughput than the LLR-based SCL decoder with the full radix-sorter. However, for the metric sorter starts to lie on the critical path of the decoder, therefore, using the pruned radixsorter results in a significant increase in throughput of up to for . To provide more insight into the effect of the metric sorter on our SCL decoder, in Table IV we present the metric sorter delay and the critical path start-and endpoints of each decoder of Table III . The critical paths for and are also annotated in Fig. 2(a) with green dashed lines and red dotted lines, respectively. We denote the register of the controller which stores the internal LLR memory read address by . Moreover, let and denote a register of the partial sum memory and the metric memory, respectively. From Table IV , we observe that, for , the radixsorter does not lie on the critical path of the decoder, which explains why using the pruned radixsorter does not improve the operating frequency of the decoder. For the metric sorter does lie on the critical path of the decoder and using the pruned radix-sorter results in a significant increase in the operating frequency of up to . It is interesting to note that using the pruned radix-sorter eliminates the metric sorter completely from the critical path of the decoder for . For , even the pruned radixsorter lies on the critical path of the decoder, but the delay through the sorter is reduced by .
D. Comparison With LL-Based SCL Decoders
In Table V , we compare our LLR-based decoder with the LL-based decoders of [23] and [22] along with our LL-based decoder of [19] . For the comparisons, we choose our SCL decoder with the best hardware efficiency for each list size, i.e., for we pick the SCL decoder with the radix-sorter, while for we pick the decoder with the pruned radixsorter. Moreover, we pick the decoders with the best hardware efficiency from [22] , i.e., the -rSCL decoders. [23] : From Table V we observe that our LLR-based SCL decoder has an approximately smaller area than the LL-based SCL decoder of [23] for all list sizes. Moreover, the throughput of our LLR-based SCL decoder is up to higher than the throughput achieved by the LL-based SCL decoder of [23] , leading to a , , and better hardware efficiency for , and , respectively.
1) Comparison With
2) Comparison With [22] : The synthesis results of [22] are given for a nm technology, which makes a fair comparison difficult. Nevertheless, in order to enable as fair a comparison as possible, we scale the area and the frequency to a nm technology in Table V (we have also included the original results for  completeness) . Moreover, the authors of [22] only provide synthesis results for and . In terms of area, we observe that our decoder is approximately smaller than the decoder of [22] for all list sizes. We also observe that for our decoder has a lower throughput than the decoder of [22] , but for the throughput of our decoder is higher than that of [22] . Overall, the hardware efficiency of our LLR-based SCL decoder is and better than that of [22] for and respectively.
E. CRC-Aided SCL Decoder
As discussed in Section II-C, the performance of the SCL decoder can be significantly improved if it is assisted for its final choice by means of a CRC which rejects some incorrect codewords from the final set of candidates. However, there is a trade-off between the length of the CRC and the performance gain. A longer CRC, rejects more incorrect codewords but, at the same time, degrades the performance of the inner polar code by increasing its rate. Hence, the CRC improves the overall performance if the performance degradation of the inner polar code is compensated by rejecting the incorrect codewords in the final list.
1) Choice of CRC:
We picked three different CRCs of lengths , and from [33] with generator polynomials:
(21a) (21b) (21c) respectively and evaluated the empirical performance of the SCL decoders of list sizes of , , , aided by each of these three CRCs in the regime of dB to dB. For , using either the CRC-or the CRC-(represented by generator polynomials (21a) and (21b) respectively) improves the performance of standard SCL decoder. In contrast, using the CRC-makes the performance degradation of the inner polar code dominant at dB causing the CA-SCLD to perform worse than the standard SCL decoder. At higher SNRs the performance of the CA-SCLD with CRCis better than a standard SCL decoder but not better than that of a CA-SCLD with shorter CRCs. The CRC-aided SCL decoders with CRC-and CRC-have almost the same blockerror probability (the block-error probability of the CA-SCLD with CRC-is only marginally better than that of the CA-SCLD with CRC-at dB). Given this observation and the fact that increasing the length of the CRC decreases the throughput of the decoder (see Section VI-E2), we conclude that the CRC-of (21a) is a reasonable choice for a CA-SCLD with list size . For , allocating bits for the CRC of (21b) turns out to be the most beneficial option. CRC-and CRC-will lead to almost identical FER at dB while CRC-improves the FER significantly more than CRC-at higher SNRs. Furthermore, CRC-leads to the same performance as CRCat high SNRs and worse performance than CRC-in low-SNR regime.
Finally, for we observe that CRC-of (21c) is the best candidate among the three different CRCs in the sense that the performance of the CA-SCLD which uses this CRC is significantly better than that of the decoders using CRC-or CRCfor dB, while all three decoders have almost the same FER at lower SNRs (and they all perform better than a standard SCL decoder).
In Fig. 5 , we compare the FER of the SCL decoder with that of the CA-SCLD for list sizes of , and , using the above-mentioned CRCs.
2) Throughput Reduction: Adding bits of CRC increases the number of information bits by , while reducing the number of groups of frozen channels by at most . As a result, the sorting latency is generally increased, resulting in a decrease in the throughput of the decoder. In Table VI we have computed this decrease in the throughput for different decoders and we see that the CRC-aided SCL decoders have slightly (at most ) reduced throughput. For this table, we have picked the best decoder at each list size in terms of hardware efficiency from Table III. 3) Effectiveness of CRC: The area of the CRC unit for all synthesized decoders is in less than m for the employed TSMC 90 nm technology. Moreover, the CRC unit does not lie on the critical path of the decoder. Therefore, it does not affect the maximum achievable operating frequency. Thus the incorporation of a CRC unit is a highly effective method of improving the performance of an SCL decoder. For example, it is interesting to note that the CA-SCLD with has a somewhat lower FER than the standard SCL decoder with (in both floating-point and fixed-point versions) in the regime of dB. Therefore, if a FER in the range of to is required by the application, using a CA-SCLD with list size is preferable to a standard SCL decoder with list size as the former has more than five times higher hardware efficiency.
VII. DISCUSSION
A. SC Decoding or SCL Decoding?
Modern communication standards sometimes allow the usage of very long block-lengths. The error-rate performance of polar codes under conventional SC decoding is significantly improved if the block-length is increased. However, a long block-length implies long decoding latency and large decoders. Thus, an interesting question is whether it is better to use a long polar code with SC decoding or a shorter one with SCL decoding, for a given target block-error probability. In order to answer this question, we first need to find some pairs of short and long polar codes which have approximately the same block-error probability under SCL and SC decoding, respectively to carry out a fair comparison.
In Fig. 6 (a) we see that a ( ) polar code has almost the same block-error probability under SC decoding as a ( ) modified polar code under CA-SCLD with list size and CRC-of (21a). Similarly, in Fig. 6(b) we see that a ( ) polar code has almost the same block-error probability under SC decoding as an ( ) modified polar code decoded under CA-SCLD with list size and CRC-of (21b).
As mentioned earlier, our SCL decoder architecture is based on the SC decoder of [5] . In Table VII we present the synthesis results for the SC decoder of [5] at block lengths and and compare them with that of our LLR-based SCL decoder, when using the same TSMC nm technology and identical operating conditions. For all decoders, we use PEs per path and bits for the quantization of the LLRs.
First, we see that the SCL decoders occupy an approximately larger area than their SC decoder counterparts. This may seem surprising, as it can be verified that an SC decoder for a code of length requires more memory (LLR and partial sum) than the memory (LLR, partial sum, and path) required by an SCL decoder with list size for a code of length , and we know that the memory occupies the largest fraction of both decoders. This discrepancy is due to the fact that the copying mechanism for the partial sum memory and the path memory still uses crossbars, which occupy significant area. It is an interesting open problem to develop an architecture that eliminates the need for these crossbars.
Moreover, we observe that both SC decoders can achieve a slightly higher operating frequency than their corresponding SCL decoders, although the difference is less than . However, the per-bit latency of the SC decoders is about smaller than that of the SCL decoders, due to the sorting step involved in SCL decoding. The smaller per-bit latency of the SC decoders combined with their slightly higher operating frequency, make the SC decoders have an almost higher throughput than their corresponding SCL decoders.
However, from Table VII we see that the SCL decoders have a significantly lower per-codeword latency. More specifically, the SCL decoder with and has a lower per-codeword latency than the SC decoder with , and the SCL decoder with and has a lower per-codeword latency than the SC decoder with . Thus, for a fixed FER, our LLR-based SCL decoders provide a solution of reducing the per-codeword latency at a small cost in terms of area, rendering them more suitable for low-latency applications than their corresponding SC decoders.
B. Simplified SC and SCL Decoders
There has been significant work done to reduce the latency of SC decoders [10] - [13] by pruning the decoding graph, resulting in simplified SC (SSC) decoders. The SC decoder architecture of [5] , used in our comparison above, does not employ any of these techniques. Since our SCL decoder uses SC decoders, it seems evident that any architectural and algorithmic improvements made to the SC decoder itself will be beneficial to the LLR-based SCL decoder as well. However, the family of SSC decoders does not seem to be directly applicable to our LLR-based SCL decoder. This happens because, in order to keep the path metric updated, we need to calculate the LLRs even for the frozen bits. As discussed in Section III, it is exactly these LLRs that lead to the improved performance of the SCL decoder with respect to the SC decoder. However, alternative and promising pruning approaches which have been recently introduced in the context of LL-based SCL decoding [22] , [34] , are fully applicable to LLR-based SCL decoding.
VIII. CONCLUSION
In this work, we introduced an LLR-based path metric for SCL decoding of polar codes, which enables the implementation of a numerically stable LLR-based SCL decoder. Moreover, we showed that we can simplify the sorting task of the SCL decoder by using a pruned radix-sorter which exploits the properties of the LLR-based path metric. The LLR-based path metric is not specific to SCL decoding and can be applied to any other tree-search based decoder (e.g., stack SC decoding [31] ).
We implemented a hardware architecture for an LLR-based SCL decoder and we presented synthesis results for various list sizes. Our synthesis results clearly show that our LLR-based SCL decoder has a significantly higher throughput and lower area than all existing decoders in the literature, leading to a substantial increase in hardware efficiency of up to . Finally, we showed that adding the CRC unit to the decoder and using CA-SCLD is an easy way of increasing the hardware efficiency of our SCL decoder at a given block-error probability as the list size can be decreased. Specifically, our CA-SCLD at list size has somewhat lower block-error probability and more than five times better hardware efficiency than our standard SCLD at list size .
