This brief introduces a hardware complexity reduction method for successive cancellation list (SCL) decoders. Specifically, we propose to use a sorting scheme so that L paths with smallest path metrics are also sorted according to their path indexes for path pruning. We prove that such sorting scheme reduces the input number of multiplexers in any hardware implementation of SCL decoding from L to (L/2 + 1) without any changes in the decoding latency. Field programmable gate array (FPGA) implementations show that the proposed method achieves significant gain in hardware consumptions, especially for large list sizes and block lengths.
I. INTRODUCTION
S UCCESSIVE-CANCELLATION (SC) is the first decoding algorithm proposed for polar codes by Arıkan in [1] . Being a low-complexity algorithm, SC brings a penalty in the achievable error performance. In [2] , successive cancellation list (SCL) decoding was proposed to improve the error performance, following similar ideas in [3] developed for Reed-Muller (RM) codes.
A common problem of SCL decoder implementations is the high hardware complexity, which is mostly due to memory elements and large multiplexers in the designs. Memory elements in SCL decoder implementations store calculated log-likelihood ratio (LLR) values, decoded bits, partial-sums and pointers for each path. Multiplexers are used to copy the memory contents between decoding paths after path pruning stages. L-to-1 multiplexers are used for this purpose for each of L paths, a structure which is commonly referred as a crossbar. Input widths of these multiplexers are equal to the widths of the corresponding memory elements. For example, in SCL decoder architectures [4] - [15] , L-to-1 multiplexers with input widths of O(N) bits (e.g., N bits, N/2 bits, etc.) are used to copy the contents of registers storing decoded bits and/or partial-sums between paths. In the cases where random access memory (RAM) blocks are used for storage, which is the conventional approach for storing calculated LLR values, pointer memories are used [5] . In such Manuscript cases, L-to-1 multiplexers with input widths equal to the width of pointer registers are required for each path, the widths being O(log 2 N log 2 L) bits. Similar arguments are valid for designs [16] - [21] , where decoded bits and/or partial-sums are stored in RAM blocks or partly in registers and partly in RAM blocks.
In this brief, we propose a method to reduce the hardware complexity of SCL decoder implementations. We achieve this reduction by limiting the number of possible interactions between decoding paths by applying a novel sorting mechanism to determine the surviving paths at decision making stages of SCL decoding. Specifically, L surviving paths, which are chosen out of 2L candidate paths, are obtained as sorted with respect to their path indexes and path copying operations are performed according to the result of this sorting mechanism. We prove that the proposed method reduces the input number of memory copying multiplexers from L to (L/2 + 1). We also describe sorter architectures to enable the proposed method in an SCL decoder.
The rest of this brief is organized as follows. We give background information on SCL decoding in Section II. The proposed complexity reduction method and sorter architectures are described in Section III. Section IV gives the implementation results. Section V concludes this brief.
II. SUCCESSIVE-CANCELLATION LIST DECODING
Vectors are denoted by bold lowercase letters. We use c k i to denote the vector (c i , c i+1 , . . . , c k ). For any set S ⊆ {0, 1, . . . , N − 1}, S c denotes its complement.
A high level description of the SCL decoding algorithm is given in Algorithm 1. The indexes of information bits in the uncoded bit vector u of length-N are chosen from A, which is the set of indexes of K polarized channels with smallest Bhattacharyya parameters [1] . The probability
is defined as the decision probability for the i-th bit of the k-th decoding path for the bit value u ∈ {0, 1}. SCL decoders keep L decoded bit sequences during the decoding process in order to enhance the error performance. Decoding paths are formed during the decision making stages of SC decoding for information bits. An existing path is split into two candidate paths for the bit decision values of 0 and 1 for the decoded information bit.
In order to limit the exponential growth of decoding paths, path pruning is performed to limit the number of paths to maximum list size L. When the number of candidate paths exceed L, an SCL decoder chooses L paths as the surviving paths according to their path metrics. After path pruning, new paths are formed from existing paths. In hardware implementations, 
memory copying operations are required to perform such task, which require L-to-1 multiplexers for each memory element. In the next section, we describe a sorting method to reduce the number of inputs of such multiplexers down to L/2 + 1.
III. COMPLEXITY REDUCTION BASED ON MULTIPLEXERS
IN SCL DECODERS As mentioned in the previous chapter, an SCL decoder forms candidate paths from existing paths for each information bit. In this brief, we assume that such operation is performed by forming candidate paths with indexes 2l − 1 and 2l from the existing path with index l, 1 ≤ l ≤ L. Path metrics of the candidate paths can be calculated as in [4] or [22] .
A generic SCL decoder architecture is given in Fig. 1 . In conventional hardware implementations of the SCL algorithm, decoding paths are assigned to dedicated hardware elements, i.e., circuitry for calculations and memory elements. The dedicated circuitry of a decoding path consists of processing elements (represented by PE block in the SC module) and partial-sum logic (represented by PS logic in the PS unit). The dedicated memory elements of a decoding path store the calculated LLR values, partial-sum bits, previously decoded bits and/or pointers to specify the RAM blocks which the particular path should access for decoding calculations.
After path pruning, the dedicated hardware of decoding paths are assigned to new surviving paths to continue with the decoding operations. The memory contents for each new surviving path are copied from the corresponding dedicated hardware of an existing path according to an ordering. The ordering is determined by a sorter or a module which finds the L candidate paths with smallest path metrics.
The copying operations explained above are performed by crossbars consisting of L-to-1 multiplexers for each of L paths, as shown in Fig. 1 . The input widths of such multiplexers are required to be O(N) bits if they are employed to copy partial-sum and decoded bit registers. Calculated LLR values are conventionally stored in RAM blocks, as the number of bits to be stored is higher than those of partial-sums and decoded-bits. In this case, pointer memories with input widths of O(log 2 N log 2 L) bits are employed to map calculated LLR RAM blocks to paths [5] and are copied by crossbars with equal widths.
A. Proposed Method
We achieve a reduction in the input number of crossbar multiplexers in an SCL decoder. The reduction is achieved by ordering the indexes of surviving paths, so that dedicated memory elements of each decoding path can get memory contents from a limited set of other dedicated paths. We prove that the number of elements in such a set is ( L 2 + 1), so that the employed multiplexers are ( L 2 + 1)-to-1 instead of L-to-1. Proposition: In an SCL decoder implementation, the number of paths that can have its memory contents copied to a specific path after path pruning is ( L 2 + 1) instead of L, if the L surviving paths are ordered according to their indexes.
Proof: We denote the candidate paths with index i, 1 ≤ i ≤ 2L. A candidate path i is formed from the existing path i−1 2 + 1 according to our path splitting definition. After path pruning, L candidate paths with indexes i 1 , i 2 , . . ., i L survive, where 1 ≤ i k ≤ 2L and ∀k ∈ {1, 2, . . . , L}. The surviving path i k is assigned to path k to continue with the decoding operations and the memory contents of the existing path i k −1 2 + 1 are copied to the memories of path k. When the surviving path indexes are sorted, they satisfy the expression
Since 1 ≤ i k ≤ 2L, expression (1) implies that surviving path indexes are limited by specific minimum and maximum values. For any i k , the minimum and maximum values are found as
In order to find all possible indexes of existing paths that the k-th surviving path can originate from, we use (2) to write
Expression (3) shows that dedicated memories of the k-th surviving path can get memory contents from dedicated memories of paths with indexes specified by the limits. It is straightforward to show that there are ( L 2 +1) elements in the interval (3) for L being an even number, which completes the proof.
The proposed sorting method does not affect the latency or the error performance of SCL decoding. We demonstrate the error performances of SCL decoding with conventional and proposed sorting methods in Fig 2. 
B. Sorter Design for the Proposed Method
In this section, we give sorter designs for the proposed method. A generic sorter architecture for the method is given in Fig. 3 . The architecture takes the path metrics and path indexes of 2L candidate paths as inputs,denoted by m_in_k and i_in_k, respectively. We use a two stage sorter. The first stage finds the surviving paths according to their path metrics. The second stage sorts the surviving paths according to their indexes. The architecture outputs the path metrics and path indexes of L surviving paths, denoted by m_out_k and i_out_k, and quantized by q and p bits, respectively.
We propose 3 different sorter designs. Our aim is to offer a trade-off between hardware complexity and throughput by the proposed designs. One of the sorter architectures we employ in our designs is the radix-2L sorter [5] , [23] . Other sorting methods we consider are the bitonic sorter proposed in [24] and maximum values filter (MVF) proposed in [20] . MVF is a bitonic sorter without the final sorting stages, so that it extracts L paths with the smallest path metrics out of 2L candidate paths without any ordering. Table I summarizes the proposed designs and their complexity and delay characteristics with respect to each other.
IV. IMPLEMENTATION RESULTS In this section, we verify the proposed sorting method by field programmable gate array (FPGA) implementations. We use Xilinx-XCZU9EG-FFVB1156-2-i in the implementations and the methods in [25] for obtaining multiplexers with number of inputs different from 2 m . Table II presents the look-up table (LUT) numbers for conventional crossbars for different list sizes and bit widths. One can observe that significant numbers of LUTs are required for a single crossbar when the bit width is large. Table III gives the implementation results for state-ofthe-art sorters and the sorter designs explained in the previous section. We consider the simplified bubble sorter (SBS) [26] , radix-2L sorter [5] , MVF [20] and simplified odd-even sorter (OES) [27] for comparison. From the results in Tables II and III, one can directly observe that the crossbar complexities are significantly higher than those of the sorters. Therefore, the increase in complexity due to the employed sorter design is expected to be negligible compared to the gains obtained from crossbars with the proposed method.
Comparing the state-of-the-art and the proposed sorters, one can observe that the proposed sorting method does not necessarily imply higher sorter complexity or delay. For example, the hardware consumption of a radix-2L sorter is higher than those of Designs 1 and 2 for L > 4 and L > 8, respectively .  TABLE III  SORTER IMPLEMENTATION RESULTS   TABLE IV  ESTIMATED LUT GAINS FROM CROSSBARS   TABLE V  DECODER IMPLEMENTATION RESULTS Design 3 achieves higher throughput than those of SBS, MVF and OES for L > 4. Furthermore, there are large variations in hardware consumptions and operating frequencies also among the state-of-the-art sorters. We can conclude that the proposed sorting method can be implemented with lower hardware complexity or higher throughput than those of state-of-the art sorting methods. Table IV presents the estimated LUT gains from the crossbar implementations when the proposed sorting method is used. Comparing Tables III and IV, one can observe that the expected hardware consumption gains are much larger than possible hardware consumption increases due to the sorter designs in a decoder.
In order to validate the estimated gains in Table IV , we implement SCL decoders with and without the proposed method. We use the semi-parallel architecture in [28] with P = 32 processing elements and LLR-based metric calculation in [4] . Decoded bits are stored in registers of N bits. Calculated LLRs are stored in RAM blocks and pointer registers of (log 2 N − 1) log 2 L bits are used for pointer-based copying. Partial-sums are calculated by the partial-sum network in [29] and stored in registers of P (parallel part) and N/2 (serial part) bits. LLRs are represented in the conventional sign-magnitude form, however, different LLR representation methods, such as [30] , can also be employed. All registers in the implementations are copied by crossbars with the corresponding input widths. The implemented decoders are flexible in terms of code rate. Conventional decoders without proposed sorting method employ simplified OES sorter, which is a Table III . Implementation results are given in Table V .
The results show that the proposed method achieves significant hardware consumption gain for SCL decoders. For the considered decoder architecture and block lengths, we obtain hardware consumption gains of approximately 9.41% for L = 4 and 23.3% for L = 8 with the proposed method. The complexity gains can be verified from the crossbar implementation results in Table IV for a 4096-bit crossbar to copy decoded bit registers and a 2048-bit crossbar to copy partial-sum registers.
The sorting method we propose does not increase the decoding latency, thus the throughput values of the architectures in comparison have identical relations with the maximum achievable operation frequencies. The maximum delay path of the decoder passes through the sorter, as pointed out in [4] . We observe that the throughput of the decoders using the proposed sorting method are higher than those of the conventional decoders, owing to the smaller delay of Design 3 sorter with respect to the delay of OES sorter.
Finally, we compare a decoder implementation employing the proposed sorting method with state-of-the-art SCL decoders. Table VI gives the implementation results. We optimize our decoder for code rate 0.5 for a fair comparison with the presented decoders. For this purpose, we apply certain simplifications of simplified successive cancellation list (SSCL) decoding [31] . Specifically, we perform rate-0 and repetition code simplifications for constituent codes of length up to 16. For a polar code with N = 4096 and K = 2048, this reduces the decoding latency down to 7297 with a similar hardware complexity, as also stated out in [32] . We note that the throughput can further be improved by the optimizations in [33] without any change in the sorter architecture.
Compared with the decoders in [19] and [34] , the decoder with the proposed sorting method has a significant advantage in terms of hardware complexity. The decoder in [35] achieves hardware complexity reduction by completely eliminating decoded bit and partial-sum crossbars at the expense of increased latency. Therefore, the decoder in [35] has a higher latency even though it benefits from multibit decoding (MBD) [6] and switches to parallel decoding at low decoding stages to achieve latency reduction. As seen from the implementation results, the simplified decoder with the proposed sorting method can achieve larger throughput than that of [35] with significantly lower memory consumption. On the other hand, the decoder in [35] can operate with larger list sizes owing to the lack of crossbars with large input widths. The results show that the proposed method offers a balanced decoder design with reasonable hardware complexity and throughput, especially for large block lengths.
We note that the presented sorters and decoder are example implementations to verify the gains obtained by the proposed sorting method.
V. CONCLUSION
In this brief, a hardware complexity reduction method for SCL decoder implementations is proposed. The method comprises applying a sorting mechanism to candidate path metrics so that surviving paths are sorted according to their path indexes. With the proposed method, (L/2 + 1)-to-1 multiplexers can be employed instead of L-to-1 multiplexers for memory copying operations after path pruning. Implementation results show that significant reduction in hardware consumption is achievable with the proposed method without any penalty in decoding latency. Future work includes novel sorter designs and ASIC implementations for the proposed method.
