Abstract-Long polar codes achieve the capacity of binaryinput discrete memoryless channels when decoded with a successive cancelation (SC) algorithm. For polar codes with short or moderate length, the decoding performance of the SC algorithm is inferior, and the cyclic redundancy check (CRC) aided successive cancelation list (SCL) algorithm achieves significantly improved performance. In this paper, we propose an efficient list decoder architecture for the CRC aided SCL algorithm. Three list decoders with list size L = 2, 4 and 8, respectively, are implemented with a 90nm CMOS technology. Compared to list decoders with L = 2 and 4 in the literature, the proposed list decoders achieve 1.42 and 2.84 times, respectively, higher hardware efficiency. The implementation with list size L = 8 demonstrates that our decoder architecture works for large list sizes.
I. INTRODUCTION
Polar codes, recently introduced by Arıkan [1] , are a significant breakthrough in coding theory. It is proved in [1] that polar codes achieve the channel capacity of binary-input symmetric memoryless channels. Besides, they can be efficiently decoded by the successive cancelation (SC) algorithm [1] with complexity O(N log N ), where N is the code block length. Polar codes with very large code block length (N > 2 20 [2] ) approach the capacity when decoded with the SC algorithm. The error performance of polar codes with short or moderate length, when decoded with the SC algorithm, is much worse than Turbo or low-density parity-check (LDPC) codes [3] . An SC list (SCL) algorithm was recently proposed [3] to improve the performance of polar codes with short or moderate length. Since the SCL algorithm results in little performance improvement for list size L > 2 in moderate to high SNR region [3] , a CRC [4] aided SCL (CA-SCL) algorithm was proposed [3] to improve the performance of polar codes with short or moderate length for a large list size.
Existing polar decoder architectures in the literature mainly focus on the SC decoders. A semi-parallel polar decoder with reduced hardware complexity was proposed in [2] . A reducedlatency SC decoder architecture was proposed in [5] . The simplified SC decoder for polar codes proposed in [6] reduces the decoding latency by more than 88% for a rate-0.7 polar code with length 2 18 . Though increasing the code length can improve the error performance of a SC decoder, it is not always feasible due to latency requirement. Thus, it is necessary to investigate list decoder architectures for polar codes with short or moderate length. The tree search list decoder architecture for the SCL This work was supported by NSF under Grant ECCS-1055877. algorithm proposed in [7] is the only list decoder architectures for polar codes in the literature to the best of our knowledge. The main contributions of this paper are:
1) An efficient list decoder architecture for the CA-SCL algorithm is proposed, and it is implemented for L = 2, 4 and 8, respectively, under a 90nm CMOS technology. Our list decoder architecture is based on a novel path pruning unit (PPU) that applies to larger list sizes. Hence, our list decoder architecture supports larger list sizes, where the the CA-SCL algorithm leads to performance gain over the the SCL algorithm in moderate to high SNR region. The implementation with list size L = 8 demonstrates that the proposed architecture works for large list sizes. 2) Compared with the list decoders with the same list sizes in [7] , the proposed list decoders for list size L = 2 and 4 achieve 1.42 and 2.84 times, respectively, higher hardware efficiency (throughput normalized by area).
II. POLAR CODES AND THE CA-SCL ALGORITHM
A generator matrix G of a polar code is an N × N matrix (N = 2 n ) given by G = B N F ⊗n , where B N is the bit reversal permutation matrix and F = In [3] , the list decoding technique was applied to the SC algorithm, and the resulting SCL algorithm outperforms the original SC algorithm. The SCL algorithm keeps at most L decoding paths during the decoding of a codeword and outputs a list of L possible decoded codewords (L is called the list size). For l = 0, 1, · · · , L − 1 and λ = 1, 2, · · · , n − 1, let P l,λ be a probability message array which contains 2 n−λ elements, For the CA-SCL algorithm, some information bits are replaced with the cyclic redundancy check (CRC) checksum of the reset information bits. When the information part of a decoded codeword passes the CRC, it is chosen as the final output codeword. The CA-SCL algorithm can be implemented in the logarithmic domain, where the basic computations of the path metric computing algorithm in [3, Alg. 3] are
where
, and max(x, y) is a simplification of max * (x, y) = max(x, y) + log(1 + e −|x−y| ). Each probability message is denoted as a log-likelihood probability message (LLM).
III. EFFICIENT LIST DECODER ARCHITECTURE

A. Top Architecture
For a polar code of length N = 2 n , the proposed list decoder architecture, shown in Fig. 1 , consists of T basic processing units (PUs) that perform the computations in (1) and (2) . The proposed list decoder architecture also contains n LLM register arrays (LRAs), where
calculates the binary partial sum used in the computation in (2) . The CRC checksum (CRCC) module tests whether the information bits of a decoded codeword pass the CRC.
Suppose
n are updated during the path metric computation process of decoding step i.
Once the LLM register R l,n , which stores two path metrics, is updated, if u i is an information bit, each current decoding path l is split into two decoding paths with the associated path metrics stored in R l,n [0] [0] and R l,n [0] [1] , respectively. The corresponding decoded bit for u i for the two split decoding paths are 0 and 1, respectively.
For the proposed list decoder architecture, instead of using the lazy copying method in [3] , a direct copying scheme is employed to simplify the control of the proposed decoder architecture. During the path pruning process, a decoding path l may be processed in the following three ways: 1) Deleted: The associated LLMs and partial sums are no longer kept. 2) Copied: Part of the contents in P l,λ and C l,λ are copied to P l ,λ and C l ,λ , respectively, where l is the index of a decoding path that will be killed. Once the copying is finished, the partial sum computation for the decoding path l carries on with decoded code bitû i = 0 while the partial sum computation for the decoding path l carries on with decoded code bitû i = 1. 3) Reserved: The partial sum computation for the decoding path l carries on with the decoded code bit of the decoding path l.
The path pruning unit (PPU) in Fig. 1 , which consists of the proposed maximum value filter (MVF) and control signal generator (CSG) modules, generates the control signals that guide the copying of LLMs and partial sums. The partial sums associated with the decoding path l will be overwritten with those from the decoding path a l . When l = a l , the path metric computation of the decoding path l will be performed with messages stored in the LLM registers associated with the decoding path l. The partial sum computation is performed based on the partial sums stored in PSU l . The copying of the LLM happens within an LRA module shown in Fig. 1 . The cross bar (CB) is a data multiplexing network, and its L data inputs and L data outputs are denoted as CI 0 , CI 1 
The CRC checksum computation unit CRCC l can be implemented based on the well known serial CRC generation circuit. The partial sum computation unit PPU l can be implemented based on [3, Alg. 4] . The details of CRCC l and PUU l are omitted here due to limited space.
If T < 2 n−j L, similar to the approach in [2] , the LLM registers within LRA j will be updated in a partial-parallel way. During each clock cycle, only T LLM registers are updated. The T switch networks (SW) in Fig. 1 
where P = T /L is a power of 2 and R is the code rate.
B. Maximum Values Filter
Taking list size L = 8 as an example, the proposed MVF shown in Fig. 2 consists of a bitonic sequence generator [8] (BSG) and a stage of compare-and-select (CAS) module. Since there are at most 2L = 16 split decoding paths and thus 16 path metrics, the BSG in Fig. 2 needs 16 inputs and outputs to sort the input path metrics in the bitonic format [8] . Each input and output of the BSG has three parts: a path metric, the corresponding path index and the corresponding decoded bit value. The increasing order sorter (IS) shown in Fig. 2 sorts its inputs so that the path metric of its first output is no greater than that of the second. The decreasing order sorter (DS) shown in Fig. 2 sorts its inputs so that the path metric of its first output is no smaller than that of the second. The CAS compares the path metrics of its two inputs and selects the corresponding list index and bit value whose associated path metric is larger.
Let LS r (r = 0, 1, · · · , 15) denotes the associated path metric of the BSG output S r in Fig. 2 . The path metrics of the 16 outputs satisfy the following relationship. It is proved in [8] that the 8 maximum values among LS i 's are max(LS r , LS 8+r ) for r = 0, 1, · · · , 7.
As a result, based on the output of the BSG, the outputs of the CAS modules are the decoding path index set I L and the corresponding decoded code bit set B L . As shown in Fig. 2 , 
C. Control Signal Generator
The control signal generator (CSG) module generates L list indices a 0 , a 1 
D. Fixed point implementation of CA-SCL algorithm
For the fixed point implementation of the CA-SCL decoding algorithm, a non-uniform quantization method should be employed. Suppose each channel LLM in P l,0 is quantized to t bits, then each LLM in P l,λ should be quantized with t + λ bits (1 ≤ λ < n). Otherwise, all LLMs in P l,n are saturated and thus the list decoder is unable to pick the L most reliable decoding paths. As shown in Fig. 3 , for a rate- As shown in Fig. 3, CS2 -max denotes the floating-point CA-SCL algorithm with L = 2 and the F computation in (1). CS2-map is the floating-point L = 2 CA-SCL algorithm employing the max * (x, y) function. CSi-max-j denotes the fixed-point CA-SCL algorithm with L = i and a non-uniform quantization scheme where each channel LLM is quantized to t = j bits. Si-max-j denotes the fixed-point SCL decoding algorithm with L = i and non-uniform quantization scheme with t = j. For all CA-SCL algorithms, a CRC-32 scheme is used. In this paper, several conclusions can be drawn from the simulation results:
1) For the fixed point CA-SCL algorithm, each channel LLM should be quantized with at least 4 bit to ensure good decoding performance. 2) A better decoding performance can be achieved with larger list size for L > 2. Hence, it is desirable to design list decoder architectures suitable for large list size. 1 Since 32 information bits are replaced with CRC checksum, there are only 480 information bits for the simulated rate 
IV. IMPLEMENTATION RESULTS
In this paper, three list decoders using T = 512 PUs are implemented for L = 2, 4 and 8 respectively, for a polar code with N = 1024. Each channel LLM is quantized to 3 bits for decoders with L = 2 and 4 in order to make fair comparisons with the decoders in [7] . Each channel LLM of the list decoder with L = 8 is quantized to 4 bits. These decoders are synthesized under a 90nm CMOS technology. For our list decoders with L = 2, 4 and 8, one, two and three stages of pipeline registers are used, respectively. The implementation results are shown in Table I . Compared with the list decoders with the same list sizes in [7] , the proposed list decoders for list size L = 2 and 4 achieve 1.42 and 2.84 times, respectively, higher hardware efficiency, which denotes throughput normalized by the corresponding area. Besides, the proposed list decoders achieve better FER performance. Compared with the decoders in [7] , the proposed decoders achieve higher frequencies since the proposed MVF module has shorter latency than that of the metric sorter in [7] .
The implementation with L = 8 demonstrates that our architecture works for large list sizes, which lead to further decoding performance improvement. There are no implementation results of a list decoder of list size 8 in the literature. 
