Long polar codes can achieve the symmetric capacity of arbitrary binary-input discrete memoryless channels under a low-complexity successive cancelation (SC) decoding algorithm. However, for polar codes with short and moderate code lengths, the decoding performance of the SC algorithm is inferior. The cyclic-redundancy-check (CRC)-aided SC-list (SCL)-decoding algorithm has better error performance than the SC algorithm for short or moderate polar codes. In this paper, we propose an efficient list decoder architecture for the CRC-aided SCL algorithm, based on both algorithmic reformulations and architectural techniques. In particular, an area efficient message memory architecture is proposed to reduce the area of the proposed decoder architecture. An efficient path pruning unit suitable for large list size is also proposed. For a polar code of length 1024 and rate 1/2, when list size L = 2 and 4, the proposed list decoder architecture is implemented under a Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS technology. Compared with the list decoders in the literature, our decoder achieves 1.24-1.83 times the area efficiency.
I. INTRODUCTION
P OLAR codes, recently introduced by Arıkan [1] , are a significant breakthrough in coding theory. It is proved that polar codes can achieve the channel capacity of any discrete or continuous memoryless channel [1] , [2] . Polar codes can be efficiently decoded by the low-complexity successive cancelation (SC) decoding algorithm [1] with a complexity of O(N log N), where N is the block length. To approach the channel capacity using the SC algorithm, polar codes require very large code block length (for example, N > 2 20 [3] ), which is impractical in many applications. For short or moderate length, the error performance of polar codes under the SC algorithm is worse than that of Turbo or low-density parity-check codes [4] .
Lots of efforts [4] - [11] have already been devoted to the improvement of error-correction performance of polar codes with short or moderate lengths. An successive cancelation list (SCL) decoding algorithm was proposed recently in [4] , which performs better than the SC algorithm and Manuscript which performs almost the same as a maximum-likelihood decoder [4] . In [5] - [7] , the cyclic redundancy check (CRC) is used to pick the output codeword from L candidates, where L is the list size. The CRC-aided SCL (CA-SCL) algorithm performs much better than the SCL algorithm at the expense of negligible loss in code rate.
In terms of hardware implementations of the SC algorithm, an efficient semiparallel SC decoder was proposed in [3] , where resource sharing and semiparallel processing were used to reduce the hardware complexity. An overlapped computation method and a precomputation method were proposed in [12] to improve the throughput and to reduce the decoding latency of SC decoders. Compared with the semiparallel decoder architecture in [3] , the precomputation-based decoder architecture [12] can double the throughput. A simplified SC decoder for polar codes, proposed in [13] , reduces the decoding latency by more than 88% for a rate 0.7 polar code with length 2 18 .
The investigation of efficient list decoder architectures for polar codes is motivated by improved error performance of the SCL and CA-SCL algorithms, especially for polar codes with short or moderate lengths. The tree search list decoder architecture for the SCL algorithm proposed in [14] is the first list decoder architecture for polar codes in the literature to the best of our knowledge. In this paper, we propose the first hardware implementation of the CA-SCL algorithm to the best of our knowledge. Based on both algorithmic and architectural improvements, our decoder architecture achieves better error performance and higher area efficiency compared with the decoder architecture in [14] . Specifically, the major contributions of this paper are as follows.
1) Message memories account for a significant fraction of an SC or SCL decoder [3] , [14] . In this paper, an area efficient message memory architecture is proposed. Besides, a new compression method for the channel messages is used to reduce the area of the proposed decoder architecture. 2) An efficient processing unit (PU) is proposed. For the proposed list decoder architecture, a fine grained PU profiling (FPP) algorithm is proposed to determine the minimum quantization size of each input message for each PU so that there is no message overflow. Using the quantization size generated by the FPP algorithm for each PU, the overall area of all PUs is reduced. 3) An efficient scalable path pruning unit (PPU) is proposed to control the copying of decoding paths. Based on the proposed memory architecture and the scalable PPU, our list decoder architecture is suitable for large list sizes.
4)
A low-complexity direct selection (DS) scheme is proposed for the CA-SCL algorithm when a strong CRC is used (e.g., CRC32). The proposed DS scheme simplifies the selection of the final output data word. 5) For a (1024, 512) rate-1/2 polar code, the proposed list decoder architecture is implemented for list size L = 2 and 4, respectively, under a 90-nm CMOS technology. Compared with the decoder architecture in [14] synthesized under the same technology, our decoder achieves 1.24-1.83 times area efficiency (throughput normalized by area). Besides, the proposed CA-SCL decoder has better error performance compared with the SCL decoder in [14] . The rest of this paper is organized as follows. In Section II, polar codes as well as the SCL and CA-SCL algorithms are briefly reviewed. Two improvements of the CA-SCL algorithm are discussed in Section III. The proposed list decoder architecture is described in Section IV. Section V shows the implementation and comparison results of the proposed list decoder architecture. The conclusion is drawn in Section VI.
II. POLAR CODES AND ITS CA-SCL ALGORITHM

A. Polar Codes
A generation matrix of a polar code is an N × N matrix G = B N F ⊗n , where N = 2 n , B N is the bit reversal permutation matrix [1] , and F = 1 are divided into two sets: the information bits set A contains K indices and the frozen bits set A c contains N − K indices.
B. SCL and CA-SCL Algorithms
List decoding was applied to the SC algorithm in [4] and the resulting SCL algorithm outperforms the SC algorithm. For a list size L, the SCL algorithm keeps at most L decoding paths and outputs L possible data wordsû N−1
= (û l,0 ,û l,0 , . . . ,û l,N−1 ). A low-complexity state copying scheme was proposed in [14] to simplify the copying process when a decoding path needs to be duplicated.
For l = 0, 1, . . . , L − 1 and λ = 0, 1, . . . , n, let P l,λ be an array with 2 n−λ elements: P l,λ [ j ] contains two messages P l,λ [ j ][0] and P l,λ [ j ] [1] for j = 0, 1, . . . , 2 n−λ − 1. C l,λ has the same structure as P l,λ : C l,λ [ j ] contains two binary partial sums C l,λ [ j ][0] and C l,λ [ j ] [1] for j = 0, 1, . . . , 2 n−λ − 1. The SCL algorithm with low complexity state copying [14] is reformulated in Algorithm 1. For the decoding of u i , the SCL algorithm can be divided into the following parts.
1) For each surviving decoding path l, compute the path metrics P l,n [0][0] and P l,n [0] [1] using the recursive function metricComp(l, i ) shown in Algorithm 2.
Based on the recursive algorithm for computing path metric in [4] and the low complexity state copying algorithm in [14] , the path metric computation is formulated in a nonrecursive way in Algorithm 2, where r l = (r l [n − 1], r l [n − 2], . . . , r l [0]) is the message updating reference index array for decoding path l. For decoding path l, r l [0] ≡ 0, while all other elements are initialized with 0. Two types of basic operations, denoted as F and G operations, respectively, are employed in Algorithm 2.
2) If u i is a frozen bit, for each decoding path, the decoded code bitû l,i = 0, decoding path l will carry on Algorithm 3 pUpdate (l, λ, i ) [4] withû l,i = 0. If u i is an information bit, decoding path l (l = 0, 1, . . . , L − 1) splits into two decoding paths with corresponding path metrics being P l,n [0][0] and P l,n [0] [1] , respectively. There are at most 2L paths after splitting, and 2L associated path metrics. The pathPruning function in Algorithm 1 finds the L most reliable decoding paths based on their corresponding path metrics. 3) For each of the L surviving decoding paths, the pUpdate(l, n, i ) function shown in Algorithm 3 [4] updates the partial sum matrices that will be used in the following path metric computation.
We make several observations about the path metric computation such as the following. 1) When i = 0, P l,1 , . . . , P l,n are updated in serial, and only the F computation is employed. 2) For i > 0, P l,φ (i) , . . . , P l,n are updated in serial. The G computation is used when computing P l,φ (i) , while the F computation is used for the other probability message arrays.
3) The computation of P l,φ
while the computation of P l,λ (λ > φ (i) ) is based on P l,λ−1 .
The path pruning function, path pruning, finds the L most reliable paths, a 0 , a 1 , . . . , a L , and their corresponding decoded bits, c 0 , c 1 , . . . , c L , based on the path metrics. The path metrics of the surviving L decoding paths are the L largest ones among 2L input metrics. Once the surviving decoding paths are found, the partial sums and the reference index array of decoding path l will copy from decoding path a l . The partial sum computation of decoding path l is carried on with the binary input c l .
The pruning scheme in this paper and the path pruning scheme in [11] both try to eliminate decoding paths that are less reliable. However, there are still some differences such as the following.
1) The pruning scheme in [11] is used for SC stack (SCS) decoding algorithm as well as the successive cancellation hybrid decoding algorithm, which is a hybrid of SCL and SCS decoding algorithms, whereas our pruning scheme is used for the SCL algorithm. 2) For the SCL algorithm, suppose there are L decoding paths before the decoding of u i , then the metrics of 2L expanded decoding paths are computed. The pruning scheme in this paper finds the L largest metrics out of 2L metrics and keeps their corresponding decoding paths. For the pruning scheme in [11] , a path will be deleted if its path metric is smaller than a dynamic threshold, a i − ln(τ ), where a i is the largest metric of candidate paths and τ is a configuration parameter. 3) For the path pruning scheme in [11] , the number of deleted paths is not fixed and depends on the configuration parameter τ , while the number of deleted paths is always L for the pruning scheme in this paper. The F and G operations in Algorithm 2 are in probability domain. The F and G operations in Algorithm 2 can also be performed over the logarithm domain [6] . For u ∈ {0, 1}, the resulting logarithm domain G and F computations are shown in (1) and (2), respectively, as shown at the bottom of this page, where max * (x, y) = max(x, y)+log(1 +e −|x−y| ). max * (x, y) can also be approximated with max(x, y), resulting in the approximated F computation in (3), as shown at the bottom of this page.
In [5] , the performance of the SCL algorithm is further improved by the adoption of CRC, which helps to pick the right path from the L possible decoded data words. In terms of the fixed point implementation, the CA-SCL algorithm is quite sensitive to saturation. For two decoding paths, it is hard to decide which is better if the metrics of both paths are saturated. To avoid message saturation, a nonuniform quantization scheme is proposed in [14] . If the channel messages (P l,0 ) are all quantized with t bits, all the log-likelihood messages (LLMs) of P l,λ need to be quantized with t + λ bits to avoid saturation.
III. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
In this paper, two improvements of the CA-SCL algorithm are proposed. First, for the i th received bit y i , there are two likelihoods, Pr{y i |0} and Pr{y i |1}. Suppose Pr{y i |m} (m ∈ {0, 1}) is the smaller one among the two likelihoods. For j ∈ {0, 1}, two LLMs are defined as
Thus, one of the LLMs is always zero, and the other is always nonnegative. For the proposed list decoder, only the nonnegative LLM and its corresponding binary index s are stored. As shown in Fig. 1 , Msg denotes the stored nonnegative LLM, and its corresponding bit index is s. When s = 0,
If t bits are needed to quantize a channel LLM, it takes t + 1 bits to represent two LLMs corresponding to a received bit y i , while it takes 2t bits to store two LLMs directly. Second, at the end of the CA-SCL decoding algorithm, the candidate data word that passes the CRC and has the greatest path metric is the output data word, which will incur additional comparisons. In this paper, a simple DS scheme is proposed: we first calculate all L checksums in parallel and then scan from the checksum of data word zero to the checksum of data word L − 1; if a data word passes the CRC, the scan process is terminated and the corresponding candidate data word is the final output one. When all L CRC checks fail, since the CRC checksum could be corrupted, a decoding failure is announced if retransmission is possible; otherwise, pick a data word randomly and output.
The DS scheme reduces computational complexity at the expense of possible performance degradation. In this paper, we give an estimation of the frame error rate (FER) degradation. Let w denote the number of the detectable errors for our CRC. Assume all the bits of the final L candidate data words are independently subject to a bit error probability, p b . We calculate the increase in FER, P e , caused by the DS scheme instead of the ideal selection (IS) scheme, which always selects the transmitted data word if it is within the final L candidates. For each candidate data word, there are three probabilities.
1) The probability that the candidate data word is the same as the transmitted one is given by
2) The probability that the candidate fails the CRC is denoted as p 2 .
3) The probability that the CRC identifies the candidate as the transmitted data word by mistake is denoted as p 3 , and p 3 .
Note that p b depends on the signal-to-noise ratio (SNR) and the list size L. For a specific SNR, to simplify our analysis, we can use p b,SC to approximate p b , where p b,SC denotes the bit error probability of the SC algorithm. The probabilities, p 2 and p 3 , are also approximated. Though approximated probabilities are employed when calculating P e , the order of P e still helps us in determining whether our DS scheme is applicable. For instance, when a strong CRC is used, i.e., large w, p 3 is small, leading to a small P e . On the other hand, a higher data rate leads to a greater K and hence a greater P e . 
A. Numerical Results
For a rate 1/2 polar code with N = 1024, the FERs of the SC, SCL, and CA-SCL algorithms are shown in Fig. 2 , where SC denotes the floating-point SC algorithm. CS2-max and CS2-map denote the floating-point CA-SCL algorithm with L = 2 and the approximated F computation shown in (3) and the F computation shown in (2), respectively. CSi -max-j denotes the fixed-point CA-SCL algorithm with L = i and nonuniform quantization scheme with t = j , where t is the number of quantization bits for channel probability message. Si -max-j denotes the fixed-point SCL algorithm with L = i and nonuniform quantization scheme with t = j . For all simulated CA-SCL algorithms, a CRC scheme with a generator polynomial 0x1EDC6F41 is employed, and the DS scheme is employed to pick the final output codeword from L possible candidates.
The simulated results show the following. 1) For the CA-SCL algorithm, the approximated F computation in (3) results in negligible performance degradation. 2) When each channel LLM is quantized with 4 bits, the employment of the proposed nonuniform quantization scheme leads to negligible performance degradation. When each channel LLM is quantized with 3 bits, the resulting FER performance is roughly 0.2 dB worse than that using 4-bit quantization. 3) Using a larger list size leads to obvious performance improvement for the CA-SCL algorithm, whereas the SCL algorithm with L = 2, 4 has nearly the same performance, especially in the high SNR region. For polar codes with moderate block length (e.g., N = 2 11 , 2 12 , 2 13 ), similar phenomena have been observed in [5] . More simulation results on the proposed DS scheme are provided. There are three selection schemes employed in our simulations.
1) The proposed DS scheme that outputs the first data word that passes CRC. 2) IS scheme that always outputs the correct data word if it exists in the final list. 3) Metric based selection (MS) scheme [5] that outputs the data word that has the maximal path metric among all data words that have passed CRC. In Figs. 3 and 4 , DSk, ISk, and MSk denote the CA-SCL algorithms with list size L = k under the direct, ideal, and MS schemes, respectively. The generation polynomial of the CRC16 used in our simulations is 0x1021.
As shown in Fig. 3 , when code rate is 0.75 and CRC16 is used, the proposed DS scheme introduces early error floor for all simulated list sizes, while the MS scheme performs nearly the same as the IS scheme. When code rate is 0.5, as shown in Fig. 4 , the DS scheme performs nearly the same as the IS scheme with list size L = 2. When list size L = 4, 8, 16, the proposed DS scheme shows certain performance degradation compared with the IS scheme, while the MS scheme has little performance degradation. When CRC32 is used, the proposed DS scheme performs nearly the same as the IS scheme for both code rates 0.5 and 0.75 [15] .
We also calculate the bound on the FER degradation for all simulated cases. We choose SNR = 3.6 dB, since DS4, DS8, and DS16 begin to show an error floor at this SNR in Fig. 3 . For the length 1024 polar code, the bit error probability p b of the SC algorithm is 6.28 × 10 −4 and 3.04 × 10 −6 for rates 0.75 and 0.5, respectively. For CRC16 and CRC32, w = 2 [16] and 4 [17] , respectively. When CRC16 is used, for each simulated list size, the bound is around 10 −2 and 10 −10 for rates 0.75 and 0.5, respectively. When CRC32 is used, for each simulated list size, the bound is 10 −4 and 10 −17 for rate 0.75 and 0.5, respectively. It is found that the error degradation caused by our DS scheme is big when the corresponding P e is big (e.g., 10 −2 ). On the other hand, when P e is quite small (e.g., 10 −17 ), our DS scheme leads to little performance degradation.
Based on our calculation results, for a given CRC and code rate, P e increases with the list size L. This observation indicates that the potential performance degradation caused by the DS scheme will increase when L increases. This is consistent with the simulation results shown in Figs. 3 and 4 .
IV. EFFICIENT LIST DECODER ARCHITECTURE
For the CA-SCL algorithm, we propose an efficient partial parallel list decoder architecture shown in Fig. 5 . The proposed list decoder architecture mainly consists of the channel message memory (C-MEM), the internal LLM memory (L-MEM), L PU arrays (PUAs) (PUA 0 , PUA 1 , . . . , PUA L−1 ), the PPU and the CRC checksum unit. These components are described in detail in the following sections.
A. Message Memory Architecture
The L-MEM stores all the inner LLMs used for metric computation. Since all the LLMs in P l,λ need to be quantized with t + λ bits for λ ≥ 1, the variable-size LLMs make the L-MEM architecture for the proposed list decoder nontrivial. In this paper, an area efficient scalable memory architecture for L-MEM is proposed based on the nonuniform quantization, so that the following occurs. 1) For λ = 1, 2, . . . , n, since each LLM within P λ = (P 0,λ , P 1,λ , . . . , P L−1,λ ) is quantized with t + λ bits, a regular submemory is created for storing LLMs in P λ . 2) All n submemories are combined to a single memory.
3) Due to the nonuniform quantization, the width of each submemory may be different. As a result, the concatenated L-MEM is an irregular memory with varying width within its address space. For the proposed memory architecture, the irregular L-MEM is split into several regular memories to fit current memory generation tools. The proposed L-MEM is a mix of different types of memories, including static RAM (SRAM), register file (RF), or register. Since SRAM and RF are more area efficient than a register, the proposed L-MEM architecture is better than the register-based L-MEM in [14] . Suppose there are T PUs in each PUA shown in Fig. 5 , it consumes at most 4LT LLMs for one round of computation. For λ = 1, 2, . . . , n, we store all the LLMs within P λ = (P 0,λ , P 1,λ , . . . , P L−1,λ ) in a single memory as follows. 1) When 2 n−λ+1 L > 4LT , it takes a submemory of 2 n−λ−1 /T words, where each word has 4LT (t +λ) bits. 2) When 2 n−λ+1 L ≤ 4LT , it takes a submemory with only one single word, which has 2 n−λ+1 (t + λ)L bits.
An example of the concatenation of n = 6 submemories, (S 1 , S 2 , . . . , S 6 ), is shown in Fig. 6(a) . For current memory compiler, it is hard to generate an irregular single memory instance, as shown in Fig. 6(a) . For the proposed L-MEM architecture, the concatenated irregular memory is split into several regular memory instances, as shown in Fig. 6(b) , where additional dummy memories are added so that each instance is regular. For general cases, the irregular memory is divided into λ o = n − log 2 T − 1 regular instances. Depending on the number of words, each memory instance could be implemented with SRAM, RF, or registers. Compared with the register-based L-MEM, the proposed L-MEM architecture is more area efficient because of the following reasons.
1) Some submemory instances can be implemented with
SRAM or RF which is more denser than register-based memory. 2) As shown in Fig. 6(b) , most of the LLMs are stored in the largest memory instance M 1 that contains
words, where each word has 4LT (t + 1) bits. As shown in (6) , N w is inverse to the number of PUs, T, within a PUA. As a result, the area of the proposed L-MEM depends on T for a fixed block length N = 2 n and t. Taking RF as an example, we show the comparison of area efficiency of RFs with different depths in Table I , where area per bit (APB) denotes the total area of a memory normalized by the number of total stored bits. The total areas shown in Table I are from a memory compiler associated with a 90-nm technology. As shown in Table I , the RF with a larger depth has a smaller APB. Hence, given the same amount of bits, it takes a smaller area if those bits can be stored in a RF with a larger depth. For SRAM, the same phenomena have been observed.
The C-MEM can be implemented with a simple regular memory with N/2T words, where each word has 2T (t + 1) bits. Due to the compression of the channel message, each compressed channel message is decompressed into two LLMs by the deComp unit in Fig. 5 before being fed to the PUs. The deComp unit can be implemented with multiplexers.
B. Processing Unit Array
The G and approximated F computations shown in (1) and (3), respectively, are used in the metric computation. These two types of basic operation can be performed with a PU [14] , [15] . The hardware complexity of the proposed PU is determined by p, which is the width of an input LLM.
Due to the nonuniform quantization of the LLMs belonging to different message arrays, for each PU, the number of Algorithm 4 FPP Algorithm quantization bits, p, for each input LLM should be large enough so that no overflow will happen. According to the fixed point implementation of the CA-SCL algorithm, the quantization of P l,n (l = 0, 1, . . . , L − 1) needs the most binary bits, which is t + n. For each PUA, it is unnecessary to employ T PUs with p + 1 = t + n. In this paper, a FPP algorithm, shown in Algorithm 4, is proposed to decide p for each PU. For the j th PU of PUA l (l = 0, 1, . . . , L − 1), each LLM input is quantized with p[ j ] bits. The proposed FPP algorithm is based on the observation that only 2 n−λ < T PUs are needed when computing the updated P l,λ with λ > λ o . Thus, in the proposed PUA l , only PU l,0 , PU l,0 , . . ., PU l,2 n−λ −1 are enabled for the computing of P l,λ . Based on the proposed FPP algorithm, each PUA can finish the metric computation without any overflow at the cost of less area consumption. As shown in Algorithm 4, the bit width of the LLM inputs of a PU is determined by n, T, and t. One example is shown in Table II , where n = 10, T = 8, and t = 4.
The area saving due to the proposed fine grained profiling algorithm also depends on T, n, and t. For the proposed list decoder architecture, there are L identical PU arrays, where each array contains T PUs. In Table III , we compare the area of a regular PU array with that of an array where the input message width of each PU is determined by the fine grained profiling algorithm. As shown in Table III , the area of PU arrays is reduced by 30%-55% depending on the number of PUs with an array and the block length N = 2 n . Here, each channel message is quantized with t = 4 bits.
1) Metric Computation Schedule: For the proposed L-MEM, each data word is capable of storing 4T L LLMs. Moreover, each word is equally divided into L consecutive parts, where the lth part stores the LLMs corresponding to decoding path l. The metric computation schedule is almost the same as that of the partial parallel SC decoder in [3] except that L PUAs work simultaneously for L decoding paths, respectively.
When a data word needs to be updated, the write mismatch would happen since L PUAs generate only 2LT updated LLMs during one clock cycle. These L PUAs need to read two consecutive data words from L-MEM to generate 4T L LLMs. For the proposed list decoder architecture, as shown in Fig. 5 , L write buffers are employed to store half of 4T L LLMs generated by L PUAs. Once the remaining LLMs are computed, the output selection (OSel) module formats these LLMs in the way that these LLMs are stored in the L-MEM.
Since all the LLMs belonging to P λ = (P 0,λ , P 1,λ , . . . , P L−1,λ ) with λ > λ o are stored in a single data word in L-MEM and the computing of LLMs belonging to P λ+1 can only take place once P λ are updated, an additional clock cycle is needed to read out the LLMs within P λ that have been just written into the L-MEM. This will increase the delay and decrease the throughput of the proposed list decoder. As shown in [3] , the bypass buffer, register buffer, is used to temporarily store the messages written into the L-MEM and eliminate the extra read cycle.
C. Path Pruning Unit
For the CA-SCL algorithm, once the path metric computation of decoding step i is finished, each current decoding path splits into two sub decoding paths. However, the list decoder keeps at most L decoding paths. For the proposed list decoder architecture, a PPU is proposed to prune the split decoding paths in an efficient way. As shown in Fig. 5 , the proposed PPU contains two sub modules, the maximum value filter (MVF) and the crossbar control signal generator (CCG). The MVF generates L path indices a 0 , a 1 , . . . , a L−1 and L associated decoded bits c 0 , c 1 , . . . , c L−1 . For a current decoding path l, both the path metric and partial sum computations will be based on the LLMs and partial sums within decoding path a l , and the decoded data bit for u l,i is c l . a l and c l for l = 0, 1, . . . , L − 1 are used to control the copying of partial sums and checksums.
1) Maximum Values Filter: Taking list size L = 8 as an example, the proposed MVF architecture, shown in Fig. 7 , consists of a bitonic sequence generator (BSG) and a stage of compare and select (CAS) modules. The BSG has 16 inputs (D 0 , D 1 , . . . , D 16 ) and 16 outputs (S 0 , S 1 , . . . , S 16 ). Each input or output consists of three parts: the path metric, the associated list index, and decoded bit. The width of each input and output is z = x 1 + x 2 + 1, where x 1 = t + n is the number of bit used to quantize a path metric and x 2 = log 2 L is the number of bits used to represent a list index.
Each stage of the BSG consists of L/2 increased-order sorter and L/2 decreased-order sorter, which are shown in Fig. 8(a) and (b), respectively. Both the IS and DS have two inputs and two outputs. For k = 0, 1, SI k = (LR k , l k , b k ), SO k = (LR k , l k , b k ) and LR k , l k , and b k denote the path metric and its corresponding list index and decoded bit. The IS reorders the inputs such that path metric LR 0 ≤ LR 1 . The output of the comp-max module is 1 when LR 0 > LR 1 . The DS reorders the inputs such that LR 0 ≥ LR 1 and the output of the comp-min module is 1 when LR 0 < LR 1 .
The BSG reorders the inputs based on the magnitude of path metrics. Let LS r (r = 0, 1, . . . , 15 ) denote the associated path metric of output S r and the path metrics of the 16 outputs  TABLE III  AREA COMPARISON BETWEEN THE FINE-GRAINED PU ARRAY AND REGULAR PU USING TSMC 90-nm satisfy LS 0 ≤ LS 1 ≤ · · · ≤ LS 7 (7)
It is proved in [18] that the eight maximum values among LS i 's are max(LS r , LS 8+r ) for r = 0, 1, . . . , 7. Hence, a stage of CAS modules is appended at the outputs of BSG shown in Fig. 7 , where CSA r takes S r and S r+8 as inputs. This stage of CAS modules produce the outputs O t = (a l , c l ) for l = 0, 1, . . . , L − 1. As shown in Fig. 8(c) , the CAS module compares the path metrics of its two inputs and selects the corresponding list index and bit value whose associated path metric is larger. The metric sorter in [14] has the same function as that of the proposed MVF. We compare the proposed bitonic sorter-based MVF module with the metric sorter [14] in terms of area and critical path delay (CPD) under different list sizes when both modules are synthesized under the TSMC 90-nm CMOS technology. As shown in Table IV , the proposed MVF module is more suitable for large list sizes. For list size L = 2 to 32, the proposed MVF achieves 8%-77% area saving. The proposed MVF architecture achieves area saving because the comparator dominates the area for the metric sorter and the MVF modules. For list size L, the metric sorter needs N MS = L(2L − 1) comparators, while the proposed MVF module needs N MVF = 1 + 2 + · · · + log 2 L = L/2((log 2 L) 2 + log 2 L + 2) comparators. When L is large, N MS /N MVF ≈ 4L/log 2 L. Clearly, our MVF module needs fewer comparators.
When L = 2, 4, 8, compared with the metric sorter, the proposed MVF has longer CPD while achieving area saving. However, the longer delay for the MVF is inconsequential because it is not in the critical path for the decoder architecture when L ≤ 8. When L = 16, 32, the proposed MVF is better than the metric sorter in terms of both area and CPD. Thus, the proposed MVF is more suitable for large list sizes.
2) Crossbar Control Signal Generator: Due to the lazy copy method [14] , when decoding path l needs to be copied to decoding path l , instead of copying LLMs from path l to path l , the index references (r l = (r l [n − 1], . . . , r l [0]) shown in Algorithm 2) to LLMs of path l are copied to path l . For decoding path l, when PUA l is computing updated LLMs in P l,λ , the crossbar (CB) module shown in Fig. 5 selects input LLMs from decoding path r l [λ − 1]. The CB can be implemented with L-to-1 multiplexers.
The CCG generator computes the control signals of CB, cc 0 , cc 1 , . . . , cc L − 1, where the lth output of CB is connected to the cc l th input. Since the CCG is a direct implementation of the lazy copy scheme in [14] , the details are omitted and can be found in [15] .
D. Partial Sum Update Unit and the CRC Unit
In this paper, a parallel partial sum update unit (PSU) is proposed to provide the partial sum inputs to L PUAs when performing the G computation. Compared with the PSU in [3] and [14] , which needs N −1 single bit registers for a decoding path, our PSU needs only N/2−1 single register bits.
Take N = 2 3 as an example, the architecture of PSU l that computes the partial sums for decoding path l is shown in Fig. 9 , where stage 3 and stage 2 have one and two elementary update units (EUs), respectively. r l,3,0 , r l,2,0 , and r l,2,1 shown in Fig. 9 are single bit registers. c l =û l,i is the binary input of the PSU l . There are three partial sum outputs: b l,3 , b l,2 , and b l,1 with a width of 1, 2, and 4 bits, respectively. When the LLMs in P l,λ need to be updated with the G computation, b l,λ is the corresponding partial sum input. The architectures of PSU l for other code lengths can be derived from the architecture in Fig. 9 . For a polar code with length N = 2 n , the corresponding PSU l contains n − 1 stages: stage n , stage n−1 , . . . , stage 2 , where stage j has 2 n− j EUs for n ≥ j ≥ 2.
When bit index i is even, c l is stored in r l,n,0 and other registers keep their current values unchanged. When bit index i is odd, bit registers in stage n , stage n , . . . , stage φ (i+1) are updated with their corresponding inputs. When decoding path index l = a l , the updated partial sums of decoding path l should be computed based on the bit registers in PSU a l . The switch network (SW) shown in Fig. 9 selects the corresponding bit register value from PSU a l . The width of the input signal B l, j,k = {r 0, j,k , r 1, j,k , . . . , r L−1, j,k }\{r l, j,k } is L − 1 bits.
The CRC unit (CRCU) checks whether a data word passes the CRC. Suppose an h-bit CRC checksum is used, the architecture of the CRCU l for decoding path l is shown in Fig. 10 , where the generation polynomial for the CRC checksum computation is p(x) = x h + p h−1 x h−1 + · · · + p 1 x + 1. The proposed CRCU l is based on a well-known serial CRC computation architecture [19] . If the polynomial coefficient p k = 0, the corresponding XOR gate and multiplexer are removed. During the decoding of the first K -h information bits, the control signal shift l = 0 and CRCU l computes the h-bit checksum of these information bits. The checksum is stored in bit registers d l,0 , d l,1 , . . . , d l,h−1 shown in Fig. 10 . When a frozen bit is being decoded, d l,0 , d l,1 , . . . , d l,h−1 will not be updated. Once the checksum computation is finished, the checksum is compared with the remaining h decoded information bits, and the control signal shift l = 1. The checksum and the remaining h code bits are compared bit by bit. The comparison result is stored in the register cs l . The decoded codeword for decoding path l passes the CRC only if cs l = 0. The SW module shown in Fig. 10 is the same as that used in the partial sum computation unit PSU l . When l = a l , the SW module selects d a l ,k for k = 0, 1, . . . , h − 1.
E. Decoding Cycles
For the proposed list decoder, pipeline registers can be inserted in the paths that pass through the MVF. Let N C denote the number of cycles spent on the decoding of one data word. For list decoder architectures based on partial parallel processing [3] 
where N, T, n p , and R denote the block length, the number of PUs per decoding path, the number of pipeline registers inserted in the PPU, and the code rate, respectively. The corresponding throughput T P = f N R/N C , where f is the frequency of the list decoder. The latency T D = N C / f .
F. Scalability of the Proposed List Decoder Architecture
For our list decoders, in terms of error performance, it is desirable to use large list sizes since a larger L leads to more performance gain for the CA-SCL decoding algorithm. For the current list decoder architecture in [14] , two problems arise when L increases.
1) The message memories of the list decoder in [14] are built with registers due to the nonuniform quantization of the logarithm domain messages. Besides, the message memories dominate the whole decoder area. As a result, the memory area of the list decoder is linearly proportional to list size L. For a larger list size, the list decoder architecture in [14] will suffer from large area and high power consumption due to its register-based memory. 2) As shown in Table IV , when the list size grows, the metric sorter suffers from large area and long critical path delay, which results in a slower clock frequency of the list decoder. If multiple pipelines are inserted in the metric sorter, the number of cycles for decoding one codeword also increases as shown in (9) . For our list decoder architecture, these two issues are solved as follows.
1) The proposed memory architecture is more area efficient compared with register-based memory. Besides, the proposed memory architecture offers a tradeoff between data throughput and memory area. The register-based memory [14] remains almost unchanged when the number of PUs changes. However, for the proposed memory architecture, the number of PUs affects the depth-width ratio of the message memories. Hence, the area of message memory can be tuned by varying the number of PUs. Reducing the number of PUs will increase the depth of message memories, which is more area efficient. On the other hand, reducing the number of PUs will also increase the number of cycles used on decoding one codeword and decrease the data throughput. 2) When the list size increases, the proposed MVF is more area efficient and has a shorter CPD compared with the metric sorter [14] . As shown in (6) , the depth of the largest L-MEM instance will increase when N increases. Hence, the area efficiency will be improved when N increases. As a result, our list decoder architecture is more suitable for large block length N.
V. IMPLEMENTATION RESULTS
In this paper, our list decoder architecture has been implemented with list sizes L = 2 and 4 for a rate 1/2 polar code with N = 1024. For each list size, two list decoders with the numbers of T = 8 and 16 PUs, respectively, are implemented and synthesized under a TSMC 90-nm CMOS technology. For the L-MEM within each of our list decoder, each submemory is compiled with a memory compiler if its depth is large enough. Otherwise, the submemory is built with registers. For all implemented decoders, each channel LLM is quantized with 4 bits to achieve near floating-point decoding performance. For our list decoders with L = 2 and 4, one stage of the pipeline registers is used. Since the synthesis results in [14] were based on a United Microelectronics Corporation (UMC) 90-nm technology, Balatsoukas-Stimming et al. [14] have generously resynthesized their decoder architecture using the TSMC 90-nm technology. We list both synthesis results from [14] and the resynthesized results. To make a fair comparison, we focus on the resynthesized results.
The implementation results in Table V show the following.
1) The decoder architecture in [14] has a higher throughput than our list decoder architecture. The reason is that the decoder architecture in [14] employs register-based memory while the proposed list decoder architecture employs RF-based memories. The read and write delays of an RF are larger than those of a register-based memory.
2) On the other hand, our list decoder architecture is more area efficient. For list decoders with the same L and T values, compared with the decoder of [14] , our list decoder architecture achieves 1.24-1.83 times of area efficiency.
Our list decoder is implemented for the N = 1024 polar code because the same block length is used in [14] . For larger block length or larger list size, our advantage in area efficiency is expected to be greater due to more area-efficient L-MEM.
Since the CA-SCL algorithm helps to select the correct one from L possible decoded codewords [5] , the decoding performance of the CA-SCL algorithm is better than that of the SCL algorithm with the same list size in [14] . As shown in Fig. 2 , the proposed CA-SCL decoders in Table V outperform the SCL decoders in Table V . We note that the number of PUs has no impact on the error performances of the SCL and CA-SCL decoders.
As shown in Fig. 2 , for the CA-SCL algorithm, increasing the list size results in notable decoding gain according to our simulations. As shown in [4, Fig. 1 ], increasing the list size of the SCL algorithm leads to negligible decoding gain, especially in the high SNR region. For the CA-SCL algorithm, the choice of list size L depends on the tradeoff between error performance and decoding complexity. Better error performance can be achieved by increasing the list size L.
For the SCL algorithm, we need to find the threshold value L T , where little further decoding gain is achieved by employing a list size L > L T . For the SCL algorithm, the feasible list size should be no greater than L T and satisfy the error performance requirement.
Due to the serial nature of the SC method, the SC-based decoders and its list variants suffer from long decoding latency. In terms of throughput, the throughput of SC-based decoders is expected to be lower than BP-based decoders, since the BP algorithm for polar codes has a much higher parallelism. On the other hand, the BP algorithm for polar codes still suffers from inferior finite-length error performance [9] , [20] . Current simulation results [9] show that the error performance of the BP algorithm for polar codes is similar to that of SC algorithm, but worse than those of the SCL and CA-SCL algorithms.
VI. CONCLUSION In this paper, an efficient list decoder architecture has been proposed for polar codes. The proposed decoder architecture achieves higher area efficiency and better error performance than the previous list decoder architectures.
