Abstract-The cyclic redundancy check (CRC) aided successive cancelation list (SCL) decoding algorithm has better error performance than the successive cancelation (SC) decoding algorithm for short or moderate polar codes. However, the CRC aided SCL (CA-SCL) decoding algorithm still suffer from long decoding latency. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. For the proposed RLLD algorithm, all rate-0 nodes and part of rate-1 nodes are decoded instantly without traversing the corresponding subtree. A list maximum-likelihood decoding (LMLD) algorithm is proposed to decode the maximum likelihood (ML) nodes and the remaining rate-1 nodes. Moreover, a simplified LMLD (SLMLD) algorithm is also proposed to reduce the computational complexity of the LMLD algorithm. Suppose a partial parallel list decoder architecture with list size L = 4 is used, for an (8192, 4096) polar code, the proposed RLLD algorithm can reduce the number of decoding clock cycles and decoding latency by 6.97 and 6.77 times, respectively.
I. INTRODUCTION
Polar codes [1] are a significant breakthrough in coding theory, since it is proved that polar codes can achieve the channel capacity of binary-input symmetric memoryless channels in [1] and any discrete or continuous channel in [2] . Polar codes can be efficiently decoded by the low-complexity successive cancelation (SC) decoding algorithm [1] with complexity of O(N log N ), where N is the block length.
Lots of efforts [3] , [4] have already been devoted to improve the error-correction performance of polar codes with short or moderate lengths. An successive cancelation list (SCL) decoding algorithm was recently proposed in [3] , performs better than the SC decoding algorithm and performs almost the same as a maximum-likelihood (ML) decoder [3] . In [4] , the cyclic redundancy check (CRC) is used to pick the output codeword from L candidates, where L is the list size. The CRC-aided SCL decoding algorithm performs much better than the SCL decoding algorithm at the expense of negligible loss in code rate. For example, it was shown in [4] that the CRC-aided SCL decoding algorithm outperforms the SC decoding algorithm by more than 1 dB when the bit error rate (BER) is on the order of 10 −5 for a polar code of length 2048. Many research efforts [5] - [9] have been devoted to the reduction of the decoding latency of the SC decoding algorithm. The simplified successive cancelation (SSC) and the ML-SSC decoding algorithms were proposed in [5] and [7] , respectively. Both SSC and ML-SSC decoding algorithms can reduce the decoding latency of a SC decoder significantly.
However, the reduced latency list decoding algorithm has been rarely discussed in open literature.
In this paper, the algorithms that reduce the latency of list polar decoders are investigated. The main contributions are shown as follows.
1) A reduced latency list decoding (RLLD) algorithm over LLR domain for polar codes is proposed. The proposed RLLD algorithm deals with rate-0 nodes and part of rate-1 nodes in the same way as the SSC decoding algorithm. 2) A list ML decoding (LMLD) algorithm is proposed to decode the ML and remaining rate-1 nodes. For the list size L ≤ 8, a hardware friendly simplified LMLD (SLMLD) algorithm is also proposed. 3) For list size L = 4, an efficient hardware architecture for the proposed SLMLD algorithm is presented. Under a TSMC 90nm technology, at the cost of 1.07 million standard NAND gates, the proposed architecture can achieve a frequency of 400MHz with 4 stage of pipelines. 4) For a partial parallel decoder architecture with L = 4, it is shown that the RLLD with the SLMLD algorithms can reduce the decoding cycles and latency by 6.97 and 6.77 times, respectively.
II. PRELIMINARIES A. Polar codes encoding
The generation matrix of a polar code is an N × N matrix G = B N F ⊗n , where N = 2 n , B N is the bit reversal permutation matrix, and F = are divided into two sets: the information bits set A contains K indices and the frozen bits set A c contains N − K indices. u A are the information bits whose indices all come from A. u A c are the frozen bits whose indices from A c . The encoding graph of a polar code with N = 8 is shown in Fig. 1 .
B. SSC and ML-SSC Decoding Algorithms
A polar code of length N = 2 n can also be represented by a full binary tree of depth n [5] , where each node of the tree is associated with a constituent code. The binary tree representation of an (8, 3) polar code is shown in Fig. 2 , where the black and white leaf nodes correspond to information and frozen bits, respectively. In order to show the connection between the tree representation and the direct encoding graph in Fig. 1 , the constituent code associated with each tree node is also shown in Fig. 2 . There are three types of nodes in a binary tree representation of a polar code: rate-0 , rate-1 and arbitrary rate nodes. The leaf nodes of a rate-0 and rate-1 nodes are associated with only frozen and information bits, respectively. The leaf nodes of an arbitrary rate node are associated with both information and frozen bits. For example, the rate-0, rate-1 and arbitrary rate nodes in Fig. 2 are represented by circles in white, black and gray, respectively. The SC decoding algorithm can also be mapped on a binary tree, where each node acts as a decoder for its constituent code. As shown in Fig. 2 , the decoder at node v receives a soft information vector α v and returns its correspondent constituent code β v . The SC decoding algorithm is initialized by feeding the root node with the channel LLRs,
, where L i = log(Pr(y i |x i = 0)/ Pr(y i |x i = 1)). When an internal node v is activated, it calculates the soft information vector α l v sending to its left child, where
f (a, b) = 2 tanh −1 (tanh(a/2) tanh(b/2)), and t is the layer index of the child node. f (a, b) can be approximated as:
Node v then waits until it receives the constituent code β l v . The soft information vector
Once the right child returns its constituent code β r v , node v computes its constituent code β v as:
where 0 ≤ i < 2 n−t and ⊕ is modulo-2 addition. When a leaf node v is activated, its constituent code β v is set to 0 if leaf node v is associated with a frozen bit. Otherwise, β v is calculated from α v with the threshold detection:
From the root node, all nodes in a tree are activated in a recursive way for the SC decoding. Once β v for the last leaf node is generated, the codeword x n−1 0 can be obtained by combining and propagating β v up to the root node.
The SSC decoding algorithm in [5] simplifies the decoding of rate-0 and rate-1 nodes. Once a rate-0 node is activated, it immediately returns its constituent code which is an all zero vector. Once a rate-1 node is activated, its constituent code is directly calculated from the received soft information vector with the threshold detection rule shown in Eq. (5). The ML-SSC decoding algorithm [7] simplifies the SSC decoding algorithm further by performing the exhaustive-search ML decoding on some resource constrained arbitrary rate nodes, which are called ML nodes in [7] . For an ML node with layer index t, the associated constituent code is estimated according to:
where C is the set of possible constituent codes for the ML node. The binary tree representations of the example (8, 3) polar code under SSC and ML-SSC decoding algorithms are shown in Fig. 3 (a) and (b), respectively. It is observed that the SSC decoding algorithm can reduce the number of nodes to be activated. This number is further reduced by applying the ML-SSC decoding algorithm which introduces ML nodes. It is obvious that all the child nodes of a rate-0 and rate-1 node are still rate-0 and rate-1 nodes, respectively. During the reduction of the binary tree, a rate-0 or rate-1 node is kept only if their parent nodes are not a rate-0 or rate-1 node, respectively. For an arbitrary rate node v, let n v and d v denote the number of leaf nodes and the number of leaf nodes that correspond to information bits, respectively. In [7] , an arbitrary rate node is labeled as an ML node only if its n v and d v do not exceed predefined values. 
C. LLR Based List Decoding Algorithms
In the first several works [3] , [10] , [11] on list decoding of polar codes, the list decoding algorithm is performed either on probability or logarithmic likelihood (LL) domain. In [12] , an LLR based list decoding algorithm is proposed to reduce the message memory requirement and the computational complexity of LL based list decoding algorithm. The LLR based list decoding algorithm employs a novel path metric PM
l , which is computed as:
where
and y is the received channel soft information vector.
III. THE PROPOSED RLLD ALGORITHM
Though existing list decoding algorithms for polar codes can improve the performance of SC decoders significantly. They still suffer from long decoding latency. During the decoding of each information bit, the current decoding paths need to be doubled and at most L most reliable decoding paths are kept, where L is the list size. The extra cycles spent on path pruning increase the number of the overall decoding cycles [10] . In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Let W v and I v denote the number of leaf nodes and leaf nodes associated with information bits of a node v in a binary tree, respectively. Let W T be a predefined threshold value. The general architecture of the proposed RLLD algorithm is shown as follows:
1) For a binary tree representation of a polar code, label all the rate-0, rate-1 and ML nodes. For a node v in the tree, let W v and I v denote the numbers of leaf nodes and leaf nodes associated with information bits, respectively. For rate-1 nodes, I v = W v . Moreover, two type of nodes are defined: T 0 and T 1 . T 0 nodes include rate-1 nodes with I v > W T and all rate-0 nodes. T 1 nodes include rate-1 nodes with I v ≤ W T and all ML nodes. For all
where W M L is also a predefined threshold value. 2) For each decoding path, perform the SC decoding algorithm on the corresponding pruned binary tree, if a T 0 node is activated, the corresponding constituent code is decoded immediately and sent to its parent node. Besides, it is unnecessary to compute the LLR vector sent to a rate-0 node, since the constituent code of a rate-0 node is always a zero vector. 3) If a T 1 node is activated, compute 2
Iv path metrics for each current decoding path, where each path metric corresponds to the reliability of a possible decoding path. Find at most L most reliable decoding paths and continue their corresponding SC decoding. Since only rate-1 nodes with I v < W T are involved in the list decoding, the choice of W T should be decided by the numerical simulation. 4) Once all T 0 and T 1 nodes have been activated and all the SC decoding procedures on each decoding path are finished, perform cyclic redundancy check (CRC) on the information bits of each candidate codeword. The output codeword is the one that passes the CRC. In terms of software or hardware implementation, the proposed RLLD algorithm can be performed over L LLR matrices and L bit matrices. For l = 0, 1, · · · , L−1 and t = 1, 2, · · · , n, let P l,t be a probability message array of 2 n−t elements: P l,t [j] stores an LLR message for j = 0, 1, · · · , 2 n−t − 1. The received channel LLRs are stored in P 0,0 which has N = 2 n elements. C l,t has a similar structure as P l,t : (3) . If node v is a rate-0 node, as mentioned before, it is unnecessary to compute the received LLR vector. Under this circumstance, t v is decreased by 1. When a decoding path l needs to be copied to decoding path l , the lazy copy approach in [10] is applied. Instead of copying LLR matrices,
For decoding path l, during the computation of P l,tv , LLR arrays, P l,Is , · · · , P l,tv , need to be updated in serial, where I s is a pre-computed layer index. For the tree representation of a polar code, suppose all leaf nodes from left to right are indexed from 0 to N − 1. Let the indices of the leftmost and rightmost leaf nodes of the subtree of node v be IDX 0 and IDX 1 , respectively. I s is computed based on IDX 0 as shown in Algorithm 2, where the function dec2bin computes the binary representation of its input and B n−1 and B 0 are the most and least significant bits, respectively.
Once the constituent code C l v sent from node v for decoding path l is computed,
n−tv . If the contents of decoding path l need to be copied to decoding path l , the partial sums in decoding path l are copied to the corresponding locations in decoding path l . If node v is the right child of its parent node, then the partial sum computation for path l is performed as shown in Algorithm 3. The input I e is a layer index and can be obtained by applying Algorithm 2 with IDX 0 and I s being replaced with IDX 1 and I e , respectively.
A. LMLD Algorithms
When a T 1 node is activated, the current decoding paths will expand, and at most L most reliable decoding paths are Algorithm 1: llrComp(l, α) input : I s , t v output: P l,tv
In this paper, a list ML decoding (LMLD) algorithm is proposed to find at most L most reliable decoding paths. For a T 1 node v, there are 2 Iv candidate output constituent codes since the number of information bits associated with the leaf nodes of a node v is I v . Therefore, for each decoding path l, the proposed LMLD algorithm computes 2
Iv extended path metrics PM j l for j = 0, 1, · · · , 2
Iv −1 based on the current path metric PM l . Finding the L most reliable surviving decoding paths is equivalent to find the L most reliable constituent codes among all candidates. Here, several conclusions are made on path metrics and extended path metrics:
• For each decoding path l, the path metric PM l is initialized with 0. The extended path metrics are computed
only when a T 1 node is activated. As a result, the smaller a extended path metric is, the more reliable a corresponding constituent code is.
Based on the previous conclusions, the proposed LMLD algorithm finds the L most reliable constituent codes by sorting out the L minimum metrics among 2 Iv L metrics. Let set S = {(l, j) r |r = 0, 1, · · · , L − 1}, where (l, j) r is the index of a candidate constituent code. Thus, the proposed LMLD algorithm is shown in Eq. (9),
where argmin−L finds the associated indices of the L minimum metrics among all input metrics. The current L path metrics are updated with the L minimum extended path metrics.
As shown in Eq. (9), the computational complexity of the proposed LMLD algorithm is exponential to I v which is the number of leaf nodes associated with information bits for node v. As a result, the maximum value of I v should be limited for practical implementation of the proposed LMLD algorithm. In this paper, the maximum value of I v is set to 8. The maximum number of leaf nodes of a ML node is set to W M L = 16. In case of W T is greater than 8, the corresponding rate-1 node is split to several rate-1 nodes with W v = 8. The other generated nodes due to the split are viewed as arbitrary rate nodes. Take a rate-1 node with W v = 32 as an example, the split is shown in Fig. 4 , where 4 rate-1 nodes with W v = 8 are generated while the other generated nodes are deemed as arbitrary rate nodes. Besides, W v for a rate-1 node can only be a power of 2. 
B. SLMLD Algorithms
The computational complexity of the proposed LMLD algorithm is still high when I v is close to 8. In this paper, for L =≤ 8, a simplified list ML decoding (SLMLD) algorithm suitable for parallel hardware implementation is proposed to reduce the computational complexity of the proposed LMLD algorithm in further. Here, L is assumed to be a power of 2. The proposed SLMLD algorithm shown in Eq. (9) is divided into two major steps:
1) For each current decoding path l, find its most reliable L constituent codes based on node metrics. Since only the L most reliable constituent codes are needed at last and at most L constituent codes are from the same decoding path l, it is enough to find the L most reliable constituent codes for a decoding path l. 2) Compute the extended path metrics based on survived node metrics from previous step, and find the final L most reliable constituent codes based on these L × L extended metrics. Depending on the value of I v , the first step can be simplified further. If 2 Iv ≤ L, nothing needs to be done. If 2 Iv = 2L, the minimum L extended path metrics and their corresponding l and j indices are computed with a bitonic sequence [13] based sorter (BBS) [11] , where the BBS first transforms the inputs into a bitonic sequence and then generates L minimum metrics among all inputs. When 2 Iv > 2L, the minimum L node metrics are computed as follows:
Iv node metrics are divided into L groups as follows:
The minimum two metrics of each group are then computed.
• Among the resulting 2L extended path metrics, the minimum L extended path metrics and their corresponding l and j indices are computed with a BBS. When list size L = 2, for any I v values, the first step is just finding the minimum two extended path metrics and their corresponding index pairs (l, j)'s.
The second step of the proposed SLMLD algorithm employs the 2L-L BBS sorter with 2L inputs and L outputs repeatedly to generate L final extended path metrics and their associated path indices. Take L = 4 as an example, there are 4L extended path metrics: PM
are applied to two 2L-L BBSs, respectively. Thus, total 2L metrics are selected out. Then the 2L-L BBS is employed again to generate the final L minimum extended path metrics: PM
C. Simulation Results
For an (8192, 4096) polar code, the frame error rate (FER) performance of the proposed RLLD algorithm are shown in Fig. 5 , under the AWGN channel with BPSK modulation.
As shown in Fig. 5 , CSi denotes the CRC aided SC list decoding algorithm [3] with list size L = i over LLR domain, and RS(i, ω) denotes the proposed RLLD algorithm with the SLMLD algorithm when list size L = i and W T = ω. For both CSi and RS(i, ω) algorithms, 32 information bits are replaced with a 32-bit CRC checksum.
For simplicity, the FER performances of the proposed RLLD algorithm with LMLD (RL) algorithm are not shown in this paper, since the FER performances of the RL algorithm are the same as that of the CS algorithm with the same list size. Based on the simulation results, the following conclusions are made:
• The performance of the proposed RS algorithm is affected by the list size L. For the (8192, 4096) polar code, the FER performances of RS (2, 8) is close that of CS2. However, RS (4, 8) and RS (8, 8) show performance degradation when the FER is blow 10 −4 .
• In order to achieve good error correction performance, for the proposed RS algorithm, the threshold value W T should be large enough. A larger W T will transfer more rate-1 nodes to T 1 nodes, which in turn increases the chance that a correct codeword shows in the final lists. For the (8192, 4096) polar codes, RS (4, 8) and RS (8, 8) perform worse than RS(4, 32) and RS (8, 32) , respectively, when the SNR is large.
• The side effect of increasing W T is that both the decoding complexity and latency will increase since more T 1 nodes are generated. Based on simulation results shown in Fig. 5 , a dynamic W T can be adopt for the proposed RS algorithm in order to achieve the most latency reduction at different SNR regions while maintaining the error correction performance.
D. Hardware implementation of the proposed SLMLD
In this paper, an efficient hardware implementation of the proposed SLMLD algorithm is shown in Fig. 6 , where the corresponding list size L = 4, and the architectures for other L values can be inferred. As shown in Fig. 6 , the node metric generation (NMG) unit finds L minimum node metrics and their corresponding constitution codes for each decoding path. For the decoding path l, the extended path metrics PM j l 's are obtained by adding the node metrics with the path metric PM l , which is stored in registers and initialized with 0. BBS 8−4 in Fig. 6 The hardware architecture of the NMG unit is shown in Fig. 7 . Since the maximum value of I v is 8 for any T 1 node, there are at most 2 8 = 256 candidate constituent codes for a T 1 node v. Each Enc unit in Fig. 7 is responsible for generating a candidate constituent code based on the encoding of polar codes. For j = 0, 1, · · · , 2
Iv − 1, the LLR selection unit, LS j , and the summation unit, SUM j , work together to compute the node metric NM In this paper, the proposed architecture for the SLMLD algorithm is synthesized under a TSMC 90nm CMOS technology. With 4 stages of pipeline registers, it achieves a frequency of 400MHz and consumes 1.07 million standard NAND gates.
For our implementation, when a T 1 node is activated, it will take 4 clock cycles to find the surviving constituent codes and decoding paths. The area of the architecture of the SLMLD algorithm is almost the same as an LLR based list decoder with L = 4.
E. Comparisons of decoding clock cycles and latency
Since the detailed decoding cycles of list decoders are related with a detailed hardware architecture, in this paper, the decoding latency comparison is performed based on the assumption that the partial parallel list architecture [10] is employed and there are P = 128 processing units for each decoding path. Let N R denote the clock cycles used to decode a codeword for decoders with the proposed RS algorithm. Then N R = N L + N P , where N L and N P are cycles used on the LLR computation and path pruning, respectively. Besides, N P = N a N s , where N a is the times that a T 1 node is activated and N s is the number pipelines inserted in the implementation of the SLMLD algorithm. Let N C denote the clock cycles used to decode a codeword for decoders with the CS algorithm. Then N C = 2N + Under the UMC 90nm CMOS technology, the (8192, 4096) list polar decoder can achieve a frequency of 412MHz [12] when list size L = 4. Since the list decoder with the proposed RS decoding algorithm need only to change the path pruning part, the proposed list decoder can only achieve a frequency of 400MHz under 90nm technology. Thus, the decoding latency is reduced by about 6.77 times due to the proposed RS decoding algorithm when L = 4.
IV. CONCLUSION
In this paper, a reduced latency decoding algorithm for polar codes is proposed. The hardware implementation of the SLMLD is also discussed. The future work includes studying the performances of the proposed RLLD algorithm when FER is below 10 −10 . Besides, more efficient implementations of the proposed SLMLD algorithm when list size is large will be investigated.
