Abstract-While long polar codes can achieve the capacity of arbitrary binary-input discrete memoryless channels when decoded by a low complexity successive cancelation (SC) algorithm, the error performance of the SC algorithm is inferior for polar codes with finite block lengths. The cyclic redundancy check (CRC) aided successive cancelation list (SCL) decoding algorithm has better error performance than the SC algorithm. However, current CRC aided SCL (CA-SCL) decoders still suffer from long decoding latency and limited throughput. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Our RLLD algorithm performs the list decoding on a binary tree, whose leaves correspond to the bits of a polar code. In existing SCL decoding algorithms, all the nodes in the tree are traversed and all possibilities of the information bits are considered. Instead, our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and hence improves throughput, while introducing little performance degradation. Based on our RLLD algorithm, we also propose a high throughput list decoder architecture, which is suitable for larger block lengths due to its scalable partial sum computation unit. Our decoder architecture has been implemented for different block lengths and list sizes using the TSMC 90nm CMOS technology. The implementation results demonstrate that our decoders achieve significant latency reduction and area efficiency improvement compared with other list polar decoders in the literature.
Abstract-While long polar codes can achieve the capacity of arbitrary binary-input discrete memoryless channels when decoded by a low complexity successive cancelation (SC) algorithm, the error performance of the SC algorithm is inferior for polar codes with finite block lengths. The cyclic redundancy check (CRC) aided successive cancelation list (SCL) decoding algorithm has better error performance than the SC algorithm. However, current CRC aided SCL (CA-SCL) decoders still suffer from long decoding latency and limited throughput. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Our RLLD algorithm performs the list decoding on a binary tree, whose leaves correspond to the bits of a polar code. In existing SCL decoding algorithms, all the nodes in the tree are traversed and all possibilities of the information bits are considered. Instead, our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and hence improves throughput, while introducing little performance degradation. Based on our RLLD algorithm, we also propose a high throughput list decoder architecture, which is suitable for larger block lengths due to its scalable partial sum computation unit. Our decoder architecture has been implemented for different block lengths and list sizes using the TSMC 90nm CMOS technology. The implementation results demonstrate that our decoders achieve significant latency reduction and area efficiency improvement compared with other list polar decoders in the literature.
Index Terms-polar codes, successive cancelation decoding, list decoding, hardware implementation, low latency decoding I. INTRODUCTION Polar codes [3] are a significant breakthrough in coding theory, since they can achieve the channel capacity of binaryinput symmetric memoryless channels [3] and arbitrary discrete memoryless channels [4] . Polar codes of block length N can be efficiently decoded by a successive cancelation (SC) algorithm [3] with a complexity of O(N log N ). While polar codes of very large block length (N > 2 20 [5] ) approach the capacity of underlying channels under the SC algorithm, for short or moderate polar codes, the error performance of the SC algorithm is worse than turbo or LDPC codes [6] .
Lots of efforts [6] [7] [8] have already been devoted to the improvement of error performance of polar codes with short or moderate lengths. An SC list (SCL) decoding algorithm [6] performs better than the SC algorithm. In [6] [7] [8] , the cyclic redundancy check (CRC) is used to pick the output codeword from L candidates, where L is the list size. The CRC-aided SCL (CA-SCL) decoding algorithm performs much better than the SCL decoding algorithm at the expense of negligible loss in code rate.
Despite its significantly improved error performance, the hardware implementations of SC based list decoders [9] [10] [11] [12] [13] still suffer from long decoding latency and limited throughput due to the serial decoding schedule. In order to reduce the decoding latency of an SC based list decoder, M (M > 1) bits are decoded in parallel in [14] [15] [16] , where the decoding speed can be improved by M times ideally. However, for the hardware implementations of the algorithms in [14] [15] [16] , the actual decoding speed improvement is less than M times due to extra decoding cycles on finding the L most reliable paths among 2 M L candidates, where L is list size. A software adaptive SSC-list-CRC decoder was proposed in [17] . For a (2048, 1723) polar+CRC-32 code, the SSC-list-CRC decoder with L = 32 was shown to be about 7 times faster than an SC based list decoder. However, it is unclear whether the list decoder in [17] is suitable for hardware implementation.
In this paper, a tree based reduced latency list decoding algorithm and its corresponding high throughput architecture are proposed for polar codes. The main contributions are:
• A tree based reduced latency list decoding (RLLD) algorithm over logarithm likelihood ratio (LLR) domain is proposed for polar codes. Inspired by the simplified successive cancelation (SSC) [18] decoding algorithm and the ML-SSC algorithm [19] , our RLLD algorithm performs the SC based list decoding on a binary tree. Previous SCL decoding algorithms visit all the nodes in the tree and consider all possibilities of the information bits, while our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and hence improves throughput, while introducing little performance degradation.
• Based on our RLLD algorithm, a high throughput list decoder architecture is proposed for polar codes. Compared with the state-of-the-art SCL decoders in [10] , [12] , [15] , our list decoder achieves lower decoding latency and higher area efficiency (throughput normalized by area).
More specifically, the major innovations of the proposed decoder architecture are:
• An index based partial sum computation (IPC) algorithm is proposed to avoid copying partial sums directly when one decoding path needs to be copied to another. Compared with the lazy copy algorithm in [6] , our IPC algorithm is more hardware friendly since it copies only arXiv:1510.02574v1 [cs.AR] 9 Oct 2015 path indices, while the lazy copy algorithm needs more complex index computation.
• Based on our IPC algorithm, a hybrid partial sum unit (Hyb-PSU) is proposed so that our list decoder is suitable for larger block lengths. The Hyb-PSU is able to store most of the partial sums in area efficient memories such as register file (RF) or SRAM, while the partial sum units (PSUs) in [9] , [10] , [12] store partial sums in registers, which need much larger area when the block length N is larger. Compared with the PSU of [10] , our Hyb-PSU achieves an area saving of 23% and 63% for block length N = 2 13 and 2 15 , respectively, under the TSMC 90nm CMOS technology.
• For our RLLD algorithm, when certain types of nodes are visited, each current decoding path splits into multiple ones, among which the L most reliable paths are kept. In this paper, an efficient path pruning unit (PPU) is proposed to find the L most reliable decoding paths among the split ones. For our high throughput list decoder architecture, the proposed PPU is the key to the implementation of our RLLD algorithm.
• For the fixed-point implementation of our RLLD algorithm, a memory efficient quantization (MEQ) scheme is used to reduce the number of stored bits. Compared with the conventional quantization scheme, our MEQ scheme reduces the number of stored bits by 17%, 25% and 27% for block length N = 2 10 , 2 13 and 2 15 , respectively, at the cost of slight error performance degradation.
Note that the SSC and ML-SSC algorithms reduce the decoding latency by first performing it on a binary tree and then pruning the binary tree. Inspired by this idea, our RLLD algorithm performs the SC based list decoding algorithm on a binary tree. The low-latency list decoding algorithm [17] also performs the list decoding algorithm on a binary tree. Our work [1] and the decoding algorithm in [17] are developed independently. While both our RLLD algorithm and the lowlatency list decoding algorithm in [17] visit fewer nodes in the binary tree so as to reduce the decoding latency, there are some differences:
• Compared with the decoding algorithm in [17] , our RLLD algorithm visits fewer nodes. Illuminated by the ML-SSC algorithm, our RLLD algorithm processes certain arbitrary rate nodes [18] in a fast way.
• When a rate-1 node [18] is visited, our RLLD algorithm employs a less complex and hardware friendly algorithm to compute the returned constituent codewords.
• Our RLLD algorithm is based on LLR messages, while the algorithm in [17] is based on logarithm likelihood (LL) messages, which require a larger memory to store.
In terms of hardware implementations, compared with stateof-the-art SC list decoders [9] , [10] , [12] , [13] , [15] , [16] , our high throughput list decoder architecture shows advantages in various aspects:
• For the high throughput list decoder architecture, LLR message is employed while LL message was used in [9] , [10] , [15] , [16] . The LL based memories require more quantization bits and a larger memory to store. The area efficient memory architecture in [10] is employed to store all LLR messages. LLR messages were also employed in [12] , [13] . However, the register based memories in [12] , [13] suffer from excessive area and power consumption when N is large.
• Our list decoder architecture employs a Hyb-PSU, which is scalable for polar codes of large block lengths. The register based PSUs of the list decoders in [9] , [10] , [12] suffer from area overhead when the block length is large. Instead of copying partial sums directly, our scalable PSU copies only decoding path indices, which avoids additional energy consumption. The proposed high throughput list decoder architecture has been implemented for several block lengths and list sizes under the TSMC 90nm CMOS technology. The implementation results show that our decoders outperform existing SCL decoders in both decoding latency and area efficiency. For example, compared with the decoders of [12] , the area efficiency and decoding latency of our decoders are 1.59 to 32.5 times and 3.4 to 6.8 times better, respectively.
For our RLLD algorithm and the corresponding decoder architecture, when computing the returned constituent codewords from an FP node or a rate-1 node, the returned L constituent codewords may not be the L most reliable ones among all candidates. This kind of approximation leads to more efficient hardware implementation of our list decoding algorithm at the cost of certain performance degradation. In contrast, existing SC list decoders in [6] , [13] usually selects the L most reliable candidates.
The rest of the paper is organized as follows. Related preliminaries are reviewed in Section II. The proposed RLLD algorithm is presented in Section III. The high throughput list decoder architecture is presented in Section IV. In Section V, the implementation and comparisons results are shown. At last, the conclusion is drawn in Section VI.
II. PRELIMINARIES

A. Polar Codes
n . Under the polar encoding, x
, where B N is the bit reversal permutation matrix, and F = 1 1 0 1 . Here ⊗n denotes the nth Kronecker power, F ⊗n = F ⊗ F ⊗(n−1) and F ⊗0 = 1. For i = 0, 1, · · · , N − 1, u i is either an information bit or a frozen bit, which is set to zero usually. For an (N, K) polar code, there are a total of K information bits within u
The encoding graph of a polar code with N = 8 is shown in Fig. 1 .
B. Prior Tree-Based SC Algorithms
A polar code of block length N = 2 n can also be represented by a full binary tree G n of depth n [18] , where each node of the tree is associated with a constituent code. For example, for node 1 shown in Fig. 2 layer index Fig. 1 . The binary tree representation of an (8, 3) polar code is shown in Fig. 2 , where the black and white leaf nodes correspond to information and frozen bits, respectively. There are three types of nodes in a binary tree representation of a polar code: rate-0 , rate-1 and arbitrary rate nodes. The leaf nodes of a rate-0 and rate-1 nodes correspond to only frozen and information bits, respectively. The leaf nodes of an arbitrary rate node are associated with both information and frozen bits. The rate-0, rate-1 and arbitrary rate nodes in Fig. 2 are represented by circles in white, black and gray, respectively.
The SC algorithm can be mapped on G n , where each node acts as a decoder for its constituent code. The SC algorithm is initialized by feeding the root node with the channel LLRs,
, where Y i = log(Pr(y i |x i = 0)/ Pr(y i |x i = 1)) and (y 0 , y 1 , · · · , y N −1 ) is the received channel message vector. As shown in Fig. 2 , the decoder at node v receives a soft information vector α v and returns a constituent codeword β v . When a non-leaf node v is activated by receiving an LLR vector α v , it calculates a soft information vector α 
From the root node, all nodes in a tree are activated in a recursive way for the SC algorithm. Once β v for the last leaf node is generated, the codeword x
can be obtained by combining and propagating β v up to the root node.
The SSC decoding algorithm in [18] simplifies the processing of both rate-0 and rate-1 nodes. Once a rate-0 node is activated, it immediately returns the all zero vector. Once a rate-1 node is activated, a constituent codeword is directly calculated by making hard decisions on the received soft information vector as shown in Eq. (1). The ML-SSC decoding algorithm [19] further accelerates the SSC decoding algorithm by performing the exhaustive-search ML decoding on some resource constrained arbitrary rate nodes, which are called ML nodes in [19] . For an ML node with layer index t, the constituent codeword passed to the parent node p v is
where C is the constituent code associated with node v.
C. LLR Based List Decoding Algorithms
For SCL decoding algorithms [6] , [9] , [13] , when decoding an information bit u i , each decoding path splits into two paths withû i being 0 and 1, respectively. Thus 2L path metrics are computed and the L paths correspond to the L minimum path metrics are kept. The list decoding algorithms [6] , [9] are performed either on probability or logarithmic likelihood (LL) domain. In [13] , an LLR based list decoding algorithm was proposed to reduce the message memory requirement and the computational complexity of LL based list decoding algorithm. For decoding path l (l = 0, 1, · · · , L − 1), the LLR based list decoding algorithm employs a novel approximated path metric
where
is the received channel message vector.
III. REDUCED LATENCY LIST DECODING ALGORITHM
A. SCL Decoding on A Tree
Similar to the SSC decoding algorithm, we also perform the SC based list decoding algorithms [6] , [9] on a full binary tree G n [1] , [17] . The SCL decoding is initiated by sending the received channel LLR vector to the root node of G n . As shown in Fig. 3 , without losing generality, each internal node v in G n is activated by receiving L LLR vectors, α v,0 , α v,1 , · · · , α v,L−1 , from its parent node v p and is responsible for producing L constituent codewords,
, where α v,l and β v,l correspond to decoding path l for l = 0, 1, · · · , L − 1. Suppose the layer index of node v is t, α v,l and β v,l have 2 n−t LLR messages and binary bits, respectively, for 
for 0 ≤ i < 2 n−t−1 and l = 0, 1, · · · , L − 1. Here f (a, b) = 2 tanh −1 (tanh(a/2) tanh(b/2)) and can be approximated as:
Node v then waits until it receives L codewords,
and passes them to its right child node v R , where
for 0 ≤ i < 2 n−t−1 and
and passes them to its parent node v p , where For l = 0, 1, · · · , L − 1, PM l is the path metric associated with decoding path l and is initialized with 0. When a leaf node v associated with an information bit is activated, decoding path l splits into two paths with β v,l being 0 and 1, respectively. Note that the layer index of a leaf node is n, hence α v,l and β v,l have only one LLR and binary bit, respectively, when node v is a leaf node. For the SCL decoding, 2L expanded path metrics are computed, where
for j = 0, 1 and
Decoding path a l will be copied to decoding path l before further partial sum and LLR vector computations. For each decoding path l, path metric is also updated with PM l = PM
. When a leaf node v associated with a frozen bit is activated,
The SCL algorithm on a tree described above is equivalent to the SCL algorithms in [6] , [9] .
B. Proposed RLLD algorithm
In this paper, a reduced latency list decoding (RLLD) algorithm is proposed to reduce the decoding latency of SC list decoding for polar codes. For a node v, let I v denote the total number of leaf nodes that are associated with information bits. Let X th be a predefined threshold value and X 0 and X 1 be predefined parameters. Our RLLD algorithm performs the SC based list decoding on G n and follows the node activation schedule in Section III-A, except when certain type of nodes are activated. These nodes calculate and return the codewords to their parent nodes while updating the decoding paths and their metrics, without activating their child nodes. Specifically:
• When a rate-0 node v is activated, β v,l is a zero vector
For polar codes constructed in [20] , [21] , we observe that the polarized channel capacities of the information bits corresponding to rate-1 nodes with I v > X th are greater than those of the other information bits. Hence, for rate-1 nodes with I v > X th , our RLLD algorithm considers only the most reliable candidate codeword for each decoding path due to a more reliable channel.
• When a rate-1 node v with I v X th is activated, the returned codewords are calculated by a candidate generation (CG) algorithm, which is proposed later.
• Let t denote the layer index of node v. When an arbitrary rate node v with I v X 0 and 2 n−t X 1 is activated, each decoding path splits into 2
Iv paths. From now on, such an arbitrary rate node is called fast processing (FP) node. A metric based search (MBS) algorithm, which is proposed later, is used to calculate the returned codewords. Moreover, our RLLD algorithm works on a pruned tree. As a result, our RLLD algorithm visits fewer nodes than the SCL algorithm in [6] , [9] . The full binary tree is pruned in the following ways:
• Starting from the complete tree representation of a polar code, label all FP nodes such that the parent node of each of them is not an FP node. Note that an FP node v is an arbitrary rate node with I v X 0 and 2 n−t X 1 . For each labeled FP node, remove all its child nodes.
• Based on the pruned tree from the previous step, label all rate-0 and rate-1 nodes such that the parent node of each of these rate-0 and rate-1 nodes is not a rate-0 and rate-1 node, respectively. In the next, remove all child nodes of each of labeled rate-0 and rate-1 node. The leaf nodes of the pruned tree from the above two steps consist of rate-0, rate-1 and FP nodes. The non-leaf nodes of the pruned tree are arbitrary rate nodes.
When a rate-1 node with I v > X th or a rate-0 node is activated, ideally PM l is updated with
For each rate-1 node with I v > X th , ∆ v,l = 0 since β v,l is the hard decision of α v,l . However, for a rate-0 node, ∆ v,l could have a non-zero value. For our RLLD algorithm, ∆ v,l is also set to 0 for each rate-0 node, since the resulting performance degradation is negligible. By setting ∆ v,l to 0, we no longer need to calculate α v,l sent to a rate-0 node. 1) Proposed CG Algorithm: When a rate-1 node with I v X th is activated, instead of considering 2 Iv candidate codewords for each decoding path, since there are at most L codewords from the same decoding path that could be passed to the parent node, it is enough to find only the L most reliable codewords among 2
Iv candidates for each decoding path. When I v is large (e.g. I v 32), finding the L most reliable codewords is computationally intensive and lacks efficient hardware implementations. For our RLLD algorithm, we considers only the W (W < L) most reliable codewords among 2
Iv candidates for each decoding path. In this paper, W is set to 2, since it results in efficient hardware implementations at the cost of negligible performance loss.
When W = 2, the proposed CG algorithm, shown in Alg. 1, is used to calculate the codewords passed to the parent node. Besides, the CG algorithm also outputs L list indices, a 0 , a 1 , · · · , a L−1 , which indicate that decoding path a l needs to be copied to path l. Suppose the layer index of such a rate-1 node v is t. For each decoding path l, there are 2 Iv = 2
candidate codewords that could be passed to the parent node v p . However, our CG algorithm considers only the most reliable codeword C v,l,0 and the second most reliable codeword C v,l,1 . In order to find these two codewords, each candidate codeword C v,l,j is associated with a node metric
) and 1 otherwise. As a result, the smaller a node metric is, the more reliable the corresponding candidate codeword is. Based on Eq. (9), C v,l,0 = h(α v,l ) is the hard decision of the received LLR vector α v,l . C v,l,1 is obtained by flipping the k M,l -th bit of C v,l,0 , where k M,l is the index of the LLR element with the smallest absolute value among α v,l .
Each decoding path splits into two paths and has two associated candidate codewords. Alg. 1 calculates 2L expanded path metrics PM j l for l = 0, 1, · · · , L−1 and j = 0, 1 to select L codewords passed to the parent node. The min L function in Alg. 1 finds the L smallest values among 2L input expanded path metrics. Once β v,l for l = 0, 1, · · · , L − 1 are computed, decoding path a l is copied to decoding path l before further operations.
2) Proposed MBS Algorithm: When an FP node is activated, each current decoding path expands to 2 Iv paths, each of which is associated with a candidate codeword. Similar to the CG algorithm, the proposed MBS algorithm calculates L codewords passed to the parent node and L path indices, a 0 , a 1 , · · · , a L−1 . The calculation of returned codewords are shown as follows.
• For each candidate codeword C j v,l , calculate its corresponding node metric NM
• Find L expanded path metrics among 2 Iv L ones. The correspondent candidate codewords are passed to the parent node v p .
To calculate the node metric, we propose a new method with low computational complexity. In the literature, two methods can be used: the direct-mapping method (DMM) shown in Eq. (9) and the recursive channel combination (RCC) [16] . In terms of computational complexity, the former needs 2
Iv (2 n−t − 1)L additions, where N = 2 n and t is the layer index of an FP node v. The RCC needs (
Compared to the DMM, the RCC approach needs fewer additions. For our RLLD algorithm, we want to compute these 2
Iv node metrics in parallel. However, the parallel hardware implementations of the DMM and RCC algorithms require large area consumption. This will be discussed in more detail in Section IV-C.
In this paper, a hardware efficient node metric computation method, which takes advantage of both the DMM and the RCC, is proposed. The proposed method, referred to as the DR-Hybrid (DRH) method, is shown in Alg. 2, where
, and r (0 r 3) is represented by a binary tuple of length two, i.e. r = r 0 + 2r 1 . In our method, the RCC approach is used to calculate θ l,i first. Then, the DMM is carried out.
Algorithm 2: DR-Hybrid method
The DRH method needs 4 × 2 n−t−1 + 2 Iv (2 n−t−1 − 1) additions. Take X 0 = 8 and X 1 = 16 as an example, the DMM, RCC and DRH methods need 3840, 864 and 1824 additions. Though our DRH method needs more additions than the RCC, it results in a more area efficient hardware implementation when all 2
Iv node metrics are computed in parallel, since the RCC method needs more complex multiplexors.
Once we have 2 Iv L node metrics and corresponding candidate codewords, 2
Iv L expanded path metrics PM
Iv − 1 can be computed. The next step is selecting L returned codewords and their corresponding expanded path metrics.
Since directly finding the L minimum values from 2 Iv L ones is computationally intensive and lacks efficient hardware implementations, a bitonic sequence based sorter [10] (BBS) with 2
Iv L inputs is able to fulfill this task. Such a BBS takes 2
Iv−1 L(
Iv−2 L compare-and-switch (CS) units [10] , where each of them has one comparator and two 2-to-1 multiplexors and s = log 2 (2 Iv L). In order to simplify the hardware implementation, a two-stage sorting scheme was proposed in [16] , where the first stage selects q (q < L) smallest node metrics from 2
Iv ones for each decoding path. The second stage selects the L smallest metrics from the Lq expanded path metrics produced by the first stage. Compared with the direct sorting scheme [10] , [15] , the hardware implementation of the two-stage sorting scheme is more efficient at the cost of certain error performance degradation.
In this paper, our MBS algorithm employs the two-stage sorting scheme and improves the first stage in the following two aspects:
• Instead of using a fixed q, our MBS algorithm employs a dynamic q Iv,L (q Iv,L L), which is a power of 2 and depends on both I v and L.
• An approximated sorting (ASort) method, which leads to an efficient hardware implementation, is used to select q Iv,L metrics from 2 Iv ones, though these sorted metrics are not always the q Iv,L smallest ones. Our ASort method is illustrated as follows:
• When 2 Iv 2L, the BBS with 2L inputs and L outputs is used to select the q Iv,L minimum node metrics from 2 Iv ones.
• When 2 Iv > 2L, all 2 Iv node metrics are divided into q Iv,L groups:
. The two minimum node metrics of each group are first computed. The BBS computes the minimum q Iv,L node metrics among 2q Iv,L ones.
After the first stage of sorting, the number of expanded path metrics N e could be 2L, 4L, · · · , L × L. The second stage of sorting is the same as that in [16] . A binary tree of 2L-L BBSs are employed to sort the final L minimum expanded path metrics. Take N e = 4L as an example, there are 4L extended path metrics: PM
are applied to two 2L-L BBSs, respectively. Thus, 2L metrics are selected. Then the 2L-L BBS is employed again to generated the final L minimum extended path metrics: PM
C. Parameters of Our RLLD Algorithm
For our RLLD algorithm, the returned codewords from rate-1 nodes with I v > X th are obtained by making hard decisions on the received LLR vectors. The other rate-1 nodes are processed by our CG algorithm. Note that both the hard decision approach and our CG algorithm could cause potential error performance degradation since ideally we should consider 2 Iv candidate codewords for each decoding path. With more rate-1 nodes (decreasing X th ) being processed by the hard decision approach, the decoding latency could be reduced at the cost of more error performance degradation. Besides, in order to save computations, path metrics remain unchanged when a rate-0 node is activated, which may cause error performance degradation.
The choices of X 0 and X 1 are tradeoffs between implementation complexity and achieved decoding latency reduction. Ideally, we want X 0 and X 1 to be as large as possible so that more data bits could be decoded in parallel. Since the number of adders needed by Alg. 2 is proportional to 2 X0 X 1 , the values of X 0 and X 1 are limited by hardware implementations.
For the two-step sorting scheme of our MBS algorithm, we want q Iv,L to be as small as possible so that the sorting complexity could be minimized. However, reducing q Iv,L could degenerate the resulting error performance, since ideally we need to consider the L most reliable candidate codewords for each decoding path. As a result, the selections of q Iv,L are tradeoffs between sorting complexity and error performance.
D. Comparison with Related Algorithms
If we perform the SC based list decoding algorithms [6] , [9] on a tree, then all 2N − 1 nodes of the tree will be activated. For our RLLD algorithm, denote n a as the number of activated nodes. Then we have n a < 2N − 1, where n a is determined by the block length N , the code rate, the locations of frozen bits and the parameters X 0 and X 1 . X 0 and X 1 are used to identify all FP nodes. The reduction of the number of activated nodes will transfer into reduced decoding latency and increased throughput. Take the (8, 3) polar code in Fig. 2 as an example, suppose X 0 = 1 and X 1 = 2, then only 5 nodes (nodes 0, 1, 2, 5, and 6) need to be activated by our RLLD algorithm, whereas the algorithms in [6] , [9] need to activate all 15 nodes.
The CA-SCL decoding algorithm was also performed on a binary tree in [17] . Compared with the low-latency list decoding algorithm [17] , our RLLD algorithm employs the proposed MBS algorithm to process FP nodes, while FP nodes were processed by activating its child nodes in [17] . Our MBS algorithm results in decreased decoding latency at the cost of potential error performance loss. Besides, our RLLD algorithm takes a simpler approach when a rate-1 node is activated. When a rate-1 node is activated, a Chase-like algorithm was used to calculate the L codewords passed to the parent node in [17] . Compared to the Chase-like algorithm, our CG algorithm has lower computational complexity and is more suitable for hardware implementation because:
(1) The Chase-like algorithm in [17] was performed over log-likelihoods (LL) domain while our method is performed over LLR domain. Compared with our LLR based method, it takes more additions to calculate related metrics for the Chaselike algorithm.
( 17] . In contrast, our method considers only two constituent codewords, which leads to simpler hardware implementations.
(3) In order to find the L best decoding paths and their constituent codewords, the Chase-like algorithm creates a candidate path list. The final L candidates are determined by inserting and removing elements from the list. The Chase-like algorithm is suitable for software implementations. However, the hardware implementations of the Chase-like algorithm has not been discussed in [17] . On the other hand, with a bitonic based sorter [10] (BBS), the L most reliable decoding paths can be decided in parallel for our CG algorithm.
E. Simulation Results
For an (8192, 4096) polar code, the bit error rate (BER) performances of the proposed RLLD algorithm as well as other algorithms are shown in Fig. 4 . In Fig. 4 , CSx denotes the CA-SCL decoding algorithm with L = x, where CRC-32 is used. Rx-y denotes our RLLD algorithm with L = x and X th = y. The values of q Iv,L 's under different list sizes and I v 's are shown in Table I . For all simulated algorithms, the additive white Gaussian noise (AWGN) channel and binary phase-shift keying (BPSK) modulation are used. For all simulated RLLD algorithms, X 0 = 8 and X 1 = 16. Based on the simulation results shown in Fig. 4 , we observe that R2-8 performs nearly the same as CS2 and R2-64. When the list size increases, compared with CS4, R4-8 shows obvious error performance degradation when BER is below 10 −7 . The degradation is reduced by increasing X th to 128, as we observe that R4-128 performs nearly the same as CS4. When the list size further increases (e.g. L = 16 and 32), at low BER level, the error performance degradation exists even when X th = 256. As shown in Fig. 4 Depending on the specific list size, it seems that our RLLD algorithm has performance degradation compared to the CA-SCL algorithm at certain BER values even when all rate-1 nodes are processed by the proposed CG algorithm. There are several reasons for the error performance degradation:
(1) For our RLLD algorithm, when a rate-1 node with I v X th is activated, only the two most reliable constituent codewords are kept. When list size L is large, there may not be enough candidate codewords to include the correct codeword, since our CG algorithm could miss certain good candidate codewords.
(2) When a rate-1 node with I v > X th is activated, only the most reliable candidate codeword is considered for each decoding path, which could also cause error performance degradation.
(3) During the first sorting stage of our MBS algorithm, when 2 Iv L, q Iv,L is selected to be no greater than L for certain I v values for efficient hardware implementation. As a result, we may lose certain good candidate codewords due to the limitation on q Iv,L .
IV. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
A. Top Decoder Architecture In this paper, based on the proposed RLLD algorithm, a high throughput list decoder architecture, shown in Fig. 5 , for polar codes is proposed. In Fig. 5 , the channel message memory (CMEM) stores the received channel LLRs, and the internal LLR message memory (IMEM) stores the LLRs generated during the SC computation process. With the concatenation and split method in our prior work [10] , the IMEM is implemented with area efficient memories, such as register file (RF) or SRAM. The proposed architecture has L groups of processing unit arrays (PUAs), each of which contains T processing units [5] (PUs) and is capable of performing either the f or the g computation in Eqs. (5) and (6), respectively. The hybrid partial sum unit (Hyb-PSU) in Fig. 5 consists of L computation units, CU 0 , CU 1 , · · · , CU L−1 , which are responsible for updating the partial sums of L decoding paths, respectively. The path pruning unit (PPU) in Fig. 5 finds the list indices and corresponding constituent codewords for L survival decoding paths, respectively. The control of our decoder architecture can be designed based on the instruction RAM based methodology in [22] .
Both our high throughput list decoder architecture in Fig. 5 and that in [10] employ a partial parallel processing method. Besides, both architectures contain a channel message memory and internal message memory. However, compared to the architecture in [10] , the major improvements of our list decoder architecture are:
(a) Instead of LL messages, our high throughput list decoder architecture employs LLR messages, which result in more area efficient internal and channel message memories.
(b) The PPU in Fig. 5 implements our CG and MBS algorithms, while the PPU in [10] is just a sorter which selects L values among 2L ones. Due to the proposed PPU, our decoder architecture achieves much higher throughput than that in [10] .
(c) Our list decoder architecture employs a novel Hyb-PSU, which is more area and energy efficient than that in [10] . Our Hyb-PSU is based on the proposed index based partial sum computation algorithm. When a decoding path needs to be copied to another one, instead of copying partial sums directly our Hyb-PSU copies only decoding path indices. In contrast, the PSU in [10] copies path sums directly, which incurs additional energy consumption. Our Hyb-PSU stores most of the partial sums in area efficient memories, while the PSU in [10] stores all the partial sums in area demanding registers. Hence, our Hyb-PSU is scalable for larger block lengths.
B. Memory Efficient Quantization Scheme
For an SC or SCL decoder, the message memory occupies a large part of the overall decoder area [5] , [10] . An SCL decoder needs a channel message memory and an internal message memory. For an LLR based SCL decoder, the channel memory stores N channel LLR messages. The internal message memory stores Ln LLR matrices: P l,t for l = 0, 1, · · · , L − 1 and t = 1, 2, · · · , n, where P l,t has 2 n−t LLR messages.
For a fixed point implementation of our RLLD algorithm, it is straightforward to quantize all LLRs in the internal memory with Q bits. In this paper, a memory efficient quantization (MEQ) scheme is proposed to reduce the size of the internal memory. f (a, b) in Eq. (5) has the same magnitude range as those of a and b, while the magnitude range of g(a, b, s) in Eq. (6) is at most twice of those of a and b (s is either 0 or 1). Since P 0,t , P 1,t , · · · , P L−1,t are computed based on P 0,t−1 , P 1,t−1 , · · · , P L−1,t−1 , for a decoding path l, the LLRs in P l,t1 may need a greater magnitude range than that of the LLRs in P l,t2 , where t 1 > t 2 . Suppose each channel LLR is quantized with Q c bits, the proposed MEQ scheme is as follows:
(1) Suppose all LLRs within the internal memory are quantized with Q m bits, determine the minimal Q m such that the error performance degradation of the fixed point performance is negligible.
(2) Let t 1 , t 2 , · · · , t r be r integers, where t 1 t 2 · · · t r n and r = Q m − Q c . Denote P t = (P 0,t , P 1,t · · · , P L−1,t ). Suppose LLRs associated with P 1 , P 2 , · · · , P t1 are quantized with Q c bits and the remaining LLRs are quantized with Q m bits. Decide the maximal t 1 such that the resulting fixed point error performance degradation is negligible. Once t 1 is decided, suppose the LLRs within P t1+1 , P t1+2 , · · · , P t2 are all quantized with Q c +1 bits, find the maximal t 2 such that the corresponding error performance degradation is negligible. In this way, t 3 , · · · , t r are decided in a serial manner so that P ti+1 , P ti+2 , · · · , P ti+1 are quantized with Q c + i bits for 1 i r − 1, and P j are quantized to Q m bits for j > t r .
With the proposed MEQ scheme, the number of bits saved for the internal memory is
where t 0 = 0 and t r+1 = n are introduced for convenience. In order to show the effectiveness of our MEQ scheme, the error performances of our RLLD algorithm with the proposed MEQ scheme are shown in Fig. 6 , where the RLLD algorithm with our MEQ scheme is compared with the floating-point CA-SCL decoding algorithm, floating-point RLLD algorithm, and RLLD algorithm with a uniform quantization scheme for three different polar codes, (1024, 512), (8192, 4096) and (32768, 29504) with X th = 32, 128, 1024, respectively. For all fixedpoint decoders, each channel LLR is quantized with Q c = 5 bits. For the RLLD algorithm with uniform quantization, each LLR in the internal memory is quantized with Q m = 6 bits for the length 2 10 and 2 13 polar codes. For the polar code with a length of 2 15 , the uniform quantization takes 7 bits. For our MEQ scheme, Q m = 7. Since Q m − Q c = 2, we need to determine two integers, r 1 and r 2 , for our MEQ scheme. When N = 2 10 , 2 13 and 2 15 , (r 1 , r 2 ) = (1,2), (3, 4) and (4,5), respectively. As shown in Fig. 6 , the performance degradation caused by our MEQ scheme is small. Compared with the uniform quantization, the proposed MEQ scheme reduces the number of stored bits by 4.5%, 13.5% and 27.2% for N = 2 10 , 2 13 and 2 15 , respectively. For all the simulation results shown in Fig. 6 , list size L = 4. 
C. Proposed path pruning unit
When a rate-1 node with I v X th or an FP node is activated, each decoding path splits into multiple ones and only the L most reliable paths are kept. The PPU in Fig. 5 implements our CG and MBS algorithms, and is responsible for calculating L returned codewords,
decoding path l copies from decoding path a l before further decoding steps.
Take L = 4 as an example, the proposed PPU is shown in Fig. 7 , which can be easily adapted to other L values. Our PPU in Fig. 7 has two types of node metric generation (NG) units, NG-I and NG-II, which compute the node metrics for a rate-1 node and an FP node, respectively. NG-I l and NG-II l correspond to decoding path l. For decoding path l, the expanded path metrics PM j l 's are obtained by adding the node metrics to the path metric PM l , which is stored in the path metric registers (PMR) and initialized with 0.
When a rate-1 node is activated, NG-I l outputs two node metrics for l = 0, 1, · · · , L−1. After 2L expanded path metrics are computed, a stage of metric sorter (MS 2L−L ) selects the L minimum metrics and their corresponding codewords from 2L ones. The metrics sorter MS 2L−L implements the min L function in Alg. 1 and can be constructed with a BBS. When an FP node is activated, L NG-II modules implement the first part of our two-stage sorting scheme. For each decoding path, q Iv,L node metrics and their correspondent codewords are computed. The tree of metric sorters sort the L minimum metrics among q Iv,L L ones. This is achieved by log 2 q Iv,L stages of metric sorters, where q Iv,L is a power of 2. The output expanded path metrics of the last stage of metric sorter are saved in the PMR. The corresponding codewords of the selected L expanded path metrics are also chosen. The related circuitry is omitted for brevity.
The micro architecture of NG-I l is shown in Fig. 8 . The most complex part of NG-I l is finding the minimum LLR magnitude and its corresponding index among the LLR vector
Since the node metric of the most reliable candidate codeword is always 0, we need to compute NM Fig. 8 , which is the node metric of the second most reliable candidate When I v > T , suppose T is a power of 2, then I v can be divided by T . During each clock cycle, only T LLRs are fed to NG-I l , and the minimum value and its corresponding index are computed in a partial parallel way. The minimum value and associated index of the first T inputs are stored in mLR and mIR, respectively. The minimum value of the second group of T inputs is compared with the current value stored in mLR, and is stored in mLR if it is smaller than the current value of mLR. This repeats until the whole LLR vector α v,l is processed. At last, the minimum value of |α v,l | and its index are stored in mLR and mIR, respectively. The hard decoding of α v,l is stored in the hard decoded constituent codeword memory (HCM0), and is copied to HCM1 when the second most reliable constituent codeword is computed.
The micro-architecture of NG-II l under X 0 = 8 and X 1 = 16 is shown in Fig. 9 , where the block MUX4T256 includes 256 4-to-1 multiplexers. Our NG-II l consists of two parts: the first part calculates 2
Iv node metrics, NM
Iv −1 l , based on Alg. 2, and the second part implements the first stage sorting of our MBS algorithm. For L = 4, when 2
Iv > 2L, the 2 Iv metrics are first divided into four groups. The Min-2 [23] block is modified slightly to find the two minimum node metrics and their associated indices for each metric group. The MS 8−4 block calculates the final output metrics. When 2 Iv = 2L = 8, the MS 8−4 blocks work directly on the 2L = 8 expanded path metrics. When 2 Iv L, the expanded path metrics are output directly. As shown in Figs. 7 to 9, our PPU has long critical path delay, since there are many levels of logic from the inputs to outputs. Pipelines should be used to improve overall decoder frequency. Based on the DMM method in Eq. (9), the node metric computation part needs 2 Iv (2 n−t − 1)L adders and 2 Iv 2 n−t L 2-to-1 multiplexers, where N = 2 n and t is the layer index of an FP node v. Based on the RCC method, it takes (
to-1 multiplexers and 4 × 2 n−t−1 L 2-to-1multiplexers. In contrast, based on our DRH method, it takes 4 × 2 n−t−1 + 2 Iv (2 n−t−1 −1) adders, 2 Iv 2 n−t−1 4-to-1 and 4×2 n−t−1 2-to-1 multiplexers. Table II compares hardware resources needed by the DMM, RCC and DR-Hybrid methods when X 0 = 8, X 1 = 16, and α v,l [j] (0 ≤ j < 2 n−t ) is a 6-bit LLR. As shown in Table II , the DRH method requires the smallest total area. Besides, the implementations based on DMM, RCC and DRH have roughly the same critical path delay. 
D. Proposed hybrid partial sum unit
For the list decoder architectures in [9] , [10] , all partial sums are stored in registers and the partial sums of decoding path l are copied to decoding path l when decoding path l needs to be copied to decoding path l. The PSU in [9] and [10] needs L(N − 1) and L( N 2 − 1) single bit registers to store all partial sums, respectively. Thus, for large N , the register based PSU architectures in [9] , [10] are inefficient for two reasons. First, the area of the PSU is linearly proportional to N . For large N (e.g. N > 2 15 ), the area of PSU is large since registers are usually area demanding. Second, the power dissipation due to the copying of partial sums between different decoding paths is high when N is large.
1) Proposed Index Based Partial Sum Computation Algorithm: In order to avoid copying partial sums directly, an index based partial sum computation (IPC) algorithm is proposed in Algorithm 3, where p l [z] (l = 0, 1, · · · , L − 1 and z = 0, 1, · · · , n) is a list index reference. C l,z for l = 0, 1, · · · , L − 1 and z = 0, 1, · · · , n are partial sum matrices [6] , [10] . C l,z has 2 n−z elements, each of which stores two binary bits.
For our RLLD algorithm, once a rate-0, rate-1 or an FP node sends L codewords to its parent node, the partial sum computation is performed after decoding path pruning. Let t denote the layer index of such a node v. Let (B n−1 , B n−2 , · · · , B 0 ) denote the binary representation of the index of the last leaf node belonging to node v, where B n−1 is the most significant bit. Let t e = n − j, where j is the smallest integer such that B j = 0. If B j = 1 for j = 0, 1, · · · , n − 1, t e = 0. Once β v,0 , β v,1 , · · · , β v,L−1 are calculated, decoding path l may need to be copied to path l before the following partial sum computation. Under this circumstance, the index references are first copied, where
The lazy copy algorithm was proposed in [6] to avoid copying partial sums directly. However, the lazy copy algorithm is not suitable for hardware implementation due to complex index computation. The PSU in [10] copies all partial sums belonging one decoding path to the corresponding locations of another decoding path.
if v is the left child node of its parent node then
6 if v is the left child node of its parent node then exit for l = 0 to L − 1 do
2) Micro Architecture of the Proposed Hybrid Partial Sum Unit: Based on our IPC algorithm, a Hyb-PSU is proposed with two improvements. First, some partial sums are stored in . . . (b) Two types of PEs are used in the PE tree in Fig. 10(a) . Suppose the maximal length of a constituent codeword that is returned from a rate-0, rate-1 or FP node is 2 µ , then stage z (z n − µ) employs only the type-I PEs. The remaining stages in the PE tree employ the type-II PEs.
(c) Compared with the type-II PE, the type-I PE has an extra data load unit (DLU). For PE l,z,j within stage z (j = 0, 1, · · · , 2 n−z − 1), the binary outputs, o l,z,2j and o l,z,2j+1 , are connected to b l,z−1,2j and b l,z−1,2j+1 , respectively. The wired connections are not shown in Fig. 10(a) T is the number of processing elements belonging to a decoding path in a partial parallel list decoder. For our memory compiler, if c w,z is greater than a threshold value, then BM l,z is implemented with an RF. If c w,z is even greater than another threshold value, then BM l,z is implemented with an SRAM.
(e) The connector module (CN) has two T -bit inputs and two T -bit outputs. The connections between the outputs and inputs are
T /2 j < T 3) Computation Schedule of Our Hybrid Partial Sum Unit: Once the returned L codewords β v,0 , β v,1 , · · · , β v,L−1 are computed, the path pruning unit also outputs L indices a 0 , a 1 , · · · , a L−1 , where a l needs to be copied to decoding path l. For l = 0, 1, · · · , L − 1, β v,l is first loaded into stage t by the DLU in Fig. 10(b) , and the output partial sums in Alg. 3 come out from stage t e . For stage t, if β v,l is sent from a rate-0 node, then the control signal LZ t is 0, since β v,l is a zero vector. Otherwise, LD t = 0 and LZ t = 1. For the other stages, LD z = 1 and LZ z = 1 (z = t).
For all partial sums within the partial sum matrix C l,z , we divide them into two sets: C When t e m, C 0 l,te is computed in one clock cycle and is output from stage t e , where C l,te [j] [0] is set to s l,te,j produced by the type-I and type-II PEs for j = 0, 1, · · · , 2 n−te − 1. When t e < m, C 0 l,te is computed in 2 n−te /T cycles, and T updated partial sums are computed in each clock cycles. Since decoding path a l needs to be copied to path l, for z = t, t − 1, · · · , t e +1, the computation of C 4) Comparisons with Related Works: Compared to the partial sum computation architectures in [9] , [10] , the proposed Hyb-PSU architecture has advantages in the following two aspects.
(1) The proposed Hyb-PSU is a scalable architecture. The PSU architectures in [9] , [10] require L(N −1) and L(N/2−1) single bit registers, where N = 2 n is the block length. Hence, they will suffer from excessive area overhead when the block length N is large. In contrast, the proposed Hyb-PSU stores L(N − 1) bits and most of these bits are stored in RFs or SRAMs, which are more area efficient than registers.
(2) The architectures in [9] , [10] copies partial sums of a decoding path to another decoding path when needed, while our Hyb-PSU copies only index references. We define the copying of a single bit from one register to another as a single copy operation. When decoding path l needs to be copied to path l, the PSU in [10] requires N 1 = 2 n−1 −1 copy operations, while our Hyb-PSU needs only N 2 = (n+1) log 2 L copy operations. Since the value of L for practical hardware implementation is small, our lazy copy needs much fewer copy operations than direct copy.
In this paper, when L = 4 and T = 128, for N = 2 13 and 2 15 , the proposed hybrid partial sum unit architecture is implemented with m = 3 and m = 5, respectively, under a TSMC 90nm CMOS technology. Our partial sum computation unit consumes an area of 0.779mm 2 and 1.31mm 2 for N = 2 13 and N = 2 15 , respectively. To the best of our knowledge, those decoder architectures in [9] , [10] , [15] , [24] are the only for SC based list decoding algorithms of polar codes. However, in [9] , [15] , [24] , the partial sum computation unit architecture was not discussed in detail and the implementation results on the PSU alone are not shown. Hence, we compare our proposed Hyb-PSU with that in [10] . When L = 4, the partial sum unit architecture in [10] for N = 2 13 and 2 15 consumes an area of 1.011mm 2 and 3.63mm 2 , respectively, under the same CMOS technology. All PSUs are synthesized under a frequency of 500MHz. Our Hyb-PSU achieves an area saving of 23% and 63% for block length 2 13 and 2 15 , respectively.
E. Latency and Throughput
For the proposed high throughput decoder architecture, the number of clock cycles, N D , used on the decoding of a codeword depends on the block length, the code rate and the positions of frozen bits. For our RLLD algorithm, let N V be the number of nodes (except the root node) visited in G n . Let S V denote the set of indices of visited nodes (except the root node). Let S V be a subset of S V and S V consists of rate-1 nodes with I v X th and all FP nodes. For v i ∈ S V , let t i be the layer index of node v i for i = 0, 1, · · · , N V − 1. Then
T is the number of clock cycles needed to calculate the LLR vectors sending to node v i . N (i) P is the number of clock cycles used by our PPU when v i is activated. Note that decoding path splits only if node v i is a rate-1 node with I v X th or an FP node. Hence, N (i)
P = 0 and depends on the node type, X th , q Iv,L , T , L and the number of pipeline stages in our PPU. This will be discussed in more detail in Section V.
Since our list decoder outputs x
, we need to obtain u
before calculating the CRC checksum of the information bits. A partial-parallel polar encoder [25] can be used and the corresponding latency is N/T when T bits are fed to the encoder in parallel. For the computation of CRC, a partial parallel CRC unit [26] can be used, and the corresponding latency is also N/T . As a result,
T is the number of clock cycles due to encoding and CRC checksum computation.
The latency of our decoder is T L = N D /f , where f is the decoder frequency. Since we are using CRC for output final data word, we calculate the net information throughput (NIT) of our decoder, where NIT =
, where h is the CRC checksum length. Here, the latency due to the CRC checksum computation does not affect out decoder throughput, since our decoder can work on the next frame once our Hyb-PSU begins to output decoded codewords for the current frame.
V. IMPLEMENTATION RESULTS AND COMPARISONS
To compare with prior works, we implement our high throughput list decoder architecture for three polar codes with lengths of 2 10 , 2 13 and 2 15 , respectively, and rates 0.5, 0.5 and 0.9, respectively. The last polar code is intended for storage applications. For each code, three different list sizes are considered: L = 2, 4, 8. All our decoders are synthesized under the TSMC 90nm CMOS technology using the Cadence RTL compiler. The area efficiency (AE) of a partly parallel decoder architecture depends on the number of PUs. In order to make a fair comparison with prior works in [10] , [12] , [16] , the number of PUs for each decoding path of our implemented decoders is selected to be 64 when N = 2 10 . When N = 2 13 and 2 15 , the number of PUs per decoding path is 128 for our decoders. The list decoders in [27] are based on a line architecture, which requires N 2 PUs. A total of 3, 4 and 6 pipeline stages, respectively, are inserted in the PPU for decoders with L = 2, 4 and 8, respectively. The number of pipeline stages needed for our PPU is determined by the longest data path. For each v i ∈ S V , if node v i is a rate-1 node with I v X th , N (i) P depends on the number of PUs in a decoding path: when I v T , N (i) P = 2 for all our implemented decoders; otherwise, N (i) P = 4 for all our decoders, since the minimum value of a received LLR vector is calculated in a partial parallel way, which incurs extra clock cycles. When node v i is an FP node, N (i) P relates to q Iv,L . Depending on the detailed value of q Iv,L , we may use different data paths when computing the L minimum expanded path metrics. The locations of all pipelines are arranged so that fewer clock cycles are needed when the q Iv,L is smaller. In Table VI , we list the detailed value of N (i) P with respect to I v and L.
The selection of X th is a trade-off between AE and error performance. When increasing X th , more rate-1 nodes will be processed by our CG algorithm. Hence, N D increases and the resulting NIT decreases. Meanwhile, the corresponding error performance is better especially in high SNR region. Our high throughput list decoder architecture supports all X th values. For all our implemented decoders, X th is large enough so that all rate-1 nodes are processed by our CG algorithm. In this setup, for each implemented decoder, N D is maximized with respect to X th , and hence the throughput of [10] has been re-synthesized under the TSMC 90nm CMOS technology. * These are the original implementation results based on a 65nm CMOS technology. †These are the scaled results under the TSMC 90nm CMOS technology. [10] , [16] have been re-synthesized under the TSMC 90nm CMOS technology. The number of PU per decoding path is 128. [10] , [16] have been re-synthesized under the TSMC 90nm CMOS technology. The number of PU per decoding path is 128. our decoder architecture in Tables III, IV and V is the minimum achieved by our decoders. For each code, the corresponding error performance is better than that of the RLLD with the MEQ in Fig. 6 . The implementation results are shown in Table III , IV and V. The implementation results show that our decoders outperform existing SCL decoders [10] , [12] , [15] in both decoding latency and area efficiency. Compared with the decoders of [12] , the area efficiency and decoding latency of our decoders are 1.59 to 32.5 times and 3.4 to 6.8 times better, respectively. The area efficiency and decoding latency of our decoders are 3.9 to 21.5 times and 5.5 to 13 times better, respectively, than the decoders of [10] . Compared with decoders of [16] , our decoders improve the area efficiency and decoding latency by 1.12 to 12 times and 2.8 to 9 times, respectively. When N = 2 10 , the area efficiency and decoding latency of our decoders are 3.8 to 4.2 times and 3.58 to 3.84 times better, respectively, than the decoders of [15] . Compared with the decoders of [15] , our decoders would show more significant improvements in area efficiency and decoding latency when N is larger.
Based on the implementation results shown in Tables III, IV and V, it is observed that when the block length is fixed, as the list size L increases, the area efficiency and decoding latency will decrease and increase, respectively, because:
• It takes more memory to store internal LLRs when L increases.
• The number of pipeline stages within our PPU will increase when L increases, which in turn increases the overall decoding clock cycles.
The latency reduction and area efficiency improvement of our decoders are due to the reduced number of nodes activated in the decoding. However, the area and frequency overhead of the proposed PPU somewhat dilute the effects due to decoding clock cycles reduction. For example, our decoder reduces the number of decoding cycles to approximately 1 7 of that of the decoders in [12] for L = 2, 4 and 8. However, the reduction in decoding cycles does not fully transfer into the improvement in decoding latency and area efficiency. Based on our implementation results, take L = 2 as an example, the PPU occupies 61.99%, 40.16% and 25.40% of the area of the whole decoder, for N = 2 10 , 2 13 and 2 15 , respectively. Compared with the decoders with N = 2 10 and 2 13 , the effects on the area efficiency caused by the area overhead of PPU are smaller for decoders with N = 2 15 . Keeping T unchanged, as N increases, the area of the PPU increases very slowly while the total area of all LLR memories is proportional to N . Hence, for larger N , PPU occupies a smaller percentage of the total area of a whole decoder. When list size L is fixed, as N increases, the latency reduction and area efficiency improvement compared with other decoders in the literature will be greater.
VI. CONCLUSION
In this paper, a reduced latency list decoding algorithm is proposed for polar codes. The proposed list decoding algorithm results in a high throughput list decoder architecture for polar codes. A memory efficient quantization method is also proposed to reduce the size of message memories. The proposed list decoder architecture can be adapted to large block lengths due to our hybrid partial sum unit, which is area efficient. The implementation results of our high throughput list decoder demonstrate significant advantages over current state-of-the-art SCL decoders.
