Abstract-Polar codes are of great interests because they provably achieve the capacity of both discrete and continuous memoryless channels while having an explicit construction. Most existing decoding algorithms of polar codes are based on bit-wise hard or soft decisions. In this paper, we propose symbol-decision successive cancellation (SC) and successive cancellation list (SCL) decoders for polar codes, which use symbol-wise hard or soft decisions for higher throughput or better error performance. First, we propose to use a recursive channel combination to calculate symbol-wise channel transition probabilities, which lead to symbol decisions. Our proposed recursive channel combination also has a lower complexity than simply combining bit-wise channel transition probabilities. The similarity between our proposed method and Arikan's channel transformations also helps to share hardware resources between calculating bit-and symbol-wise channel transition probabilities. Second, a two-stage list pruning network is proposed to provide a trade-off between the error performance and the complexity of the symbol-decision SCL decoder. Third, since memory is a significant part of SCL decoders, we propose a pre-computation memory-saving technique to reduce memory requirement of an SCL decoder. Finally, to evaluate the throughput advantage of our symbol-decision decoders, we design an architecture based on a semi-parallel successive cancellation list decoder. In this architecture, different symbol sizes, sorting implementations, and message scheduling schemes are considered. Our synthesis results show that in terms of area efficiency, our symbol-decision SCL decoders outperform both bit-and symbol-decision SCL decoders.
I. INTRODUCTION Polar codes, a groundbreaking finding by Arikan [1] in 2009, have ignited a spark of research interest in the fields of communication and coding theory, because they can provably achieve the capacity for both discrete [1] and continuous [2] memoryless channels. The second reason why polar codes are attractive is their low encoding and decoding complexity. For example, a polar code of length N can be decoded by the successive cancellation (SC) algorithm [1] with a complexity of O(N log N ). However, their capacity approaching can be achieved only when the code length is large enough (N > 2 20 [3] ) if the SC algorithm is used. For short or moderate code length, in terms of the error performance, polar codes with the SC algorithm are inferior to Turbo codes or low-density parity-check (LDPC) codes [4] , [5] .
Since the debut of polar codes, a lot of efforts have been made to improve the error performance of short polar codes. Systematic polar codes [6] were proposed to reduce the bit error rate (BER) while guaranteeing the same frame error rate (FER) as their non-systematic counterparts. Although a Viterbi algorithm [7] , a sphere decoding algorithm [8] and stack sphere decoding algorithm [9] can provide maximum likelihood (ML) decoding of polar codes, they are considered infeasible, especially for long polar codes, due to their much higher complexity than the SC algorithm. Recently, an SC list algorithm for polar codes was proposed in [10] to bridge the performance gap between the SC algorithm and ML algorithms at the cost of complexity of O(LN log N ), where L is the list size. Moreover, the concatenation of polar codes with cyclic redundancy check (CRC) codes was introduced in [4] , [11] . To decode the CRC-concatenated polar codes, a CRC detector is used in the SCL algorithm to help select the output codeword. The combination of an SCL algorithm and a CRC detector is called CRC-aided SCL (CA-SCL) algorithm. [11] shows that with the CA-SCL algorithm, the error performance of a (2048, 1024) CRC-concatenated polar code is better that of a (2304, 1152) LDPC code, which is used in the WiMax standard [12] .
Several architectures have been proposed for the SC algorithm. Arikan [1] showed that a fully parallel SC decoder has a latency of 2N − 1 clock cycles. A tree SC decoder and a line SC decoder with complexity of O(N ) were proposed in [13] . These two decoders have the same latency as the fully parallel SC decoder. To reduce complexity further, Leroux et al. [3] proposed a semi-parallel SC decoder for polar codes by taking advantage of the recursive structure of polar codes to reuse processing resources. Assuming that the number of processing elements (PEs) are P (P = 2 p ≤ N ), the latency of the semi-parallel SC decoder is 2N + N P log 2 ( N 4P ) clock cycles. To reduce the latency, a simplified SC (SSC) polar decoder was introduced in [14] and it was further analyzed in [15] . In the SSC polar decoder, a polar code is converted to a binary tree including three types of nodes: rate-one, ratezero and rate-R nodes. Based on the SSC polar decoder, the ML SSC decoder makes use of the ML algorithm to deal with part of rate-R nodes in [16] . However, the SSC and ML-SSC polar decoders depend on positions of information bits and frozen bits, and are code-specific consequently. In [17] , a pre-computation look-ahead technique was proposed to reduce the latency of the tree SC decoder by half. For the SCL polar decoder, the semi-parallel architecture was adopted in [18] . In [19] Balatsoukas-Stimming et al. proposed an architecture of L = 4 to achieve a throughput of 124 Mbps and a latency of 8.25 ms when decoding a (1024, 512) polar code. In [20] , Lin and Yan designed an SCL polar decoder with the throughput of 182 Mbps and a latency of 5.63 ms. To reduce the memory requirement, the log-likelihood ratio (LLR) messages are used in [21] . The throughput of existing polar decoders is still not high enough for high speed applications.
arXiv:1501.04705v1 [cs.IT] 20 Jan 2015
Since the low throughput (or long latency) of the SC decoder is due to its serial nature, several previous works attempt to improve the throughput (or latency). In [22] , the data bits of a polar code is split into several streams, which are decoded simultaneously. This idea of parallel processing is extended in [23] , where the SC decoder is transformed into a concatenated decoder, where all the inner SC decoders are carried out in parallel. Yuan and Parhi proposed a multi-bit SCL decoder [24] .
In this paper, we address the throughput/latency issue by proposing symbol-decision SC and SCL decoders, which are based on symbol-wise hard or soft decisions. Since each symbol consists of M bits, when M > 1 the symboldecision decoders achieve higher throughput as well as better error performance. The proposed symbol-decision decoders are natural generalization of their bit-wise counterparts, and reduce to existing bit-wise decoders when the symbol size is one bit. The main contributions of this paper are:
• We propose a novel recursive channel combination to calculate the symbol-wise channel transition probabilities, which enable symbol decisions in SC and SCL algorithms. The proposed recursive channel combination also has a lower complexity than simply combining bit-wise channel transition probabilities. The similarity between the Arikan's recursive channel transformation and our symbol-wise recursive channel combination helps to share hardware resources to calculate the bit-and symbol-based channel transition probabilities.
• An M -bit symbol-decision SCL decoder needs to find the L most reliable candidates out of 2 M L list candidates. We propose a two-stage list pruning network to perform this sorting function. This pruning network also provides a trade-off between performance and complexity.
• By adopting pre-computation technique [25] , We develop a pre-computation memory-saving (PCMS) technique to reduce the memory requirement of the SCL decoder. Specifically, the channel information memory can be eliminated when using the PCMS technique. Moreover, this technique also helps to improve throughput slightly.
• To evaluate the throughput of symbol-decision SC decoders, we propose an area efficient architecture for symbol-decision SCL decoders 1 . In our architecture, to save the area, adders in processing units are reused to calculate the symbol-wise channel transition probability. We propose two scheduling schemes for sharing hardware resources. We also propose two list pruning network for designs with different symbol sizes.
• We design two-, four-, and eight-bit symbol-decision SCL decoders for a (1024, 480) CRC32-concatenated polar code with a list size of four. Synthesis results show that in terms of area efficiency, our symbol-decision SCL decoder outperforms all existing state-of-the-arts SCL decoders in [19] [20] [21] , [24] . For example, the area efficiency of our four-bit symbol-decision SCL decoder is 259.2 Mb/s/mm 2 , which is 1.51 times as big as that 1 We focus on the SCL decoder because the SC decoder can be considered as an SCL decoder with a list size of one.
of [21] . Our implementation results also demonstrate that the symbol-decision SCL decoder can provide a range of tradeoffs between area, throughput, and area efficiency. Our symbol-decision decoding algorithms assume that the underlying channel has a binary input, and our symbol-wise channel transformation is virtual and introduced for decoding only. Hence, our work is different from those assuming a q-ary (q > 2) channel (see, for example, [26] ).
The decoding schedule (bit sequence) of our symboldecision decoding algorithms is actually the same as those in [22] [23] [24] , but our symbol-decision decoding algorithms are different from those in [22] [23] [24] in two aspects. First, our symbol-wise recursive channel transition is different from how transition probabilities are derived in [22] [23] [24] . Second, the symbol-decision perspective allows us to prove that the symbol-decision algorithms have better frame error rates (FERs) than their bit-decision counterparts [27] , while only simulation results are provided in [22] , [24] and error performance is not investigated in [23] . There are additional differences between our decoding algorithms/architectures and those in [22] [23] [24] . For instance, all the bits within a symbol are estimated jointly in our symbol-decision SC algorithm, whereas some bits are decoded independently for the decoder with parallelism two in [22] . Also, while our symbol-decision decoding is introduced on the algorithmic level, the multibit decoder is introduced on the level of decoding operations [24] . Finally, for our symbol-decision SCL decoders, we use the semi-parallel architecture because it is more area efficient than the tree architecture and the line architecture [13] .
The rest of our paper is organized as follows. Section II briefly reviews polar codes and existing decoding algorithms for polar codes. In Section III, the symbol-based recursive channel combination is proposed to calculate the symbolbased channel transition probability. Moreover, to simplify the selection of the list candidates, a two-stage list pruning network is proposed. In Section IV, we introduce a method to reduce memory requirement of list decoders of polar codes by pre-computation technique. In Section V, we demonstrate the hardware architecture for symbol-decision SCL decoders. Two scheduling schemes for hardware sharing are discussed. We also propose two list pruning network for different designs: a folded sorting implementation and a tree sorting implementation. A discussion on the latency of our architecture and synthesis results for our implementations are provided in this section as well. Finally, we draw some conclusions in Section VI.
II. POLAR CODES AND EXISTING DECODING ALGORITHMS

A. Preliminaries
We follow the notation for vectors in [1] , namely u 
) and x jM (j−1)M +1 denote the output and input of W 
B. Polar Codes
Polar codes are linear block codes, and their block lengths are restricted to powers of two, denoted by N = 2 n for n ≥ 2.
where B N is the N × N bit-reversal permutation matrix and F ⊗n denotes the n-th Kronecker power of
is the sub-sequence of u restricted to A. For an (N, K) polar code, the data bit sequence is grouped into two parts: a K-element part u A which carries information bits, and u A c whose elements are predefined frozen bits, where A c is the complement of A. For convenience, frozen bits are set to zero.
C. SC Algorithm for Polar Codes
Given a transmitted codeword x and the corresponding received word y, the SC algorithm for an (N, K) polar code estimates the encoding bit sequence u successively as shown in Alg. 1. Here,û = (û 1 ,û 2 , · · · ,û N ) represents the estimated value for u.
To calculate W 
and
where 0 < i ≤ Λ = 2 λ < N , and 0 ≤ λ < n. Expressed in log-likelihood (LL), Eqs. (2) and (3) can be approximated as [4] :
1,e |u 2i ) − log 2.
To simplify the calculation, the constants in Eqs. (4) and (5) can be discarded since this global offset for all LLs does not affect the decoding decision. The SC algorithm makes hard-decision for only one bit at a time, as shown in Fig. 1(a) . We call it bit-decision decoding algorithm. A parallel SC decoder [22] [23] [24] makes hard-decision for M bits instead of only one bit at a time, as shown in Fig. 1(b) .
Without loss of generality, assume M is a power of two,
Given
where |AM j | represents the cardinality of AM j . If M = N , this decoding algorithm is exactly a maximum-likelihood sequence decoding algorithm. 
sortPDecrement(S);
E. SCL and CA-SCL Algorithms for Polar Codes
Instead of making a hard decision for each information bit of u in the SC algorithm, the SCL algorithm creates two paths in which the information bit is assumed to be 0 and 1, respectively. If the number of paths is greater than the list size L, the L most reliable paths are selected. At the end of the decoding procedure, the most reliable path is chosen asû. The SCL algorithm is formally described in Alg. 2. Without loss of generality, we assume L to be a power of two, i.e.
to represent the i-th list vector, where 0 < i ≤ L. S is a structure type array with size 2L. Each element of S has three members: P, L, and U. The function sortPDecrement sorts the array S by decreasing order of P. c=conc(a,b) attaches a bit sequence b at the end of a bit sequence a, and the length of the output bit sequence c is the sum of lengths of a and b.
The CA-SCL algorithm is used for the CRC-concatenated polar codes. The difference between CA-SCL [11] and SCL algorithms is how to make the final decision forû. If there is at least one path satisfying the CRC constraint, the most reliable CRC-valid path is chosen forû. Otherwise, the decision rule of the SCL algorithm is used for the CA-SCL algorithm.
III. M -BIT SYMBOL-DECISION DECODING ALGORITHMS FOR POLAR CODES A. M -bit Symbol-Decision SC Algorithm
Here, we proposed a symbol-decision SC algorithm, which treats M -bit data as a symbol and decodes a symbol at a time. Let Z represent the alphabet of all M -bit symbols. The symbol-decision SC algorithm deals with the virtual channel W (j)
N,M if we consider X M as the binary vector representation of Z. Therefore, the symbol-decision SC algorithm has the same schedule as the parallel SC algorithm in [22] [23] [24] . However, our symbol-decision SC algorithm has a different approach, called symbol-based recursive channel combination, to compute symbol-based channel transition probabilities W
, which is our main focus.
In [22] [23] [24] , the calculation of the symbol-based channel transition probability W
) is based on the following equation, referred to as direct-mapping calculation:
where
,ŵ
) is calculated by the Arikan's recursive channel transformations.
Actually, a symbol-based recursive channel combination described in Proposition 1 can be used to calculate W
Assume that all bits of u are independent and each bit has an equal probability of being a 0 or 1. Given
2Λ,Φ is obtained by a single-step combination of two independent copies of a
where the channel transition probability satisfies,
Similar to the SC algorithm, with the help of the symbolbased recursive channel combination, an M -bit symboldecision SC algorithm can be represented by using a message flow graph (MFG) as well, where a channel transition probability is referred to as a message for the sake of convenience. This MFG is referred to as SR-MFG. If the code length of a polar code is N , the SR-MFG can be divided into (n + 1) stages (S 0 , S 1 , · · · , S n ) from the right to the left: one initial stage S 0 and n calculation stages. For the SC algorithm, all calculation stages carry out the Arikan's recursive channel transformation. However, for the M -bit symboldecision SC algorithm, in the left-most m calculation stages (S n , · · · , S n−m+1 ), called S-COMBS stages, symbol-based channel combinations are carried out. For the rest (n − m) calculation stages (S n−m , · · · , S 1 ), called B-TRANS stages, the Arikan's recursive channel transformations are performed. The S-COMBS stages use outputs of B-TRANS stages to calculate symbol-based messages.
For [22] [23] [24] , we refer to the MFG as the DM-MFG which also consists of two parts: B-TRANS and DM-CAL. The B-TRANS part of the DM-MFG is the same as that of the SR-MFG. However, there is only one stage in the DM-CAL part of the DM-MFG which performs the direct-mapping calculation.
For example, as shown in Fig. 2 , the SR-MFG of a four-bit symbol-decision SC algorithm for a polar code with N = 8 has four stages. Messages of the initial stage (S 0 ) come from the channel directly. Messages of the first stage (S 1 ) are calculated with Arikan's transformations. Messages of the second and third stages (S2 and S3) are calculated with Eq. (9) . Stages in the left gray box are the S-COMBS stages. Stages in the right gray box are the B-TRANS stages. Fig. 3 shows the DM-MFG when the direct-mapping calculation is used to calculate symbol-based channel transition probability W 
The message flow graph of a four-bit symbol-decision SC algorithm for a polar code with a code length of eight by using the proposed symbolbased recursive channel combination.
For the direct-mapping calculation, Eq. (7) M +i−n messages. One addition is needed to compute each LL message according to Eq. (9) . Hence, the number of additions needed by the S-COMBS stages to calculate W
Actually, if we perform the hardware implementation, the worst case -that all bits of a symbol are information bits -should be considered. Therefore, the recursive symbol-based channel combination can be taken advantage of to reduce complexity of calculating the symbol-based channel transition probability.
For the example shown in Fig. 2 , Eq. (7) needs 2 4 (4 − 1) = 48 additions to calculate log(W Table I lists the numbers of additions needed by our recursive method and directmapping calculation [22] [23] [24] when all M bits of a symbol are information bits. When M = 8, the number of additions needed by our proposed method is 17% of that needed by the direct-mapping calculation. The other advantage of the proposed method to calculate the symbol-based channel transition probability is that it reveals the similarity between the Arikan's recursive channel transformation and symbol-based recursive channel combination. We will take advantage of this similarity to reuse adders and to save area when computing the bit-and symbol-based channel transition probability in our proposed architecture. In [24] , additional dedicated adders are used to calculated the symbolbased channel transition probability, which is not area efficient.
In terms of the error performance, the symbol-decision SC algorithm is not worse than the bit-decision SC algorithm [27] . Fig. 4 shows the BERs and FERs of symbol-decision SC algorithms for a (1024, 512) polar codes. SDSC-i denotes the i-bit symbol-decision SC algorithm. When M = 2 and 4, the FER performance is the same as that of the bit-decision SC algorithm. When M = 8, the FER performance is slightly better. 
C. Generalized Symbol-Decision SCL Decoding Algorithm
Similarly, the symbol-based recursive channel combination is also useful for the SCL algorithm. The symbol-decision SCL algorithm is more complicate than the SCL algorithm, since the path expansion coefficient is not a constant any more. In the SCL algorithm, for each information bit, the path expansion coefficient is two. But for the M -bit symboldecision SCL algorithm, the path expansion coefficient is 2 |AMj | , which depends on the number of information bits in an M -bit symbol. The M -bit symbol-decision SCL algorithm is formally described in Alg. 3. Without any ambiguity, 0 represents a zero vector whose bit-width is determined by the left-hand operator. The function dec2bin(d, b) converts a decimal number d to a b-bit binary vector. Eq. (9) is used to calculate the symbol-based channel transition probability corresponding to each list, i.e. W (j+1) u AMj =dec2bin(k, |AM j |); 
D. Two-Stage List Pruning Network for Symbol-Decision SCL algorithm
For the M -bit symbol-decision SCL algorithm, the maximum path expansion coefficient is 2 M , i.e. each existing path generates 2 M paths. Therefore, in the worst-case scenario, the L most reliable paths should be selected out of 2 M L paths. To facilitate this sorting network, we propose a two-stage list pruning network. In the first stage, the q most reliable paths are selected from up to 2 M paths that come from expansion of each existing path. Therefore, there are qL paths left. In the second stage, the L most reliable paths are sorted out from the qL paths generate by the first stage. The message flow of a two-stage list pruning network is illustrated in Fig. 6 . If q ≥ L, the L paths found by the two-stage list pruning network are exactly the L most reliable paths among the 2 M L paths. When q < L, the probability that the L paths found by the two-stage list pruning network are exactly the L most reliable paths among the 2 M L paths decreases as well. This may cause some performance loss. But a smaller q leads to a two-stage list pruning network with lower complexity. Therefore, the two-stage list pruning network uses an additional parameter q to introduce different trade-offs between error performance and complexity.
IV. PRE-COMPUTATION MEMORY-SAVING TECHNIQUE
Pre-computation technique was first proposed in [25] and can be used to improve processing rate when the number of possible outputs is finite. In [17] , the pre-computation technique is used to improve the throughput of the line SC decoder with an additional cost of increased area. Here, our main purpose is to use the pre-computation technique to reduce the memory required by list decoders because the memory of an SCL decoder to store the channel transition probability becomes a big challenge as the list size and code length increase. Henceforth, this memory saving technique is called the pre-computation memory-saving (PCMS) technique. It is worth noting that this memory-saving technique is independent of the decoder architecture and the message representation of SCL decoders.
Let us take the MFG shown in Fig. 2 as an example. For stages S 0 and S 1 , the numbers of pairs of LLs stored by the list decoder are 8 and 4L, respectively. Actually, the outgoing message W Generally speaking, the PCMS technique takes advantage of the relationship between messages of S 0 (channel LLs), and outgoing messages of S 1 . By storing only all possible outgoing messages of S 1 , the PCMS technique helps list decoders save memory.
Let us evaluate the memory saving of the PCMS technique, assuming LL representation is used for the channel transition probability. Without PCMS technique, a list decoder for a polar code with the code length of N has a list size of L stores (N − 2)L + N LL pairs. Each pair contains two messages which are associated with the conditional bit being zero or one. The total number of bits used for LL storage is
where Q ch denotes the number of bits used for the quantization of the channel LLs.
With the PCMS technique, the total number of LL pairs needed by a list decoder is
The total number of bits needed for LL storage is:
Therefore, when LL representation is used for messages, the PCMS technique saves N (LQ ch + L − Q ch − 3) bits of memory. The saving is linear with both N and L. Consider a polar code with N = 1024, a list decoder with L = 4 and Q ch = 4. Without the PCMS technique, B LL = 57104. With the PCMS technique, B PCMS = 43792. The PCMS technique helps to save 13312 bits of memory, which is 23% of B LL .
The other advantage of the PCMS technique is that it improves the throughput slightly because the messages of S 1 are already in the memory and don't need to be calculated from the channel messages. For example, for a bit-decision semi-parallel SCL decoder with the list size of L, if the code length is N and the number of processing units is P , the latency saving due to the PCMS technique is An MPU block calculates messages for B-TRANS and S-COMBS messages and updates the partial-sum network by adopting blocks of the SCL decoder in [20] . The additions of S-COMBS stages are carried out by reusing the same hardware resource which is used to calculate messages of B-TRANS stages to reduce the area. Compared with the SCL decoder in [20] , the MPU has neither path pruning unit nor the CRC checker. The other improvement for the MPU is that PCMS technique is used here. The architecture of an MPU is shown in Fig. 10 We take the MFG of Fig. 2 as an example to illustrate the function of block MSEL. For node f 21 of path l, {W 
, belonging to path l. The detailed information of other blocks in Fig. 10 can be found in [20] and will not be discussed in this paper.
The message-passing scheme in MFG of a polar code is in a serial way, which means that the calculation of a stage depends on the output of its previous stage. The PUs in [20] only carry out the B-TRANS additions. On the other hand, the S-COMBS stages need only additions and a processing unit has four adders. Therefore, in order to save hardware resources, the adders in the processing units is reused to calculate the symbol-based channel transition probability, after these processing units finish calculations for the B-TRANS stages. In other words, additions of both the B-TRANS and the S-COMBS stages are folded onto the same adders in the processing unites. As shown in Fig. 11, c[0] and c [1] are outputs for the B-TRANS stages; [2] , and d [3] are outputs for the S-COMBS stages. Block MBG provides a mask bit for each path. If there are f (f geq0) frozen bits in the M -bit symbol, the number of expanded paths will be 2 M −f . For hardware implementations, we need to consider the worst case and all messages corresponding to 2 M possible paths are calculated. Each path is associated with a mask bit. When some paths are not needed, due to frozen bits, they are turned off by mask bits. Fig. 12 shows how to generate the mask bit for path i, where
jM +1 is impossible to be i and the message corresponding to u jM +M jM +1 = i is set to 0 in block MSNG. Block LPN receives 2 M L messages from block MSNG, finds the most reliable L paths, and feeds decision results back to the MPUs. Here, we use two different sorting implementations -a folded sorting implementation and a tree sorting implementation -for different designs. The basic unit for these two implementations is a bitonic sorter [28] , which outputs the L max values out of 2L inputs. It is referred to as BS L. The folded sorting implementation needs 2 The folded sorting implementation has a smaller area than the tree sorting implementation. However, the pipeline can be applied to the tree sorting implementation by inserting registers between layers to improve the throughput of the tree sorting implementation.
For the two-stage list pruning network proposed in Sec. III-D, either the folded sorting implementation or the tree sorting implementation can be used for the 2 M -to-q sorting function and the qL-to-L sorting function.
Block CNTL provides control signals to schedule the hardware sharing for MPUs and decides when to start pruning paths. The signal frz flag is an indicator which is one when a frozen vector appears. When frz flag is one, all MPUs use zero to update the partial-sums instead of outputs of the LPN. In this case, the LPN, the MSNG, and the calculation of S-COMBS stages are bypassed. The OLG stores the output paths. The CRCC checks if a path satisfies the CRC constraint.
B. Message Scheduling and Latency Analysis
To improve area efficiency, for different number of PUs, different scheduling schemes are needed. To reuse the adders of the processing units, the additions of the S-COMBS stages in the MFG must be scheduled properly. Assume the number of the processing units is P . The total number of the adders provided by processing units is 4P . If 2 M L ≤ 4P , we use a serial scheduling, which means that there is no overlap for the processing units and the LPN in terms of the operation time, as shown in Fig. 15 .
B-TRANS S-TRANS
Processing Units LPN
S n-m+1 S n-m+2 S n ... S 1 S n-m S 2 ... Suppose each addition takes one clock cycle. Then each S-COMBS stage takes one clock cycle to compute messages. Therefore, it takes m clock cycles for the S-COMBS stages to output messages to the LPN. To save the area, the folded sorting implementation is applied for the serial scheduling.
When cycles are needed. In each cycle, 4P messages are calculated. To reduce the latency, the overlapping scheduling shown in Fig. 16 is used. In clock cycle c 0 , the first 4P messages come out. In clock cycle c 1 , the LPN starts work. Therefore, the MPUs and the LPN are working simultaneously for
Here, the LPN works in a pipeline way. Hence, the tree sorting implementation is deployed for the overlapping scheduling and a BS L is connected at the end of the tree sorting implementation in a way shown in Fig. 17 , where the number on a line represents the number of messages transmitted through the line.
B-TRANS
S-TRANS Sn-m+1 Sn-m+2 Sn ...
Processing Units LPN
: Clock cycles when the processing units are busy.
: Clock cycles when the LPN is busy. : Clock cycles when both the processing units and LPN are busy. The latency of an M -bit symbol-decision SCL decoder consists of: the latency for calculating messages of the B-TRANS stages, the latency for calculating messages of the 
where the third term, − N L/M P/M , is the latency saving by using PCMS technique. T S represent the number of clock cycles for the calculations of S-COMBS stages per symbol. T N represents the number of extra clock cycles per symbol needed by the LPN to finish the list pruning after all messages of the stage S n are calculated. If 2 M L ≤ 4P , the number of clock cycles used to calculate messages for S-COMBS stages is
. T N is determined by the detailed implementation. Hence, the latency of the symboldecision SCL decoder is:
where γ is a ratio of the number of frozen vectors to N M . Table II shows the latencies (in clock cycles) for different decoders to decode a (1024, 480) CRC32-concatenated polar code with 64 processing units and L = 4. We assume a BS L needs one clock cycle to find the four maximum values out of eight values. For M = 2 and M = 4, a folded sorting implementation and the serial scheduling are used. For M = 8, a pipelined tree sorting implementation and the overlapped scheduling are applied. For M = 8 and q = 2, the basic unit in the tree sorting implementation is to find the two maximum values out of eight values, which needs one clock cycles. Therefore, T N = 4 when M = 8 and q = 2. It is claimed in [22] that the M -bit SDSCL decoder could have M times faster decoding speed than the bit-decision SCL decoder, which is much better than our implementation results. Let us review Eq. (12) 
To be exactly
should be satisfied, which means that the calculation of the symbol-based channel transition probability and the list pruning procedure do NOT take any clock cycle. This is impractical. However, T S and T N cannot be zero in a practical design. If N L < 8P and P ≤ N L, to achieve M times faster,
. Therefore, the statement about the decoding speed gain in [22] is too idealistic to be achieved in practice because the practical implementation needs some extra cycles to calculate the symbol-based channel transition probability and to perform the list pruning function.
C. Synthesis results
To implement the proposed symbol-decision SCL decoder, we consider only M = 2, 4 and 8. For M ≥ 16, it is impractical to build list pruning networks. For example, for the worst case of M = 16 that all the bits of a symbol are information bits, there are 2 16 L = 65536L paths. Even if L = 1, to find the maximum value among 65536 values still needs a huge amount of hardware resources and leads to a huge latency.
In our implementations, L = 4. Each implementation has 64 processing units. LL messages are used in our designs. The channel LL messages are quantized with 4 bits. A (1024, 480) CRC32-concatenated polar code is used. The synthesis tool is Cadence RTL compiler. The process technology is TSMC 90nm CMOS technology. Our proposed architectures are compared with the state-of-the-arts SCL architectures, in [19] [20] [21] , [24] , both bit-and symbol-decision algorithms. The synthesis results in [21] and [20] are also based on a TSMC 90nm CMOS technology. The original synthesis results of [19] and [24] are based on a UMC 90nm and ST 65nm CMOS technologies, respectively.
The synthesis results shown in Table III , demonstrate that our symbol-decision SCL polar decoders have higher area efficiencies than the SCL decoders in [19] , [20] , [24] , and [21] . The SCL decoders in [19] , [21] , [24] have higher clock rates than our designs because it uses registers as storage units. However, in our designs, register files are used.
The SDSCL-8 decoders provide a higher throughput and a smaller latency than the SDSCL-2 and SDSCL-4 decoders, and occupy larger areas. However the improvements on the throughput and latency are not linear in the symbol size.
Compared with the SCL decoder in [20] , the increase of areas of symbol-decision SCL decoders is mainly due to sorting networks because the adders of processing units are reused to calculate both the bit-and symbol-based channel transition probabilities. For the SDSCL-4 decoder, because the sorting network of the SDSCL-4 decoder is only 0.073 mm 2 , there is no need to shrink q further. For the SDSCL-8 decoders, when q = 4, the area of the sorting network is 0.454 mm 2 . However, when q = 2, the sorting network occupies 0.196 mm 2 which is less than a half of that of q = 4. A smaller q does help the SDSCL-8 decoder achieve a higher throughput, a smaller latency, a smaller area, and a higher area efficiency, but it also introduces an FER performance loss of 0.25 dB to the SDSCL-8 decoder at an FER level of 10 −3 as shown in Fig. 7 .
Moreover, we also provide synthesis results for SDSCL-8 decoders without the PCMS technique. The PCMS technique helps the SDSCL-8 decoders gain an area saving of about 0.12 mm 2 .
We've already mentioned that LL messages are used in our designs. If LLR messages [21] are used, symbol-decision SCL decoders can have better area efficiencies than our current designs because the memory requirement for LLR messages are fewer than that for LL messages [21] .
VI. CONCLUSION
In this paper, we use the symbol-based recursive channel combination to calculate the symbol-based channel transition probability. We show that based on the LL representation of the transition probability, this recursive procedure needs fewer additions than the method used in [22] , [24] . Furthermore, a two-stage list pruning network is proposed to simplify the L-path finding problem. We use the PCMS technique to reduce the memory requirement for list decoders. By applying the PCMS technique, we design an efficient architecture for symbol-decision SCL decoders. Specifically, we introduce two scheduling schemes to perform the hardware sharing. A folded sorting implementation and tree sorting implementation are also discussed. We also implement symbol-decision SCL polar decoders for two-bit, four-bit and eight-bit, respectively, with a list size of four. Our synthesis results show that symboldecision SCL polar decoders outperform existing SCL polar decoders in terms of the area efficiency. Our proposed methods and architecture provide a range of tradeoffs between area, throughput and area efficiency.
ACKNOWLEDGMENT
We would like to thank the authors of [19] for providing the synthesis results using the TSMC 90nm technology in Table III . |u iΦ+Φ ) .
Because all bits of u are independent and each bit has an equal probability of being a 0 or 1, Then, by equations (15) ∼ (18), Eq. (9) is obtained.
