We present an LLR-based implementation of the successive cancellation list (SCL) decoder. To this end, we associate each decoding path with a metric which (i) is a monotone function of the path's likelihood and (ii) can be computed efficiently from the channel LLRs. The LLR-based formulation leads to a more efficient hardware implementation of the decoder compared to the known log-likelihood based implementation. Synthesis results for an SCL decoder with block-length of N = 1024 and list sizes of L = 2 and L = 4 confirm that the LLR-based decoder has considerable area and operating frequency advantages in the orders of 50% and 30%, respectively.
INTRODUCTION
Polar codes are a class of capacity-achieving error correcting codes with low-complexity encoding and decoding algorithms [1] . Specifically, the successive cancellation (SC) decoder has a structured nature which makes its hardware implementation attractive [2, 3, 4] . Moreover, the SC decoder can be implemented in a numerically accurate and stable way by representing the involved transition probabilities as log-likelihood ratios (LLRs).
Successive cancellation list (SCL) decoding was introduced in [5] to improve the finite block-length performance of polar codes. However, the original description of the SCL decoder is given in terms of the path likelihoods which is not suitable for practical implementation. Hence, the first hardware architecture for SCL decoding [6] , uses log-likelihoods (LLs) to partially overcome the numerical stability issues. Unfortunately, the decoder still requires very large and irregular memory elements and processing elements that support large bit-widths, which induce a high cost in terms of hardware resources.
Contribution and Paper Outline: In this paper, we propose an LLR-based formulation of the SCL decoder and show that such a formulation can significantly improve the hardware architecture of [6] by solving all the aforementioned implementation problems. In Section 2, after a brief review of the SCL decoding algorithm, we introduce a path-metric which is iteratively updated as a function of the LLR of the bit being decoded given the past trajectory of the decoding path and its (tentative) value. We then prove that this metric is a monotone function of the path's likelihood which yields the LLRbased formulation of the SCL decoder. In Section 3, we provide a short review of the SCL decoder hardware architecture of [6] and compare the synthesis results for the LL-and LLR-based decoders.
Notation: Throughout this paper the boldface letters denote vectors. The elements of a vector x are denoted as xi. By x m we mean the sub-vector [x0, x1, . . . , xm]
T if m ≥ 0 and the null vector otherwise. If I = {i1, i2, . . . } is a set of indices (such that i1 < i2 < . . . ), then xI denotes the sub-vector [xi 1 , xi 2 , . . . ]
T .
SC LIST DECODING OF POLAR CODES
A polar code of rate R < 1 and block length N = 2 n is constructed by 'appropriately' choosing a subset A ⊂ {0, 1, . . . , N − 1} of cardinality |A| = N R called the information indices. The transmitter then constructs the vector u ∈ {0, 1} N by putting the N R data bits on uA and fixing uF , where F {0, 1, . . . , N − 1} \ A, to known-to-receiver frozen bits. Subsequently, a codeword x = Gu is computed and sent over the channel. 1 . The receiver observes a noisy version of x denoted as y and has to decode x or, equivalently, u. Since the sub-vector uF is already known, the decoder's task reduces to estimating uA. To this end, Arıkan proposes the SC decoding procedure summarized in Algorithm 1. In line 5 of Algorithm 1, W (i) n (y, u i−1 |ui) represents the likelihood of ui given the channel output y and u i−1 considering ui+1, ui+2, . . . , uN−1 as unknown bits.
Algorithm 1: Successive Cancellation Decoding [1] .
SC decoding is sub-optimal since at each step i ∈ A the decoder ignores the information it possesses about the future frozen bits {uj : j > i, j ∈ F}. In return for this sub-optimality, the likelihoods W (i) n (y,û i−1 |ui) can be computed efficiently and the decoding complexity scales like O(N log N ). Furthermore, Arıkan shows that, as long as the code rate is below the channel capacity, there exists A ⊂ {0, 1, . . . , N − 1} for which the block-error probability of this scheme vanishes as N increases [1, 7] .
SC List Decoding
Unfortunately, the sub-optimality of SC decoding is still significant in small-to-moderate block-lengths used in practice. A successive cancellation list (SCL) decoder has been proposed in [5] to partially compensate for this sub-optimally.
In short, SCL decoding is carried out as follows: at each decoding step i ∈ A, instead of fixing a decision on ui (line 5 of Algorithm 1), two decoding paths corresponding to either possible value of ui are created and decoding is continued in two parallel decoding threads. In order to avoid the exponential growth of the number of decoding paths, at each step only the L most likely paths are retained. Finally, the decoder will end up with a list of L candidates for uA out of which the most likely one is declared as the final estimate, uA. This procedure is described in Algorithm 2.
Algorithm 2: SC List Decoding [5] . Simulation results show that, with relatively small list sizes, the SCL decoder's performance is very close to the optimal ML decoder. More importantly, with a clever choice of the data structures, SCL decoding can be done in O(L N log N ) time complexity [5] .
LLR-based Path Metric Computation
Algorithms 1 and 2 are both valid high-level descriptions of SC and SCL decoding, respectively. However, implementing the decoders using the likelihoods directly is risky as likelihoods become very small numbers and the decoder will be prone to underflow errors.
A practical SC decoder, therefore, receives the channel output in the form of LLRs (ln
, i ∈ {0, 1, . . . , N − 1} where W (y|x), x ∈ {0, 1} is the channel transition probability) and computes the decision LLRs,
, which are sufficient statistics for decisions in line 5 of Algorithm 1. Furthermore, the computations are numerically stable and the decoding involves O(N log N ) arithmetic operations in total [1, Section VIII].
For the SCL decoder, however, the LLRs seem to be insufficient for choosing the L most likely paths in line 8 of Algorithm 2. In [5] the decoder, therefore, computes a scaled version of the pair of likelihoods W
, ui ∈ {0, 1} assuming the channel output is provided in the form of pairs of likelihoods (W (yi|xi), xi ∈ {0, 1}, i ∈ {0, 1, . . . , N − 1}). In order to avoid the underflows, all likelihoods are scaled by a common factor at each intermediate step of the computations. This normalization step is circumvented in the hardware implementation of [6] by performing the computations in the log-likelihood domain.
Luckily, it turns out that the decoding paths can also be ordered according to their likelihoods using only the decision LLRs and the past trajectory of each path as we shall demonstrate in the following.
Theorem 1.
For each path ℓ ∈ {0, 1, . . . , L − 1} and each step i ∈ {0, 1, . . . , N − 1} let the path-metric be defined as:
where
.
If all the information bits are uniformly distributed in {0, 1}, for any pair of paths ℓ, ℓ ′ ∈ {0, 1, . . . , L − 1},
ℓ ′ . In view of Theorem 1, one can implement the SCL decoder using L parallel low-complexity and stable LLR-based SC decoders as the underlying building blocks and, in addition, keep track of L path-metrics. The metrics can be updated iteratively as the decoder proceeds according to
Any comparison of the likelihoods of the paths can be done equivalently using the values of the path-metrics.
Recall that the SC decoder's decisions (in line 5 of an LLR- (2) is well-approximated as
Hence, our metric has a natural interpretation: If at step i, the ℓ-th path does not follow the direction δ(L
, it will be penalized by an amount of ≈ |L
We devote the rest of this section to prove Theorem 1.
Lemma 2. If Ui is uniformly distributed in {0, 1}, then,
Proof.
for ∀ui ∈ {0, 1},
Proof of Theorem 1. We show that
Having shown (4), Theorem 1 will follow as an immediate corollary to Lemma 2 (since the channel output y is fixed for all decoding paths). Since the path index ℓ is fixed on both sides of (1) we will drop it in the sequel. Let µ(u) 1 − 2u and Λ
(the last equality fol-
, and observe that showing (4) is equivalent to proving
Now we have
Therefore,
Repeated application of (6) (for i − 1, i − 2, . . . , 0) yields
Dividing both sides by P[Y = y] proves (5).
SCL DECODER HARDWARE ARCHITECTURE
In the SCL decoder (SCLD) hardware architecture of [6] the SC decoder computations are implemented using pairs of log-likelihoods (LLs), ln W (i) n (y, u i−1 |u), u ∈ {0, 1}. LLs provide some numerical stability and reduce the dynamic range of the involved quantities so that a fixed-point implementation is feasible, although a large number of quantization bits is still required for good performance. In this section, we present area and timing results for an LLR-based implementation of [6] in order to highlight the significant area savings and throughput gains that can be achieved by exploiting Theorem 1.
LL-based SCLD Hardware Architecture
The LL-based SCLD hardware architecture presented in [6] is mainly comprised of three components, namely the metric computation unit (MCU), the path selection component, and the state memories component. The MCU consists of L arrays of processing elements (PEs), which implement L parallel LL-based SC decoders. The state memories store the LLs, the pathsû[ℓ], as well as the partial sums required by the SC decoders. The path selection component is responsible for sorting the 2L LL values, ln
. . , L − 1}, ∀ui ∈ {0, 1} and choosing the L most likely paths to follow in each step.
For a low-complexity hardware implementation, the LLs used by the SC decoders have to be quantized. Specifically, in [6] the channel LLs are quantized using an unsigned fixed-point representation with Q LL i integer and Q LL f fractional bits. The LL-based SC update rules involve additions of LLs [6] , which all have the same sign. Thus, in order to prevent catastrophic overflows, at each SC decoding stage the number of integer bits is increased by one. Therefore, the LLs at each stage s = 0, . . . , n, must be represented using Q LL i + s integer bits. This necessity leads to large and very irregular LL memories, which are not well-suited for hardware implementation. Moreover, the PEs and the metric sorter need to support computations with the maximum bit-width, i.e., Q LL = Q LL i + Q LL f + n bits. It has been shown in [6] that in total
bits are required for storing the LLs in this scheme.
LLR-based SCLD Hardware Architecture
For the LLR-based implementation, the MCUs are modified to implement the Min-Sum LLR-based SC update rules [3, Section VI]. Moreover, the path selection unit implements the approximated pathmetric update-rule of (3). Since the rule is iterative, the path selection unit contains a memory with L storage locations to store PM involve both additions and subtractions, the dynamic range of the LLRs used in LLR-based SCLD is generally smaller than that of the LLs used in the LL-based SCLD. Thus, one intuitively expects overflows to happen less frequently and that there is no need to increase the word size by one bit per decoding stage. This intuition is confirmed by our simulation results. This leads to an LLR memory with a fixed word size, which is more suitable for hardware implementation than the irregular memory required by the LL-based decoder. We can guarantee that there will be no overflows in the path metrics by using an unsigned representation with Q LLR i + n integer and Q LLR f fractional bits yielding totally
+ n bits per path-metric. It turns out that, in practice, much fewer bits are sufficient. Using an approach identical to that of [6, Section IV.C] one can verify that totally (N + (N − 1)L)Q LLR bits will be needed to store the LLRs. Adding the LQ M bits for storing the path-metrics, we see that this implementation will require
bits for the storage of LLRs and path metrics in total.
Implementation Results
In Fig. 1 the frame error rate (FER) of an LLR-based floating-point implementation of an SCL decoder with exact SC decoding and metric update rules is compared to that of a fixed-point implementation of an SCL decoder with approximated SC and metric update rules for an N = 1024 polar code of rate 1/2 over a BAWGN channel 2 . The fixed-point LLR-based decoder uses Q Table 1 . Memory requirement for LL-and LLR-based decoder for N = 1024, Q LL = 4, Q LLR = 6 and Q M = 8.
Cell Area LLR-based LL-based Reduction
List Size L = 1 ([3, Table IV ], scaled to 90 nm) Total 0.592 mm 2 n/a n/a Memory 0.554 mm 2 n/a n/a MCU 0.034 mm 2 n/a n/a Other 0.004 mm 2 n/a n/a Table 2 . Cell area breakdown of the LL-and LLR-based SCL decoders for an N = 1024 polar code and that of the SC decoder implementation of [3] for comparison.
Q M = 8 bits are sufficient; simulated performance for Q M = 8 and Q M = 15 are the same (while setting Q M = 7 degrades the performance). We observe that, using the aforementioned parameters, the performance loss of the fixed-point implementation with approximated updated rules is minimal with respect to that of the floatingpoint implementation with exact update rules. Moreover, the FER of this fixed-point implementation can be seen to be exactly equal to that of an LL-based fixed-point implementation of SCL decoding in [6] + 1 = 6 bits suffice. Moreover, the sorting metric in the former implementation is 14 bits wide, while in the latter it is only Q M = 8 bits wide. We have compared the memory requirement of two implementations (Equations (7) and (8)) as a function of the list size L in Table 1 . We observe that the LLR-based representation is advantageous in terms of the memory requirements, in particular as the list-size increases 3 . In Table 2 we compare synthesis area results for LL-and LLRbased SCL decoders with L = 2 and L = 4. All designs were synthesized using the same UMC 90nm library in the typical corner. We observe that by using LLRs the total area is reduced by 41% and 53% for an SCL decoder with L = 2 and L = 4, respectively. In absolute numbers, the largest gain in both cases comes from the reduced size of the state memories, which are the largest components 3 Our simulation results confirm that the same quantization scheme leads to a fair comparison for L = 8 and L = 16 as well. Table 3 . Operating frequency results for the LL-and LLR-based SCL decoders for an N = 1024 polar code and that of the SC decoder implementation of [3] for comparison.
of both the LL-and the LLR-based decoders. In relative numbers the transition to LLRs is most beneficial for the MCU, which is reduced in size by approximately 60% in both cases. This number is in line with the approximately 60% reduction in bit-width of the quantities involved in the computations (i.e., from 14 bits to 6 bits). The path selection component benefits from the 40% bit-width reduction of the path metrics with an average area reduction of the same order. The corresponding post-synthesis timing results are presented in Table 3 . We observe that for L = 2, the LLR-based decoder can achieve a 31% higher clock frequency than the corresponding LLbased decoder. This significant improvement comes not only from the bit-width reduction of the MCUs, but also from the highly reduces size of the LLR storage memory. In the L = 4 LLR-based decoder the signal path with the highest delay in the hardware goes through the metric sorter contained in the path selection component and not through the MCU. The 40% bit-width reduction of the metrics helps the comparators used in the radix-2L sorter [6] , but it does not improve the logic gate tree that follows these comparators and combines their results. Thus, the increase in clock frequency is 7%.
Remark. The transition to LLRs can significantly reduce the area and increase the operating frequency of an implementation of SCL decoding. To be fair, we mention one disadvantage of the iterative metric update, namely that the simplified SC decoding proposed in [8] can no longer be taken full advantage of, since the LLRs for all bits are required to keep the metric updated. It can still be applied to the first group of consecutive frozen bits, which is usually large.
CONCLUSION
In this paper, we derived an LLR-based implementation of the successive cancellation list decoder using a path metric based on which the paths can be ranked according to their likelihoods. This metric can also be used for any other tree-search decoding algorithm that compares the paths according to their likelihoods, such as SC stack decoding [9] . Moreover, we demonstrated the advantages of our implementation by comparing synthesis results for an LL-and an LLR-based hardware successive cancellation list decoder architecture. Specifically, the decoder area was reduced by up to 53% and the clock frequency was increased by up to 31%.
In addition to the gains in the hardware cost, most processing blocks, such as channel equalizers and demodulators, in practical receivers process data in the form of LLRs. Hence, the presented implementation of decoder can readily be integrated in existing systems while the LL-based decoder would require an extra preprocessing stage to convert the channel LLRs to LLs.
