Abstract-Polar codes are able to achieve the capacity of memoryless channels under successive cancellation (SC) decoding. Soft Cancellation (SCAN) is a soft-output decoder based on the SC schedule, useful in iterative decoding and concatenation of polar codes. However, the sequential nature of this decoder leads to high decoding latency compared to state-of-the-art codes. To reduce the latency of SCAN, in this paper we identify special nodes in the decoding tree, corresponding to specific frozen-bit sequences, and propose dedicated low-latency decoding approaches for each of them. The resulting fast-SCAN decoder does not alter the soft-output compared to the standard SCAN while dramatically reducing the decoding latency and yielding the same errorcorrection performance.
I. INTRODUCTION
Polar codes are a class of linear block codes introduced in [1] . They rely on the polarization effect to identify reliable and unreliable bit-channels, freezing the unreliable ones and transmitting information through the reliable ones. Polar codes can achieve capacity under successive cancellation (SC) decoding for infinite code length. However, for finite block length, SC yields poor error-correction performance and long decoding latency, due to its serial nature. On one side, to improve the performance of SC at minimal latency cost, SC list (SCL), and its evolution aided by cyclic-redundancy-check (CRC) have been proposed in [2] and [3] , respectively. Concurrently, to reduce the latency of SC without any performance degradation, constituent codes that can be easily decoded have been investigated [4] - [6] . In particular, [4] introduces a simplified-SC (SSC) decoder that can efficiently decode rate-0 and rate-1 nodes. Fast-SSC [5] improves SSC by implementing fast parallel decoders for single-parity-check nodes, repetition nodes and their mergers. More recently, [6] identified 5 new types of nodes, providing their efficient decoders.
SCAN decoding [7] is a low-complexity soft output decoding algorithm based on the SC schedule, that can be effectively used in concatenation schemes and iterative decoding. Consequently, SCAN decoders suffer also from a poor latency due to the serial nature of the SC schedule. Moreover, SCAN allows iterative decoding, increasing its latency with the number of iterations. In order to speed up the decoding process, [8] introduces simplifications for rate-0 nodes (all frozen bits) and rate-1 nodes (all information bits).
In this work, we detail fast decoders for several constituent codes under SCAN decoding improving the latency of SCAN without modifying its soft output, thus yielding the sameû extrinsic values for iterative message passing. An analysis of the decoding latency of the fast-SCAN decoder reveals up to 94.5% reduction with respect to SCAN, while simulation results show that both decoders have the same error-correction performance.
II. PRELIMINARIES
In this section we introduce the basic concepts on polar codes, together with the SC and SCAN decoding algorithms.
A. Polar codes
A polar code (N, K) of length N = 2 n and dimension K is a block code relying on the polarization effect of kernel matrix G 2 1 0 1 1 . The polarization effect defined by
creates N virtual bitchannels, each one having a different reliability. The reliability of each channel may be computed through Density Evolution using Gaussian Approximation, the Bhattacharrya parameter or through Monte Carlo simulation [9] . The K information bits are assigned to the K most reliable bit-channels, while the remaining N − K bit-channels are set to a known value, usually 0, and represent the frozen set F of the code. The N -bit codeword x is generated as x = uG N , where u is the input vector of the code, with u i∈F = 0.
B. SC-based decoding
SC has been proposed in [1] as the native polar code decoding algorithm. As shown in Figure 1 , it can be described as a binary tree search, where the tree is traversed depth-first starting from the left branch. We consider the soft information received as input from the channel to be in the form of logarithmic likelihood ratios (LLRs). A node stage t represents a constituent code of length 2 t , receiving from its parent Fig. 2 . Soft message exchange between decoding tree stages.
node the soft information vector λ of length 2 t . This is used to compute the 2 t−1 -element soft information vector λ ℓ for its left child. The left child eventually returns its constituent codeword β ℓ , that is used to compute the soft information λ r to be sent to the right child. After receiving the second constituent codeword β r from the right child, the node feeds back the 2 t -element estimated codeword β to its parent node. In order to reduce the latency of SC-based decoders, the SSC decoder [4] is able to decode sub-trees constituted of information bits (rate-1) or frozen bits (rate-0) without having to explore them. Later, single-parity-check (SPC) and repetition (REP) nodes have been decoded efficiently in [5] to further speed up SC decoding. Finally, five new nodes were identified in [6] , while generalized and expanded versions have been proposed in [10] .
C. SCAN decoding
SCAN decoding [7] relies on the SC schedule to exchange soft information through the decoding tree in both directions. This allows to refine the soft information both at the root and at the leaves of the tree by iterating the decoding process. Compared to SC, SCAN returns soft values instead of hard decisions, slightly increasing decoding complexity and latency.
Stage t of the decoding tree is constituted of 2 n−t nodes of size 2 t , where 0 ≤ t ≤ n. The i-th node at stage t receives vector λ i t including 2 t LLRs from its parent node, and performs update operations to feed back the 2 t -element soft vector β i t . As with SC, the vector λ 0 n is initialized with channel LLRs. Moreover, the decoder can exploit a priori information coming from the frozen set F ; the message β i 0 fed back from the leaves, corresponding to the estimated vectorû, is set to
The other messages are initialized to 0 since no further a priori information is available. It is worth noting that message sets β Message passing through the decoding tree follows the SC scheduling described in Section II-B. Figure 2 represents a node at stage t, i.e. a constituent code of length 2 t . First, the soft message vector to be sent to the left child is computed as for k = 0, . . . , 2 t−1 − 1 . Then, the node receives the soft message vector β 2i t−1 from its left child and computes the message λ 2i+1 t−1 sent to its right child as λ
As soon as the soft message vector β 2i+1 t−1 is received from the right child node, the feedback soft message vector β i t of length 2 t is calculated as
and sent to the parent node. Functionf : R 2 → R is the box-plus operator
whose hardware-friendly implementation is given bỹ
III. FAST-SCAN DECODING
In this section we introduce SCAN message update rules for several special nodes used in fast SC decoders. Fast-SCAN can provide the same soft output of SCAN while considerably reducing the decoding latency. Figure 3 depicts the pruned decoding tree as explored by fast-SCAN decoding for the (256, 239) polar code designed according to the 5G standard [11] . The full tree would be composed by 511 nodes, that are reduced to 17 through constituent code fast decoding.
Simplified SCAN (SSCAN) decoding has been proposed in [8] , where decoders for rate-0 and rate-1 nodes are examined. A rate-0 node always returns β 
A. SPC nodes
In an SPC node the first leaf is frozen while all the other leaves represent information bits. SPC nodes are more likely to occur in high-rate polar codes [5] , that are used in constructions such as product polar codes [12] . The SPC node imposes an even parity constraint on its bits; the Fast-SSC [5] algorithm decodes SPC nodes with Wagner decoding [13] , i.e. by flipping the least reliable bit if the overall parity is not satisfied.
When we decode an SPC node with SCAN, the β term in (2) is equal to 0 until t ≥ 2, hence, entry k of λ 
. The structure of SPC nodes guarantees that the feedback from the right subtree is always the all-zero vector; given (4) 
We can thus decode SPC nodes without traversing the tree. We can rewrite (8) as
where P is the overall parity
h is the hard decision taken on the LLRs in λ i t , and k 1,2 are the indices of the the least and second least reliable values of λ i t . It is worth noticing that (9) does not change the parity of the input LLRs, while on the contrary Wagner decoding forces the even parity constraint. When an SPC node feeds back a vector having a wrong parity, SCAN decoding may fail due to the extrinsic nature of its output. In order to force the parity condition while keeping the expected LLRs distribution (9) can be modified as
However, the output will be no longer extrinsic, and will differ from that of SCAN. In the following, we propose to use (9) considering that a key application of fast-SCAN is the speedup of iterative decoding of polar-based code constructions.
B. REP nodes
A REP node occurs when all the leaves of a node are frozen except the rightmost one. As a result, the value of the 2 t bits at the root of the REP node is equal to the information bit. Fast-SSC decodes REP nodes through hard decision on the sum of all the elements in λ i t . The result is then replicated 2 t times in the feedback. Concerning the SCAN decoder, the β term in (3) representing the feedback from the left subtree is always a vector of infinitives for t ≥ 2, leading to λ
and λ
[1] are respectively the sum of all evenindexed and odd-indexed λ = {0}, and thus β
[0] . Finally, each entry k of β i t can be computed without traversing the tree as the sum all the entries of λ i t excluding k:
C. Type-X nodes
In [6] , the authors provide 5 new nodes, named Type-I to Type-V. A type-I node, also known as REP-II node, has all leaves frozen except the last two. They can be decoded as separate REP nodes identified by even-indexed and oddindexed bits. Consequently fast-SCAN computes element k of β i t , with k = 0, . . . , 2 t−1 − 1, as
without traversing the type-I tree. Type-III nodes have 2 frozen bits located on the first two bitchannels, while the other bit-channels are unfrozen . They may be decoded as two separate SPC nodes composed by the evenindexed and odd-indexed values. We denote k 0,e and k 0,o the least reliable indices corresponding to the set of even-indexed and odd-indexed bits, with k 1,e and k 1,o being the second least reliable index of each set. The overall parities P e and P o are computed in order to calculate the soft message feedback as in (9):
otherwise.
The other type-X nodes presented in [6] can be decoded without traversing the tree as well. For Type-II and Type-IV nodes, entry k of β i t is computed as:
i.e. a combination of sum and box-plus operations among values of λ i t selected with modulo-4 indexing. While these equations do not alter the soft output of SCAN, their computational complexity is substantially higher than the other identified nodes, and they will not be considered in the following Sections. Finally, the structure of Type-V nodes implies a frozen bit embedded in a series of information 2  2  2   2  2  2  2   2  2  2  2  2  2  2  2   2 2  2 2  2 2  2 bits. As a consequence, aside from its high complexity, the mathematical expression for the computation of β i t at the root is only valid for the first iteration. Thus, Type-V nodes are not considered in the remainder of the paper either.
IV. DECODING LATENCY ANALYSIS
In this Section we evaluate the decoding latency of the proposed Fast-SCAN decoder and compare it to the latency of the standard SCAN decoder. Similarly to [4] , [6] , we suppose that hard decisions on LLRs and bit operations are executed instantaneously, while operations involving real numbers (additions, comparisons) and Wagner decoding require one clock cycle. With this assumption, one SCAN update rule consumes 2 clock cycles, since it is composed of a box-plus operation followed by an addition, as discussed in Section II-C. are not updated through the decoding. Finally, the root has a latency of 2 corresponding to the computation of the vector β i n . The overall SCAN latency for a node at stage t is then L = 4 * (N v − 2) + 2 * N v + 2 = 6 · (2 t − 1). As seen in Figure 4 , the latency of SSCAN to decode an SPC node of size 2 t is L = 4 * (t − 1) + 2, since the constant values of β i t allow for instantaneous decoding of rate-0 and rate-1 nodes. For fast-SCAN, hard decisions and (10) are executed instantaneously as well, hence only (9) with the search of the two least reliable LLRs is taken into account. As in [6] , we assume that the minimum search operations need one clock cycle; thus, the decoding latency of an SPC node requires L = 2 clock cycles.
The latency computation for SSCAN in case of REP nodes is symmetrical to that of SPC nodes, leading to the same latency L = 4 * (t − 1) + 2. For fast-SCAN, the sum of LLRs in (12) takes 1 cycle [6] , while removing one LLR value takes an additional clock cycle, resulting in L = 2 clock cycles.
Type-I nodes have only 2 information bits in the last channels; a node of size 2 t is mostly composed of zero-latency rate-0 nodes. Hence, type-I is very similar to the REP node also in terms of latency, the only difference being the edge connecting the size-2 rate-1 node to its parent node. Consequently, SSCAN requires L = 4 * (t−1)−4+2 = 4 * (t−2)+2 clock cycles. In fast-SCAN, the even-indexed and odd-indexed bits are interpreted as two independent REP nodes that can be decoded in parallel, reducing the latency to L = 2.
Similarly, a type-III node of size 2 t can be seen as the juxtaposition of two SPC nodes, one involving the evenindexed LLRs, the other involving the odd-indexed LLRs. For SSCAN, the latency is reduced to L = 4 * (t − 2) + 2 as for Type-I, while fast-SCAN decodes the two SPC nodes in parallel, reducing the latency to L = 2 clock cycles.
V. RESULTS
In this Section, we consider polar codes defined in the 5G standard [11] ; we analyze the frequency of occurrence of the identified special nodes, and provide the consequent speedup with respect to standard SCAN decoding. Figure 5 shows the number and the size of the identified nodes for a given polar code of length N = 1024 and dimensions K = 128, 512, 768, 896. The nodes are counted considering the maximum size possible. For instance, an SPC node of length 128 followed by a rate-1 node of length 128 is counted as an SPC node of length 256. Low-rate polar codes are more likely to have long rate-0, REP, and Type-I nodes. For example, the (1024,128) polar code has a REP node of length 128 and a rate-0 node of length 256. Around rate 1/2 the nodes are more evenly distributed, with many nodes having size ≤ 16. In high-rate polar codes, the longer nodes are rate-1, SPC and Type-III nodes. According to the four rates depicted in Figure 5 , type-I and type-III are the least likely to occur.
The decoding latency of a sample of 5G-NR polar codes under SCAN and fast-SCAN is presented in Table I . It can be seen that fast-SCAN can reduce the latency of SCAN of more than 80% at code rate 1/2, regardless of the code length; moreover, at high and low code rates, where special nodes of larger sizes are present, the gain can surpass 94%.
In Figure 6 , we consider an additive white Gaussian noise channel with binary phase-shift keying modulation and provide simulation results for an N = 256 2 , K = 239 2 polar code under SC, SCL (with and without CRC), SCAN and fast-SCAN. We also provide simulation results of the product polar code scheme presented in [12] , where the component codes of the product codes are polar codes. Being an iterative concatenated scheme, this scenario can benefit from the soft-in soft-out capabilities of SCAN-based decoding algorithms. Simulations decode the component polar codes with SCL, fast-SCAN and SCAN decoders; these decoders exchange soft information between iterations, following the scheme detailed in [12] in case of SCL. SCAN and fast-SCAN consider one internal iteration. The component code is the N = 256, K = 239 code shown on Figure 3 . We can see that SCAN and fast-SCAN yield the same BLER for both standard and product polar codes, showing that the proposed fast decoding techniques are exact and do not incur any performance degradation. SCAN and fast-SCAN slightly improve on the performance of SC for standard polar codes: this is because the hard decision is taken on the a-posteriori information, i.e. the combination of the soft output and the channel LLRs, as would a turbo decoder. Fast-SCAN outperforms SCL decoding of product polar codes, yielding a gain of almost 1.5dB at BLER=10 −3 . It also outperforms standard polar decoding using SC and non-CRC aided SCL, while it approaches the BLER of CRC-aided SCL.
For a single internal iteration, SCAN decodes the component code in 6 · (2 8 − 1) = 1530 clock cycles, while fast-SCAN needs 58 clock cycles. The latency of the product polar codes multiplies the latency of the component decoders by the number of row/column half-iterations, since row and column decoding cannot be run in parallel to enable the exchange of information [12] . Hence, the maximum latency results in 8 * (58 + 58) = 928 clock cycles with fast-SCAN and (1530 + 1530) * 8 = 24480 clock cycles for SCAN. The standard polar code is decoded in 2N − 2 = 131070 clock cycles with SC and 2N − 2 + K = 188191 with SCL, while fast-SCAN requires 12190 clock cycles. 
VI. CONCLUSIONS
In this paper, we have proposed fast-SCAN, a reduced latency decoder based on the SCAN decoding algorithm. Fast-SCAN decodes constituent nodes exactly without the need to explore the decoding tree, and thus has no impact on the errorcorrection performance of SCAN. At the same time, it can reduce the decoding latency of SCAN of up to 94.5%.
