Abstract-Polar codes are widely considered as one of the most exciting recent discoveries in channel coding. For short to moderate block lengths, their error-correction performance under list decoding can outperform that of other modern errorcorrecting codes. However, high-speed list-based decoders with moderate complexity are challenging to implement. Successivecancellation (SC)-flip decoding was shown to be capable of a competitive error-correction performance compared to that of list decoding at a fraction of the complexity, but suffers from a variable execution time and a higher worst-case latency. In this work, we show how to modify the state-of-the-art highspeed SC decoding algorithm to incorporate the SC-flip ideas. The algorithmic improvements are presented as well as average execution-time results tailored to a hardware implementation. The results show that the proposed fast-SSC-flip algorithm has a decoding speed close to an order of magnitude better than the previous works while retaining a comparable error-correction performance.
I. Introduction
Polar codes made it into 3GPP's next-generation mobilecommunication standard (5G) due to their excellent errorcorrection performance under successive-cancellation list (SCL) decoding [1] . However, implementing high-throughput SCL decoders while retaining a moderate complexity is challenging since the decoder speculatively explores multiple candidate solutions in parallel of which a majority is, in the end, abandoned. As an alternative, Afisiadis et al. proposed the low-complexity successive-cancellation flip (SCF) decoding algorithm [2] which explores candidate solutions sequentially, therefore avoiding many unnecessary computations. They showed that SCF could match the error-correction performance of SCL decoding under some conditions. However, being sequential in nature, the instantaneous decoding throughput is variable and the average throughput depends on the channel signal-to-noise ratio. Nevertheless, it was shown in [3] that, under reasonable conditions, it had much better average throughput and a significantly lower average decoding complexity in terms of the area-time product than SCL decoding.
Contributions: In this paper, we show how to merge the SCF decoding algorithm with the state-of-the-art high-speed successive-cancellation (SC)-based decoding algorithm that decomposes polar codes into constituent codes. We introduce decision-log-likelihood-ratio (LLR) calculations, required for SCF decoding, tailored to the various constituent-code types. We show that the new decoding algorithm has an errorcorrection performance that is either virtually the same or very close to that of the original SCF algorithm. Hardware implementation considerations are discussed, and the execution time of the proposed algorithm is compared with that of the only SCF-decoder implementation from the literature.
Outline: The remainder of this paper starts with Section II which provides background about polar-code construction and encoding, polar-code representations and decomposition in constituent codes, and the decoding algorithms on which this work is built. Section III describes the proposed new algorithm and the decision-LLR equations for the various constituent-code types. Hardware considerations are discussed in Section IV, where an average execution-time comparison against the state of the art is also presented. Finally, Section V concludes this paper.
II. Background

A. Construction and Encoding
Polar codes are linear block codes, i.e., the encoding process implies the linear transformation of a vector of bits. This transformation is structured in a way that results in a polarization effect, as its length tends to infinity, where some of the encoded bits can be decoded perfectly while the others become completely unreliable.
In particular, in matrix form, a polar code of length N can be obtained as
where n log 2 N, and u is the vector of bits to be encoded. To obtain an (N, k) polar code of rate R = k /N, the k most-reliable bit locations in u hold the information bits while the other N − k bits, called frozen bits, are set to a predetermined value (usually 0). The bit-location reliabilities depend on the channel type and condition. Many methods have been proposed to calculate these reliabilities; we use that of Tal and Vardy [4] .
B. Representations and Constituent Codes
In addition to the matrix form, polar codes can be represented as a graph. Fig. 1a shows such a representation for an (8, 5) polar code, where + are modulo-2 additions, the grayed u i 's with i ∈ {0, 1, 4} hold frozen bits, and the black u i 's with i ∈ {2, 3, 5, 6, 7} hold the information bits. Encoding is done by propagating the vector u, in that graph, from left to right.
Also from Fig. 1a , notice how polar codes are built recursively: the first half (x 3 0 Taking into consideration the frozen-bit locations, many of these constituent codes can be considered as block codes with a special structure rather than as general polar codes. As will be briefly reminded in Section II-C below, this structure can then be exploited to use dedicated decoding algorithms which are more efficient than the generic decoding algorithm for polar codes.
Alternatively to the graph representation, it was shown in [5] that polar codes can also be represented as binary trees, or decoder trees, where the white and black leaf nodes correspond to frozen-bit and information-bit locations, respectively. 
C. Fast Simplified Successive-Cancellation Decoding
The SC decoding algorithm as initially proposed [6] proceeds by visiting the decoder-tree representation-e.g., Fig. 1b -sequentially, from top to bottom, from left to right, successively estimatingû at the leaf nodes, from the noisy channel values. Alamdar-Yazdi and Kschischang proposed the simplified SC (SSC) algorithm where the subtrees solely composed of either frozen (rate-0 codes) or information nodes (rate-1 codes) are not fully traversed [5] . Recognizing more types of constituent codes, specialized algorithms and corresponding dedicated decoders were proposed in [7] . The algorithm described in [7] , and later extended in [8] and [9] , is referred to as the fast-SSC (fast-SSC) algorithm. Fast-SSC decoders have a throughput that depend on the frozen-bit locations, however it is typically an order of magnitude higher than that of other SC-based decoders. Fig. 1c shows a decoder tree for the same (8, 5) polar code as above where the fast-SSC algorithm is applied, i.e., where the tree is pruned by recognizing that the left-hand-side subtree of Fig. 1b corresponds to the ML node of [7] (Type-I node in [9] ) and the right-hand-side subtree to a single-parity-check (SPC) node. In Fig. 1c , the former subtree is replaced by a purple-striped node and the latter with an orange-hatched node.
D. Successive-Cancellation Flip Decoding
The SCF decoding algorithm builds upon a slightlymodified SC decoding algorithm. It starts by going through a regular SC-decoding pass (first trial). In parallel to the decoding process, a list of (absolute) decision LLRs associated to each estimated information bit is built. Once SC decoding has completed, the embedded cyclic redundancy check (CRC) is verified. In case it matches, decoding stops and the estimated codeword is output. Otherwise, another SC-decoding pass (second trial) is launched, however, this time once the location of the information bit that corresponds to the least-reliable decision LLR (from the first trial) is reached, that estimated bit is flipped before resuming SC decoding. Once SC decoding has completed, the CRC is verified again. If it matches, the estimated codeword is output otherwise the process is restarted, where the bit corresponding to the second-leastreliable decision LLR is to be flipped. This process goes on until the maximum number of trials T max is reached. Note that setting T max = 1 corresponds to regular SC decoding.
By nature, the latency of SCF decoding is defined as a multiple of the underlying SC decoder
where L SC is the latency of the underlying SC decoder. Furthermore, both the average and worst-case throughput are functions of the throughput of that same SC decoder. Thus, improving the speed of the underlying SC decoder also improves the throughput of the SCF decoder.
III. Fast-SSC-Flip Decoding
During the first trial, the SCF decoding algorithm builds a list that contains the decision LLRs corresponding to each information bits. That list λ is used to determine which information bit to flip in subsequent trials. In the following, we show how the fast-SSC and SCF algorithms can be merged.
The fast-SSC algorithm uses dedicated decoders to estimate multiple information bits at a time. These decoders need to be modified to calculate the decision LLRs required by the SCF algorithm, and to add support for the bit-flipping procedure. These modifications are described in the following subsections for the essential leaf-node types (constituent codes) that contain information bits. The other leaf-node types that may be encountered can all be expressed as node combinations that include the types covered below. 
A. Information Nodes
In case the decoder has already passed the first trial, and the index of the information bit to be flipped falls within an information node, after decoding the rate-1 code, the bit that corresponds to that index is flipped.
B. Repetition Nodes
Repetition nodes protect a single information bit (k v = 1) by repeating it N v times at encoding time. Thus, fast-SSC decoding takes a hard decision on the sum of the N v node input LLRs. For the decision LLR, the original SCF algorithm uses the absolute value of the LLR that corresponds to the information bit. For a repetition node, both the SC and fast-SSC algorithms effectively calculate that LLR as the sum of all node input LLRs. Thus, it is proposed that the decision LLR be calculated in the same way
After the first trial, if the index of the information bit to be flipped corresponds to that protected by a repetition code, that bit is flipped after decoding.
C. Birepetition Nodes
We define as a birepetition node a node where all bit locations are frozen with the exception of the two mostsignificant positions that carry information bits (k v = 2). In the original fast-SSC algorithm [7] , birepetition codes of length N v = 4 were decoded by the ML node, whereas longer birepetition codes were decomposed. It was shown in [9] that they can be efficiently decoded by recognizing the similarity with repetition nodes, i.e., they can be efficiently decoded as two independent repetition codes. The first repetition code is composed from the even-indexed locations and the second one from the odd-indexed locations. Therefore, the decision-LLR calculations are
Similarly to the repetition node, if the index of the information bit to be flipped corresponds to one of the two bits protected by a birepetition code, the corresponding information bit (even or odd indexed) is flipped after decoding.
D. SPC Nodes
SPC nodes are a special type of (N v , k v ) polar codes with k v = N v − 1, where the only frozen bit is in the first location. A maximum-likelihood decoding algorithm for SPC codes, when a single estimated-bit vector is to be retained, can be summarized as flipping the information bit that corresponds to the least-reliable input LLR when the parity-check bit is not satisfied [7] , [10] . The challenge to adapt this type of node to SCF decoding is to calculate with low complexity meaningful alternative decision LLRs that take into account the parity constraint. Calculating the exact decision LLRs as the original SCF algorithm would involved too many calculations. Thus, we propose an approximation similar to the detection metric update rule for SPC nodes proposed in [11] . To this end, we define the decision LLRs as
where s is a scaling factor, and p is the calculated parity on all N v input LLRs, i.e.,
The approximations on the decision LLRs incur an errorcorrection performance loss. As will be shown in the next section, this loss can be partially compensated for by using the scaling factor s but, more importantly, it becomes negligible as the maximum number of trials T max is increased.
In case a bit flip is required, this node is more involved than the others as two bit estimates need to be flipped simultaneously for the parity constraint to remain satisfied. Two cases can be distinguished. Let i flip denote the location of the initial bit to be flipped, and i min 1 and i min 2 correspond to indices of the least-and second-least-reliable input LLRs, respectively. If i flip = i min 1 , both i flip and i min 2 get flipped, otherwise, both i flip and i min 1 get flipped. Fig. 2 compares the error-correction performance, in terms of frame-error rate (FER), for the new decoding algorithm against that of the original SCF algorithm as proposed by [2] , where T max is the maximum number of trials. A curve for the new algorithm where SPC nodes are not used is also included for reference. Without SPC nodes, SPC codes of length N v = 4 (e.g., the node v of Fig. 1c ) are decomposed as a length-2 repetition code combined with a length-2 rate-1 code. Longer SPC codes generate a chain of rate-R and rate-1 nodes that terminate with a length-2 repetition code combined with a length-2 rate-1 code. A short (512, 128) polar code is used with a 16-bit CRC to be representative of what could be used in next-generation mobile-communication systems [1] . These simulation results are for BPSK-modulated random codewords transmitted over an AWGN channel.
E. Error-correction Performance
Comparing the SCF curve (black with markers) against that of the proposed algorithm without SPC nodes (cyan with markers), either for T max = 8 (left) or 16 (right), it can be seen that the FER is virtually the same.
However, as mentioned in the previous section, the use of (approximate) SPC nodes incurs a small performance loss. Looking at the results for T max = 8 (left side of Fig. 2 ) it can be seen that the loss for using SPC nodes with a scaling factor s = 1 (purple curve with markers) amounts slightly more than 0.15 dB at a FER of 10 
IV. Hardware Implementation Considerations
As in [3] , the list of decision LLRs λ can be kept sorted with an insert-sort unit, running in parallel with the decoding process, capable of handling a maximum of min(P − 1, T max − 1) input LLRs, where P the maximum number of LLRs that specialized decoders can simultaneously access from memory. By keeping the list sorted, its size can be constrained to T max − 1. Thus, a memory of Q λ (T max − 1) bits is sufficient to store the decision-LLR list, where Q λ is the number of quantization bits used to represent decision LLRs. Alongside, a list of the corresponding indices requires a memory of (T max − 1) log 2 k bits.
Starting with the implementation of the fast-SSC algorithm as described in [7] , the ML unit-an unrolled generic SC decoder for length-4-only birepetition codes-would be replaced with a second copy of the unit implementing the repetition node in order to implement the Birepetition node. If P > T max − 1, the SPC unit would need an LLR sorter to only retain the T max − 1 smallest decision LLRs. Some bit-flipping circuitry needs to be added to all units handling information bits. These modifications are expected to have little impact on the critical path as even the most involved modifications (SPC unit) only appends bit flips, the remainder of the calculations can occur in parallel. Frame-error rate Average Execution Time (CCs)
[3] this work T max = 16: [3] this work 
Latency Comparison
The only reported hardware implementation of an SCF decoder [3] is built upon a slightly improved semi-parallel SC decoder with a latency, in clock cycles (CCs), defined as
where N is the polar-code length, and b is the location of the first information bit. The latency of the fast-SSC algorithm cannot be expressed in compact closed form as it heavily depends on the frozen-bit locations, and node types and constraints. However, numerical evaluations show that it is roughly an order of magnitude lower than that of SC decoding for all relevant code rates.
To get a grasp of the improvements under reasonable conditions, Fig. 3 shows a comparison of the average execution time, in terms of clock cycles, between the SCF decoder of [3] with what could be that of the proposed fast-SSC-flip decoding algorithm. For this work, all the original nodes from [7] were used except the ML node which has been replaced with the more efficient Birepetition node. The Repetition, Birepetition, and SPC nodes are constrained to a maximum size of 32, 64, and 64, respectively. The value of P is set to 64 to match that of [3] . All curves are for the same code used in Fig. 2 . The scaling factor s of (6) was set to 0.5. Fig. 3 confirms that the average execution time of the proposed fast-SSC-flip algorithm is close to an order of magnitude lower than that of the other work, both for T max = 8 (solid curves) and T max = 16 (dashed curves). Furthermore, it can be seen that the worst-case execution time for this work with T max = 8 (solid red bottom-most line) is only slightly worse that that of the best-case execution time of [3] .
V. Conclusion
In this paper, we showed how to merge the state-of-theart high-speed SC-based decoding algorithm-fast-SSC-with the SCF algorithm. The resulting algorithm was shown to have a significantly higher speed than the SCF-decoder implementation from the literature while retaining an error-correction performance very close to the original SCF algorithm. The key ingredients are the new decision-LLR calculations and bit-flipping procedures introduced to SCF decoding for the multi-bit dedicated decoders used in the fast-SSC algorithm.
The proposed decoding algorithm is promising for applications that can handle a variable execution time. Its errorcorrection performance can match that of list-based decoding with a small list and its speed tends to that of the fastest lowcomplexity SC decoders in practical operating conditions.
