The successive cancellation list decoding algorithm for polar codes yields near-optimal decoding performance at the cost of high implementation complexity. The successive cancellation stack algorithm has been shown to provide similar decoding performance at a much lower computational complexity, but software implementations report a sub-par T/P performance. In this technical report, the benefits of the fast simplified successive cancellation list decoder are extended to the stack algorithm, resulting in a throughput increase by two orders of magnitude over the traditional stack decoder.
II Background

II-A Polar codes
Polar codes asymptotically achieve the symmetric channel capacity for a B-DMC W by considering a set of N independent copies of W and recursively applying a polarizing transform F = [ 1 0
1 1 ] to the inputs of the channels, resulting second set of N channels W A polar code of length N and message bit length K shall be denoted by P C(N, K). where n = log 2 N and F ⊗n is the n th Kronecker power of the kernel F , and can be represented by the XOR tree shown in Figure 1 . The tree has n stages, and the variable λ ∈ [0, n] is used to denote the current stage in the tree. Given a stage λ, there are (n − λ + 1) branches denoted by φ ∈ [0, (n − λ)], and the size of each branch is Λ = 2 λ .
Given an information bit set
II-B Decoding algorithms
All decoding algorithms in this section are described in the LLR domain.
Successive cancellation
The successive cancellation (SC) decoding algorithm [1] operates on the encoding tree,
propagating that channel values LLR(y i ) from stage n to produce LLR y at stage 0, according to the min-sum approximation [12] in Figure 2 and Equation (2).
The estimateû i can then be made following Equation (3) .
Figure 3 -SC decoding tree and schedule for P C(8, 4).
The XOR encoding tree in Figure 1 is reinterpreted as a binary tree as shown in Figure 3 , and the stage λ and branch φ is used to identify each node, denoted by (λ, φ). and previous branch u = (λ, φ − 1) according to Equation (4) . The LLR α (0,i) calculated for a leaf node is the desired LLR y
The bit estimates of the leaf nodes at (0, i) correspond toû i , and are obtained via a hard decision on its LLR. The bit estimates of parent nodes v = (λ, φ) are calculated by propagating those of both the child nodes l = (λ − 1, 2φ) and r = (λ − 1, 2φ + 1), as shown in Equation (5).
Fast simplified successive cancellation
The fast simplified successive cancellation (FSSC) decoding algorithm [4] improves upon the computational complexity of the SC decoding algorithm by recognizing constituent codes in the SC decoding tree and pruning the nodes. The four nodes considered are:
• Rate-0
Rate-0 (R-0) nodes are the nodes in the SC tree below which all the leaf nodes correspond to frozen bits. For an R-0 node at v = (λ, φ) in the decoding tree, no further traversal is needed and the bit estimates for the stage can be update as follows:
• Repetition Repetition (REP) nodes contain only a single information bit at the rightmost leaf node. The bit estimates for REP node at v = (λ, φ) in the tree can therefore only be all 0's or all 1's, and the decision is made using an efficient ML decoding by:
• Rate-1
Rate-1 (R-1) nodes are the nodes in the SC tree below which all the leaf nodes correspond to information bits. Similar to R-0 nodes, an R-1 node at v = (λ, φ)
requires no further traversal and the bit estimates for the stage can be updated by taking a hard decision on the stage LLRs:
• Single parity check
Single parity check (SPC) nodes contain only a single frozen bit at the leftmost leaf node. The bit estimates for REP node at v = (λ, φ) in the tree therefore have to satisfy a parity constraint such that the XOR of all the estimates should be 0. This can be achieved by computing the parity of the hard decisions of the REP node LLRs, and then flipping the least reliable estimate if the parity is 1:
Since the FSSC scheme does not traverse the decoding tree till the leaf nodes, the bit estimatesû i are not readily available. With non-systematic encoding,û i can be obtained by re-encoding the estimated codeword present in the bit estimates at the root node of the tree,
. With systematic encoding [13, 14] ,û i is directly available
Successive cancellation list
When the SC decoding algorithm encounters an information bit, an immediate decision is made and half the potential remaining paths are discarded from consideration. On In order to ascertain which paths should remain in the list and which should be discarded, each path is associated with a path metric (PM) that is updated using the LLRs when a decision is made at the leaf nodes for bit index i, as shown in (6) [15] .
After the SCL decoder has estimated all N bits, the path with the best PM is returned as the decoding output. Results in [3] show a significant improvement in error correction performance by appending a small cyclic redundancy check (CRC) code with the message bits to aid the SCL decoder in choosing the correct path from the final candidates in the list.
Fast simplified successive cancellation list
The FSSC scheme is applied to SCL decoding in [5] , by defining the path creation and PM update for an FSSC node v located at (λ, φ) as follows:
An R-0 node creates no new paths, and the PM's and node bit estimates are updated according to:
, ∀ l paths in the list
• Repetition REP nodes create only two candidate paths for each path in the list, and the bit estimates and PM updates are given by: update equations for the SPC node are omitted for the sake of brevity, and can be referenced from [5] .
Successive cancellation stack
The SCL decoder considers L candidate paths for each bit estimate in the codeword, resulting in a total search space of LN paths. At this point, the term iteration is defined as a decoder making a leaf node bit estimate for a candidate path. The SC decoder therefore takes N iterations to produce the decoding result, while the SCL decoder takes
The successive cancellation stack (SCS) algorithm [9] is a sequential traversal through the same search space as the SCL decoder. The algorithm begins by extending an initial path following the SC procedure, and updating its PM following Equation (6) . At the time of estimating information bits, both candidates are considered and the less reliable path is stored in a stack of size D that is assumed to be sufficiently large. As the algorithm proceeds, the number of candidates in the stack grows, and in each iteration only the path with the winning PM is extended. If the winning path has a length of N, its bit estimates are returned as the decoding result and the algorithm terminates. Alternatively, the CRC-aided scheme in SCL can be applied to validate the decoded result [10] . If the CRC check fails, the path is removed from the stack and the algorithm continues. By nature of the algorithm, if L paths fail the final CRC check, then all paths are removed from the stack and the algorithm terminates. 
III Implementation Details
This section introduces the memory layout and decoding schedule implementation for the successive cancellation family of polar decoders, which is then extended to incorporate list decoding. Finally, the stack decoder implementation is discussed.
III-A Successive cancellation decoders
The SC and FSSC algorithms make use of the α and a β memory tree structures shown in Figure 4 to store intermediate LLR calculations and bit propagations. The memory is structured according to the space efficient scheme outlined in [3] , and has a spatial complexity O(N) that scales linearly with the code length N.
Each stage λ in the α memory is only given Λ slots of memory -enough to store the LLR's of a single branch φ. This is possible because upon observing the SC schedule,
one can see that when calculating the LLR's α v [i] at a node v = (λ, φ), the LLR's for all branches φ ′ < φ in the same stage λ will not be used again and can be safely overwritten.
Each stage in the β memory is given 2Λ slots of memory. This is because a stage must store the bit estimates from two child branches in order to update the parent node in the stage above. Once the bit estimates have been propagated, the values can be safely overwritten by subsequent nodes in the stage.
The schedule of operations in the SC decoder is realized at run time using the index i of the current bitû i being estimated, by implementing Equations (4) and (5) according to Algorithms 1 and 2 [3] .
Computing the FSSC schedule at run time incurs a significant computational penalty since the entire decoding tree has to be traversed to identity the FSSC nodes. The FSSC schedule is therefore created and stored as the decoder is instantiated, which the decoder can then load and loop through for each decoding run. The schedule is stored as operations and the nodes in the tree at which they are performed. Figure 5 illustrates an example of an FSSC schedule created for the SC decoding tree in Figure 3 . By abuse of notation, the operations that implement Equations (4) and (5) are denoted by α and β respectively, and the operations R-0, R-1, REP and SPC implement the equation for the corresponding node.
Node: 
III-B List decoders
The The naive approach to use these memory trees is to duplicate the α and β values for new candidate paths, which results in wasted memory operations for paths that are killed before the values are used.
The authors in [3] propose a lazy-copy scheme in which α and β memory is allocated stage by stage, rather than the tree as a whole, and new candidate paths point to the memory of the parent path that created them. Memory duplication now only occurs when a path needs to modify a stage in memory pointed to by multiple paths, and only that stage is duplicated. The SCL and FSSCL decoders in this work implement a minor modification to the lazy-copy scheme of [3] to support decoding in the LLR domain.
The decoding schedule for SCL and FSSCL is realized in the same manner as outlined for their counterparts SC and FSSC.
III-C Stack decoder
A candidate path that is placed on the stack must store its:
• path metric (PM)
• path length (PL)
• intermediate α and β values
The stack is implemented as D length arrays of these data-structures, and when a path is placed on the stack, it is assigned an index at which to store its values in these arrays. The winning path for each iteration is determined through a linear search on the PM arrays.
The PM and PL arrays are one-dimensional with a space complexity of O(D), while the bit estimates array is two dimensional with a complexity of O(DN). The data-structure for the α and β values follows the same structure as in Figure 4 , resulting in a memory complexity of O(DN). Its usage is also governed by the lazy copy scheme in [3] . Finally, the schedule for SCS is realized following the same Algorithms 1 and 2 as in the SC decoder.
Based on the observation that the SCS decoder extends only one path at a time, a reduced memory scheme (SCS-RM) is proposed in [11] in which only a single copy of the α and β memory is instantiated. The initial path is created, and as long as there is no path switch, intermediate α and β values remain valid and the path can continue to be extended. Potential candidates that are created store only their PM, PL and leaf node bit estimatesû
0 , where i is the current length of the path.
Algorithm 3:
Populating β memory with a new path p
A path switch renders the α and β memory values invalid, which now have to be recalculated for the new path. This is achieved by first populating the β memory with the 
IV Fast simplified stack decoding
The FSSCL scheme of [5] can readily be applied to the SCS decoder. The key difference is that the FSSCL decoder has all candidate paths available at a given node, and is able to prune paths and pick the survivors immediately. In contrast, the FSSCS decoder creates all the candidate paths for the node and places them on the stack, and the paths are either further extended or killed at a later point following the SCS algorithm rules.
Two key implementational details are highlighted, the first of which is that the FSSCS decoder switches between paths at different points in the FSSC schedule. While the path length alone can be used to determine the coordinates of the current node (λ, φ) in the decoding tree, it is not sufficient to determine which FSSC operation (α, β, R-0, R-1, REP or SPC) must be performed. To this end, when a path is placed on the stack, it stores and additional parameter -its current progress in the FSSC schedule.
The second detail involves applying the SCS-RM scheme of [11] to the FSSCS decoder, referred to as FSSCS-RM. Since the FSSC scheme does not necessarily traverse down to the root nodes to make bit estimates, it is impossible for the FSSCS-RM decoder to repopulate the β memory via the SCS-RM Algorithm 3. This hurdle is overcome by changing the structure of the β memory of the FSSCS-RM decoder. The work in [19] presents an efficient scheme to compute and store the β values in the context of VLSI design, which is adapted to software in this work.
The β memory is now organized as an array of N bits as shown in Figure 6a . When a nodes bit estimates are made following the FSSCL equations, the estimates are stored directly in the β memory array beginning at the index corresponding to the length of the path. In the case of a β propagation operation, the bits are XOR-ed in place. Figure 6 shows the usage of the β memory for the FSSC schedule of Figure 5 . In Figures 6b and 6c, the four bits corresponding to the REP and SPC node respectively are stored at the correct locations, following which Figure 6d shows the β operation performed in place.
The final content of the β memory is the estimated codeword at the root node of the tree, and by using systematic encoding, the estimated message bits are readily available.
(a) β memory structure.
(b) β memory contents following REP node.
(c) β memory contents following SPC node.
(3,0) β [3] (3,0) β [4] (3,0) β [5] (3,0) β [6] (3,0) β This leads to the observation that each path on the stack can store the β array directly instead of the bit estimatesû i . An additional advantage is that a path switch in the FSSCS-RM scheme does not need to re-populate the β array, since the propagations are already correctly stored in place.
V Results and discussion
Simulations are performed for P C(1024, 512), and the set of information bit indices A is obtained from the polar code sequence listed in the 3GPP technical specification for the 5G standard [7] . The CRC used in all variants of the SCL and SCS decoders is the 24-bit CRC-24C with a polynomial of 0xB2B117, also provided in [7] . The list parameter L is set to 8 and the stack size D is set to the maximum size NL = 8192 for all decoders.
All code is written in C language and compiled with GCC version 6.3.0 using the -Ofast, -march=native, -funroll-loops and -finline-functions compile flags. α and β values are implemented using 32-bit floating point numbers and 8-bit unsigned integers respectively. Simulations are run using 6 threads on an AMD Ryzen 5 1600
6-Core CPU clocked at 3.2 GHz. The T/P of the decoder is reported as an average per thread, and considering only information bits.
Figure 7a exhibits that the FER performance is maintained for all variants of the stack and list deocders. The slight FER permornace degradation in the fast simplified decoders is attributed to the Chase-II approximation used [5] . Figure 7b shows that the baseline T/P of the SCS decoder is, at best, 9 Kbps at an E b N o of 3 dB, which is more than an order of magnitude lower than the SCL T/P of 314 Kbps. The SCS-RM scheme is able to improve the SCS throughput by more than an order of magnitude to 232 Kbps. The FSSCL decoder reports a T/P of 1.22 Mbps, which is four times the T/P of SCL. Finally, applying the fast simplified scheme to SCS decoding results in similar throughput gains as observed with SCL. At an E b N o of 3 dB,
FSSCS-RM provides a T/P of 930 Kbps, which is four times the T/P of SCS-RM and two orders of magnitude more than the T/P of the baseline SCS.
VI Conclusion
This report outlines a procedure for applying the fast simplified scheme [5] to the reduced memory stack decoder [11] . Results show that the T/P of the FSSCS-RM decoder is improved by two orders of magnitude over the baseline SCS decoder, from 9 Kbps to 930 Kbps. The FSSCS-RM decoder using the largest stack size achieves the T/P of the FSSCL decoder at practical SNR's.
