Polar codes attract more and more attention of researchers in recent years, since its capacity achieving property. However, their error-correction performance under successive cancellation (SC) decoding is inferior to other modern channel codes at short or moderate blocklengths. SC-Flip (SCF) decoding algorithm shows higher performance than SC decoding by identifying possibly erroneous decisions made in initial SC decoding and flipping them in the sequential decoding attempts. However, it performs not well when there are more than one erroneous decisions in a codeword. In this paper, we propose a path metric aided bit-flipping decoding algorithm to identify and correct more errors efficiently. In this algorithm, the bit-flipping list is generated based on both log likelihood ratio (LLR) based path metric and bit-flipping metric. The path metric is used to verify the effectiveness of bit-flipping. In order to reduce the decoding latency and computational complexity, its corresponding pipeline architecture is designed. By applying these decoding algorithm and pipeline architecture, an improvement on error-correction performance can be got up to 0.25dB compared with SCF decoding at frame error rate of 10 −4 , with low average decoding latency.
I. INTRODUCTION
Polar codes [1] are the first channel codes proven to achieve the capacity of various communication channels and have been selected for the control channel in the 5G enhanced Mobile BroadBand (eMBB) scenario [2] . However, for short to moderate blocklengths, the performance of successive cancellation (SC) decoding is worse than that of Turbo codes or low-density parity-check (LDPC) codes. To overcome this limitation, SC list (SCL) decoding [3] was introduced to improve the performance at the cost of increased computational complexity and decoding latency.
In recent researches, the successive cancellation flip (SCF) decoding proposed in [4] was shown to be capable of providing error-correction performance close to that of SCL decoding with a small list size, while keeping the computational complexity close to that of SC. The idea of SCF decoder is to allow multiple subsequent decoding attempts to opportunistically correct the erroneous decision made in initial SC decoding by flipping the most unreliable bit. Modifications to SCF decoding are proposed in [5] [6] [7] to reduce the decoding latency and implementation complexity. However, these decoding methods focus on correcting the first error and can not identify This work was supported in part by the NSF of China (Grant No. 61874140 ). more than one erroneous decisions, which limits their errorcorrection performance.
In order to enhance the performance of SCF decoding, several improvements have been proposed to correct more erroneous decisions [8] [9] [10] [11] [12] . In [8] , they subdivide the codeword into several partitions, on which SCF is run individually. However, by adopting their method to correct more errors, the erroneous bits need to distribute just in different partitions evenly, which limits its correcting capability. In [9] , they investigate the distribution of the first erroneous bit and restrict the search scope of flipping bits to a subset of information bits. By iteratively modifying the subset, their method can identify multiple incorrect bits. However, the reason of erroneous decisions is not only due to the transmitting capability of the subchannel itself, but also the condition of current channel noise. So we can not determine the positions of flipping bits by only considering the codeword itself.
The dynamic SCF (DSCF) decoding algorithm proposed in [12] shows a promising way to identify multiple erroneous decisions, while their method is not efficient to correct them. Different from their work, in our method, we generate the bit-flipping list based on both path metric and bit-flipping metric. The path metric of each decoding attempt can be used as a feedback to verify the effectiveness of bit-flipping attempt. Basing on effective bit-flipping attempt, we could identify multiple erroneous decisions step by step. In order to reduce the decoding latency and implementation complexity, a pipeline decoding architecture is designed.
The remainder of this work is organized as follows: in Section II, an overview of polar codes, SCF decoding, and DSCF decoding is presented. In Section III, the proposed decoding algorithm and its corresponding pipeline architecture are detailed. Section IV reports the simulation results, and then conclusions are drawn in Section V.
II. PRELIMINARY

A. Polar Codes
Polar codes characterized by (N, K, I) can achieve channel capacity via the phenomenon of channel polarization [1] . The channel polarization theorem states that, as the blocklength N goes to infinity, a polarized subchannel becomes either a noiseless channel or a pure noise channel. By transmitting information bits over the reliable subchannels and transmitting frozen bits which are known by both transmitter and receiver over the unreliable subchannels, polar codes can achieve the channel capacity. Hence, constructing a polar code is equivalent to find the K most reliable subchannels over which the information bits are transmitted, with a set I indicating the locations of these subchannels. Many construction methods have been proposed to calculate the reliability of subchannels. In our work, we use the Gaussian approximation (GA) based density evolution method proposed in [13] , since it is popular in the construction of polar codes for its good tradeoff between the complexity and performance.
The encoding process of a polar code can be represented with a matrix multiplication like:
The vector u hold the information bits denotes the source codeword to be encoded, vector x denotes the encoded codeword and G N is the generator matrix, while ⊗ denotes the Kronecker product, B is a bit-reversal permutation matrix. As for the decoding, we denote by y the data received from the channel and use them as the decoder inputs. The decoder's output is denoted byû N 1 , whereû i is the estimate of the bit u i by hard decision. This hard decision is made according to the log likelihood ratio (LLR) L i = log( P r(y,û i−1 1 )|ui=0 P r(y,û i−1 1 )|ui=1 ) by using the hard decision function h:
where sign(L i ) = ±1. At the same time, the LLRs at different calculation stage l are computed iteratively by follows:
whereŝ denotes the partial sum ofû i−1 1 . And in the LLR domain, the function f and g perform the following calculation for given inputs LLRs L a and L b :
B. Successive Cancellation Flip Decoding
The SCF decoding is a slightly-modified SC decoding algorithm, characterized by a number of extra decoding attempts, where several unreliable bits are flipped from its initial SC decoding. The decoding procedure of SCF is that, after the first SC decoding pass, the concatenated cyclic redundancy check (CRC) is verified. In case it matches, the decoding procedure stops and the estimatedû N 1 is output. Otherwise, a list of positions of the least reliable estimated bits is built and then another SC decoding pass is launched. In this pass, once the location of the information bit that corresponds to the least reliable bit is reached, that estimated bit is flipped before subsequent SC decoding. Once an SC decoding pass has finished, the CRC is verified again. This procedure is repeated until the CRC pass or a predetermined maximum number T of decoding attempts is reached. However, since the concatenated CRC could not indicate the number of erroneous decisions and their positions, the performance of SCF decoding is limited by a hypothetical decoder, called SC-oracle decoder [4] , which can accurately avoid all first wrong decisions.
C. Dynamic SCF Decoding
The DSCF decoding aimed to correct multiple erroneous bits was proposed in [12] . This decoding method is characterized by a bit-flipping list L f lip , which updates after every decoding attempt. It contains T bit-flipping sets {E 1 , · · · , E ω } with the highest probability to correct the trajectory of the SC decoding. The ω denotes the maximum number of bits that can be corrected by this list. They name it the noise order, which indicates the correcting capability of the bit-flipping list. By adopting this definition, the order of SCF decoding is order-1.
The DSCF decoding algorithm builds the bit-flipping list by using a new bit-flipping metric M α [11] , which takes into account the serial nature of the SC decoder. This metric has a much higher probability to find the first error that occurred during the sequential decoding process than the absolute value of LLRs. The method of calculating the probability of a flip set E ω with ω flipping bits to correct the trajectory of SC is close to that used to calculate P ( [13] . By using this method, the probability P (E ω ) can be computed by the following expression:
However, since the computation of p e (û[E ω−1 ] j ) is a hard task, it can be approximately replaced by q e (û[E ω−1 ] j ) = 1 (1 + exp(|L[E ω−1 ] j |)). By this way, they defined their bitflipping metric as:
Specially for the initial SC decoding pass, the M α (i) of each information bit can be calculated as:
The difference of procedure between SCF decoding and DSCF decoding lies in the updating of the bit-flipping list. For DSCF decoding, after each decoding attempt, new flipping bit would be added to the current flip set and its corresponding M α would be computed. If the M α is greater than the least one in the list, the bit-flipping set will be inserted to the list.
III. IMPROVED SUCCESSIVE CANCELLATION FLIP DECODING
In this section, we propose a path metric aided bit-flipping decoding to improve the performance of SCF decoder to correct multiple erroneous decisions. And its corresponding pipeline architecture is designed to reduce decoding latency.
A. Path Metric Aided SCF Decoding Algorithm
In comparison of different bit-flipping decoding algorithms proposed in current researches, we find that for many wrong estimated codewords the first error is always in the initial bitflipping list. However, the correct codeword can not be got at the last since that there are more than one errors caused by channel noise in a codeword or that the decoding algorithm can not find out all of the errors in the limited attempts. Based on this view, we make simulations to evaluate the correct ratio when the first error bit is in the initial bit-flipping list built by using the bit-flipping metric proposed in [12] .
From Fig.1 , we can observe that the correct ratio is unsatisfactory at low E b /N 0 regime and that the performance of DSCF decoding is dependent on its ranking in the initial bitflipping list. This means that the bit-flipping set at the top of the list has more probability to try more than one bitflipping positions in the limited decoding attempts. When the bit-flipping set containing the accurate bit-flipping bits lies at the bottom of the list, it may be taken out of the list in the subsequent list updating or could not get enough attempts to find all of the errors before reach the predetermined T .
In the above simulations, we also found that if the first error position is in the initial bit-flipping list, its corresponding codeword almost has the smallest LLR-based path metric [14] . As shown in Fig.1 , the set containing the accurate bit-flipping bit ranks top in the list sorted by LLR-based path metric. Hence, we introduce the LLR-based path metric as a feedback to the bit-flipping list generating.
The procedure of our Path Metric Aided SCF (PMA-SCF) decoding algorithm is that: after the first SC-decoding pass, the initial bit-flipping list is built based on the bit-flipping M α value, and then the bit-flipping sets will be attempted one by one until the CRC matches or all bit-flipping sets have been attempted. During these decoding attempts, the LLR-based path metric of each attempt will be calculated, while a new bitflipping list with order-2 will be built with its corresponding former bit-flipping set. This list will be sorted first by the path metric and then by the M α value. That is to say, whether it has the priority to be a start point for correcting multiple errors is determined by the path metric, while whether a bit should be added to the bit-flipping set is determined by the M α value. Then new decoding attempt launches according to this new list. The detail of this decoding algorithm is described in Algorithm.1, 2 and 3.
Algorithm 1 Path Metric Aided SCF Decoding Algorithm 1: procedure PMA-SCF(y N 1 , T, I) 2:
if T > 1 and CRC(û N 1 ) = failure then 4:
while L = ∅ do T ←size of(L) 3: for j ← 1 to T do 4: In Algorithm.2, P m denotes the path metric of the decoding pass, whose bit-flipping set current decoding attempt extends. The function Sort firstly sorts the L j by its P M j , and then sorts the bit-flipping set E in list L j by its M α (E). All these sorted L j constitute the new bit-flipping list L.
B. Pipeline Architecture for Bit-flipping Decoding
By using the path metric as the feedback, it is inevitable to increase the decoding latency if we still use the decoding architecture of SC. In order to reduce the decoding latency, we design a pipeline architecture to realize parallel decoding of different attempts. Different from the parallel architecture adopted by SCL decoder, our pipeline architecture does not need too many processing elements to calculate the LLR data of different decoding attempts at the same time, since there is no data dependency between different attempts. As a result, the usage ratio of processing elements in our pipeline is much higher than that of SCL decoder.
The pipeline decoding is realized by splitting the command stream from its corresponding data. Since the different decoding attempts have the same decoding schedule, we can use only one set of command with several pointers to realize the control of different decoding attempts. As shown in Fig.2 , there are four pointers indicating the current decoding stage of its corresponding decoding attempt. Each command in the command stream contains the decoding stage information and the type of decoding function (f or g).
According to the pointers, the commands are fetched to the command FIFO, while the corresponding data of different attempts are fetched to the data FIFO. Then the processing elements controlled by the current commands work to process the data in the data FIFO. Each processing element could execute the f or g calculation and hard decision function h according to the stage and function type. Since the data buffered in data FIFO may not be processed completely in one cycle of calculation, the remaining data will stay in the FIFO. Meanwhile, new data will be pushed into the FIFO when the data of former command have been processed, which will lead to a high peak-to-average ratio of the usage of data FIFO. In order to reduce the size and peak-to-average ratio of data FIFO, launch intervals are arranged among the sequential decoding attempts to avoid data with large scale to be pushed into data FIFO at the same time.
The calculation results of the processing elements are sent to corresponding internal LLR memory, while the hard decision results are sent to the path memory. Based on the hard decisions, the partial sums are calculated. Then they are fetched to the partial sum FIFO. An insertion sorter is adopted to calculate and sort the bit-flipping metric of each bit. Meanwhile, the path metrics and the CRC check are computed based on the hard decisions on-the-fly. Based on the bit-flip metrics and path metrics, the flip list is generated.
Besides, by applying the latency saving technique proposed in [15] , the decoding latency can be farther reduced, since the different decoding attempts have the same start point in the decoding command stream. Due to page limitations, the trade-off among decoding latency, usage ratio of processing elements and memory requirements are omitted here. A more detailed presentation about this architecture will be given in the full version of this paper.
IV. SIMULATION RESULTS
In this section, the frame error rate (FER) performance and the decoding latency of the proposed PMA-SCF decoding algorithm are evaluated via Monte-Carlo simulations. We make simulations based on the AFF3CT [16] software, which is extended with our designed decoding algorithm. Specially, the transmissions are run on binary phase-shift keying (BPSK) modulation and additive white Gaussian noise (AWGN) channel. All polar codes are constructed targeting an E b /N 0 of 3.0dB. And all CRC-aided polar codes are concatenated with a 16-CRC with generator polynomial g(D) = D 16 + D 12 + D 5 + 1. In this regard, the coding rate of these polar codes is R = (K + 16)/N . In Fig.3 , we compare the performance of PMA-SCF decoder for polar codes with different blocklength and rate. One can observe that the polar codes with low code rate have much better performance than that with high code rate, specially at the low E b /N 0 regime. The performance gap between different code rate narrows at the high E b /N 0 regime, which demonstrates the effectiveness of the proposed decoding algorithm to correct erroneous decisions. However, the performance gap between the R = 1/2 curve and the R = 2/3 curve does not narrow much, since there are too many erroneous bits in a codeword of code rate R = 2/3, which beyond the correction capability of PMA-SCF decoder. Besides, the error-correction performance increases as the blocklength increases. It can be also observed that the gaps between different blocklengths narrow quickly at high E b /N 0 regime, since more wrong estimated codewords can be corrected by PMA-SCF decoding. Fig.4 depicts the FER performance of our proposed decoding algorithm for polar code PC(1024-528) against other bit-flipping decoding algorithms, where PSCF-P2 denotes the partitioned SCF decoding [8] with divided partition P=2. In order to keep the same code rate, the concatenated CRC code for PSCF-P2 is 8-CRC, while 4-CRC for PSCF-P4. In Fig.4 , all incarnations of SCF decoding have the same predetermined maximum attempts T = 10. The FER performance of oracleassisted SC-oracle decoder (SCO-O1) and SC decoder are used as the baseline for comparison, where SCO-Ok means that it can always correct the first k erroneous decisions met by SC decoder, but no more errors can be corrected. It can be observed that our PMA-SCF decoding algorithm performs slightly better than SCO-O1 decoder in almost all cases, since its ability to correct higher-order errors. However, the performance gap between PMA-SCF and SCO-O1 curves narrows at high E b /N 0 regime and SCO-O1 outperforms PMA-SCF at E b /N 0 = 3.5dB. This is due to the fact that the number of multiple errors in one codeword decreases as the E b /N 0 increases and PMA-SCF could not accurately identify all first errors. Besides, one can observe that our proposed decoding algorithm outperforms the SCF decoding 0.25dB at FER of 10 −4 with the same T value and performs better than DSCF decoding at all E b /N 0 conditions. In order to evaluate the performance of PMA-SCF decoder with different predetermined maximum attempts T , they are compared with CRC-aided SCL decoding algorithm with list size L ∈ {2, 4, 8} and 16-CRC for PC(1024-528). As shown in Fig.5 , the FER performance of oracle-assisted SC-oracle decoder with order 1 (SCO-O1) and order 2 (SCO-O2) are used as the baseline for comparison. One can observe that the FER performance of PMA-SCF decoding algorithm with different predetermined maximum attempts are all between the SCO-O1 and SCO-2 curves, while only the CRC aided SCL decoding with list size L = 2 (CA-SCL-L2) is worse than SCO-O1 and the CA-SCL-L8 is better than SCO-O2. The performance of PMA-SCF decoding with T = 20 is similar to that of CA-SCL-L4 at almost all cases. Considering their decoding latency shown in Fig.6 and implementation complexity, the PMA-SCF decoding is more efficient than CA- SCL-L4 to get equivalent FER. In Fig.6 , the decoding latency of our proposed decoding algorithm is evaluated, with respect to that of SCF, PSCF and DSCF. In this comparison, we use the clock cycles as the measurement, instead of the average number of attempts, since the adopting of pipeline architecture. The decoding latency of the SC decoder and the SCL decoder measured as that does in [15] are portrayed as the reference line. The average decoding latency at each E b /N 0 point is obtained by simulating 1 × 10 8 frames. It can be observed that the decoding latency of SCF is the highest among all SCF based decoders at low E b /N 0 regime, since the absolute value of LLR is not efficient to be used as the bit-flipping metric. Compared with SCF decoding, the PSCF decoding has much lower decoding latency, since it may stop decoding when a partition fails before T iterations. Our proposed PMA-SCF is about 24% above that of PSCF with P = 4 at the worst E b /N 0 = 1dB, while it is up to 1.8× faster than that of DSCF. It can also be observed that the decoding latency of our algorithm degrades quickly as the E b /N 0 increases, and approaches that of SCL at moderate E b /N 0 . At higher E b /N 0 , all SCF based decoding converge to the decoding latency of SC algorithm.
V. CONCLUSION In this paper, we propose the PMA-SCF decoding algorithm, that generates the bit-flipping list according to its bit-flipping metric and path metric, which provides an effective starting point to correct more erroneous decisions. The corresponding pipeline architecture is designed to reduce the decoding latency. We show that the average latency is much lower than current bit-flipping decoding method at the cost of increased memory. The simulation results show that our decoding algorithm can provide a performance improvement of up to 0.25dB at FER of 10 −4 compared to SCF decoding, while decode up to 1.8× faster than DSCF decoding at E b /N 0 = 1dB point.
