Abstract-In theory, Polar codes do not exhibit an error floor under successive-cancellation (SC) decoding. In practice, frame error rate (FER) down to 10 −12 has not been reported with a real SC list (SCL) decoder hardware. This paper presents an asymmetric adaptive SCL (A2SCL) decoder, implemented in real hardware, for high-throughput and ultra-reliable communications. We propose to concatenate multiple SC decoders with an SCL decoder, in which the numbers of SC/SCL decoders are balanced with respect to their area and latency. In addition, a novel unequal-quantization technique is adopted. The two optimizations are crucial for improving SCL throughput within limited chip area. As an application, we build a link-level FPGA emulation platform to measure ultra-low FERs of 3GPP NR Polar codes (with parity-check and CRC bits). It is flexible to support all list sizes up to 8, code lengths up to 1024 and arbitrary code rates. With the proposed hardware, decoding speed is 7000 times faster than a CPU core. For the first time, FER as low as 10 −12 is measured and quantization effect is analyzed.
I. INTRODUCTION Polar codes, proposed by Arikan [1] , has been selected by the 5G standards. Polar codes with successive-cancellation (SC) decoding theoretically achieve channel capacity in the asymptotic sense. To improve error-correction performance at short or moderate lengths, SC list (SCL) decoding is proposed by keeping L codeword candidates. Concatenated with cyclic redundancy check (CRC) [2] or parity check (PC) [3] bits, the error-correction performance can be further improved.
One advantage of polar codes is that it does not exhibit an error floor when decoded by the SC and SCL algorithms. This makes Polar codes suitable for applications with stringent error performance requirements. For some industrial and medical applications, FER is required to be smaller than 10 −10 . However, an efficient hardware solution designed for this purpose has not been reported yet.
It is not easy to achieve this goal in an efficient way, because both decoding latency and throughput should be highly optimized within limited chip area. To our best knowledge, an ultra-low FER below 10 −10 has not been reported from a real hardware. Although many efforts have been made to optimize the decoder hardware of Polar codes [4] [5] [6] [7] [8] [9] [10] [11] , the lowest FER reported in a real hardware is ≈ 10 −6 (not fulfilling the < 10 −10 requirement). An FPGA emulation platform is designed for ultra-reliable communications [12] , but does not present any hardware-measured FER results.
A longer version is available at "arxiv.org/pdf/1904.02327" [15] .
A. Motivation and Contribution
To achieve ultra-reliable and high-throughput decoding, we adopt the adaptive SCL decoder framework in [13] . To further improve throughput, we propose an asymmetric adaptive SCL (A2SCL) decoder, based on the observation that SC and SCL decoders exhibit huge differences in terms of the area & latency, as well as quantization precision. A2SCL mainly adopts the following two techniques: 1) Asymmetric deployment: the number of SC and SCL decoders are no longer 1:1 as in the original design, but carefully chosen to reflect their significant difference in terms of area and latency. 2) Asymmetric quantization: The different demands for data precision between SC and SCL decoders are also exploited to pack as many SC decoders for parallel decoding, yet without FER loss.
In addition, we provide a reference design through an efficient emulation platform in an FPGA, and evaluate the ultra-low FER performance of Polar codes to demonstrate its practical value. The proposed A2SCL decoder not only achieves FER ≈ 10 −12 , but also supports list sizes 1, 2, 4, 8 with maximum code length N max = 2 10 . The emulation platform has the following features:
• Integrity: All modules in the link-level emulation such as source vector generator, encoder, modulator, AWGN channel and decoder are executed in the FPGA. The server is only responsible for the code lengths/rate configuration and results collection.
• Efficiency: The emulation platform dramatically improves evaluation speed. One FPGA board can be up to 7000 times faster than a CPU core.
• Flexibility: The emulation platform supports CA-Polar (up to 24 CRC bits), PC-Polar [3] (as specified by 3GPP), various rate-matching schemes, list sizes, code lengths and code rates. All these can be configured by the server on the fly.
• Scalability: A server can manage one or more FPGAs to speed up the emulation. Servers can also form a cluster to further speed up the emulation.
With the emulation platform, ultra-low FER performance of Polar codes is measured and the error-correction performance of 3GPP NR Polar codes is evaluated.
II. POLAR CODES
An (N, K) polar codes has N coded bits and K information bits. The code rate is R = K/N . The information bits are assigned to the K most reliable sub-channels, and frozen bits, typically zeros, are assigned to the remaining ones. The encoding of Polar code is c = uF ⊗n , where u is the information vector (including information and frozen bits),
] ⊗n is the transformation matrix, where ⊗ denotes Kronecker power, and n = log 2 N .
A. SC-based Decoders
An SC decoder can be represented by a factor graph, in which soft bits propagate from right to left and the hard bits propagate from left to right. The information vector u is decoded sequentially from top to bottom. A hardware-friendly version of soft value updating is carried out in log-likelihood ratio (LLR) domain [8] . Two incoming LLRs (L in1 and L in2 ) are combined to produce L out with the following f-function
or g-function
whereŝ is the modulo-2 sum of previously decoded bits and is called partial sum (PS).
For an SCL decoder, the decoding process is similar to SC decoder except that it keeps L paths. When making hard decision for each bit, L paths split into 2L paths, and the ones with smallest path metric (PM) are kept. For the l th path and bit u i , the LLR of stage 0 is denoted by L l 0,i and its hard decision is denoted by β l i . The PMs update according to
After all bits are decoded, the path with the smallest PM is selected as the decoding output. For CRC aided SCL (CA-SCL), the most reliable path that passes CRC check is selected as the decoding output. For parity-check SCL (PC-SCL), each parity bit is decided by its parity function rather than by its LLR. A PC-CA-SCL decoder combines the features of both, if both CRC bits and PC bits are employed. Throughout this work, we implement CA-SCL and PC-CA-SCL decoders.
III. ASYMMETRIC ADAPTIVE SCL (A2SCL) DECODER
The original adaptive SCL decoder [13] progressively increases the list size until a packet is successively decoded or a maximum list size L max is reached. Our implementation is built upon a simplified version of [13] that has only two decoders, i.e., an SC and an SCL with a given list size. The algorithm is described in Algorithm 1.
Although the software implementation of Algorithm 1 is rather straightforward, its hardware implementation is different. One has to take into account the huge difference between an SC decoder and an SCL decoder in terms of hardware Algorithm 1 Simplified Adaptive SCL Decoder:
(1) Try to decode the incoming packet using SC. Table I . The (normalized) measurements are based on our reference ASIC implementations in [11] , with both SC and SCL decoders optimized to their best efficiency (see details in [11] ). According to the measurements, both the area and latency of an SCL decoder (L = 8) is up to 6 times of an SC decoder with the same quantization and code rate. If we implement many SCL decoders with different list sizes, both the area efficiency and time efficiency will be very low.
The work load comparison between an SC and an SCL decoder is given through a case study of (N = 1024, K = 512, 24 CRC bits) Polar codes. The required SNR for CA-SCL with L = 8 to achieve ultra-reliable communications (FER≤ 10 −8 ) is around 3.5 dB. In such a high SNR region, an SC decoder already exhibits very small FER (∼ 10 −4 ), i.e., only loses one or two packets in 10,000. That means, while SC needs to process all packets, only a small fraction of packets need to be processed by the SCL decoder. This is a huge difference in terms of work load.
Considering the above, a direct implementation of [13] would incur very low hardware utilization efficiency. To address this, we propose an asymmetric adaptive SCL (A2SCL) decoder to overcome the above mentioned issues.
A. Asymmetric deployment
To increase throughput, an A2SCL decoder deploys as many SC decoders as possible. To improve efficiency, A2SCL implements only one SCL decoder (e.g., L max = 8)
1 , instead of many SCL decoders with different list sizes (e.g., L = 2, 4, 8).
A scheduler with a MUX is used to collect the CRC-failed packets from the SC decoders, and send them to the SCL decoder. Fig. 1 shows the hardware architecture of the A2SCL decoder. We refer to the different number of SC decoders and SCL decoder as "asymmetric deployment". The SC and SCL decoder Cores adopt some state-of-the-art optimizations over SC and SCL decoders [11] . Both decoders only store intermediate LLRs for every two neighboring stages in the factor graph. The "double-packet mode" and "decodedbit recovery" features [11] are enabled to reduce the number of LUT/BRAM/FF modules. The hardware-friendly "syndromecheck" [14] and "decision-aided" [9] approaches are adopted to increase the throughput of SC and SCL decoders, respectively.
Assume the work target is FER< 10 −9 , in almost all cases, SC decoder's FER< 10 −3 under the target SNR. According to simulation results and real hardware test results [11] , the SC decoder and SCL decoder's throughput ratio is 5:1. Thus, the SCL decoder can process the failed packets of 200 SC decoders at FER< 10 −9 . The LLR buffer size of the SCL decoder should be larger than those of SC decoders, in case that many SC decoders generate failed packets at the same time. The following formula evaluates the probability that, during one SCL decoding, the SC decoders have failed e packets.
where c =
NSC ×2×TSCL TSC
is the total number of packets processed by the SC decoders during one SCL decoding, N SC is the number of SC decoders in the A2SCL, T SCL and T SC are the decoding time of SCL and SC decoders, respectively.
In our final design, N SC = 18 SC decoders are implemented in the A2SCL decoder. As mentioned above, T SCL /T SC = 5. Assume the SC decoders work at F ER ≈ 10 −3 , Table II shows the probabilities when the number of SC-failed packets e increases from 0 to 4. According to the table, the probability that e < 3 is 99.9%. Thus, we set the LLR buffer size of the SCL decoder to be 2048 (two packets at maximum), while larger sizes are also allowed.
B. Asymmetric quantization
In real hardware, all LLRs are quantized.More quantization bits improves decoding performance, but requires the extra First, an SC decoder should be as fast as possible. For an A2SCL decoder, its SC decoding performance can be relaxed to some extent, because the SCL decoder will take care of the failed packets. Typically, longer codes require more quantization bits than short ones. As shown in Fig. 2 , the FER curves of N = 1024, K = [1/8, 7/8] Polar codes with quantization bits = [6, 8, 12] are almost the same under SC decoding. Accordingly, 6-bits or 8-bits quantization is sufficient for SC decoders.
Second, the SCL decoder should yield almost the same performance as a floating-point decoder. We plot the FER curves of N = 1024, K = [1/8, 7/8] Polar codes under SCL decoding (L = 8) as reference to show the influence of different quantization bits. As shown in Fig. 3, 8 -bits quantization incurs 0.1db loss at maximum, and 12-bits quantization yields the same performance as a floating-point decoder. Accordingly, we adopt 12-bits quantization.
IV. EMULATION PLATFORM
An overview of our platform is shown in Fig. 4 . A server can manage one or more FPGA boards via the PCI-E slots. When multiple FPGA boards (constrained by the number of PCI-E slots) are employed, the decoding throughput can be further increased. A Xilinx xc7vx690t is integrated in the FPGA board. The server is the controller of the platform. The code construction (information sub-channel positions) can be configured by the server to evaluate different code constructions. In addition, code and channel parameters are also configured at the server. According to these configurations, frames are generated, encoded, passed through the AWGN channel 2 and decoded in the A2SCL decoder. The number of decoded frames and frame errors are counted in the FPGA and collected by the server. Finally, the FER curve is displayed on the server.
A. Run time balancing
In the link-level emulation, different modules have different run time to process one packet. Balancing the run time among different modules will benefit the overall operating efficiency. Table III shows the running cycles required by each module in the link-level emulation platform 3, 4 for different (N, K) case. Obviously, the running cycles of encoder and AWGN depend on the code length N , and SC/SCL decoders depend on both N and K. According to the number of running cycles, we integrate the same number of encoder and AWGN modules, and set the ratio of encoders and SC decoders to be 1 : 2.
Thanks to the asymmetric architecture, we can integrate more modules within the limited FPGA resource. Our FP- GA chip integrates 9 encoders, 9 AWGN channel modules, one A2SCL decoder which include 18 8-bits-quantized SC decoders and one 12-bits-quantized SCL decoder. The resource utilization of each module is shown in Table IV 5 .
B. Hardware vs software implementations
To justify the A2SCL hardware platform, its simulation speed and FER performance are compared with a software counterpart. The hardware platform utilizes only one FPGA board. The software implementation is written by C language, and runs on a server that contains 4 Intel Xeon(R) E5-4627 v2@3.30GHz CPUs with 12 cores and 256 GB RAM. Define the speed ratio (SR) as the emulation time of 12 CPU cores divided by that of one FPGA board, also plotted in Fig. 5 . The highest SR is 611, which means that one FPGA board is 611 times faster than 12 CPU cores. Converted to one CPU core, a FPGA board is 7332 times faster. As shown, the emulation platform can greatly reduce emulation time.
2) FER performance: Based on 5G Polar codes with lengths N = 1024 and N = 800, we compare the the floating-point results from software and fixed-point results 
V. PERFORMANCE OF 5G POLAR CODES IN 5G eMBB
With the A2SCL platform, we can now efficiently evaluate error-correction performance of 5G Polar codes at FER below 10 −11 . The typical cases of downlink control information (DCI) are evaluated with QPSK modulation and AWGN channel. For K = 64, PDCCH aggregation levels [1, 2, 4, 8] 6 , the measured FER results are shown in Fig.7 . We also measured K = [96, 128, 164], PDCCH aggregation levels [1, 2, 4, 8] . Additional simulation results can be found in [15] .
VI. CONCLUSION
In this paper, we present an asymmetric adaptive SCL decoder in real hardware. The decoder can provide much higher decoding throughput in a resource-limited FPGA/AISC chip. The A2SCL algorithm, along with all the required linklevel modules, is implemented in an FPGA platform. The platform is efficient, flexible and scalable. The emulation speed of one FPGA board is 7332 times faster than one CPU core. Ultra-low FER performance as low as 10 −12 is measured for 5G Polar codes for the first time in real hardware. 6 For K = 64, aggregation level 1, the rate matching is shortening; for aggregation level [2, 4] , the rate matching is puncturing; for aggregation level 8, the rate matching is repetition. 
