By using Majority Logic (MJL) aided Successive Cancellation (SC) decoding algorithm, an architecture and a specific implementation for high throughput polar coding are proposed. SC-MJL algorithm exploits the low complexity nature of SC decoding and the low latency property of MJL. In order to reduce the complexity of SC-MJL decoding, an adaptive quantization scheme is developed within 1-5 bits range of internal log-likelihood ratios (LLRs). The bit allocation is based on maximizing the mutual information between the input and output LLRs of the quantizer. This scheme causes a negligible (0.1 < dB) performance loss when the code block length is N = 1024 and the number of information bits is K = 854. The decoder is implemented on 45nm ASIC technology using deeply-pipelined, unrolled hardware architecture with register balancing. The pipeline depth is kept at 40 clock cycles in ASIC by merging consecutive decoding stages implemented as combinational logic. The ASIC synthesis results show that SC-MJL decoder has 427 Gb/s throughput at 45nm technology. When we scale the implementation results to 7nm technology node, the throughput reaches 1 Tb/s with under 10 mm 2 chip area and 0.37 W power dissipation.
I. INTRODUCTION
It is foreseen that within the next decade there will be demand for forward error correction (FEC) codes operating at Terabit-per-second (Tb/s) data rates for certain beyond-5G applications [1] . The demand for higher data rates can be seen by looking at the recent standardization activities. For wired connections, the IEEE 802.3ba Ethernet standard specifies 100 Gigabitper-second (Gb/s) throughput over optical media [2] . In the wireless domain, the IEEE 802. 15 .3d standard ratified in 2017 defines a 100 Gb/s system using frequencies in the 252 -322 Gigahertz (GHz) range [3] . The 2018 Ethernet Roadmap [4] foresees demand for Terabit-persecond (Tb/s) data rates for 2020 and beyond.
This paper studies the feasibility of achieving Tb/s data rates using polar codes. Part of the challenge of reaching Tb/s with polar codes is generic, common to all FEC schemes, and stems from limitations of the VLSI technology. A second set of difficulties are specific to polar codes, arising from the inherently sequential nature of the decoding of polar codes. We investigate both aspects of the challenge and propose solutions. We begin by giving an overview of the problem.
A. VLSI technology challenges for Tb/s FEC
For several decades, FEC data rates could be increased by advances in VLSI technology, in accordance with technology forecasts known as Moore's law and Dennard's scaling law [5] . Although transistor dimensions still continue to shrink in accordance with the Moore's law, transistor switching speeds (clock frequencies) cannot keep increasing due to power density constraints [6] . With the clock frequency reaching practical limits at around 1-5 GHz, implementing Tb/s FEC schemes in VLSI requires highly parallel and deeply pipelined implementation architectures. This in turn makes implementation issues, such as chip area and power density, to move to the forefront as major design parameters, along with traditional measures of FEC performance such as coding gain or gap-to-capacity. The design and implementation of Tb/s FEC codes involves a complex tradeoff between a large set of parameters.
In the Tb/s regime, I/O bottleneck and excessive memory usage emerge as two important generic problems. To see the scale of the I/O problem, consider as an example of a FEC system with a coding rate R = K/N carrying K bits of information in code blocks of length N bits. Suppose the receiver front-end provides the decoder with soft information in the form of log-likelihood ratios (LLRs) at a rate of γ R LLRs-per-second with a precision of Q bits-per-LLR, where γ is the throughput in b/s. Let f c be the clock frequency for the interface between the decoder and the receiver front-end and P is the number of spatially parallel decoders connected to the front-end. The interconnect bus width at this interface will then have to contain at least
wires assuming that each wire in the bus carries binary signals. For example, with γ = 1 Tb/s, f c = 1 GHz, Q = 3 bits, and R = 1/2, we have W = 6000. For the given W , a set of (N ,P ) values can be (512, 4), (1024, 2), (2048, 1). This example clearly shows the difficulty of increasing γ while f c is held fixed. In order to alleviate the I/O bottleneck, we consider in this paper a relatively high rate code with R = 5/6, and try to minimize Q by using an quantization scheme that is information-theoretically as efficient as possible, as suggested in [7] . In order to illustrate the memory problem mentioned above, suppose that the decoder in the preceding example is implemented in a deeply-pipelined fashion, using D pipeline stages, where D is the decoder latency measured in number of clock cycles. Thus, we are assuming that there are P D codewords inside the decoder at any moment, the codewords spread over the successive stages of decoding in an assembly-line fashion. The memory requirement for this architecture may be estimated as
where Q is the average number of bits per LLR value inside the decoder. The product N P D emerges a significant parameter for controlling M Req . The number of pipeline stages D is related to N in a manner that is specific to the code family and decoder type within that code family. For example, for the basic successive cancellation (SC) decoding method for polar codes, the smallest value of D is 2N − 2 (achieved by using a fully parallel implementation), making the product N P quadratic in N . Such a quadratic growth in M Req as a function of N severely limits the length of codes that can be used, leading to inferior coding gains. In this paper we seek a remedy to this problem by introducing a hybrid decoding algorithm that has a lower latency D than the SC decoding algorithm. The hybrid algorithm combines SC decoding with Majority Logic (MJL) decoding, as discussed below. As a further measure to reduce M Req , we implement a variable-length quantization scheme inside the decoder so as to minimize Q for a given performance.
B. Relation to previous work
Polar codes were introduced in [8] . Polar codes are closely related to Reed-Muller (RM) codes [9] , [10] . Many existing decoding algorithms for polar codes were originally devised for RM codes [11] , [12] . This is true for the two decoding algorithms of interest in this paper, namely, SC decoding and MJL decoding. In fact, MJL decoding was the original decoding method for RM codes [10] . The distinctive feature of MJL decoding is its inherently parallel nature. The SC decoding method provides better coding gain at the expense of being serial in nature (increased latency). In this paper we combine the best features of SC decoding and MJL decoding. We use a soft-decision version of MJL decoding [13] , [14] . The implementation presented below takes advantage of specific techniques for speeding up the SC decoder. These include methods to recognize specific constituent codes of the given polar code and decodes them quickly as described in [15] , [16] , [17] , [18] , [19] , and [20] .
A hybrid SC-MJL decoder implementation for polar codes was reported in [21] . That design relied on using combinational logic and aimed to provide a flexible architecture that could operate at various different coding rates. Unlike [21] , here we focus on throughput only and use a fully unrolled and pipelined SC-MJL architecture to decode particular code segments faster. Similar to [16] and [17] , the repetition (REP) and single parity-check (SPC) code segments are decoded by MAP [22] and Wagner [23] decoders respectively.
The outline of this paper is as follows. Section II gives a short review of polar coding and introduces the SC-MJL decoding with adaptive quantization. Section III presents the unrolled SC-MJL decoder architecture. Section IV presents the communication performance and ASIC implementation results of the SC-MJL decoder. Finally, Section V summarizes the main results with a brief conclusion.
II. POLAR CODES AND SC-MJL DECODING
This section starts with a short review of polar codes. Then, in Section II-B, the proposed SC-MJL decoding algorithm is introduced. Finally, in Section II-C, the adaptive quantization scheme used in this paper is presented.
A. Review of polar codes
Polar codes are a class of linear codes. Here, we consider only polar codes over the binary field F 2 . For every n ≥ 1, there exists such a code with block length N = 2 n and a transform matrix G N = G ⊗n where G ⊗n is the n th Kronecker power of a kernel matrix G = 1 0 1 1 . In polar coding, the user data d K 1 is first embedded in a transform input vector u N 1 and the codeword is obtained as
We write u A to denote the data-carrying part of u N 1 . The remaining part of u N 1 is denoted u A c and is frozen to zero. We write u A = d K 1 and u A c = 0 to indicate the composition of the transform input u N 1 . For a description of the details of polar coding, we refer to [8] .
B. The proposed SC-MJL decoding
The proposed SC-MJL decoding is given in Algorithm 1. Initially, the recursive block length parameter M = N and ℓ N 1 is the channel log-likelihood ratio (LLR) vector with
where W (y|x) is the channel transition probability density function. v N 1 is an indicator vector of the frozen coordinates defined as
The building blocks of the decoder are f, g and d functions. The function f(ℓ, ℓ ′ ) for any two LLR values ℓ and ℓ ′ is defined as
which can be approximated [24] as
The function g(ℓ, ℓ ′ , α) for any ℓ and ℓ ′ and any α ∈ {0, 1} is defined as
The function d(ℓ, v) for any ℓ and frozen bit indicator v is defined as
Algorithm 1 combines SC decoder with certain shortcuts such as MJL decoding, Wagner decoding, etc. For details of SC decoding we refer to [8] , and to [13] for MJL decoding. A precise statement of the MJL decoder as used here is given as Algorithm 2 with a generic block length N MJL . The algorithm has log N MJL + 1 stages. For the i th stage, the MJL algorithm decodes log M i number of bits in parallel. For each bit, the algorithm calculates a final LLR value ℓ j using the given f (3) and g (4) functions. After allx M 1 bits are decoded, the encodedû M 1 sequence is computed by using the bitreversal permutation matrix B M and the generator matrix
The flowchart representation of Algorithm 1 is shown in 
C. Adaptive quantization of the LLRs
The chip area of the SC decoder is dominated by the memory and the register chains in the deeply-pipelined architecture [16] . Implementation practice shows that using 5 or 6-bit precision for each LLR value causes tolerable performance loss [25] . We propose to reduce LLR precision even further (1-5 bits range of LLRs) using an adaptive quantization technique. The bit allocation is based on maximizing the mutual information between input and output LLRs of the quantizer. Unlike using lookup tables [26] , here we use the regular f and g functions with custom input data width. The data width or, in other words, the number of required quantization bits is optimized using input LLR distribution of each constituent polar code. For example, a rate-1 polar code segment with an arbitrary block length can be represented with one bit (the sign bit). Since polarization takes place, using large number of bits is not necessary for the polarized code segments. In this way, the LLRs located on those paths can have adaptive quantization levels.
Applying adaptive quantization to (1024,854) polar code, the internal LLR bit precision is shown in Fig. 2 . The number of quantization bits are illustrated on each line. For example, the second half of the (1024,854) polar code uses one less quantization bits by dropping the redundant least significant bit. The adaptive quantization method has a significant impact on reducing the chip area as well as the power dissipation of the SC-MJL decoder as shown in Section IV-B.
III. UNROLLED SC-MJL DECODER ARCHITECTURE
We propose unrolled and deeply pipelined SC-MJL decoder architecture with fully-parallel processing units. We take advantage of bit-reversal decoding to operate on neighboring LLRs. The SC decoder, denoted as SC(N, K), consists of two sub-decoders which have the same block length N 2 with a different payload K i = N 2 R i . In general, SC(N, K) is decoded in four steps: f, SC( N 2 , K 1 ), g and SC( N 2 , K 2 ). As a small example, the architecture of SC (16, 9) is shown in Fig. 3 (denoted as L(SC 1 )) untilẑ 8 1 becomes ready at the input of g. Likewise,ẑ 8 1 is stored untilx 8 1 is ready. The proposed SC-MJL decoder architecture for N = 16 and K = 9 is shown in Fig. 4 . First, the adaptive quantization block (abbreviated as Adp. Q.) reduces the input LLR quantization from Q to Q ′ bits. Then, f function, MJL(8, 2) decoder, g function, and Wagner (8, 7) decoder are activated consecutively. Whenẑ 8 1 andx 8 1 are ready, PSUL calculates the systematic outputû 16 1 . Each decoding operation takes one time step except PSUL, which performs combinational XOR operations. Therefore, the total latency of SC-MJL (16, 9) is 4 time steps, which is considerably smaller than 30 time steps as in the SC (16, 9) decoder. Furthermore, the MJL (8, 2) decoder architecture is shown in Fig. 5 . It utilizes nine adders, four f functions, two d functions, one g function and one XOR gate such that each f function contains a comparator and an XOR gate and each g function has two adders and one multiplexer. 
A. Register Balancing
The proposed SC-MJL decoder simultaneously processes different codewords in a sequence of decoding stages. The complex operations in the sequential stages and strict setup/hold time requirements may cause a throughput bottleneck in the decoder. The critical path, where the worst negative slack (WNS) is minimum, may limit the frequency and reduce the throughput. In order to avoid this, register balancing is performed in HDL level to merge the consecutive short paths by removing the registers in between those paths. The locations of remaining registers are chosen according to combinational delay of the merged stages. Applying register balancing enables SC-MJL decoder to perform multiple calculations within a clock cycle. It reduces both the latency and the memory usage of the decoder. For example, the latency of SC-MJL (16, 9) decoder reduces by two clock cycles when the given registers in Fig. 4 are removed without violating the WNS.
IV. IMPLEMENTATION STUDY
The SC-MJL(1024,854) decoder is implemented on 45nm ASIC using the general purpose (GP) standard cell library (tcbn45gsbwp12tbc). The nominal PVT values are 45nm, 1.2V and 25°C. The implementation parameters are N LIM = 32 and N MJL = 8. In this configuration, the number of shortcuts are: 13 MJL, 13 SPC, 5 REP, 16 Rate-1 and 3 Rate-0. In addition to that the clock gating method is employed for the available registers to reduce the power dissipation.
A. Performance Results
Extensive simulations have been performed to obtain the communication performance results of the SC-MJL decoding algorithm and the adaptive quantization method. The simulations have been carried out with an AWGN channel and BPSK modulation for the (1024,854) code. The performance of the SC-MJL decoding algorithm with a variable N MJL is shown in Fig. 6 . As N MJL increases, the performance deteriorates progressively. It is observed that N MJL = 8 causes a tolerable loss.
The communication performance of the SC and the SC-MJL decoders are shown in Fig. 7 . There is almost 0.1 dB performance difference between SC and SC-MJL decoder. An additional performance loss occurs when the adaptive quantization is used. Applying register balancing does not introduce an additional performance degradation. However, using fixed Q = 4 bits quantization for both channel and internal LLRs causes more than 0.3 dB performance loss.
B. ASIC Implementation Results
The ASIC post-synthesis results of SC(1024,854) and SC-MJL(1024,854) decoders are shown in Table I . The SC-MJL decoder dissipates 1.5 times less power than the benchmark SC decoder, while having a smaller area. The proposed adaptive quantization and register balancing architecture further reduces the power dissipation by 1.4 and 2.3 times, respectively. Due to register balancing architecture, both latency and pipeline depth of the decoder reduce to 40 clock cycles. Since the throughput results of given implementations are the same, the most energy efficient implementation is the last one with 2.4 pJ/bit. The post-synthesis results are scaled from 45nm to 7nm technology using the conservative scaling formulas in [1] . In addition to the scaling, each implementation utilizes two parallel decoders, which operate at 585.5 MHz frequency as the expected 2.2 GHz frequency is scaled down by a factor of 3.7. Another parameter is the area scaling, which is a multiplier to the chip area to obtain a reasonable power density for a feasible cooling off the chip. The results show that the proposed implementation is expected to have 0.37 pJ/bit energy efficiency at 7nm while having 1 Tb/s throughput.
C. ASIC implementation comparison of SC-MJL with other high throughput polar decoders
The ASIC post-synthesis results of high throughput polar decoders are compared in Table III . Using the same scaling rule in [27] and [21] , the normalized results show that the SC-MJL decoder is the most energy efficient decoder. Although it can operate at 1.5 lower frequency than the SC-Fast decoder, it has 3.2 times better area efficiency due to efficient merging of pipelined stages in the register balancing architecture.
V. CONCLUSION
In order to reach high throughput within the physical limits of the current VLSI technology, we proposed SC-MJL decoding algorithm with an adaptive quantization and register balancing architecture. Firstly, the SC-MJL decoder architecture reduces the pipelined depth of the SC algorithm by 1.2 times. In addition to that the proposed adaptive quantization scheme further reduces both computational and memory complexity of the SC-MJL decoder. The proposed decoding algorithm utilizes a deeply-pipelined and unrolled hardware architecture using combinational logic. In this architecture, the consecutive decoding stages are merged to further reduce 0.46 1.89 * 0.12 * Energy Eff. (pJ/bit) 0.5 ‡ 6.9 4.6 * Not presented in the paper, calculated from the presented results a Normalized factor for area is 0.39 = (28/45) 2 † Norm. factor for power is 0.43 = (28/45)(1.0/1.2) 2 ‡ Norm. factor for energy eff. is 0.27 = (28/45) 2 (1.0/1.2) 2 the pipeline depth of the decoder to 40 clock cycles. The ASIC synthesis results show that the SC-MJL decoder has 427 Gb/s throughput at 45nm technology. When the results are scaled to 7nm, the throughput reaches Tb/s under 10 mm 2 chip area with 0.37 W power dissipation. Finally, the comparison with other high throughput implementations shows that the proposed SC-MJL decoder has a remarkable area and energy efficiency.
