In this paper, we propose a rate-compatible forward error-correcting (FEC) scheme based on low-density-parity check (LDPC) codes together with its software reconfigurable unified field-programmable gate array (FPGA) architecture. By FPGA emulation, we demonstrate that the proposed class of rate-compatible LDPC codes based on puncturing and generalized LDPC coding with an overhead from 25% to 46% provides a coding gain ranging from 12.67 to 13.8 dB at a post-FEC bit-error rate (BER) of 10 −15 . As a result, the proposed rate-compatible codes represent one of the strong FEC candidates of soft-decision FEC for both short-haul and long-haul optical transmission systems.
Introductions and Motivation
Modern high-speed optical communication systems require high-performance forward errorcorrecting (FEC) engines that can support throughputs of 100 Gbit/s or multiple thereof, which have low-power consumption, while providing high net coding gains at a target bit-error rate of 10 −15 , and which are preferably adaptable to the time-varying optical channel conditions. Thanks to their remarkable error-correcting capabilities and recent development of corresponding integrated circuits, low-density parity-check (LDPC) codes and turbo-product codes have led to their recent inclusion in several standards in optical community, such as ITU-T G.975, G.709 and, hence, represent the key technologies enabling the development of 400 Gb/s, 1Tb/s Ethernet, and beyond [1] - [3] . Recently, soft-decision binary and non-binary LDPC codes with an outer hard-decision code, usually high-rate Reed-Solomon (RS) or Bose-Chaudhuri-Hocquenghem (BCH) code, pushing the system BER to levels below target BER have been proposed in [4] and [5] . However, the implementation of such a concatenated LDPC codes requires a thorough understanding of the LDPC code performance and a properly designed interleaver between the LDPC code and the outer code should be employed to avoid the errors at the output of LDPC decoder. Later, a welldesigned and optimized single convolutional LDPC code and spatially-coupled LDPC code have been demonstrated to have very low error floors below the system's target bit-error rate (BER) [6] - [7] . Moreover, a real-time single carrier system with convolutional LDPC coding operating at 400 Gb/s has been demonstrated and currently the standardization by the optical Internetworking Forum (OIF) is underway [8] - [9] . While it is extremely important to design a desirable code rate LDPC code to closely approach channel capacity, more specifically, with overhead from 15% to 25%, it may become desirable to have a rate-compatible transmission that could improve network robustness, flexibility, and throughput as traffic demands evolve. Most recently, dual-polarization (DP)-quadrature phase-shift keying (QPSK), DP-16quadrature amplitude modulation (QAM), DP-64QAM, with varying code rates have been studied to achieve the highest generalized mutual information (GMI) at a given signal-to-noise ratio (SNR) [10] , and this study explored a total of 10 modulation formats to find the best combination of spectral efficiency and highest span loss budget [11] . The implementation cost of families of such rate-compatible punctured LDPC codes requiring the storage of a parity-check matrix per family is significantly high.
In this paper, we apply a different strategy to address above problem of the design of ratecompatible (RC) LDPC codes for the next generation of optical transmission systems. Our motivation is two-fold. First, a well-constructed capacity-approaching LDPC code offers the promise of substantial performance gain for the next-generation optical transport network (OTN) systems. Secondly, a unified architecture of LDPC decoder have been shown to allow a wide range of performances for OTN, where large number of parameters can be reconfigured in order to cope with the time-varying optical channel conditions and service requirements. Our field programmable gate array (FPGA) emulation results demonstrate that the proposed class of rate-adaptive LDPC codes with overhead ranging from 25% to 46% that can achieve a Q-limit ranging from 5.33 dB to 4.2 dB at BER of 10 −15 , which corresponds to coding gains from 12.67 dB to 13.8 dB. To the best of our knowledge, this is the first work implementing a rate-compatible LDPC codes with unified architecture that represents a promising solution for the next generation of beyond 100 Gb/s optical transport networks.
The rest of this paper is organized as follows. In Section 2, we first present a construction rule for the class of rate-compatible quasi-cyclic (QC) LDPC codes and introduce their decoding algorithms. In Section 3, we introduce a unified FPGA-based RC LDPC decoder architecture and the optimization guidelines as well as the corresponding logic utilization. In Section 4, we compare and analyze the emulation results and discuss their significance. Finally, Section 5 concludes the paper.
Rate-Compatible LDPC Codes and Corresponding Decoding Algorithms
There exist various algebraic and combinatorial methods to construct structured LDPC codes [12] - [13] . The large girth QC LDPC code design with column-weight γ and row-weight ρ based on permutation matrices is adopted due to its low memory usage for storing the parity-check matrix [13] , which can be represented by
where I P i ,j represents the B × B circulant permutation matrix with a location of 1 in row r being at column (r + P i ,j ) mod B .
One straightforward way for rate-compatible coding is to design a class of LDPC codes by selecting a class of different code parameters, together with a number of modulations [11] . Later, borrowing the idea from shortening and puncturing of RS codes [14] , a puncturing algorithm is proposed that optimizes the degree distribution of a class of LDPC code for a given mother code [15] . Furthermore, a rate-adaptive staircase LDPC code shows improvement over shortening utilizing a set of interleavers [16] . However, neither of the above poses a sufficiently good hardware structure suitable for real-time implementation; in other words, no heterogeneous implementation architecture exists. In this paper, we propose a RC LDPC codes capable of both coarse-and fine-tuning, which enables significant saving on logic and memory recourses. Given a (N , K , M ) LDPC code, where N , K , M represent the number of variable-nodes, number of information bits, and number of checknodes; the code rate which can be calculated by R ≥ 1 − M /N , we can reduce the code rate either by reducing n or increasing m. For coarse-tuning, by setting the initial log-likelihood ratio (LLR) into largest integer value, we softly puncture several columns from a mother code that is represented by (1), while we replace several single-parity check (SPC) codes by a linear block code (also referred as generalized codes or local codes) for fine-tuning. For example, let a QC-LDPC code with γ = 3, ρ = 15, B = 1129 be a mother code, puncturing one rightest column in Eq. (1), the code rate is reduced from 0.8 to 0.7857, which represents the coarse-tuning process. Let d denotes the distance between two neighboring SPC codes to be replaced by a simple linear block code (n, k, m), where n, k, m represents the number of code bits, number of information bits and number of check bits, respectively, After uniform substitution, the code rate can be calculated by
We choose (15, 11) Hamming code to replace every d = 127, 63 SPC codes so that the resulted code rates are 0.7953 and 0.7906, which represents the fine-tuning process. Note that we choose a simple Hamming code over other linear block codes in a doping SPC fashion, such as binary BCH or Reed-Muller codes, due to its low implementation complexity. In summary, the proposed RC LDPC codes can be constructed in two steps. In step 1, we first design a large-girth QC-LDPC code based on (1). In step 2, the parity-check matrix is modified either by puncturing several rightmost columns or doping selected SPC codes with simple linear block codes.
For completeness of the presentation, we provide the layered decoding algorithm for the proposed RC LDPC codes. Let M k,l cv , M k,l vc , and M ch,v represent the check c to variable v message, the variable v to check c at k-th iteration and l-th layer message, and the LLR from the channel, respectively; where k = 1, . . . , I max and l = 1, . . . , γ. The layered scaled min-sum algorithm, adopted in this paper, can be summarized as follows [17] - [18] :
2) Bit decisions:
3) Variable-node update rule:
4) Check-node update rule: a) For generalized check-nodes, a posteriori probability (APP) update rule is applied:
b) For single-parity check-nodes, scaled min-sum update rule is applied:
where s denotes the scaling factor for the min-sum algorithm and it is set to 0.75. The above procedure iterates until a pre-defined maximum number of iteration is reached or all decisions are converged to the transmitted codeword, since only 1/γ portion of check nodes are involved in each layer. 
FPGA Architecture of Proposed RC LDPC Decoder
In this section, we first provide the overall architecture of our FPGA emulation platform. After that, we provide the details of the proposed FPGA architecture for the proposed RC LDPC decoder. The puncturing-based LDPC decoder is used as a reference.
Overview Architecture
The performance of the proposed rate-adaptive LDPC codes is demonstrated using a set of FPGAs, whose high-level diagram is presented in Fig. 1 . This high level verification tool is similar to others reported in the literature [6] - [7] and consists of three main parts: a set of Gaussian noise generators, a partially pipelined LDPC decoder, and an error counter block. The Gaussian noise generator generates noise samples using a combination of Box-Muller algorithm and central limit theorem following the guidelines in [19] . Such generated sequence of noise samples is multiplied with standard deviation of noise σ and fed to the LDPC decoder with quantized LLRs. There are three types of software reconfigurable registers to be initialized in the high-level diagram. The first set of registers are instantiated to define the noise variance and the codeword to be transmitted, the second set of registers to configure the rate adaptive LDPC decoder, including the number of layered iterations, the amount of bit to be punctured, and the number of generalized check nodes; and the third set of registers stores the number of uncoded errors, coded errors, and transmitted codewords. Fig. 2(a) presents the overview of the reconfigurable binary LDPC decoder, which consists of two parts, namely, memories and processors. The processors can be classified into following categories: (i) variable-node processor (VNP) corresponding to (5); (ii) scaled min-sum check-node processor (SMS-CNP) corresponding to (7), (iii) MAP based CNP (MAP-CNP, also referred as generalized-CNP) corresponding to (6) , and (iv) early termination unit (ETU) corresponding to (4) . Four sets of block RAMs are used in the implementation of RC LDPC decoder: (i) block RAM with size of n × B W M ch, v stores the channel LLRs, (ii) block RAM with size of γ × n × B W M cv stores the messages from check-nodes to variable-nodes, (iii) block RAM with size of n stores the decoded The architecture of SMS-CNP is implemented based on (7), which is illustrated in Fig. 2(b) . We first calculate the absolute value and the sign value of the input. Then we find the first and the second minimum values of the absolute values and trace back to find the position of first minimum via binary search tree (BST). It is worth noting that the number of muxes involved in BST and the latency associated with BST are proportional to the number of inputs. With this technique, we can write back three values and ρ sign bits to the M cv memories instead of ρ values. However, we will not take advantage of memory reduction techniques in the resource utilization comparison study. The implementation of generalized-CNP is based on (6) , in which MAP function is realized by bidirectional Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm. As shown in Fig. 2(c) , we first calculate forward-and backward-recursion likelihoods, α(s, t) and β(s, t), using the input I (t), where s, t represent the number of states and length of the BCJR trellis that is proportional to the number of check nodes in linear block code and number of the check node degree ρ. The intermediate values α(s, t) and β(s, t) are then written into the forward and backward block RAMs. The outputs O (t) are generated by add-max operations, which properly combine forward-likelihood α(s, t), backward-likelihood β(s, t), and input I (t). The time-variant trellis of a linear block code implies that a pre-calculated routing scheme should be implemented so that current recursion likelihood can be updated by appropriate previous recursion likelihood. Furthermore, the max-staroperation is replaced by max-operation due to its high memory consumption as well as additional one cycle needed. The bidirectional recursion scheme is adopted to further reduce the latency of generalized-CNP. For a (15, 11) Hamming code, there are in total of 16 states and the length of trellis is 16 including initial length. The timing diagram is illustrated in Fig. 3 , where α i = α(s, i ), β i = β (s, i ) represent the forward and backward recursion likelihood vectors of size 16 at time instance i . Once half of the forward and backward recursion likelihood have been updated and written into their associate block RAMs, the output O 7 is calculated based on current α 7 and current β 8 . Immediately after, two outputs will be generated in each cycle based on current α with previous β and current β with previous α. Thus, the latency of the generalized-CNP can be calculated by T = T PS + T local,n + T SP , where T PS , T local,n , T SP represent the latency of parallel-to-serial conversion of the input, length of linear block codes, and the latency of serial-to-parallel conversion of the output. In summary, the complexity of both SMS-CNP and generalized-CNP are reasonable low, which makes the proposed rate-compatible LDPC code very promising solution for future OTNs.
Rate-Compatible LDPC Decoder Architecture
Given the overall architecture described in previous subsection, we implement a (34635, 27710) QC LDPC decoder as well as a generalized version of it using (15, 11) Hamming code, in which there is a set of reconfigurable registers to achieve the rate-adaptive purpose. The utilization reports from Xilinx xc6vsx475t of both punctured LDPC decoder and the proposed RC LDPC decoder are summarized in Table 1 . There are additional 6% in occupied slice usage and 4% in RAMB18E1 usage in the proposed RC LDPC decoder compared to that of the puncturing-based RC LDPC 
Emulation Results and Discussion
Before implementing the proposed RC LDPC codes in hardware, quantization is an important issue that needs to be addressed. Although a non-uniform quantization and different quantization methods can be used to represent M k,l cv , M k,l vc and M ch,v ; we employ the uniform quantization scheme for all three types of messages. We choose 6-bit resolution to ensure that the error-floor phenomenon is due to the code-design itself instead of finite precision representation, while keeping the complexity reasonably low. The FPGA-based emulation was conducted over binary (BI)-AWGN channel, which is valid assumption for an amplified spontaneous emission (ASE) noise dominated scenario. Given a mother code of girth-10 regular (34635, 27710) constructed using the method discussed in Section 2, a set of RC LDPC codes can be obtained with code rates of {0. 8 With above resolution, six RC LDPC decoders can be implemented in one FPGA and with four FPGAs available in our rapid prototyping platform, in total 24 decoders are employed. Each decoder consists of three CNPs and 45 VNPs in the implementation, hence the throughput of the decoder can be calculated by F CL K * n/[(B /l + δ) * I max ], where F CL K = 200 MHz is the FPGA running frequency, n is number of bits per codeword, B = 2309 is the block size, l = 3 is the pipeline depth, δ = 7 is the latency of VNP and CNP, I max = 45 is the maximum number of layered iterations. It is worth noting that the decoder will converge fast at high SNR regime (20-27 iterations verified by simulation). The aggregation throughput of the mother code will be ∼4.8Gbit/s at low SNR regime and ∼9.6Gbit/s at high SNR regime, while the throughput of code rate of 0.6858 will be ∼3.2Gbit/s and ∼6.4Gbit/s respectively. The FPGA-based BER vs. Q-factor performance of the proposed RC LDPC code is presented in Fig. 4 , in which the parameters in the legend represents the columnweight, row-weight, and the distance of two uniformly distributed generalized check-nodes and with number of layered iterations set to 45, which is equivalent to 30 conventional iterations since layered decoding converge twice as fast as conventional decoding algorithm [18] . It is worth mentioning that we puncture the entire block from (3, 15) LDPC mother code for the coarse-tuning and replace every d-th check-node with a simple (15, 11) Hamming code, where d ∈ {INF, 127, 63}, for the fine-tuning. The mother code with code rate of 0.8 and the lowest code rate of 0.6858 can achieve a Q-limit of 5.33 dB and 4.2 dB at BER of 10 −15 , which corresponds to the coding gains ranging from 12.67 dB and 13.8 dB.
Concluding Remarks
In this paper, we have proposed a novel class of reconfigurable RC LDPC codes with overhead ranging from 25% to 46% for high-speed optical transmission systems. The BER performance has been verified through FPGA emulation system and it has been shown that the proposed RC LDPC codes exhibit a superior waterfall performance and excellent error floor performance down to a BER of 10 −15 . The coding gains of the proposed RC LDPC codes, with overheads between 25% and 46%, range from 12.67 dB to 13.8 dB at the BER of 10 −15 . To the best of our knowledge, this is the first implementation of such wide range of RC LDPC codes. We believe that the proposed RC QC-LDPC code is one of the promising candidates for the next generation of optical transmission systems.
