In state-of-the-art fiber-optics communication systems the fixed forward error correction (FEC) and constellation size are employed. While it is important to closely approach the Shannon limit by using turbo product codes (TPC) and lowdensity parity-check (LDPC) codes with soft-decision decoding (SDD) algorithm; rate-adaptive techniques, which enable increased information rates over short links and reliable transmission over long links, are likely to become more important with ever-increasing network traffic demands. In this invited paper, we describe a rate adaptive non-binary LDPC coding technique, and demonstrate its flexibility and good performance exhibiting no error floor at BER down to 10 -15 in entire code rate range, by FPGA-based emulation, making it a viable solution in the next-generation high-speed intelligent aggregation networks.
INTRODUCTION
Software defined network (SDN) is currently emerging as a key technology to enable the next generation of optical transport networks and access networks supporting ever increasing traffic demands due to its dynamic, manageable, and cost effective nature as well as adaptability through abstraction of high-level functionalities [1] . In order to meet those requirements, such as agility and programmable configurability; physical layer demands dynamic wavelength, bandwidth allocation, and rate adaptive forward error correction (FEC) codes with high flexibility and unified architecture. In the past decades, a number of FECs have been intensively studied and extensively investigated in many communication systems such as space communication links, digital subscriber lines, as well as wireless systems. To be specific, Reed Solomon (RS) codes, concatenated codes, product codes, and low-density parity-check (LDPC) codes, are recommended by various ITU-T standards [2] [3] [4] [5] [6] . They are different in terms of transmission overhead (redundancy), implementation complexity, net coding gain (NCG), and burst error correction capability, to mention few.
In this invited paper, we describe and compare various LDPC coding schemes, including binary LDPC codes and nonbinary LDPC codes, as well as corresponding rate adaptation techniques for both cases. By computer simulation, we first optimize the bit-width of input log-likelihood ratios (LLRs), check-to-variable messages, and variable-to-check messages. Then we show via FPGA emulation that the large-girth LDPC coding schemes can achieve large net coding gain (NCG) exhibiting no error floor at BERs down to 10 -15 , when layered decoding algorithm is employed. Meanwhile, we demonstrate a flexible rate adaptation technique with the proposed unified decoder architecture. Finally, by comparing the implementation complexity of RS codes and concatenated codes, we conclude that the proposed flexible LDPC coding schemes with reconfigurable unified architecture represent one of the promising candidates for the nextgeneration intelligent optical aggregation networks.
The remainder of the paper is organized as follows. In Section 2, general construction method of non-binary LDPC codes is discussed and its hardware friendly decoding algorithm is overviewed. The proposed software defined rate adaptive FPGA-based decoder architecture is introduced in Section 3, while the emulation results are presented and compared in Section 4. Section 5 concludes the paper.
RATE ADAPTIVE LDPC CODES

Construction of non-binary LDPC codes
In this section, we discuss a method of constructing a high-rate non-binary LDPC codes with low error floors. This process consists of two steps: we first construct a large-girth binary LDPC codes, then replace the 1's in binary parity check matrix with non-zero element in Galois field GF(q) by random selection [7] . It is well known that density evolution (DE) [8] or extrinsic information transfer (EXIT) chart analysis [9] can be employed to derive a channel capacity approaching signal-to-noise ratio threshold for a given code rate. However, in this paper, we choose a quasicyclic regular LDPC code design based on permutation matrices due to its implementation efficiency. Following the guidelines in [10] , the parity check matrix of a (c, r)-regular non-binary QC-LDPC code can be represented by
where !,! , and !,! are both × circulant permutation matrix with the same offset over GF (2) , and GF(q), respectively. Operation * is element-wise multiplication over GF(q). For the sake of efficient implementation, we choose !,! such that every element in the circulant permutation matrix is equal.
Decoding of non-binary LDPC codes
There are several algorithms proposed for decoding of the non-binary LDPC codes including Q-ary sum-product algorithm (QSPA), log-domain FFT-QSPA (Log-FFT), mixed-domain FFT-QSPA, max-log QSPA, extended min-sum algorithm, and min-max algorithm (MMA) [11] [12] [13] [14] [15] . Let !" !,! , !" !,! , and ! represent the check c to variable v message, the variable v to check c message at k-th iteration and l-th layer, and the log-likelihood ratio (LLR) from the channel, respectively; where = 1, … , !"# and = 1, … , . The layered min-max algorithm (LMMA) is adopted in this paper; and the data flow can be summarized as follows:
• Bit decision step:
• Variable node processing rule:
• Check node processing rule:
the index ! in Eq. (5) is set to − 1 when < ! while set to otherwise and Eq. (6) is necessary for numerical reasons to ensure the convergence of the algorithm. In Eq. (7), the ! = is the set of sequences of finite field elements
, while it is realized by trellis-based recursive approach.
Rate adaptation via shortening
The basic rate adaptation via either shortening or puncturing is widely used everywhere in communication systems, and can be introduced in both block and convolutional codes [16] . In this paper, we use shortening to achieve rate adaptive LDPC coding since it can allows a wide range of rate adjustment with unified decoder architecture through a set of reconfigurable registers in FPGA. Because of the quasi-cyclic structure of our non-binary LDPC codes, we shorten entire sub-block by adding the least number of logics' blocks. For example, we start from a (3,15)-regular non-binary LDPC codes with rate of 0.8, and we can obtain a class of shortened regular non-binary LDPC codes with column weight and row weight of { (3, 14) , (3, 13) , (3, 12) , (3, 11) , (3,10)}, which corresponding to code rates of {0.786, 0.77, 0.75, 0.727, 0.7}. It is straightforward to obtain lower rate by increasing shortening length, however, this will becomes more and more inefficient and the rate below 0.7 is not recommend in optical communication systems.
ARCHITECTURE OF LDPC DECODER
Overview architecture
In order to verify the performance of the proposed rate adaptive LDPC codes, we use a field programmable gate array (FPGA) platform, whose high-level diagram is illustrated in Fig. 1 . This platform is similar to other platforms reported in the literature [4, 5, 17] and consists of three parts: a Gaussian noise generator, an LDPC decoder, and an error counter circuit. The Gaussian noise generator generates Gaussian distributed log-likelihood ratios (LLRs), originating from BPSK transmission over an AWGN channel, using two uniform random number generators and the Box-Muller transform. Such obtained sequences are then fed to the LDPC decoder after quantization. The software configuration interface is implemented by a lightweight microblaze processor used to configure the emulation setup, such as noise variance, shortening length, and read errors; and so on. Without loss of generality, all-zero codeword is assumed to be transmitted. The LDPC decoder is based on the layered decoding algorithm [18] and uses a scaled min-sum check-node computation rule with constant scaling factor and a min-max check node computation rule for binary and non-binary LDPC codes, respectively. This architecture is duplicated D times in order to increase the throughput. 
Rate adaptive binary LDPC
As shown in Fig. 2 (a) , the binary LDPC decoder consists of two major memory blocks (one stores channel LLR and another stores a posteriori probability (APP) messages), two processing blocks (variable node unit (VNU) and check node unit (CNU)), an early termination unit (ETU), and a number of mux blocks, wherein its selection of output signal can be software reconfigurable to adjust the shortening length [7] . The memory consumption is dominated by LLR message with size of × ! APP messages of size × × ! , where ! and ! denote the precisions used to represent LLR and APP messages. The logic consumption is dominated by CNU, as shown in Fig. 2 (b) . The ABS-block first takes the absolute value of the inputs and the sign XOR array produces the output sign. In the two least minimums' finder block, we find the first minimum value via binary tree and trace back the survivors to find the second minimum value as well as the position of the first minimum value. This implies that we can write 3 values and r sign bits back to the APP memories instead of r values. However, we will not take advantage of memory reduction techniques for comparison in the following sections. 
Rate adaptive non-binary LDPC decoder architecture
Similarly to the rate adaptive binary LDPC decoder architecture discussed above, the architecture of the LMMA-based non-binary LDPC decoder is presented in Fig. 3(a) . There are four types of memories used in implementation: memory !" with size of × × × ! stores the information from check nodes to variable nodes, memory for ! with size of × × ! stores the initial log-likelihood ratios, memory for with size × log ! stores the decoded bits, and memories inside each CNU store the intermediate values. The same notations are borrowed from previous subsection except that q denotes the size of Galois field. As shown in Fig. 3(b) , it is obvious that CNU is the most complex part of the decoding algorithm, which consists of r inverse permutators, r BCJR-based min-max processors and r permutators, and two types of the first-in-first-out (FIFO) registers. The inverse permutator block shifts the incoming message vector cyclically. The first FIFO is used to perform the parallel-to-serial conversion. After the min-max processor, which is implemented by low-latency bidirectional BCJR algorithm, the processed data is fed into another FIFO block performing serial-toparallel conversion, followed by the permutator block. Because of high complexity of CNU design and high memory requirements of non-binary decoder than that of binary decoder, reduced-complexity architectures and selective version of MMA can be further exploited [19] [20] . 
Min-
Max Proc Mem CNU ETU ... Mem ! L v R cv k ,l a ( ) inv P FIFÔ c (a) Norm VNU L vc k ,l a ( ) L v k ,l a ( ) ! R cv L v a ( ) R cv k ,l' a ( ) L v a ( ) inv P inv P Min- Max Proc Min- Max Proc ...
EMULATION RESULTS AND ANALYSIS
Comparison of three rate adaptive schemes in terms of resources utilization
We compare three rate adaptive schemes based on binary LDPC codes, non-binary LDPC codes, and binary generalized LDPC codes (GLDPC) [21] . These three architectures can be software-defined by initializing configurable registers in FPGA. In order to make a fair comparison, we first design a (3, 15)-regular quasi-cyclic binary and non-binary LDPC codes with length of 34,635 and select (15, 11) Hamming codes as local code in generalized LDPC codes. The required precision is 6-bit for binary LDPC decoder and binary GLDPC decoder while 8-bits are required for non-binary LDPC decoder. At the same time, we implement 15 VNUs and 1 CNU in order to keep the same throughput in three cases. The resource utilization is summarized in Table 1 . One can clearly notice that the binary GLDPC decoder has a slightly increased number of occupied slices and memories since one more APP-based CNU is required to process the simple linear block code. On the other hand, LMMA-based non-binary LDPC codes consumes 3.6 times larger memory size than the binary one because of large field size and high quantization precision, while the occupied number of slices is five times larger than that in binary case because of higher complexity of CNU. 
BER performance analysis
The BER vs. Q-factor performances of the rate adaptive binary and non-binary LDPC code are presented in Fig. 4 and Fig. 5 . The FPGA-based emulation was conducted over binary (BI)-AWGN channel and 6 and 8 bits precision are used in binary and non-binary LDPC decoder, respectively. A set of column weight and row weight configurations of { (3, 15) , (3, 14) , (3, 13) , (3, 12) , (3, 11) , (3, 10)}, which corresponds to the code rates of {0.8, 0.786, 0.77, 0.75, 0.727, 0.7}, can be achieved by software-based reconfiguration of specific register in FPGA. The girth-10 regular (34635, 27710, 0.8) binary and non-binary mother code can achieve a Q-limit of 5.2 dB and 5.14 dB at BER of 10 -15 , which corresponds to NCG of 11.83 dB and 11.89 dB. The rate adaptive non-binary LDPC codes outperforms the binary LDPC codes by approximated 0.06 dB in all range of rate from 0.7~0.8. In addition, we believe this gap will be larger when combined with higher modulation schemes enabling 100Gbits/s (with QPSK) 400Gbits/s (with 16QAM) optical communication systems. 
CONCLUSION
In this invited paper, we have proposed a rate adaptive non-binary LDPC scheme and demonstrated with a unified FPGA-based decoder architecture that it provides excellent rate flexibility. Compared to binary LDPC codes, the nonbinary LDPC codes provide additional ~0.06dB gain and we believe a larger gain can be achieved when combined with high order modulations. Thus the proposed rate adaptive non-binary LDPC codes are suitable to intelligent aggregation networks and beyond.
