Abstract-In this paper, we report an encoding and decoding method for irregular-quasic-cyclic low-density paritycheck (IR-QC-LDPC) codes with multi rates. The algorithm is applicable to parity-check matrices which have dual-diagonal parity structure. The decoding adopts normalized min-sum algorithm(NMSA). The whole verification of encoding and decoding algorithm are simulated with MATLAB, if initial bit error ratio is 6% , the code rate of 2/3 is selected, and if the initial bit error ratio is 1.04%, the code rate of 5/6 is selected. We migrate the algorithm from MATLAB to Field Program Gate Array(FPGA) and implement this algorithm based on FPGA. Based on FPGA the throughput of encoding is 183.36Mbps while the average decoding throughput is 27.85Mbps with the initial bit error ratio is 6%.
I. INTRODUCTION

I
NFORMATION reconciliation is necessary for two parties who have two sets of correlated data with a few discrepancies between them. The situation is equivalent to transmit the data from one party to the other through a noisy channel, such as the QKD system.
In a QKD system, errors between Alice and Bob are generated by imperfections of the QKD devices, or by an eavesdropper. Communications between Alice and Bob will get through authenticated classical channel to correct the errors. Since the classical channel can be listened by Eve, it is important to minimize the total information that have to be transmitted during the reconciliation process. Any extra information limits the performance of the QKD implementation.
In this work, we implemented an idea of LDPC codes which is optimized for the binary symmetric channel in QKD system. This solution addresses the rate compatible information reconciliation protocol.
LDPC codes were first discovered by Gallager [1] and rediscovered by MacKay et al. [2] Its remarkable performance is very close to Shannon capacity limit under assumption of having long codeword length.
The LDPC is a linear block code which need only once mutual information between Alice and Bob. Based on this idea, we respectively designed LDPC encoding algorithm for Alice and LDPC decoding algorithm for Bob. Both the encoding and decoding algorithm are implemented on FPGA.
Phromsa proposed a bit-flipping decoding algorithm based on LDPC to correct the errors in QKD system, however the BER is no more than 10 −3 [3] . Cui introduced a more efficient LDPC while the decoding throughput is at the rate of several Mbps [4] . In this paper, we implemented a faster decoding throughput compared with Cui.
II. LDPC CODES
Yoon, Choi and Cheong introduces a novel irregular quasi-cyclic LDPC(IR-QC-LDPC) matrix H [5] . and a responsive recursive accumulated algorithm for encoding. In IR-QC-LDPC codes, the parity-check matrix H can be partitioned into square sub-matrices of size z × z. Let H H H k i,j i,j be z × z zero sub-matrix (k i,j = −1), identity matrix I (k i,j = 0), or sub-matrix with k i,j times cyclic shift to
i,j is located at i-th row and j-th column.
⊤ is the parity vector, the vector length of s i , i ∈ {1, 2, · · · , n − m − 1} and p i , i ∈ {1, 2, · · · , m − 1} all are z, and c must satisfy (Eq. 1).
H be can be divide into information block H S and parity block H P , so H = [H S |H P ]. As for H S we can find two different prime numbers a and b(a < z,
As for H p , a irregular dual-diagonal structure is created as Eq. 3
where 0 denotes identity matrix I z×z with zero cyclic shift, d(d > 0) denotes identity matrix I z×z with d cyclic shift to right, and −1 denotes a all zeros matrix with size z × z.
So far, we introduce a full IR-QC-LDPC parity-check mother matrix H as Eq. 4,
In summary, a IR-QC-LDPC code is constructed by cyclically coupling QC-LDPC subcodes. Moreover, the parity-check matrix corresponding to the original QC-LDPC subcode consists of m block rows and n block columns, and each block is a circulant matrix of size z × z.
III. LDPC ENCODING As Sec. II described, the Eq. 4 have dual diagonal structure. [5] [6] introduce a novel recursive accumulated algorithm.
Here we define a serial of λ i as Eq. 5
As in Eq. 4, every element in H is a z × z matric, if a z bit length vector multiply the H i,j ≥ 0 , the result will be a ring left shift from its original with the value of H i,j .
As in Fig. 1 , first we shift every
Here we can calculate every λ i , i ∈ {1, 2, · · · , m − 1}. According to Eq. 1,
0 in Eq. 6 denote p 0 cyclic shift d times to the right. Accumulate the Eq. 6 to Eq. 9 in modulo 2 which are on behalf of m equations. We can get p 0 as Eq. 10. 
Then p 1 should be get according Eq. 6. After that, the left check vector of p can be get one by one through forward recursive. Every parity vector can be calculate only in one clock in FPGA.
IV. LDPC DECODING
Known as belief propagation (BP), normalized-minsum algorithm(NMSA) carries out iterative messagepassing processes to achieve convergence for all constrained variables. The NMSA iterative procedure is as algorithm 1
Considering the check matrix H in Eq. 4, the extension factor is z. The code words can be divided into n group with z bits in each group. Then each bit in every group can be considered as one layer. We can define the kth layer check-to-variable(C2V) message with δ i,j,k and variable-to-check(V2C) message with ρ i,j,k , where
Finally, we define the k-th layer posterior probability with Q i,j,k and the output codewords withx and the decoding result withx status . L i,j,k is considered to be a temporary parameter in algorithm 1.
In our design a binary symmetric channel(BSC) is assumed and the probability information is quantified with 8-bit data. Here we define the probability of receiving 0 if the raw message is 0 with P0 and the probability of receiving 1 if the raw message is 1 with P1. The current iteration is t and the maximum of t is Iter max . The input x is the original codewords.
A. Overall Architecture
The implemented overall architecture of our proposed IR-QC-LDPC decoder based on FPGA is illustrated in Fig. 2 .
In this architecture, there is a global controller, an input message unit, an iteration decision unit, an address The global controller deals with the whole procedure of the encoding and decoding. The input message unit receives message information and check information. The decode decision unit will judge wether the result during iteration is right and outputs the decoding status. The left sub-architecture can be classified into the following main categories.
1) Computational logics(registers for pipeline) consisting of CNP and VNP, and format converters(2 ′ s complement to sign-magnitude(C2S) and its inverse(S2C)). The function of these logics is to update the edge messages between the variable nodes and check nodes. 2) Random access memories controller which deal with the updated messages and channel messages. Each RAM contains memory locations that can be accessed easily based on their independent address.
B. Memory Arrangement
As in the illustration of Fig. 2 , there are three RAMarrays for message storage:
1) Variable-node initialization RAM-array(Init-Array) for initialization message of original codeword; 2) C2V message RAM-array(C2V-Array) for updating variable-node with check-node message; 3) V2C message RAM-array(V2C-Array) for updating check-node with variable-node message. In this paper, we realize the quantized NMSA with quantization of 8 − bit binary data, so the data width of each RAM is 8. As the expansion factor is z, the depth of each RAM is z according to the architecture of check matrix H.
The variable-node initialization RAM-array(InitArray) is as Fig. 3 and the size of this RAM-array is 1 × n. Here we initiate the x j,k into L0 j,k as Eq. 11 with C2S and store the L0 j,k in the k-th address of RAM C j . Once the Init-Array is initiated, it will not change during the decoding for one codeword.
Considering the structure of the check-matrix H, the size of the C2V-Array and V2C-Array are both m × n. Because H i,j repute a z × z cyclic matrix from identity matrix, there must be one and only one valid message in each row or column of the cyclic matrix if H i,j ≥ 0 and there would be z valid messages for each H i,j . During one iteration, δ i,j,k reputes the k-th column message and it is stored in the k-th address of RAM i,j in C2V-Array; ρ i,j,k reputes the k-th row message and it is stored in the k-th address of RAM i,j in V2C-Array,
The illustration of these two RAMArrays is shown in Fig. 4 .
The RAM i,j is in the in the i th row and j th column of the RAM-Array, and the number of this RAM is RAM num = i × n + j. 
C. Address Convert Unit
There are three kinds of value for H i,j , positive number, zero and −1. Fig. 5 displays the three types cyclic matrices.
1) check node rams address converting:
When updating V2C messages, we need to read the k-th message from C2V-Array at the same row, because the k-th C2V message is stored on the Y-Axis and the k-th V2C message will be stored on the X-Axis. The actual address k ′ of the C2V message is not in the same address for each RAM. The relationship of k ′ and k is as Eq. 18.
2) variable node rams address converting: When updating C2V messages, we need to read the k-th message from V2C-Array at the same row, because the k-th V2C message is stored on the X-Axis and the k-th C2V message will be stored on the Y-Axis. The actual address k ′ of the V2C message is not in the same address for each RAM. The relationship of k ′ and k is as Eq. 19.
D. Check-Node Processor
A CNU based on 8 − bit quantized min sum algorithm is applied in our architecture to realize check node message updating. In other words, the CNU consists of functional units performing comparator, absolute value, and XOR. Fig. 6 illustrates the relationship between C2V i,j,k (δ i,j,k ) and V2C-Array.
C2V i,j,k connects the corresponding V2C i
and the relationship between k
′ and k is as Eq. 19.
The procedure of the CNP is as Eq. 13. Each δ i,j,k can be processed as following, 1) Read the k ′ -th message from V2C-Array; 2) Compare the absolute value of each message excluding i-th block row and select the minimum value, at the same time XOR the sign of each message excluding i-th block row; 3) Multiply the value of previous item value with a normalized factor α and set δ i,j,k to be the result as Eq. 13. With the convenience of FPGA, we can deal with m × n δ i,j,k simultaneously(in one clock), and store the values into C2V-Array in the following clock. After z clocks, all messages of check nodes can be updated.
E. Variable-Node Processor
The VNP contains two parts: one is variable node initialization processing, and the other is variable node message update processing.
1) Variable-Node Initialization Processing:
After VNP receives the initial codeword(x) in Bob, the VNP will initiate each bits in x. x is divided into n groups with z bits in each group according to the matrix structure of H, in other words x can be denoted like Eq. 20.
In Eq. 20, every x i has z bits. Here we adopt n RAMs to store the initial probability message, the depth of each RAM is z and the width of each RAM is 8-bits. Then we initiate every bit of x i and store the result into the i-th RAM sequentially. We can collaterally store n bits at the same time, thus we need z clocks to initialize the x and store it in the n RAMs as Fig. 3 .
2) Variable Node Processor: In this module, the VNP should read message from C2V-Array and Init-Array, then the VNP will save the updated message to V2C-Array. There are m × n RAMs in the C2V-Array, we can read messages from the m × n RAMs at the same time.
Aware of the preponderance of parallel processing of FPGA, we can process the k-th variable node in each c i , i = 0, 1, 2, · · · , n − 1 at the same time. As the initialized message and the C2V message are all separately stored in the same size RAM, we can read them simultaneously as in Fig. 7 .
V2C i,j,k connects the corresponding C j,k and C2V i,j The procedure of the VNP is as following.
1) Read the k ′ -th message from C2V-Array and k-th initial message from Init-Array; 2) Sum each C2V value from from the same block row excluding j-th column and the k-th initial value from Init-Array to get the V2C message as Eq. 15; 3) Sum each C2V value from from the same block row and the k-th initial value from Init-Array to get the posterior probability as Eq. 16.
F. Decoding Decision
This module always deals with the posterior probability and decide the decoding result. The decoding decision is according with Eq. 17.
Then the decoding result will be checked whether it fit Eq. 1.
V. SIMULATION WITH MATLAB
Note that we select the parameters l max = 1944, z max = 81, n max = 24, and m = 12, 8, 6, 4 such that our IR-QC-LDPC code achieves a code rate at 1/2, 2/3, 3/4, 5/6. In this paper, we get a proper normalizing factor by simulating this value with MATLAB, each time we randomly test 1000 frames. If the base matrix H is in the style of Eq. 4, the result after decoding with different normalizing factors is illustrated in Fig. 8 . frame error rate(FER) is the frame error after error correction, and qubit error rate(QBER) is the original bit error rate, α = 0.4 is the best value for this IR-QC-LDPC codes.
Then we set the Iter max = 10, the performance of the LDPC code is shown in Fig. 9 . The BER in Fig. 9 is the max BER under the condition of FER < 0.01 after error correction. We denote the code rate with R, the BER with e and error correct efficiency with f . The relationship between f , R and e is shown in Eq. 21 [10] .
VI. IMPLEMENTATION AND PERFORMANCE BASED ON FPGA
In this paper, we introduce a fix base matrix H as Eq. 22, this matrix is introduced in IEEE 802.16e [11] .
The proposed encoder and decoder has been implemented on a Xilinx Kintex UltraScale XCKU040FFVA1156 − 2 − E FPGA which provides 242400 Look-up tables(LUTs), 484800 Flip-flops(FF) and 600 block rams(BRAM).
Key source generates Alice sifted key and send it to LDPC encoder, and generates Bob sifted key and send it to LDPC decoder. Then the encoder will send the check message to decoder after encoding.
The system clock is 25MHz, each message frame has a aptotic length of z × (n − m) = 1296bits. The frame error rate after decoding is less than 1% if the initial BER is no more than 6%, under this condition, the error correction efficiency is 1.49 according Eq. 21. The throughput for encoding is 183.36Mbps while the average throughput of decoding is 27.85Mbps.
The resource utility of XCKU040FFVA1156 − 2 − E with l = 1944, z = 81, n = 24, m = 8 for each module is shown in Tab. I. The module H denotes the Eq. 22, the module key source generates messages with discrepancies for encoding and decoding. The module ldpc encode will encode the message and generate the check message to decoder.
The whole architecture resource utility for the design decried in this paper is illustrated in Tab. II. We have implemented two code rates of IR-QC-LDPC for a QKD system, the whole verification of encoding and decoding algorithm are designed and implemented in the same FPGA which is mounted on the development kit KCU105. Test results indicate that the throughput of decoding can achieve 183.36Mbps when the code rate is 5/6 while the throughput can achieve 27.85Mbps when the code rate is 2/3. 
