ABSTRACT Privacy amplification (PA) is a vital procedure in quantum key distribution (QKD) to shrink the eavesdropper's information about the final key almost to zero. With the increase of repeat frequency of discrete variable QKD (DV-QKD) system, PA processing speed has become the bottleneck in many highspeed DV-QKD systems. In this paper, a high-speed adaptive field-programmable gate array (FPGA)-based PA scheme using a fast Fourier transform (FFT) is presented. To decrease the computation complexity, a modified 2-D FFT-based Toeplitz PA scheme is designed. To increase the processing speed of the scheme on the constraint of limited resources, a real-value oriented FFT acceleration method and a fast read/write balanced matrix transposition method are designed and implemented in our scheme. The experimental results on a Xilinx Virtex-6 FPGA demonstrate that the throughput is nearly double of the latest FPGA based Toeplitz PA scheme according to the literature. Besides, this scheme owns not only the good adaptivity to compression ratio but also the compression ratio independent resource consumption. Therefore, this scheme can fit many high-speed QKD applications.
I. INTRODUCTION
Quantum key distribution (QKD) is a notable technique which exploits the principles of quantum mechanics to accomplish the secure key distribution between two remote parties, called Alice and Bob. Since Bennet and Brassard proposed the first practicable protocol in 1984 [1] , many protocols have been proposed successively. These protocols can be divided into discrete variable (DV) protocols and continuous variable (CV) protocols [2] - [10] . Mainly due to the thorough security analysis, DV-QKD has drawn more attentions, and many DV-QKD systems have been developed [7] - [10] . A DV-QKD system includes two parts: quantum subsystem and post-processing subsystem. The function of quantum subsystem includes quantum state preparation, transmission and measurement. The post-processing subsystem mainly
The associate editor coordinating the review of this manuscript and approving it for publication was Yinghui Zhang. consists of error reconciliation and privacy amplification (PA) [11] . The task of error reconciliation is to correct error bits between two parties and get the identical corrected bit string by means of exchanging information over a public classical channel [12] , [13] . Since the attacker, called Eve, may not only eavesdrop the quantum channel but also have full access to the classical channel, he may obtain some information about the corrected bit string. Therefore, it is necessary for PA to shrink Eve's information about the final key to almost zero. Furthermore, PA is also an open issue in some technique associated with QKD, e.g. quantum private query [14] - [16] . PA eliminates the leaked information by mapping a long corrected bit string to a much shorter final key via universal 2 hash function families [17] - [21] . To reduce the finite size effect in a practical QKD system, the length of an input block for PA should be at least 10 6 , which leads to the high computation complexity and large storage requirement [22] . Therefore, PA has become the bottleneck in many QKD systems. To solve this problem, the researchers studied different kinds of hash functions, implementation algorithms and platforms. In the aspect of hash function selection, C. M. Zhang et al. chose a multiplicative-based universal 2 class of hash function to speed up PA process, and they constructed an optimal multiplication algorithm with four basic multiplications on the central processing unit (CPU) [23] . While the multiplication of large numbers is a complex calculation, which is difficult to transplant and further optimize. Nowadays, Toeplitz hashing is the most widely used in PA process because of its simple structure and parallel feature [24] . To speed up the implementation of Toeplitz hashing based PA, several implementation algorithms have been proposed. Hayashi et al. proposed a modified Toeplitz matrix to further decrease the computation [25] . Zhang et al. [26] proposed a block parallel algorithm. Yuan et al. [27] applied number theoretical transform (NTT) algorithm to Toeplitz matrix multiplication. For the first time, the use of fast Fourier transform (FFT) was proposed to accelerate Toeplitz hashing and improved the process speed significantly by Liu et al. [28] . Among these algorithms, the computation complexity of FFT-based algorithm is the lowest, i.e., O(n log n). As for the platform selection, CPU is the conventional option for Toeplitz based PA [27] , [28] . While the performance improvement of Toeplitz based PA on CPU is limited by the weak parallel computation support of CPU. Graphic processing unit (GPU) draws many attentions due to its great advantage in parallel computing. Wang et al. [29] proposed a FFT-based PA algorithm in CV-QKD based on GPU and improved the processing speed of PA to over 1Gbps. However, the volume and power consumption of GPU are pretty high, making it not suitable for practical DV-QKD applications. Field-programmable gate array (FPGA) is a suitable platform for DV-QKD system with the feature of high-parallelism, compact volume and low power consumption. Zhang et al. [26] first proposed a block parallel algorithm to speed Toeplitz hashing on FPGA. Constantin et al. [30] and Yang et al. [31] proposed an improved block parallel algorithm for Toeplitz hashing on FPGA respectively [30] , [31] . The scheme of S. S. Yang et al. achieves 64Mbps processing speed based on FPGA and reduces memory resources significantly [31] .
As far as we know, all existing PA schemes on FPGA use parallel block method Toeplitz hashing with computation complexity of O(n 2 ). It is natural to think of the FFT-based algorithm when a Toeplitz PA is designed on FPGA. However, it is a big challenge to implement the FFT-based Toeplitz PA on FPGA due to the requirements of input block length at least 10 6 , and the limited resources. In this paper, a high speed FFT-based Toeplitz PA hardware scheme is proposed for the first time. The scheme is implemented on a Virtex-6 FPGA. The throughput of our scheme reaches 116Mbps with the input block length n = 1M . Compared with the latest FPGA based PA scheme, our scheme achieves nearly twice throughput on a lower level hardware platform. Except for the high throughput, our scheme owns the good adaptivity to compression ratio and the compression ratio independent resource consumption. These advantages helped it to fit more QKD applications.
The rest of this paper is organized as follows. Some related works are described in Section 2 as the basis. In Section 3, the proposed high speed FFT-based Toeplitz PA hardware scheme is introduced in details. In Section 4, the experiment results and analysis are given. In Section 5, some conclusions are drawn.
II. RELATED WORK A. PRIVACY AMPLIFICATION
Privacy amplification is a process that allows two parties, Alice and Bob, to distill a secure final key from a partially secure bit string [17] . The definition of privacy amplification is given below from the standpoint of information theory. Before PA procedure in QKD, Alice and Bob share a random n-bit binary string X, called the corrected key in QKD. Eve learns a correlated random string W providing t(t < n) bits of information about X, i.e., H(X|W) ≥ n − t. Alice and Bob wish to publicly choose a compression function g : {0, 1} n → {0, 1} r such that Eve's partial information about X and her complete information of g can only give her little information about Y = g(X). Such procedure is indicated as Fig. 1 . To choose a universal 2 hash function as the compression function g for PA, the mutual information between the distilled key compressed by universal 2 hash function and Eve's information follows:
where s = n − t − r denotes the security coefficient of PA [17] , [21] . While the above definition based on the information theory ignores the setting where Eve holds quantum information. To address this problem, Renner et al. refined the definition of privacy amplification from the standpoint of composable security [32] . They further proved the upper bound of Eve's information asymptotically tight under both definitions, for n approaching infinity [32] . In practical QKD system, n is suggested to be larger than or equal to 10 6 to eliminate the difference between two definitions [33] . Therefore, VOLUME 7, 2019 the implementation of PA is usually computation and storage expensive.
B. TOEPLITZ MATRIX BASED PA
Toeplitz matrix based PA is most popular in DV-QKD, and the essential processing is multiplying the input n-bit string X by a n × r random Toeplitz matrix [34] . For instance, (2) is a Toeplitz matrix,
Because a Toeplitz matrix is a diagonal-constant matrix, it can be constructed by its first column and first row which means n + r − 1 random binary numbers are required. FFT algorithm is an efficient algorithm to calculate the Toeplitz matrix multiplication. The computation complexity of such multiplication operation can be reduced from O(n 2 ) to O(n log n) via FFT. The main process of Toeplitz matrix multiplication accelerated by FFT is shown as (3),
where
is the final key sequence,
is the description of Toeplitz matrix. The function of matrix P is to transfer the Toeplitz matrix to a cyclic matrix. Using this method, the input length of FFT to calculate the Toeplitz matrix would be n + r − 1, which depends on the length of both input and output of PA. Hayashi et al. proposed a modified Toeplitz matrix as the compression function and gave the security proof [25] . The modified Toeplitz matrix is constructed by the concatenation of Toeplitz matrix and the identity matrix (I, V). For instance, (4) is a modified Toeplitz matrix, 
Make up the Toeplitz matrix V r×(n−r) to a cyclic matrix, then the calculation of Y r can be accelerated by FFT as shown in (6) .
[
] is a part of the input sequence. The function of matrix P is to transfer the modified Toeplitz matrix to a cyclic matrix. Using the modified Toeplitz matrix, the required quantity of random bits can be reduced from n + r − 1 to n − 1 and the input length of FFT can also be reduced from n + r − 1 to n − 1. In this case, the length of FFT is only related to the input length, not to the output length anymore. This feature can notably reduce the design complexity of FFT-based Toeplitz PA.
III. HIGH SPEED FFT-BASED PRIVACY AMPLIFICATION HARDWARE SCHEME
A high-speed PA hardware scheme for FPGA implementation is proposed in this section. Firstly, the overall process of the scheme is given. Then a modified 2-dimensional (2-D) FFT algorithm is proposed, followed by a real-value oriented FFT acceleration method and a block-wise matrix transposition method.
A. OVERALL PROCESS OF FFT-BASED PA HARDWARE SCHEMES
The overall process of FFT-based PA Hardware Scheme is indicated as Fig. 2 . As a Toeplitz matrix based PA, our scheme is built upon the FFT algorithm and it only relies on the corrected key length without the final key length. Therefore, the scheme owns good adaptivity to the compression ratio. The compression ratio is defined as the ratio of the final key length r and the corrected key length n. The pre-processing phase is designed to adapt the varying compression ratio, in which the input sequence
. As the compression ratio changes, the pre-processing phase adjusts the length r in X r and the length n − r in X n−1 as shown in Fig. 3 . The dot operational character means the dot product of the two FFT results. In the post-processing phase, the result of inverse FFT (IFFT) is rounded to a boolean sequence. The final key sequence of PA is the XOR of X r and the result of the Toeplitz cyclic convolution Y r .
During the whole processing, the main computation load comes from FFT/IFFT. Although some FFT/IFFT cores can be obtained for FPGA design, the input length of these cores cannot satisfy the requirement of PA. To overcome this problem, a 2-D large-point FFT is designed with small-point FFT hardware cores [35] . The procedure of 2-D large-point FFT algorithm is described as Algorithm 1. 6 : end for 7: for i = 0 to k − 1 do 8: for j = 0 to k − 1 do 9: 
end for 23 : end for Although this method makes it possible to accomplish FFT/IFFT via multiple small-point FFT cores at high speed, many matrix transpositions and memory access are needed repeatedly. Thus, to speed up the PA scheme, the most important task is to optimize the number and speed of matrix transposition and memory access. Aiming at such a challenge, we design a modified 2-D FFT algorithm, a real-value oriented FFT acceleration method and a fast read/write balanced matrix transposition method.
B. A MODIFIED 2-D FFT FOR PA
Since matrix transposition is very time consuming, the PA can be accelerated if fewer matrix transpositions are required. According to Algorithm 1, three matrix transpositions are needed in the 2-D large-point FFT. It is found that removing the first and the third matrix transposition operations would only affect the order of output final key, while the mutual information between the input and output does not change at all. Therefore, the 2-D large-point FFT/IFFT algorithm can be simplified as Algorithm 2, the Modified 2-D Large-Point FFT for PA.
Algorithm 2 Modified 2-D Large-Point FFT for PA
Convert a one-dimensional input sequence X n into a twodimensional matrix A k×k 2: for i = 0 to k − 1 do 3: 4 : end for 5: for i = 0 to k − 1 do 6: for j = 0 to k − 1 do 7: 13 : end for 14: for i = 0 to k − 1 do 15: for j = 0 to k − 1 do 16 :
end for 18: end for If Y = g(X) indicates the process of Toeplitz-based PA with 2-D large-point FFT, the process of Toeplitz-based PA with modified 2-D large-point FFT can be described by
where the function T is a sequence transformation indicated as
where i, j = 0, 1, · · · , C − 1, C means the number of rows of the matrix in the 2-D FFT. Although the transformation makes Y different from Y, it can be proved that the security of PA using Algorithm 2 and Algorithm 1 are exactly equal, i.e., I (Y ; g , W ) = I (Y; g, W ). The detailed proof is presented in the following Proposition 1.
VOLUME 7, 2019
Proposition 1: Let X be a random n-bit string with uniform distribution over {0, 1}
n . Let W = e(X) for an arbitrary eavesdropping function e : {0, 1} n → {0, 1} t , where t < n, and let the length of Y and Y is r = n − t − s, where s is a positive safety parameter and s < n − t. Let function T be the sequence transformation indicated as (8) . If Alice and Bob choose Y = g (X)(7) or Y = g(X) as their secret key, where g is chosen at random from a universal 2 class of hash functions from {0, 1} n to {0, 1} r , then Eve's expected information about the secret key satisfies I (Y ; g , W ) = I (Y; g, W ) ≤ 2 −s / ln 2.
Proof: According to (7), the main differences between Y and Y are Y = T(Y 1 ) and X 1 = T(X). So the proof starts with the equivalent of the information uncertainty of Eve about X and X 1 indicated as (9) and (10), respectively.
Expand combination entropy H (X, W, T, X 1 ) with chain rule as:
. (12) Because X 1 = T(X) is an one-one mapping relationship, (13), (14) can be established,
Apparently,
Applying the same deduction as (9) - (14), (17) can be obtained,
so,
Therefore, I (Y; g, W ) = I (Y ; g , W ) according to the definition of mutual information. On the basis of existing proof in [17] 
Let us take a one-million points PA as an example. Because our PA algorithm is based on the 2-D large-point FFT algorithm, the input sequence will be loaded into a 1024 × 1024 matrix. The input and output sequences of the PA algorithm with the 2-D large-point FFT are shown in Fig. 4 (a) . The input and output sequences of the PA algorithm with the modified 2-D large-point FFT for PA are shown in Fig. 4 (b) . We have proved that the security of two methods is equivalent.
With the modified method, the number of matrix transformation and memory access that the large-point FFT/IFFT algorithm needs will significantly decrease. Since the PA algorithm needs one FFT and one IFFT. The number of matrix transposition in the whole PA algorithm will be decreased from six to two, and the time consumption of the PA algorithm will decrease significantly.
C. REAL-VALUE ORIENTED FFT ACCELERATION
The commercially available FFT hardware cores are designed to compute the FFT of a complex sequence, while both the input sequence X n and the description of Toeplitz V n are real sequences. Most FFT-based PA schemes regard the input sequence as the real part and set the imaginary part to zero directly. This method leads to a waste of computing resource and storage resource. A real-value oriented FFT algorithm is introduced to solve this problem, to compute the FFT of the input sequence x(n) and Toeplitz sequence v(n). Their FFT results X (k) and V (k) can be obtained via one complexvalued FFT as described by (19)- (24) [36] .
Such optimization method can save nearly half computing resource and storage resource of PA. 
D. BLOCK WISE MATRIX TRANSPOSITION
Although the modified 2-D FFT algorithm has decreased the number of the matrix transposition, the matrix transposition is still time-consuming. To improve the processing rate of PA further, an effective matrix transposition method, socalled block wise matrix transposition, is introduced in our scheme [37] . The access mechanism of double data rate synchronous dynamic random access memory (DDR-SDRAM, DDR for short) makes the row span access operation cost much more time than the inline access. Because the matrix transposition needs a large amount of the row span access operations, the block wise matrix transposition is introduced to reduce the number of the row span access operations, which is shown as (25) and (26) .
where A is the square matrix to be transposed, the size of the matrix is N × N , N should be a perfect square, C = √ N , and i, j, k, l = 0, · · · , C − 1. M indicates the DDR memory to access.
The main process of the common matrix transposition based on the DDR memory model is indicated as (27) ,
Taking the one-million points PA algorithm as an example, the number of row span access operation in the common matrix transposition is calculated as (28) , T span = T write +T read = 1024+1024×1024 = 1049600. (28) The block wise matrix transposition method uses the matrix partitioning technology to balance the number of row span access of read/write operations. This method can reduce the total row span access number and increase the processing rates of matrix transposition significantly. The main process of the block wise matrix transposition is indicated as Fig. 5 .
In this case, each row of the matrix is transformed to a 32× 32 matrix. The number of row span access operation of the block wise matrix transposition method is calculated as (29) .
To verify its improvement on the processing rates of matrix transposition, the experiment is carried out with the DDR3-SDRAM. The comparison experiment results of the two methods are indicated in TABLE 1.
According to the experiment results, the block wise matrix transposition method can bring a boost of the processing rate of matrix transposition by a factor of 2.76.
IV. IMPLEMENTATIONS AND RESULTS
The proposed PA scheme is implemented on the Xilinx ML605 Evaluation Kit. The kit includes a Virtex-6 XC6VLX240T-1FFG1156 FPGA with 241,152 logic cells and a 512MB DDR3 SDRAM. The overall structure of our PA scheme is shown in Fig. 6 . The input preprocessing unit is designed to store the corrected keys and the Toeplitz random numbers. It also converts the data to the floating-points for the FFT convolution. In addition, because the Toeplitz matrix should be constructed randomly for each trail, the buffer size of the corrected keys and the Toeplitz random numbers should be same to guarantee the correctness and performance of the scheme. The function of PA control unit is to control the process sequence and data interaction of the other units. The FFT convolution unit is the key unit of the PA module, which contains five major parts. The FFT core is designed to calculate the FFT on each row of the matrix. The length of the FFT IP-core is set as 1024 in the one-million points PA algorithm. The processing rate of single FFT core is 12.8Gbps, and the maximum rate of the matrix transposition is 19.74Gbps. Hence, two FFT IP cores are used in our scheme to match the rates of the FFT cores and transposition. Similarly, the IFFT core is designed to calculate the IFFT on each row of the matrix. Two IFFT IPcores are used and their lengths are also set as 1024. The realvalue oriented acceleration unit completes the computation task of the real-value FFT efficiently. The rotator factory correction unit multiplies rotation factors point-wisely by the result of FFT/IFFT core to accomplish the 2-D large-point FFT algorithm. The data distribution unit distributes the data for the calculation units and exchanges data with DDR3-SDRAM controller. The scheme is simulated with Modelsim v10.4, and the function of the scheme is verified by comparing with the reference program on Matlab 2017a. Then the scheme is implemented on a ML605 Evaluation Kit and the results are accord with the simulation. The resource utilization of the PA scheme in hardware is shown in TABLE 2. According to the resource utilization, there is enough spare resource for other modules to constitute the complete postprocessing system on the XC6VLX240T FPGA.
The comparison of several FPGA-based implementations of PA schemes is demonstrated in TABLE 3.
According to the implementation results, the processing speed of our PA scheme can reach 116Mbps, and it is nearly double of the latest FPGA based Toeplitz PA scheme [31] . This processing speed enhancement mainly benefits for two reasons. Firstly, the computation complexity of the FFT algorithm is lower than that of the linear feedback shift register (LFSR) based algorithm. Secondly, the modified 2-D FFT algorithm with a real-value oriented FFT acceleration method and a block wise matrix transposition is employed in this scheme. Besides, the processing speed of the proposed scheme is mainly limited by the memory transfer rate, it can be improved greatly by simply replacing the DDR3 used in our scheme by a faster memory chip, e.g. DDR4-DRAM.
Except for the high processing speed, another advantage of our scheme is the good adaptivity to the compression rate. Both schemes in [30] and [31] are based on the LFSR, which suffer from the compression rate dependent resource consumption. To be more specific, the resource consumptions of those schemes increase sharply with the rising of compression rate to maintain the high processing speed. The compression ratio roughly varies in the range of 10% through 30% in existing QKD systems. For example, the compression ratio in the high speed QKD system of [27] is 29%. The scheme proposed in [31] is resource-saving when the compression ratio of PA is a fixed value 10%, but the resource requirement will double if the compression rate becomes 20% to maintain the same processing speed. Unlike the LFSR-based PA scheme, the resource consumption of proposed FFT-based PA is independent of the compression rate.
Keeping the throughput constant, the resource consumptions of our scheme and the comparative two schemes are shown in Fig. 7(a) . Keeping the resource consumption stable, the throughputs of the three schemes are shown in Fig. 7(b) . From the comparison, our FFT-based scheme can meet the requirements of more QKD systems.
Although the proposed FFT-based scheme costs about 4MB BRAM, which is higher than the comparative LFSRbased schemes, 4MB is acceptable consumption considering the total BRAM resource of a typical FPGA. For example, the XC6VLX240T FPGA used in our implementation contains 30MB BRAM.
V. CONCLUSIONS AND OUTLOOK
This paper provides a high-speed PA hardware scheme and its implementation in FPGA based on the FFT. The experimental results on a Xilinx Virtex-6 FPGA demonstrate that the throughput is nearly double of the latest FPGA based Toeplitz PA scheme according to the literature. Compared with other representative works, the proposed PA scheme can support wide-range and variable compression ratio. It can reach faster processing speed with faster memory. The optimization schemes proposed in this paper also fits the FFTbased PA algorithm on other platforms, such as CPU and GPU. In the future, we will try to further improve the processing speed and reduce the resource consumption of FFT-based PA scheme on FPGA.
