Abstract: Privacy amplification (PA) is a vital procedure in quantum key distribution (QKD) to generate the secret key that the eavesdropper has only negligible information from the identical correcting key for the communicating parties. With the increase of repeat frequency of discrete-variable QKD (DV-QKD) system, the processing speed of PA has become the bottle neck restricting DV-QKD's secure key rate. The PA using Toeplitz-based Hash function is adopted widely because of its simplicity and parallel feature. Because this algorithm can be accelerated with Fast Fourier Transform (FFT), an improved scheme PA for Field-programmable Gate Array (FPGA) based on this is proposed. This paper improves the custom FFT-based algorithm by reducing the number of computations and read/write memory operations significantly. The correctness is verified when implemented in a Xilinx Virtex-6 FPGA. Meanwhile, the processing speed of improved scheme can nearly double the classical Toeplitz Hashing scheme on FPGA through the actual experiment.
Introduction
Quantum Key Distribution exploits quantum mechanics theorem to accomplish the secure key distribution. Since Bennet and Brassard proposed the first practicable protocol in 1984 [1], many protocols have been proposed successively. These protocols can divide into discrete variable (DV) protocols [2] - [6] and continuous variable (CV) protocols [7] - [10] . Because the DV-QKD is proposed earlier and the security proof of it is more complete, the development of DV-QKD drives to mature stage and many DV-QKD business systems have been developed [11] - [14] . We focused on the research of DV-QKD and found that key generation rate and QKD system on chip are two important research points of DV-QKD at this stage. DV-QKD is divided into four phases: quantum communication, public discussion, key reconciliation, and privacy amplification. The first three phases make two distant legitimate parties, usually named Alice and Bob, obtain identical random sequence. However, in this process, some information is exposed inevitably to Eavesdropper, usually named Eve.
Privacy amplification(PA) eliminates the leaked information by distilling the final secret key from a long secret random sequence with universal Hash function [15] - [17] . Several classes of Hash function has been applied to perform the PA [18] , [19] . To reduce the finite size effect in distilling secure keys, the lengths of input blocks for PA should be at least 10 6 [20] , and this leads to large length of processing blocks. Zhang et al. choose a simple multiplicative universal class of Hash function to speed up PA process, and they construct an optimal multiplication algorithm with four basic multiplication algorithms [21] . The speed of this algorithm achieves 14.68 Mbps based on CPU, but this algorithm is an iterative algorithm meaning that it consumes large resources and it is unsuited for hardware implementation. Besides, Toeplitz hashing [22] is widely used in PA process because of its simplicity and parallel feature. The authors in [23] proposed block parallel algorithm to speed Toeplitz hashing. The authors in [24] , [25] proposed improved block parallel algorithm of Toeplitz hashing respectively. The algorithm in [25] achieves 64 Mbps processing speed based on field-programmable gate array (FPGA) and reduces memory resources significantly. While this kind of algorithm has reach its speed limit constrained of its algorithmic complexity O(n 2 ). Fast Fourier transform (FFT) and fast number theory transform (FNTT) are efficient fast algorithms for Toeplitz hashing, which reduce the algorithmic complexity from O(n 2 ) to O(n log n). The authors in [26] firstly proposed a FFT based PA algorithm and implemented on Many Integrated Core (MIC). The process speed of algorithm reaches 60Mbps with raw key length of 12.8M. The authors in [27] proposed a FFT-based PA algorithm in CV-QKD based on graphic processing unit (GPU). The speed of privacy amplification is achieved over 1 Gbps. However, this algorithm is only suitable for CV-QKD, because it's efficient in case of great raw key length and low compression ratio. Crucially, the GPU and MIC platform are both hard to be integrated for its volume and power consumption.
Based on the investigation result, FPGA is a suitable platform for DV-QKD system with the feature of high-parallelism and embeddable platform. More importantly, the energy consumption of FPGA is much lower than that of GPU and MIC. Nevertheless, existing PA algorithms on FPGA are all parallel block method of Toeplitz hashing with algorithmic complexity of O(n 2 ). FFT have great potential to accelerate the speed and reduce the consumption of PA algorithm on FPGA. Unfortunately, there is no practicable FFT-based PA scheme on FPGA. The main reason is that the 10 6 length of input blocks increases the difficulty and cost of implementation. To solve this problem, we proposed a FFT-based PA hardware accelerate algorithm and implemented it on Virtex-6 FPGA. The throughput of our algorithm reaches 116Mbps with the raw key length n = 1M . It's nearly 2 times faster than the classical Toeplitz Hashing Algorithm on FPGA.
The rest of this paper is organized as follows. The principle of privacy amplification with modified Toeplitz we used is described in Section 2. In Section 3, the detail and the key improvements of our FFT-based PA hardware accelerate algorithm are stated. In Section 4, we present our PA hardware implementation module and experiment results, including the processing speed and the requirements of the hardware resources. A comparison between several FPGA-based hardware implementations is also presented. In Section 5, we provide a brief conclusion.
Related Work

Privacy Amplification
Privacy amplification is a process that allows two parties to distill a secret key from a secret random variable about which an eavesdropper has partial information [15] . Before PA procedure in QKD, Alice and Bob share a random n bits binary string W , while Eve learns a correlated random string V providing t (t < n) bits of information about W , i.e., H(W |V ) ≥ n − t. Alice and Bob wish to publicly choose a compression function g : {0, 1} n → {0, 1} r such that Eve's partial information on W and her complete information on g give her arbitrarily little information about K = g(W ) . this procedure is indicated as Fig. 1 . Universal hash function [22] is discovered to suit as the compression function g for privacy amplification [15] . The mutual information between the distill key compressed by universal hash function and Eve's information can be proved using Rényi entropy,
Where s = n − t − r means the security coefficient of PA.
Considering the security threaten of finite-key effect, the input raw key length n of privacy amplification in DV-QKD should be larger than 10 6 [20] , [28] . Therefore, the calculation of hash function is very large, and the choice of hash function class is extremely important. 
Modified Toeplitz Matrix
The Toeplitz matrix is a particular class of the universal hash functions [22] . Because Toeplitz matrix is diagonal-constant matrix, Toeplitz matrix can be constructed by its first column and first row and calculated by FFT. Therefore, the required number of random bits can be reduced to n + r − 1 and the calculation complexity can be reduced to O(n log n). However, the input length of FFT would be n + r − 1 or 2n to calculate the Toeplitz matrix depending on the compression radio, that would cost a lot of extra cost in hardware implementation. Hayashi et al. proposed using modified Toeplitz matrix instead of Toeplitz matrix as the compression function and give the security proof [29] . The modified Toeplitz matrix is constructed by the concatenation of Toeplitz matrix and the identity matrix (X, I). For instance, Eq. (2) is a modified Toeplitz matrix.
Using the modified Toeplitz matrix, the required quantity of random bits can be reduced to n and the input length of FFT can be reduced to n. In this case, the length of FFT is only related to the input block length of the key other than the final key length. This feature notably reduced the design complexity of FFT-based PA.
Modified Toeplitz Matrix Calculation by FFT
FFT algorithm is a common algorithm to calculate the Toeplitz matrix due to the O(n log n) calculation complexity of FFT. A general calculative process of the modified Toeplitz is provided in this section.
is the final key sequence. Eq. (3) is the calculative process of the final key with the modified Toeplitz matrix S r×n .
Make up the Toeplitz matrix V r×(n−r) to the cyclic matrix, then the calculation of Y r can be accelerated by FFT shown as Eq.(4).
[
] is a part of the input sequence. The matrix P aims to complement the Toeplitz matrix V r×(n−r) to cyclic matrix.
High Speed FFT-based Privacy Amplification Hardware Scheme
A high speed PA hardware scheme for FPGA implementation is proposed in this section. An overall process of the scheme is given based on FFT algorithm. Furthermore, three points of Algorithm optimization are given in accordance with the feature of privacy amplification.
Overall Process of FFT-based PA Hardware Schemes
The overall process of FFT-based PA Hardware Scheme is indicated as Fig.2 . The pre-processing phase divides the input sequence The calculation of FFT/IFFT needs the greatest computation in the entire process. Although special hardware circuits for FFT/IFFT is provided in FPGA, the input length of these circuits cant satisfy the request of PA. Concerning this issue, two-dimensional long FFT algorithm [30] is adopted to calculate long FFT with small point FFT core. the procedure of custom 2-D long FFT algorithm is described as Algorithm 1 .
In this way the calculation of FFT and IFFT can be accomplished with multiple small point FFT cores at high speed. However, this algorithm needs matrix transposition and storage repeatedly, that leads to large amount of memory read and write operations. Thus, the data transfer rate of memory is the bottleneck of the entire PA algorithm. Aiming at such shortcomings, several optimizations are proposed to reduce the amount of storage data and improve the processing speed according to the feature of PA.
Real-valued FFT Acceleration
The input sequence X n and the description of Toeplitz V n are both real sequences, but the FFT hardware circuits are designed to compute the FFT of a complex sequence. Most FFT-based PA scheme regards the input sequence as the real part and sets imaginary part to 0 directly . This method leads to a waste of computing resource and storage resource. A real-valued FFT algorithm [31] is introduced in our method to solve this problem. With this algorithm, two realvalued FFT calculations can be accomplished by one complex-valued FFT. In our scheme, we need the FFT results of the input sequence and the Toeplitz sequence. We can get the results in one complex-valued FFT with the method described as below.
This optimization accomplishes the FFT of two sequence with one complex-valued FFT operation, improving the processing rate and saving nearly half of the computing resource and storage resource.
Algorithm 1 Custom 2-Dimensional Long FFT of x
Input:
is the transpose function of the matrix A 3: for i = 0 to k − 1 do 4:
5: end for 6: for i = 0 to k − 1 do 7: for j = 0 to k − 1 do 8: 
16: for i = 0 to k − 1 do 17: for j = 0 to k − 1 do 18 :
end for 20: end for
A Modified 2D-FFT for PA
The matrix transposition and storage operation of long FFT algorithm will cost an amount of time. Nevertheless, in the PA, the unconditional secure key is the final result of PA and the security of the key is not affected by the input sequence of the origin key and the export order of the final key. Therefore, the input and output order of FFT algorithm result is also out of consideration in PA algorithm. Without regard to the order of FFT algorithm, the procedure of long FFT/IFFT algorithm mentioned earlier (3.1) can be simplified as follow:
In this way the matrix transformation and storage that the long FFT/IFFT algorithm needs will significantly decrease. Taking the one-million points PA algorithm as an example, the output sequence of the PA algorithm with the natural order FFT and the unnatural order FFT is indicated as fig.3 .
Because our one-million points PA algorithm is based on the two-dimensional FFT algorithm, the input sequence will be loaded into a 1024-demensional matrix. If the matrix is processed row by row in the two-dimensional FFT algorithm, the input and output sequence of the PA algorithm with the natural order FFT is identical to that in Fig. 3 (a) . Meanwhile, the input and output sequence of the PA algorithm with the unnatural order FFT is shown in Fig. 3 (b) in the column-by-column
Algorithm 2 Modified 2-D FFT for PA
1: Convert one-dimensional input sequence X n into two-dimensional matrix A k×k 2: for i = 0 to k − 1 do 3:
: end for 5: for i = 0 to k − 1 do 6: for j = 0 to k − 1 do 7: 
13: end for 14: for i = 0 to k − 1 do 15: for j = 0 to k − 1 do 16 : 
Fast Matrix Transposition
Although modified 2D-FFT algorithm has decreased the matrix transposition times to twice, the matrix transposition still spends a lot of time. A high effective matrix transposition method [32] is introduced in our scheme. Due to the access mechanism of DDR-SDRAM, the row span access operation will reduce the access speed of DDR. However, the matrix transposition needs a large amount of the row span access operations. The high effective matrix transposition method uses the matrix partitioning technology to reduce the times of the row span access operation. Taking the one-million points PA algorithm as an example, the main process of the common matrix transposition based on the DDR memory model is indicated as Fig. 4 . The row span access operation times of the common matrix transposition is calculated as Eq.
11:
T row−span = T write + T read = 1024 + 1024 × 1024= 1049600
The high effective matrix transposition method uses the matrix partitioning technology to balance the row span access times of the read operation and the write operation. This method can reduce the total row span access times and increase the data rates significantly. the main process of the high effective matrix transposition is indicated as Fig.5 .
In this case, each row of the matrix is transformed to a 32-dimension matrix. The row span access operation times of the high effective matrix transposition method is calculated as follow:
This method reduces the row span access times obviously. This method is experimented with the DDR3-SDRAM to prove its improvement on the data rates of matrix transposition. The comparison experiment result of the two methods is indicated in TABLE I. According to the experiment results, the high effective Matrix Transposition method can double the data rata of matrix transposition. 
Implementations and Results
Our PA scheme is implemented on the ML605 Evaluation Kit. The kit is based on a Virtex-6 XC6VLX240T-1FFG1156 FPGA with 241,152 logic cells. The kit also contains a 512MB DDR3 SDRAM to support our scheme. The overall structure of our PA scheme is shown in Fig. 6 . The input data buffer is designed to store the input key and the Toeplitz random sequence. On output, It also converts the data to the floating-points for the FFT convolution. The PA Control module is the core controller of the PA module. It controls the FFT convolution unit to accomplish the PA computational tasks. The FFT convolution unit is the key unit in PA module. It contains five major parts. The FFT core is designed to calculate the FFT on each row of the matrix. Two FFT IP-Cores provided by Xilinx are used to meet the speed requirements. The length of the FFT IP-Core is set as 1024 in one million points PA scheme. Similarly, The IFFT core is designed to calculate the IFFT on each column or row of the matrix. two FFT IP-Cores are used and their length also set as 1024. The multiply unit completes the computational task of the real-valued FFT accelerated optimizing. The rotator factory multiply unit point-wisely multiply rotation factors by the result of FFT/IFFT core to accomplish the two-dimensional FFT algorithm. The data distribute unit distributes the data for the calculation units and exchanges data with SDRAM controller. We simulated the scheme using the simulation tool Modelsim and verified the correctness of the scheme with the result on Matlab . Then the scheme was implemented on a ML605 Evaluation Kit and the result is accord with the simulation. The resource utilization of the PA scheme in hardware is shown in TABLE II. According to the resource utilization above, there is enough spare resource for other modules to constitute the post-processing system on one chip with the PA module. The comparison of several FPGA-based implementations of PA scheme is indicated in TABLE III. The schemes in [23] , [24] , [25] are all based on the linear feedback shift register(LFSR) to calculate the Toeplitz matrix. This kind of scheme can be high speed and resource-saving when the compression ratio of PA is low and fixed. When the compression ratio of PA is high or floated, the resource consumption of PA will rapidly increase. The FFT-based PA algorithm implementation isnt affected by the compression ratio, so its easy to implement the wide-range variable compression ratio with our PA scheme. Unlike the LFSR-based PA algorithm, the processing speed of the FFT-based PA algorithm is mainly limited by the memory transfer rate instead of the resource of FPGA. The processing speed of our PA scheme can reach 116Mbps and it can increase sharply with faster memory (e.g. DDR4-DRAM). 
Conclusions and Outlook
This paper provides a high-speed PA hardware scheme and its implementation in FPGA based on the FFT. The verification is accomplished on the Virtex-6 FPGA and the processing speed of the scheme can reach 116Mbps. Compared with other work, the proposed PA scheme supports wide-range and variable compression ratio and can reach faster processing speed with faster memory . The optimizing proposed in this paper can also improve the FFT-based PA algorithm on other platforms, such as CPU and GPU. In the future, we will research on the relationship between the precision of FFT and the safety of PA and try to replace the floating-point FFT with fixed-point FFT in PA algorithm to reduce the resource consumption.
