The FDFM (Few DSP slices and Few block Memories) approach is an efficient approach which implements a processor core executing a particular algorithm using few DSP slices and few block RAMs in a single FPGA. Since a processor core based on the FDFM approach uses few hardware resources, hundreds of processor cores working in parallel can be implemented in an FPGA. The main contribution of this paper is to develop a processor core that executes Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in an FPGA. This processor core that we call GCD processor core uses only one DSP slice and one block RAM, and 1280 GCD processors can be implemented in a Xilinx Virtex-7 family FPGA XC7VX485T-2. The experimental results show that the performance of this FPGA implementation using 1280 GCD processor cores is 0.0904µs per one GCD computation for two 1024-bit integers. Quite surprisingly, it is 3.8 times faster than the best GPU implementation and 316 times faster than a sequential implementation on the Intel Xeon CPU.
Introduction
The GCD (Greatest Common Divisor) computation is widely used in computer systems for cryptography, data security and other important algorithms. Most of the time of these computer systems is consumed for computing the GCDs of very large integers. Therefore, it is an important task of accelerating the GCD computation. However, arithmetic operations on integers numbers exceeding 64 bits cannot be performed directly by a conventional 64-bit CPUs as its instruction set support integers of at most 64-bit in length. It is an efficient way to implement the arithmetic operations on large integers using hardware device such as FPGA, VLSI or GPU.
The FPGA (Field-Programmable Gate Array) is an integrated circuit designed to be configured by a designer after manufacturing. It contains an array of programmable logic blocks called CLB (Configurable Logic Block), and the reconfigurable interconnects allow the blocks to be inter-wired in different configurations. Recent FPGAs have embedded DSP48E1 slices and block RAMs. The Xilinx Virtex-7 series FPGAs have DSP slices equipped with a multiplier, adders, logic operators, etc [1] . More specifically, the DSP slice has a two-input multiplier followed by multiplexers and a three input adder/subtractor/accumulator. The DSP slice also has pipeline registers between operators to reduce the propagation time. The block RAM is an embedded memory supporting synchronized read and write operations, and can be configured as a 36Kbit or two 18Kbit dual port RAMs [2] . They are widely used in consumer and industrial products for accelerating processor intensive algorithms [3, 4, 5, 6, 7, 8, 9] . Since the continuing decline in the ratio of FPGA price to performance and its programmable features, FPGA is suitable for a hardware implementation of general purpose computing. The main contribution of this paper is to present an efficient processor core that executes the Euclidean algorithm computing the GCD of two large integers using an FPGA. The proposed processor core is designed based on the FDFM (Few DSP slices and Few block Memories) approach [10] . The key idea of the FDFM approach is to use few DSP slices and few block RAMs for constituting a processor core. We must note that the FDFM approach has some advantages. First, despite the main circuit occupies most of hardware resources of the FPGA, we can also implement the necessary hardware algorithm in the FPGA using remaining few resources as shown in Figure 1 (1) . On the other hand, we can implement multiple FDFM processors working in parallel if enough hardware resources are available as illustrated in Figure 1 (2) . In this paper, we also employ the FDFM approach to implement parallel GCD computation in the FPGA. For example, in this paper, we propose a processor core for GCD computation of 1024-bit, 2048-bit, 4096-bit, and 8192-bit integers, that uses only one DSP slice and one block RAM. We implement one processor core in the FPGA, and the frequency of the FPGA is over 380MHz, that is extremely high. If only one proposed GCD processor core is implemented in the FPGA for computing one GCD of 1024-bit, 2048-bit, 4096-bit, and 8192-bit integers, it takes 73.12µs, 253.35µs, 915.78µs, and 3614.91µs, respectively. In other words, single GCD processor core has competitive performance. Since the proposed GCD processor core uses very few resources of FPGA, we can implement more than one thousand identical processor cores in an FPGA, that all processor core work are paralleled to execute bulk GCD computation. The pairwise GCD computation that computes all pairs of integers in a set, can be used to evaluate the performance of the implementation of thousand processor cores. One of the applications for benchmarking pairwise GCD computation is breaking weak RSA keys. RSA [11] is one of the most well-known public-key cryptosystems widely used for secure data transfer. RSA cryptosystem has an encryption key open to the public. An encryption key includes a modulus n called an RSA modulus such that n = pq for two distinct large prime numbers p and q.
If the values of p and q are available, the encrypted message can be easily converted to the original message. Thus, the safety of RSA cryptosystem relies on the difficulty of factoring RSA modulus n of two large prime numbers p and q. Suppose that we have a set of many RSA encryption keys collected from the Web. If some of RSA moduli in encryption keys are generated by inappropriate implementation of a random prime number generator, they may reuse the same prime number. We call the keys sharing a prime number as weak RSA keys. If two RSA moduli share a prime number, they can be decomposed by computing the GCD of these two moduli. It is well known that the GCD can be computed very easily by Euclidean algorithms [12] . Hence, we can compute the GCDs of all pairs of RSA moduli in the Web to find the RSA keys that shraing the same prime number. In this paper, pairwise GCD computation for RSA moduli is used to measure the performance of the proposed GCD processor core based on FDFM approach. We have succeeded in implementing 1792 GCD processor cores in a Xilinx Virtex-7 family FPGA XC7VX485T-2. However, when the circuit of 1792 GCD processor cores is operated on the FPGA device, this circuit becomes unstable because the number of used resources of FPGA is too close to the maximum available resourses. Finally, we implement 1280 GCD processor cores in the FPGA, that compute the GCDs of all pairs of RSA moduli that are stored in an off-chip DDR3 memory MT8JTF12864HZ-1G6G1. Our implementation of 1280 GCD processor cores computes one GCD of two 1024-bit RSA moduli in expected 0.0904µs.
Several hardware implementations for computing the GCD on FPGAs have been presented [13, 14] . However, they just implemented Binary Euclidean algorithm to compute the GCD using programmable logic blocks as it is. Hence, they can support the GCD computation for numbers with very few bits. On the other hand, several previously published papers have presented GPU implementations of Binary Euclidean algorithm in CUDA-enabled GPUs. Fujimoto [15] has implemented Binary Euclidean algorithm using CUDA and evaluated the performance on GeForce GTX285 GPU. The experimental results show that the GCDs for 131072 pairs of 1024-bit numbers can be computed in 1.431932 seconds. Hence, his implementation runs 10.9µs per one 1024-bit GCD computation. Scharfglass et al. [16] have presented a GPU implementation of Binary Euclidean algorithm. It performs the GCD computation of all 199990000 pairs of 20000 RSA moduli with 1024 bits in 2005.09 seconds using GeForce GTX 480 GPU. Thus, their implementation performs each 1024-bit GCD computation in 10.02µs. Later, White [17] has showed that the same computation can be performed in 63.0 seconds on Tesla K20Xm. It follows that it computes each 1024-bit GCD in 3.15µs. Quite recently, Fujita et al. have presented new Euclidean algorithm called Approximate Euclidean algorithm and implemented it in the GPU [18] . Approximate Euclidean algorithm performs perform each 1024-bit GCD computation in 0.346µs on GeForce GTX 780Ti and 28.6µs on Intel Xeon X7460 (2.66GHz) CPU. Our implementation of 1280 GCD processor cores in Xilinx VC707 evaluation board [19] equipped with FPGA XC7VX485T-2 performs one 1024-bit GCD computation in 0.0904µs which is 3.8 times faster than the GPU and 316 times faster than the CPU. This paper is organized as follows. We first review several Euclidean algorithms in Section 2. Then, we show the implementation of GCD processor core in Section 3. Section 4 shows the architecture of parallel GCD computation using multiple GCD processor cores, that compute the GCD of all pairs of RSA moduli stored in an off-chip DDR3 memory. We show the experimental results in Section 5. Section 6 concludes our work.
Euclidean Algorithms for computing GCD
This section review classical Euclidean algorithm and Fast Binary Euclidean algorithm for computing the GCD of two numbers X and Y . We then show Hardware Binary Euclidean algorithm by modifying Fast Binary Euclidean algorithm, that is implemented in an FPGA.
Let GCD(X,Y ) denote the GCD of X and Y . For any odd integer X and even integer Y , GCD(X, Y ) = GCD(X, Y 2 ) holds. Also, for any even integers X and Y , GCD(X, Y ) = 2 · GCD(
holds, and so we can obtain a factor of 2 in the GCD of X and Y very easily.
For simplicity, we assume that both inputs X and Y are odd and X ≥ Y holds. Based on the fact, it should have no difficulty to modify all GCD algorithms shown in this paper to handle even input numbers. Let swap(X, Y ) denote a function to exchange the values of X and Y . We can write a standard Euclidean algorithm for computing the GCD of X and Y as follows:
Since X ≥ Y holds, modulo computation is performed and X will store the value of X mod Y , which is less than Y . After that, swap(X, Y ) is executed and X > Y always holds. The same operation is repeated until Y = 0 and X stores the GCD of input integers X and Y . However, modulo computation used in Original Euclidean algorithm is costly. So, Binary Euclidean algorithm which does not execute it, is often used to compute the GCD efficiently: in each iteration of the do-while loop. We can reduce the number of iterations of the do-while loop by removing consecutive 0 bits. Let rshift(X) be a function returning the number obtained by removing consecutive 0 bits from the least significant bit of X. For example, if X = 11010100 in binary system, then rshift(X) = 110101 in binary notation. Using swap and rshift functions, we can write the Fast Binary Euclidean algorithm as follows:
In each iteration of the do-while loop, at least one 0 bit is removed from X (or Y ). Hence, for any input numbers, the number of iteration of the do-while loop in Fast Binary Euclidean algorithm is no larger than that in the Binary Euclidean algorithm. However, we need to read all bits of X and Y to exchange them if we implement function swap as it is. Also, rshift function needs a large barrel shifter. Hence, we should avoid direct implementations of these functions in the FPGA. Instead of function rshift(X), we implement function rshift k (X), which removes at most k consecutive 0 bits from the least significant bit of X. In other words, if X has at most k consecutive 0 bits from the least significant bit, all of them can be removed in one iteration of do-while loop by executing rshift k (X). If X has more than k consecutive 0 bits, then k 0 bits from the least significant bit are removed, and rshift k (X) is repeated until X is odd. For example, rshift 2 (1101,1000)=11,0110 and rshift 2 (1101,1010)=110,1101 hold. Using rshift k , we can describe the Hardware Euclideanbased GCD algorithm as follows:
Note that operation rshift k may return an even number. Hence, one of X or Y can be an even number. If this is the case, either X ←rshift k (X) or Y ←rshift k (Y ) is executed until both of them are odd. Hence, both X and Y are odd, whenever rshift k (X − Y ) is executed. Thus, the argument of rshift k is always even and the least significant bit is 0 when it is executed. Table 1 shows the average number of iterations of the do-while loop 1024-bit RSA moduli for each values of k of rshift k . Note that k = ∞ corresponds to Fast Binary Euclidean algorithm, which performs rshift function that removes all consecutive 0 bits. Clearly, the number of iterations is smaller for large k. In the preliminary version of this paper [20] , we have implemented a barrel shifter using CLBs (Configurable Logic Blocks) to compute rshift k . To balance the computing time and the used hardware resources, we have selected k = 4. Since the CLBs are costly to compute rshift k for larger k, we use a multiplier embedded in DSP slice to compute rshift k for k = 17. Hence, we can reduce the number of CLBs and implement more GCD processor cores in an FPGA. Since the subtraction of two very large numbers X and Y returning a result which has more than 17 consecutive 0 bits from the least significant bit is a very rare case, rshift 17 of our implementation and ideal rshift k (k = ∞) of the Fast Binary Euclidean algorithm has almost the same number of iterations as shown in the table. Also, the number of iterations of k = 17 is less than that of k = 4. 
A GCD processor core for large integers
This section shows a GCD processor core, which computes the GCD of two very large numbers based on the Hardware Binary Euclidean algorithm. Our GCD processor uses only one 18k-bit block RAM and one DSP slice in the FPGA. The 18k-bit block RAM is configured as a simple dualport memory [2] with ports A and B of width 36 bits and 18 bits, respectively. Reading: Since port A of the block RAM is configured as read-only 36-bit mode, the block RAM is a 512×36-bit memory for port A. We can read 36-bit data X i Y i (0 ≤ i ≤ 56) from address i using port A for performing the operation rshift 17 
Writing: Since port B is configured as write-only 18-bit mode, the block RAM is a 1024×18-bit memory for port B. We can write 18-bit data X i in address 2i or and Y i in address 2i+1 (0 ≤ i ≤ 56) using port B. In other words, the result of rshift 17 (X − Y ) (or rshift 17 (Y − X) ) can be overwitten to X (or Y ). The DSP slice in our GCD processor core uses a pre-adder, a multiplier and a three input ALU (Arithmetic Logic Unit) as illustrated in Figure 3 . Suppose that X ≥ Y holds. We briefly show how to use the DSP slice for executing the operation rshift 17 (X − Y ) of Hardware Binary Euclidean algorithm. The 36-bit data X i Y i (0 ≤ i ≤ 56) is read from the block RAM one by one, and is connected to the pre-adder of DSP slice. The operation X −Y needs to be executed from the least significant bit of large numbers X and Y . Thus, the pre-adder is used to compute X i − Y i for each 36-bit data X i Y i one by one from X 0 Y 0 . Since X i −Y i is computed one by one, if X 0 −Y 0 < 0 holds, we need to borrow from the higher bit which is in the next word X 1 − Y 1 . In other words, X 1 − Y 1 − 1 needs to be computed for 36-bit data X 1 Y 1 , and we call −1 borrow. Let b 0 denote the borrow from X 0 − Y 0 , and let
. However, we can not compute the borrow using the pre-adder because it has only two input ports. Thus, we first perform the shift operation to remove the consecutive 0 bits from the least significant bit using the multiplier. The multiplier performs the operation
one by one, where n(1 ≤ n ≤ 17) is the number of consecutive 0 bits from the least significant bit of X − Y . If X − Y has more than 17 consecutive 0 bits from the least significant bit, n has the value 17. For example, if X 0 − Y 0 = (11, 0010, 1000, 0000, 0000), that is, X − Y has 11 consecutive 0 bits from the least significant bit. The multiplier computes (X 0 − Y 0 ) × 2 17−11 = (1100, 1010, 0000, 0000, 0000, 0000). We note that the 11 bits consecutive 0 bits are on the right of the 17-th bit of (X 0 − Y 0 ) × 2 6 . For other n(1 ≤ n ≤ 17), the n bits consecutive 0 bits are also on the right of the 17-th bit of (X 0 − Y 0 ) × 2 17−n . We use this feature to remove the consecutive 0 bits in the following. Next, since we suppose that X ≥ Y holds, ALU computes
words, the computation of (X − Y ) and (Y − X) can be switched dynamically by controlling the behavior of ALU, and the borrow is computed after the shift operation using the ALU. For example, suppose that X ≥ Y and X 0 < Y 0 hold, and X 0 − Y 0 = (11, 0010, 1000, 0000, 0000). The multiplier computes (X 0 − Y 0 ) × 2 17−11 = (1100, 1010, 0000, 0000, 0000, 0000), where the 11 consecutive 0 bits are all on the right of the 17-th bit of (X 0 − Y 0 ) × 2 6 as we shown above. Then, ALU outputs
as it is. To remove 11 consecutive 0 bits from the least significant bit of X − Y , we retain the higher 18 − 11 = 7 bits from the 17-th bit of (X 0 − Y 0 ) × 2 6 , which is (110,0101). On the other hand, suppose that X 1 − Y 1 = (01, 1011, 0100, 1011, 0100), the multiplier also computes (X 1 − Y 1 ) × 2 6 = (0110, 1101, 0010, 1101, 0000, 0000). Since there is a borrow from X 0 − Y 0 , ALU computes (X 1 − Y 1 ) × 2 6 − 1 = (0110, 1101, 0010, 1100, 1111, 1111). We note that the higher 18 bits of (
Since only 7 bits of X 0 − Y 0 are retained, we need to pick up 11 bits from the least significant bit of X 1 − Y 1 − 1 to restructure the first 18-bit word of rshift 17 (X − Y ). Hence, we pick up 11 consecutive bits on the right of the 17-th bit of (X 1 − Y 1 ) × 2 6 − 1, which is (100,1011,0011). Then, we concatenate 11 bits data (100,1011,0011) of X 1 − Y 1 − 1 with 7 bits data (110,0101) of X 0 − Y 0 to restructure the first word of rshift 17 (X − Y ), which is (10,0101,1001,1110,0101). Also, the other words of rshift 17 (X − Y ) can be obtained in the same way. The configuration of DSP slice is described as follows:
Pre-adder: The pre-adder of DSP slice has 25-bit port D and 30-bit port A. The 36-bit output of the block RAM are connected to the pre-adder via a pipeline register. X is given to port D, and Y is given to port A. The remaining bits of the ports are padded with 0. The pre-adder of DSP slice can compute D − A, A and D by controlling its behavior, in other words, the pre-adder outputs X − Y , Y or X optionally. For example, to perform the operation X − Y , the subtraction X i − Y i is performed for each 36-bit data X i Y i one by one, and the output of pre-adder is connected to the multiplier.
Multiplier: The embedded multiplier has two input ports, where one accepts an 18-bit two's complement operand from port B via a pipeline register, the other one accepts an 25-bit two's complement operand from the pre-adder via a pipeline register. We use the multiplier to perform the multiplication between the result of pre-adder and value of port B, where B has the value 2 k (0 ≤ k ≤ 17) in our implementation. Thus, the operations (X i − Y i ) × B, X i × B, and Y i × B can be executed using multiplier. In other words, shift operation can be executed for X − Y , X, and Y . Hence, we do not need a barrel shifter which is used in the preliminary version of this paper [20] . The output of multiplier is connected to ALU(Arithmetic logic unit) via a pipeline register M as shown in the Figure 3 .
ALU:
The ALU (Arithmetic Logic Unit) has three input ports, that are connected to register M , input port C of DSP slice, and port CIN , respectively. The most significant bit of register P , the negation of the most significant bit of register P and port CARRY IN are connected to port CIN of ALU. Port CIN can select one of the three values by controlling its behavior. The ALU can performs several operations such as M + C + CIN and −M − C − CIN − 1. In our implementation, C is configured as the value −1. Since M is connected to the output of multiplier, we can control the behavior of the ALU dynamically for computing
where CIN is used as the borrow corresponding to the subtraction of previous 36-bit data X i−1 Y i−1 . In the preliminary verison of this paper [20] , to dynamically compute X − Y and Y − X, we exploit two multiplexers by configuring connecting both X and Y , where the multiplexers is implemented using logic resources of FPGA. Hence, the used FPGA resources of GCD processor core proposed in this paper has decreased. The value computed by ALU is then connected to register P .
Pattern detector:
The pattern detector can determine that the value of register P matches a pattern or not, as qualified by a mask. The mask is used as enable signals for pattern detector. More specifically, if a certain bit of mask is set to "0", the corresponding bit of P AT T ERN and P is compared. Otherwise, the comparison of the corresponding bits is not performed. The value of port P AT T ERN is configured as 0.
Using the block RAM and the functionality of DSP slice, we can perform Hardware Binary Euclidean algorithm without fabric barrel shifter and multiplexers that are used in the preliminary verison of this paper. We show how each operation in Hardware Binary Euclidean algorithm can be performed. Let x 1023 x 1022 · · · x 0 denote 1024 bits representing X such that x 17 x 16 · · · x 0 represents X 0 . Similarly, let y 1023 y 1022 · · · y 0 denote 1024 bits representing Y such that y 17 y 16 · · · y 0 represents Y 0 .
X is even: The number X is write to the block RAM word by word. Thus, the condition can be determined by reading the least significant bit of X 0 when X is input into the block RAM. X ← rshift 17 (X): If X is even, function rshift 17 (X) is executed to remove the consecutive 0 bits from the least significant bit of X. Suppose that we need to compute Z=rshift 17 (X). Let Z 56 Z 55 · · · Z 0 denote 57 words representing Z and show how rshift 17 (X) is computed as the flow shown in Figure 4 . All words of X are sequentially read from the block RAM beginning with X 0 and then processed one by one in a pipelined order. X i (0 ≤ i ≤ 56) is given to the pre-adder of DSP slice. The pre-adder outputs X i as it is. Also, we must obtain the number of consecutive 0 bits from the least significant bit of X 0 to execute shift operation using the multiplier. Let δ = δ 17 δ 16 · · · δ 0 denote the result of logic prefix-or operation of X 0 . The operation δ i ← x i ∨ δ i−1 (1 ≤ i ≤ 17) is performed, where δ 0 = x 0 = 0 since X is even. For example, suppose that X 0 = (11, 0010, 1011, 1011, 0000) , where the number n of consecutive 0 bits of X is 4. We have δ = (11, 1111, 1111, 1111, 0000). Note that except the consecutive 0 bits from the least significant bit, the other bits all have the value 1. Let λ = λ 17 λ 16 · · · λ 0 denote the result of exclusive-or operation of δ. The operation
00, 0000, 0000, 0001, 0000) holds. The only one bit that has the value 1 indicates that there are 4 consecutive 0 bits from the least significant bit of X 0 . Then, the inverse of λ which has the value (00, 0010, 0000, 0000, 0000), is configured as the value of port B to perform shift operation using the multiplier of DSP slice. We note that if X has n(0 < n ≤ 17) consecutive 0 bits, the value of B will be 2 17−n . Otherwise, B = 2 0 holds. In the case of executing operation X ← rshift 17 (X), pre-adder directly outputs X 0 to the multiplier. The product of X 0 × 2 17−n is then computed by the multiplier. Similarly, for other words of X, X i × 2 17−n (1 < i ≤ 56) are also computed one by one in the same way. We note that the consecutive 0 bits of X 0 are always on the right of 17-th bit from the least significant bit of X 0 × 2 17−n . In the example above, since n = 4 holds, X 0 × 2 17−4 = (110, 0101, 0111, 0110, 0000, 0000, 0000, 0000), where the 4 consecutive 0 bits from the least significant bit of X 0 are all on the right of 17-th bit of X 0 × 2 13 . The resulting value of X 0 × 2
13
is then transferred to ALU via register M . The ALU outputs M + C + CIN , where port C is configured as a constant -1. CIN is used as borrow of subtraction of X − Y which is not needed for executing rshift 17 (X), thus CIN is set to 1. Therefore, ALU outputs the resulting value X 0 × 2 13 to register P . We then retain higher 18−4 = 14 bits from 17-th bit of P , that is (11,0010,1011,1011). In other words, the 4 consecutive 0 bits from the least significant bit of X 0 are removed. Since 4 consecutive 0 bits are removed from X 0 , we must pick n bits from its next word X 1 of X to restructure the new word Z 0 of Z = rshift 17 (X). Suppose that X 1 = (01, 1101, 0010, 0011, 1011), the same operation is performed for X 1 , and X 1 × 2 17−4 = (011, 1010, 0100, 0111, 0110, 0000, 0000, 0000) will be stored in register P in the next clock cycle since the architecture is pipelined. Similarly, 4 bits from the least significant bit of X 1 are also on the right of 17-th bit of P . Thus, we can easily pick 4 bits from the least significant bit of X 1 that are store on the right of 17-th bit of P , and then concatente with retained 14 bits of X 0 to restruct the new word Z 0 = (10, 1111, 0010, 1011, 1011) of Z = rshift 17 (X). As shown in Figure 4 , since X 56 X 55 · · · X 0 are input one by one, Z 56 Z 55 · · · Z 0 can be computed one by one and then transferred to the block RAM to overwrite the old X. We say that X ← rshift 17 (X) is executed such that n consecutive 0 bits from the least significant bit of X are removed. If input X has more than 17 consecutive 0 bits from the least significant bit, the function rshift 17 (X) is repeated until X is odd. Also, if input Y is even, the same operation is performed for Y .
X ≥ Y :
The condition X ≥ Y can be determined by comparing X and Y from the most significant bit. More specifically, X and Y are compared from the words X 56 and Y 56 . The words X 56 Y 56 are read from the block RAM concurrently, then are connected to port D and A of DSP slice, respectively. We always assume that X ≥ Y holds, thus, the pre-adder computes X 56 − Y 56 , and the resulting value is input to multiplier. The port B is configured as 2 17 in this case. Thus, multiplier computes (X 56 − Y 56 ) × 2 17 . However, since B is 18-bit two's complement, the most significant bit of B is sign bit. Hence, if B = 2 17 , the operation (X 56 − Y 56 ) × (−2 17 ) is computed by multiplier, and the resulting value is then transferred to ALU. The ALU outputs the value to register P as it is. Clearly, the value of X 56 − Y 56 is left shifted by 17 bits, and is stored in register P from 34-th bit to 17-th bit. If X 56 > Y 56 holds, the most significant bit of P have the value 1 since (X 56 − Y 56 ) × (−2 17 ) is computed by the multiplier. We determine the condition X ≥ Y if the most significant bit P [47] of P has the value 0. However, the value X 56 − Y 56 may be 0 if X 56 = Y 56 holds. Thus, we use the pattern detector to determine that 18 bits in P [34:17] of register P are all 0 or not. If X 56 = Y 56 , P [34 : 17] = 0 holds and the detector outputs the value 1. We need to compare the next words X 55 Y 55 to determine the condition X ≥ Y . It takes 3 clock cycles to determine the condition X 56 = Y 56 from the words X 56 Y 56 are input to the DSP slice, because three-stage pipeline registers are used as shown in Figure 3 . And in most of cases, we can determine the condition X ≥ Y by comparing the words X 56 Y 56 . Hence, we start to execute the operation rshift 17 (X − Y ) one clock cycle after the words X 56 Y 56 are input to DSP slice. More specifically, we start the execution of rshift 17 (X − Y ) from words X 0 Y 0 without waiting the determination of the condition X ≥ Y , which we will show in operation X ← rshift 17 (X − Y ). If X 56 = Y 56 is determined after 3 clock cycles, we terminate the execution of rshift 17 (X − Y ), and restart to compare the next words X 55 Y 55 to determine the condition X ≥ Y . In the same way, (X 1 − Y 1 ) × 2 17−n + CIN − 1 is computed by ALU in the next clock cycle, We select the value of CIN as the negation of P [47] as the borrow from
−n − 1 is computed. Next, we briefly show how to obtain the word Z 0 of Z = rshift 17 (X − Y ) is computed as shown in Figure 5 . Suppose that X 0 ≥ Y 0 holds. Since the result of X 0 − Y 0 is shifted by 17 − n bits and stored in P , the n consecutive 0 bits from the least significant bit of X 0 − Y 0 are on the right of 17-th bit of P . Hence, we retain 18 − n bits on the left of 17-th bit of P to store in a register. In other words, the n consecutive 0 bits from the least significant bit of X 0 − Y 0 stored on the right of 17-th bit of P are removed. Also, X 1 − Y 1 is shifted by 17 − n bits and stored in P . Similarly, the n bits from the least significant bit of X 1 − Y 1 are stored on the right of 17-th bit of P . Then, we can easily pick up n bits from the least significant bit of X 1 − Y 1 to concatente with higher 18 − n bits of X 0 − Y 0 to restruct the new word Z 0 as shown in Figure 5 . The same operation is executed for all words X i Y i (0 ≤ i ≤ 56) in a pipelined order. Hence, the words Z 56 Z 55 · · · Z 0 can be obtained one by one and are then written back to the block RAM to overwrite the old X. X = 0: We use a register to store the current number of bits of X. If operation X ← rshift 17 (X) or X ← rshift 17 (X − Y ) is executed, we rewrite the value of this register. We determine the condition X = 0 if the number of bits of X is not 0.
Let us briefly confirm that the GCD processor core can execute Hardware Binary Euclidean algorithm. By controlling the behavior of pre-adder, multiplier and ALU of DSP slice, we can compute rshift 17 (X − Y ), rshift 17 (Y − X), rshift 17 (X) and rshift 17 (Y ) without multiplexers and barrel shifter that use resources of FPGA. The resulting value can be written to the block RAM to overwrite X or Y . The conditions "X is even" and "Y is even" can be determined when X 0 and Y 0 are written in the block RAM. The condition "X ≥ Y " can be determined by checking X and Y from the MSB (Most Significant Bit). More specifically, if X 56 > Y 56 holds, "X ≥ Y " is determined. We execute the rshift 17 (X − Y ) without waiting the determination of the condition "X ≥ Y ", because the condition "X ≥ Y " can be determined by comparing the words X 56 and Y 56 in most of the cases. However, if X 56 = Y 56 , we terminate the execution of rshift 17 (X − Y ), and then read and compare X 55 with Y 55 . During the computation of Hardware Binary Euclidean algorithm, the number of bits of X and Y is decreased. For example, if X 56 and Y 56 both decrease to 0, the next iteration of the do-while loop of Hardware Binary Euclidean algorithm is only performed for words X i Y i (0 ≤ i < 56). We use registers to store the current numbers of bits of X and Y . If the number of bits is 0, we terminate the algorithm.
Implementation of Hierachical GCD cluster with DDR3 Memory
This section shows a hierarchical parallel architecture based on the hierarchical GCD cluster [20] using an off-chip DDR3 memory equipped in Xilinx VC707 evaluation board [19] . The proposed GCD processor core is compactly designed based on the FDFM approach. We use only one DSP slice, one block RAM and a few CLBs to implement the processor core. Therefore, single proposed FDFM GCD processor core is clocked at high frenquency and provides high performance that we show in the next section. On the other hand, by employing multiple proposed FDFM GCD processors, the computing time reduces considerably. Since the proposed GCD processor is designed based on the FDFM approach and uses very few FPGA resources, we have succeeded in implementing more than one thousand proposed GCD processor cores working in parallel in the FPGA, thus, it makes sense to use multiple servers. Each server controls more than one hundred GCD processor cores. The hierarchical GCD cluster consists of multiple GCD clusters, each of which involves multiple GCD processor cores as illustrated in Figure 6 . A single central server controls local servers, each of which maintains GCD processor cores in the same GCD cluster. We show how the hierarchical GCD cluster is used to execute pairwise GCD computation for RSA moduli. The DDR3 memory consists of 8 banks. Each bank has a memory array that can be used to store lots of moduli. Suppose that we have a lot of moduli collected from the Web and all moduli are divided into two sets. We store two sets of moduli to two different banks of DDR3 memory for simplifying the address/control circuit. Our goal is to compute all pairs of moduli using the hierarchical GCD cluster in an FPGA. For this purpose, we partition all moduli of each set into groups with m moduli each. FPGA picks one group from each set and sends them to the central server, respectively. Let N = {n 0 , n 1 , . . . n m−1 } and
} denote two groups of m moduli each that the central server in the FPGA has received. The hierarchical GCD cluster computes gcd(n i , n ′ j ) for all pairs of i and j (0 ≤ i, j ≤ m − 1), and reports the GCDs larger than 1. Next, we will show how the hierarchical GCD cluster computes the GCDs of N and N ′ using GCD clusters. Each group of m moduli is partitioned into b blocks of k moduli each, where m = bk.
be two sets of k moduli in the i-th groups of sets N and N ′ , respectively. Each cluster is assigned a task to compute the GCDs of all pairs X (∈ N i ) and Y (∈ N ′ j ) for a pair i and j (0 ≤ i, j ≤ b − 1). For this purpose, all moduli in N i and in N ′ j are copied from the block RAM in the central server to that in the local server of a GCD cluster. After the local server receives all moduli, the cluster starts computing the GCDs of all pairs X (∈ N i ) and Y (∈ N ′ j ). The local server then picks a pair X and Y and copies them to the block RAM of a GCD processor. Upon completion of the copy, the GCD processor starts computing the GCD of X and Y by the Hardware Binary Euclidean algorithm. This procedure is repeated for all GCD processors. If a GCD processor terminates the GCD computation, the local server sends a new pair to it. In this way, the GCDs of all pairs in N i and N 
Implementation results in the FPGA
We have implemented a GCD processor core for computing the GCD of 1024-bit, 2048-bit, 4096-bit, and 8192-bit integers in Xilinx Virtex-7 XC7VX485T-2. Table 2 shows the implementation results.
Slice Registers and Slice LUTs (Look-Up-Tables) are hardware resources in CLB (Configurable Logic Block) [21] , which are used to implement sequential logics. The proposed GCD processor is compactly designed based on FDFM approach. More specifically, we use only one DSP slice to perform subtraction and shift operation for very large numbers and use one block RAM to store the computed result instead of using lots of CLBs. Therefore, the proposed FDFM GCD processor is clocked at over 380MHz and provides a high performance. Calculated simply, single proposed FDFM GCD processor core computes one GCD of two 1024-bit, 2048-bit, 4096-bit and 8192-bit moduli in expected 73.12µs, 253.35µs, 915.78µs and 3614.91µs.
Recall that we control the behavior of the embedded ALU of the DSP slice to perform X − Y or Y − X dynamically instead of two multiplexers used in our previous work [20] . Also, we use the embedded multiplier of the DSP slice to perform the shift operation instead of the barrel shifter used in our previous work, where the barrel shifter uses a lot of FPGA logic resources. Since these mechanisms simplify the circuit of the proposed processor, the frequency of the proposed FDFM processor is over 380MHz that is higher than that of the implementation of our previous work. First, the simulation of pairwise GCD computation for 1024-bit RSA moduli without DDR3 memory is performed. In our implementation, a GCD cluster with a local cluster with eight 18k-bit block RAMs and 128 GCD processor cores are used. Since four 18k-bit block RAMs can store ⌊ Table 3 shows the implementation results of clusters of our work. Since a cluster server uses eight 18k-bit block RAMs, each GCD cluster with 128 GCD processors involves 128 + 8 = 136 block RAMs. In this paper, the implementation of the hierarchical GCD cluster with 14 GCD clusters and the central server, uses 14 · 128 = 1792 DSP slices and 14 · 136 + 64 = 1968 block RAMs. Due to the overhead for the connection between the central server and GCD clusters, the clock frequency is decreased to 207.04MHz. The used block RAMs of the implementation with 14 clusters are close to the available number. Since the proposed GCD processor core is compactly designed, the number of processor cores in our implementation is more than that of the preliminary verison of this paper [20] . We have evaluated the number of clock cycles to compute all GCDs of 71 × 71 = 5041 pairs of 1024-bit moduli by one GCD cluster. For this purpose, we have used RSA moduli generated by OpenSSL Toolkit. By performing the simulation, one cluster with 128 processors takes 1157789 clock cycles to compute the GCDs of 5041 pairs. If a GCD cluster is clocked at 207.04MHz as shown in Table 3 , the expected computing time is 1157789/207.04MHz = 5.592ms. Also, it takes about 71 × 2 × 57 = 8094 clock cycles to transfer a pair of two blocks involving 71 moduli each and this overhead is negligible. Since up to 14 clusters can be implemented theoretically, we can expect that the GCDs of 5041 × 14 = 70574 pairs can be computed in the same time. Therefore, we say that one GCD can be computed in expected 5.592ms/70574 = 0.0792µs. Next, for measuring the performance of GCD computation accurately, we implement the hierarchical GCD cluster to compute all pairs of moduli stored in an off-chip DDR3 memory MT8JTF12864HZ-1G6G1 [22] equipped in VC707 evaluation board [19] . Unfortunately, if the used resources of FPGA is close to the available number, the circuit of FPGA becomes unstable and can not compute the results correctly when it is actually operated in the evaluation board. According to the experimental results, 10 clusters can be implemented in the FPGA clocked at 250MHz for pairwise GCD computation of 1024-bit RSA moduli. In other words, 1280 GCD processor cores can be implemented in FPGA XC7VX485T-2 equipped in VC707 evaluation board, and works in parallel to compute GCDs of all pairs of 1024-bit RSA moduli stored in the off-chip DDR3 memory.
We use the built-in CORE Generator software of Xilinx Vivado design suite 2015.1 to generate a DDR3 memory interface core in the FPGA to control the write and read operations of the DDR3 memory. The DDR3 memory consists of 8 banks. Each bank has a 2 14 × 2 10 memory array, of which each element has 64-bit. In other words, each bank of the DDR3 memory can store up to (2 14 × 2 10 × 64bits)/1024bits = 1048576 1024-bit RSA moduli. The DDR3 memory runs in 500MHz that is 2 times faster than the FPGA. Moreover, the DDR3 memory offers high-speed data transfers on the rising and falling edges of the clock of it. Hence, the DDR3 memory can perform 500MHz/250MHz × 2 = 4 times write or read operations in one clock cycle of the FPGA. Hence, we can read 4 × 64bits = 256bits data from DDR3 memory in one clock cycle of the FPGA. Suppose that we have a lot of 1024-bit RSA moduli collected from the Web, we divide all moduli into two sets and store them to two different banks of the DDR3 memory. We partition all moduli of each set into groups with 71 × 8 = 568 moduli each. FPGA picks one group from each set and sends them to the central server, respectively. More specifically, we send read commands to the DDR3 memory interface core for reading a 1024-bit modulus. Then, the interface core performs the read operation of the DDR3 memory and the modulus is transferred to FPGA after a few clock cycles. The obtained 1024-bit modulus is then stored to the block RAMs of central server as 18-bit words in 57 clock cycles, and we read the next 1024-bit modulus at the same time. The same operation is repeated until two groups of 568 moduli are stored in the central server. Moreover, the interface core processes a refresh operation to maintain the data of the DDR3 memory in refresh interval, and other operations of DDR3 memory must wait for the refresh operation. By implementing the hierachical GCD cluster with 1280 processor cores in the FPGA, we have that it takes 7294417 clock cycles compute the GCDs of 568 × 568 = 322624 pairs, where 71646 clock cycles for transferring 568 × 2 1024-bit moduli from DDR3 memory to the central server of the FPGA is included. Comparing with the total clock cycles for computing the GCDs of 322624 pairs of 1024-bit moduli, the clock cycles for transferring moduli from DDR3 memory to central server is negligible. Moreover, after all moduli of central server are transferred to the clusters, we can read the next two groups with 568 moduli each from DDR3 memory while the GCD computation of the clusters is still being performed. In other words, the operation of transferring moduli from DDR3 memory to central server can be overlapped. Hence, we note that the transfer time is not significant. Since the hierachical GCD cluster runs in 250MHz, the computing time is 7294417/250MHz = 29.178ms. Therefore, we say that one GCD can be computed in 29.178ms/322624 = 0.0904µs. For performing pairwise GCD computation of 2048-bit, 4096-bit, and 8192-bit moduli, we have succeeded in implementing the hierachical GCD cluster that has 9 clusters in the FPGA, where the frequency of FPGA is also 250MHz. The implementation results of hierarchical clusters and computing time for one GCD of 1024-bit, 2048-bit, 4096-bit, and 8192-bit moduli is also shown in Table 4 . The hierarchical GCD cluster is designed based on FDFM GCD processors that are compact and use very few FPGA resources. One of the advantage of the FDFM approach is that we can implement multiple FDFM processors working in parallel to reduce the computing time if enough hardware resources are available. Comparing with single FDFM GCD processor core, the computing time of the hierachical GCD cluster for one GCD reduces considerably by employing more than one thousand FDFM GCD processor cores.
According to the implementation results as shown in Table 4 , the hierachical GCD cluster computes one GCD of two 8192-bit moduli in 4.7895µs, that is 52.98 times slower than the time for computing one GCD of two 1024-bit moduli. We show the reason of the large difference. Since the large input numbers are stored in the block RAM as 18-bit words and processed word by word, if the width of the input numbers increases, the number of iterations of the do-while loop of the Hardware Binary Euclidean algorithm will increase. Also, the clock cycles for performing each iteration of the do-while loop will increase. Hence, the proposed GCD processor takes more clock cycles for computing one GCD of larger numbers. For example, as shown in Table 2 , single proposed processor takes 1381328.5 clock cycles for computing one GCD of two 8192-bit moduli, that is 49.32 times more than that for computing one GCD of two 1024-bit moduli. This is the main reason for the large difference of the computing time for 1024-bit and 8192-bit moduli. Recall that each 1024-bit and 8192-bit modulus is stored in the block RAM as 57 and 456 18-bit words, respectively. Hence, the central server and cluster server take more time for transferring the 8192-bit moduli than that for 1024-bit moduli. However, since the data transfer is overlapped with the GCD computation, the transfer time does not significantly affect the large difference of the computing time for 1024-bit and 8192-bit moduli. Moreover, the number of clusters for 8192-bit moduli is 9, that is less than that for 1024-bit moduli. Based on the reasons above, one GCD of two 8192-bit moduli is computed 52.98 times slower than one GCD of two 1024-bit moduli in the implementation of hierarchical GCD cluster.
Conclusions
We have presented an efficient processor core for computing GCDs of very large numbers. Since the processor is designed based on the FDFM approach, each processor core uses only one DSP slice and one 18k-bit block RAM. We implement the hierarchical GCD cluster with 1280 processor cores in Xilinx FPGA XC7VX485T-2. The implementation with 1280 processor cores executes pairwise GCD computation for 1024-bit RSA moduli stored in an off-chip DDR3 memory on Xilinx VC707 evaluation board. The experimental results shows that our implementation of 1280 GCD processor cores computes one GCD of two 1024-bit RSA moduli in 0.0904µs including the time of data transferring from off-chip DDR3 memory to FPGA. It is 3.8 times faster than the best GPU implementation and 316 times faster than a sequential implementation on the Intel Xeon CPU.
