Abstract-Recent progress in quantum physics shows that quantum computers may be a reality in the not too distant future. Post-quantum cryptography (PQC) refers to cryptographic schemes that are based on hard problems which are believed to be resistant to attacks from quantum computers. The supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol shows promising security properties among various post-quantum cryptosystems that have been proposed. In this paper, we propose two efficient modular multiplication algorithms with special primes that can be used in SIDH key exchange protocol. Hardware architectures for the two proposed algorithms are also proposed. The hardware implementations are provided and compared with the original modular multiplication algorithm. The results show that the proposed finite field multiplier is over 6.79 times faster than the original multiplier in hardware. Moreover, the SIDH hardware/software codesign implementation using the proposed FFM2 hardware is over 31 percent faster than the best SIDH software implementation.
INTRODUCTION
THE computing capability of quantum computers is significantly higher than classical computers. It is found that a 30-qubit quantum computer would have the same processing power as a conventional computer processing commands at 10 teraflops per second [1] . In 1994, Shor [2] proposed an algorithm that can be used to quickly factorise large numbers, which shows exponential speedup of the computation. Later in 1996, Grover's algorithm [3] is proposed to search an unsorted database with quadratic speedup over a conventional computer(in Oð N 2 Þ time rather than OðNÞ). Recent progress in the design and development of quantum computers shows that real quantum computers may be available in the not too distant future [4] .
As a result, commonly used public-key cryptographic algorithms, such as RSA [5] and Elliptic curve cryptography (ECC) [6] , which rely on integer factorization and the discrete log problem that are used in all of today's communications and internet security will be vulnerable to attacks from quantum computers. Post-quantum cryptographic (PQC) [7] , or quantum-safe schemes, which refer to conventional non-quantum cryptographic algorithms that are secure today but should remain secure even after practical quantum computing is a reality, have been shown to be just as practical as classical RSA and ECC schemes [8] . Recently, NIST [9] and ETSI [10] have held workshops to discuss the importance of quantum-safe cryptography. NIST are currently hosting a standardisation process to select new post-quantum cryptographic signature and encryption schemes [11] .
Much research is now being conducted into PQC. Among the various post-quantum techniques, the latest supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol scheme [12] shows promising security properties. The SIDH key exchange scheme offers significantly smaller key sizes than other post quantum key exchange and encryption counterparts such as the commonly cited lattice-based [13] , [14] , code-based [15] , hash-based [16] and multivariate quadratic [17] cryptography. As SIDH is more than a decade younger than the other types of PQC schemes, little research has been conducted into evaluating its practicality. The supersingular isogeny key encapsulation (SIKE) protocol which is based on SIDH has been submitted to the NIST post-quantum cryptography process in November 2017 [18] .
In 2011, Jao and Feo presented software implementation results showing that their proposed SIDH key-exchange protocols are over two orders of magnitude faster than classical isogeny-based cryptosystems over ordinary curves. Later Azarderakhs et al. implemented the same SIDH protocol on PC (x86-64) and ARM (ARMv7) platforms [19] . Their implementation is between 18-26 percent faster depending on the security level. In 2016, Costello et al. presented a high-speed implementation of SIDH, which is more than 2.5 times faster than the previous SIDH software results [20] . The first hardware implementation of SIDH was recently proposed, targeting a Virtex-7 Field Programmable Gate Array (FPGA) [21] , and is 1.5 times faster than the best software implementation for the 512-bit SIDH scheme.
In the recent studies [22] , [23] , it is found that the Montgomery reduction for the primes of special structure used in isogeny based cryptography is not optimal. There are special moduli can be used for faster implementations. In SIDH, the prime, p, is in the form,
where f is a small number. Similar to other public key cryptosystems, the modular multiplication plays a very important role. In [24] , Karmakar et al. proposed an efficient finite field multiplication (EFFM) algorithm, in which the prime field has a special structure.
In this paper, we improve upon EFFM [24] and propose two new algorithms. The first algorithm proposed, referred to as the improved EFFM (FFM1), improved upon the original EFFM by reducing the number of operands. The second new finite field multiplication (FFM2) algorithm proposed is very different from the original EFFM and the FFM1, and allows for larger operand sizes while reducing the number of operations. Both proposed algorithms speed up the computation significantly. Hardware architectures for both proposed algorithms are also proposed. The hardware implementation results are provided and compared with implementations of original EFFM algorithm. This paper is an extension of previous research by the authors in [25] . It improves on [25] as follows: 1) A more detailed description of the two proposed algorithms is given with a brief review of Barrett Division; 2) A new modular multiplication algorithm (FFM2) is proposed; 3) A hardware architecture for the FFM2 algorithm is proposed; 4) The hardware results of the FFM2 algorithm are provided and compared with original EFFM algorithm and the FFM1 algorithm in [25] . The hardware results show that the FFM2 algorithm is the fastest; 5) A complete SIDH hardware/software (HW/SW) codesign result using the proposed FFM2 is provided and compared with the best SIDH software implementation.
The rest of the paper is organized as follows. Section 2 reviews the original EFFM algorithm with the special primes, the Barrett reduction algorithm and the Barrett Division algorithm. The two proposed algorithms are presented in details in Sections 3 and 4, respectively, where hardware architectures are also provided. Section 5 presents hardware implementation results for both the proposed modular multiplication algorithms, and provides a comparison with the previously proposed EFFM algorithm. A SIDH HW/SW codesign using the proposed FFM2 is also provided in this section. Section 6 concludes the paper. [12] to guarantee 128-bit post-quantum security. [24] proposes an algorithm using a special structure of the prime to optimize the modular multiplication and reduction. This algorithm is briefly reviewed in this section.
REVIEW
Assume that p ¼ 2 Á 2 a 3 b À 1, where a and b are even. By using a radix R ¼ 2 a=2 3 b=2 , a field element A 2 F p can be represented as follows:
The field elements are converted once at the start and once at the end of the algorithm in the above representation from the radix R. Suppose there are two numbers A and B in the representation of Eq. (1) . By multiplying them we can get the product C as follows: À2 ðmod pÞ, respectively, or 0, which can be precomputed for a fixed prime. Therefore, we can get Eq. (3), where 4 multiplication operations are required: a 2 b 2 , a 2 b 3 , a 3 b 2 and a 3 b 3 . The other products multiplied by either a 1 or b 1 can be computed by simply selecting the correct result. Therefore, C can be rewritten as
As c 2 and c 3 are in the range of 0; R 2 ½ Þ, they need to be further reduced by improved Barrett reduction.
The original EFFM algorithm is shown in Algorithm 1, where the Barrett Division is the improved Barrett reduction used in [24] .
Barrett Reduction
According to Euclids division lemma, it is known that there exists q and r such that a ¼ q Á b þ r; r 2 0; b ½ Þfor any two positive integers a and b. Therefore, we have a ¼ rðmod bÞ. In order to get such a q and r, one division is required. However, division is a very expensive operation, which is more complex and much slower than multiplication. Thus, to speed up the division, it can be converted to a multiplication by Barrett reduction [26] , i.e., Â1=b. Furthermore, 1=b can also be expressed as follows:
Generally, the value x is taken as x ¼ 2 k =b Ä Å . However, an error (denoted as e) is produced from the approximation of 1=b, which equals 1=b À x=2 k . To make sure the final result is correct, q must be smaller than 1. This condition can be met when k ¼ log 2 a. The whole process is shown in Algorithm 2.
Algorithm 2. The Barrett Reduction Algorithm [26] Input: Two numbers a and b, parameter k,
Barrett Division
An efficient division algorithm was proposed in [24] . The algorithm can divide a number c i 2 ½0; 2 a 3 b Þ by 2 a=2 3 b=2 and calculate the quotient q and remainder r in an efficient way. As division by two is a simple right shift operation, the division can be performed by 2 a=2 3 b=2 according to the following steps: Step1: Extract the a=2 least significant bits of c i and store them in a variable r 1 ; Step2: Right shift c i by a=2 bits to obtain c 0 i ; Step3 Divide c 0 i by 3 b=2 to get the quotient q and remainder r 2 . Therefore, c i can be rewritten as follows:
The division by 3 b=2 in Step 3 is more complex than the division by 2 a=2 . As b is a fixed integer, it is possible to speed up the division by performing multiplication similar to the Barrett reduction technique, as shown in Algorithm 3. Therefore, this efficient division algorithm is referred to as the Barrett Division in this paper.
Once the quotients and remainders are obtained, it is easy to repre-
Algorithm 3. The Barrett Division Algorithm [24] Input: 2 numbers Q 2 ½0; 2 a 3 b Þ and P ¼ 2 a=2 3 b=2 and
Output: q and r such that Q ¼ q Á P þ r 1:
if r > P then 6: r ¼ r À P ; 7: q ¼ q À 1;
8:
return q; r
THE FFM1 ALGORITHM AND ITS HARDWARE ARCHITECTURE
Improving upon [24] , it is found that the modular multiplication with the special field can be further simplified based on the fact that:
The improvement is further explained in detail in the following subsections.
The FFM1 Algorithm
Assume A 2 F p , which is in the form of Eq. (1). Then we get a number A 0 2 F p as follows:
When
, and p can be written as
Therefore, A 0 can be represented in the following form:
Suppose two numbers A and B are in the form of Eq. (1). A 0 and B 0 are in the form of Eq. (7). Due to the fact that Eq. (6) holds, the following relationship between A Á Bðmod pÞ and A 0 Á B 0 ðmod pÞ can be obtained as follows:
The transforming process of the operands is shown in the Fig. 1 . According to Eq. (10), we can get the result of A Á Bðmod pÞ by computing A 0 Á B 0 ðmod pÞ, which saves about 5 additions. The product of A 0 and B 0 can be expressed as follows:
Rewriting Eq. (11) by replacing the coefficients, we can get Eq. (12) 
Algorithm 4. The FFM1 Algorithm
Input: The EFFM needs to precalculate and store 2 À2 ðmod pÞ using 21 registers while the proposed algorithm only needs 16 registers. Most importantly, the terms with R 2 have been removed in the proposed algorithm, which saves on the number of operations significantly. A detailed comparison with the hardware implementations is further discussed in Section 5.
The Proposed Hardware Architecture for FFM1
We also propose a hardware architecture of the improved modular multiplication with the special field, which is shown in Fig. 2 . In this architecture, there is one N/2-bit multiplier, one 5N/2-bit adder and one 2N-bit subtractor. An adder and subtraction that support large word lengths are used to reduce the number of clock cycles. The whole modular multiplication process is controlled by a finite state machine (FSM).
The operands are stored in pipeline registers that are inserted between arithmetic units to further increase the performance. The inputs of the multiplier, adder and subtractor are selected by MUXs, which are controlled by the FSM. The process of a full binary multiplication is shown in Fig. 3 . Two N/2-bit operands are multiplied by the N/2-bit multiplier, and then partial products are accumulated by the 5N/2-bit adder.
There are three kinds of multiplications in the algorithm, which have different input sizes, namely, N Â N=2, N Â N, and 3N=2Â 3N=2. In the 3N=2 Â 3N=2 multiplication, one of the inputs is a constant, whose most significant N/2 digits always equal to 2 and remain unchanged during the process. Therefore, the 3N=2 Â 3N=2 multiplication is performed by a 3N=2 Â N multiplication and a shift operation. This can be performed by the circuit as shown in Fig. 3 . We can get the most significant N/2 digits of the final result in the second clock cycle. On the next clock cycle, the lower N/2 digits can be obtained subsequently. Depending on the weights of the partial products, the summation is shifted.
For example, using only 5 clock cycles, we can get the product of an N Â N multiplication. For the 4 N Â N multiplications at the start of the algorithm, it only takes 17 clock cycles. Otherwise, if the multiplication is not pipelined, 20 clock cycles are required.
THE FFM2 ALGORITHM AND ITS HARDWARE ARCHITECTURE
As we mentioned in Section 2, the modulo p ¼ f Á 2 a 3 b À 1 is chosen in that: (a) f is a small number, such as 1 or 2; (b) 2 a % 3 b . In fact, the number of p that satisfy the above conditions is finite. In addition, the original algorithm in [24] provides rules on how to choose appropriate parameters so that the p is suitable for SIDH. It fixes f to 2 and makes sure that b must be even. For example, the modulo used in the [24] is p ¼ 2 Á 2 386 3 242 À 1. However, in the proposed FFM2, there is no such limitation on f or b.
In the original EFFM algorithm [24] , to prevent the c 2 and c 3 values from increasing beyond the size of the modulus, they proposed efficient Barrett Division, as discussed in Section 2.3. Since Barrett Division uses the fact that division by two is a simple right shift operation, it can replace the complex division by simple shifting, multiplication and addition operations. Inspired by this fact, we notice that there exists a factor f Á 2 a 3 b in the modulo p. Thus, a division whose divider is f Á 2 a 3 b can be performed by the Barrett Division algorithm. If the modulo is p ¼ f Á 2 a 3 b , we can complete the modular multiplication by only using Barrett Division. This form of modulo is an ideal modulo while a practical modulo is the ideal modulo plus or minus 1. In fact, there exists a relationship between the modular multiplication with an ideal modulo and the modular multiplication with a practical modulo, which will be discussed in the following subsections.
The FFM2 Algorithm
The ideal modulo equals to the practical modulo plus or minus 1. For these two cases, assume the following equations:
In Eq. (13), C represents the product of A and B. In Eq. (14), q and r represent the quotient and remainder, respectively, of the division C Ä ðf Á 2 a 3 b Þ.
First Case: The Practical Modulo Equals the Ideal Modulo Minus 1
In this case, we get:
Then, the operation C mod p can be expressed as follows:
Eq. (16) shows that the result of the modular multiplication with a practical modulo can be performed by adding the quotient and remainder of the modular multiplication with an ideal modulo. However, there is a good chance that the sum of the quotient and remainder may be longer than the modulus length. So we need to check the range of the sum, which determines the performance of the algorithm. According to Eq. (14), r is the remainder, so we can get:
In this case,
Thus, r 4 p;
As
We have
Moreover,
Thus, we can get
By considering both Eqs. (19) and (23), we have:
It shows that the sum of the quotient and remainder lies in the range ½0; 2pÞ. In conclusion, when the sum r þ q is larger than p, we need to perform another subtraction to get the final result, which is similar as the final step of the Montgomery algorithm [27] .
Second Case: The Practical Modulo Equals the Ideal
Modulo Plus 1
In this case, we have:
Then, C mod p can be expressed as follows:
Eq. (26) shows that the result of the modular multiplication with a practical modulo can be performed by subtracting the remainder from the quotient of the modular multiplication with an ideal modulo.
Similar to above, the result may be longer than the modulus, hence we need to check the range of the difference. We know that r is the remainder, so Eq. (17) holds.
We have:
As Eqs. (19) and (20) hold, and
As r and q are positive, we have r À q 2 ½Àp; pÞ:
If the difference is smaller than 0, we need to perform another addition.
Algorithm 5. The FFM2 Algorithm
Referring to Algorithm 5, we can see that there are three main steps in implementing FFM2. The first step is to calculate the product of A and B. Then, we use Barrett Division to get the r and q. Finally, according to the practical modulo p, we may need to perform an extra subtraction or addition to obtain the correct result. Once the modulo p is determined, i.e., p equals either f Á 2 a 3 b þ 1 or f Á 2 a 3 b À 1, no more than 4 steps are required in Algorithm 5. In conclusion, the FFM2 is much simpler than the EFFM and the FFM1. It has less steps and less complex operations. In particular, the FFM2 only needs to perform the Barrett Division once while the other two algorithms need to perform it twice.
The Proposed Hardware Architecture for FFM2
In order to have a comprehensive comparison, we also propose a hardware architecture for the FFM2, which is shown in Fig. 4 . It is made of one N-bit multiplier, one 5N-bit adder and one 4N-bit subtractor. Similar as the FFM1 hardware architecture, the modular multiplication process is controlled by a FSM.
Compared with the EFFM and the FFM1, the FFM2 does not require the radix in the form, R ¼ 2 a=2 3 b=2 . Thus, the lengths of inputs of multiplier, adder and subtractor are doubled. There is one 2N Â N multiplication, one 2N Â 2N multiplication and one 3N Â 3N multiplication. All three multiplications can be converted into a summation of several N Â N multiplications. If we do not break-up the multiplication, at most three additions and subtractions need to be performed, which is very simple. During the whole process, we only need 8 registers to store the precomputed values and intermediate values. Moreover, we use less MUXs to choose the different inputs. There are two dashed blocks in Fig. 4 , which represent two different cases:
In the hardware design, only one MUX exists. The values P 1 and P 2 in Fig. 4 represent the precomputed value in the Barrett Division, and the practical modulo p, respectively.
Comparison of Hardware Architectures with Different Sizes of Multipliers
In the hardware implementation of the FFM2, we explore the impact of the operand sizes of the multipliers on the algorithm. We use two kinds of multipliers: one has the size of N Â N as mentioned in the Section 4.2, while the other one has the size of N=2 Â N=2. In other words, we explore two ways to break large size multiplication into smaller ones. In this work, we break a 2N Â 2N multiplication into the following ways, as shown in Figs. 5 and 6. In our design, there is only one multiplier no matter how large the size of the multiplier is. Thus, the smaller the multiplier, the greater the number of cycles needed to complete the operation. We know that for a 2N Â 2N multiplication, only 4 cycles are needed when we use the N Â N multiplier, while we need 16 cycles when we use the N=2 Â N=2 multiplier. However, the operating frequency is lower and more resources are consumed when we use the larger multiplier. It is clear that there is a tradeoff between throughput and hardware resources. We replace the N Â N multiplier of the hardware architecture outlined in Section 4.2 with an N=2 Â N=2 multiplier, without changing other parts. We compare the hardware implementations of these two forms of multipliers, as shown in Table 1 . It can be seen that the hardware architecture with the N Â N multiplier needs less time to finish a full multiplication than the one with the N=2 Â N=2 multiplier. However, it requires double the number of LUTs and quadruple the number of DSPs. 
RESULTS AND COMPARISON
The proposed algorithms FFM1 and FFM2 are implemented in hardware and compared with the original EFFM algorithm [24] . We also apply the FFM2 hardware architecture in the HW/SW codesign implementation of the complete SIDH and compare it with the best SIDH software implementation [20] in this section.
Hardware Implementations of the Proposed FFMs
The proposed algorithms and the EFFM [24] are implemented using Vivado 16.4 on the KC705 evaluation board (with Kintex 7 FPGA chip, i.e., xc7k325tffg900-2). The proposed hardware architecture is applied. In order to have a fair comparison, we choose the same finite field, i.e., the field generated by the prime p ¼ 2 Á 2 386 3 242 À 1, which is consistent with that in [24] . The hardware comparison with [24] is showed in Table 2 . The proposed hardware architecture for the FFM1 algorithm uses 9,688 flip-flops (FFs), 17,247 LUTs and 122 DSP48s, which consumes 2, 8 and 15 of the resources available in the FPGA. One complete modular multiplication takes only 64 clock cycles and takes 1.16 ms. The operating frequency is 55 MHz. Compared with the hardware implementation in [24] , our proposed FFM1 design is over 6.56 times faster.
As the hardware architecture for FFM2 with the N Â N multiplier is faster than that with the N=2 Â N=2 multiplier as shown in Table 1 , it is chosen to compare with other designs. It uses 11,632 flip-flops (FFs), 33,501 LUTs and 529 DSPs. It is the fastest design among all hardware implementations for the modular multiplication.
HW/SW Codesign Implementation of the SIDH
To evaluate the performance of the SIDH protocol using the proposed modular multiplier algorithm and hardware architecture, the FFM2 hardware is use for performing modular multiplications in the protocol. To compare with the previous best SIDH software implementation [20] , the same prime of 2 372 3 239 À 1 is chosen and the same processor (1.7 GHz Intel i5-4210U processor) is used for both the software and the HW/SW codesign implementations. However, for a higher performance codesign, the Zynq FPGA with ARM processor can be considered. FFM1 is not used as it is slower than FFM2 in hardware and also it requires that b in the prime should be even, which cannot be applied in the comparison. The complete SIDH protocol is implemented in the software except the modular multiplications which are performed in hardware (on Kintex 7 FPGA).
The results of both software and HW/SW codesign results are provided in Table 3 . It can be seen that the SIDH codesign implementation using the proposed FFM2 hardware is 31.98 percent faster than the software implementation.
CONCLUSIONS AND FUTURE WORK
In this paper, we proposed two new modular multiplication algorithms that exploit the special structure of primes, i.e., p ¼ 2Á 2 386 3 242 À 1, which can be applied in SIDH. One is improved from the original EFFM algorithm in [24] and the other one is a whole new algorithm, which differs in structure to the previous two algorithms. Building on the properties of the previous two algorithms, a mathematical transformation is applied to reduce the number of operations in the first new algorithm (FFM1). A hardware architecture is also proposed. The proposed FFM2 is over 6 times faster than the previous EFFM in hardware. The hardware implementation of the FFM2 algorithm is the fastest among the three algorithms. Furthermore, the FFM2 algorithm can be applied to a wide range of modulo, which is limited in the EFFM algorithm and the FFM1 algorithm. The FFM2 hardware is also applied in the complete SIDH HW/SW codesign implementation, which is over 31 percent faster than the best SIDH software implementation. Future work will look at the optimized modular multiplication on F p 2 as suggested in [23] . 
