Abstract-In this paper, we propose a new unified architecture that utilizes the Montgomery Multiplication algorithm to perform a modular multiplication for both integers and binary polynomials and NTRU's polynomial multiplications. The unified design is capable of supporting a majority of public-key cryptosystems such as NTRU, RSA, Diffie-Hellman key exchange, and Elliptic Curve schemes, among others. Furthermore, the architecture is highly efficient in terms of area and speed.
INTRODUCTION
F OR a majority of public key schemes such as RSA [1] , ECC [2] , [3] , and NTRU [4] , most of the time is spent in performing modular multiplications. These multiplications occur as either integer modular multiplications, binary polynomial modular multiplications, or polynomial multiplications over an integer ring. Developing a unified architecture that can perform modular multiplications for all representations is attractive because it provides the following key features: algorithm agility, resource utilization, and compatibility.
Algorithm Agility allows the user to switch cryptographic algorithms during runtime. Therefore, if current cryptosystems become obsolete due to new security needs, several back-up cryptosystems can be used as a replacement without requiring redesign.
Resource Utilization uses the same hardware to perform the majority of the arithmetic used for the different supported algorithms. Therefore, it reduces power and area consumption, which is critical for resource constrained applications. Alternatively, the architecture can be designed so that each algorithm allocates a separate area, which is not ideal.
Compatibility allows interoperability of various applications through a multitude of cryptographic algorithms. The problem stems from the fact that high end platforms may utilize cryptosystems such as RSA, whereas, for smart cards or other embedded applications, ECC or NTRU may be more reasonable choices.
Since Montgomery Multiplication (MM) is capable of improving the performance of integer and binary polynomial modular multiplications [5] , [6] , a number of unified architectures have been proposed. The work in [7] introduces a scalable and unified multiplier. This design supports both types of arithmetic through the introduction of the "dual-field adder" which performs addition with and without carry propagation. A word-level bit-serial version of the MM algorithm is utilized to avoid designing a dual field multiplier. Scaling is achieved by increasing the number of fixed area units. In addition, the pipeline depth and word size can be configured to meet the desired area and performance specifications.
In [8] , a unified scalable Montgomery Multiplier design is presented. The design is based on the word-level MM algorithm which is implemented using a dual field multiplier that supports w Â w-bit multiplications. The highradix design utilizes the addition tree method to reduce the delay path to log3 2 w (instead of w) in the dual field multiplier and adder. The word size of the multiplier core and the pipelining depth can be configured before implementation.
In extension to the previous unified architectures, our design incorporates a third type of operation with the MM algorithm: the polynomial multiplication in NTRU's Public Key Cryptosystem, which is simply the usual convolution product of two vectors. By unifying NTRU's polynomial multiplication to the MM algorithm, a more flexible unified architecture can be designed to support more cryptosystems and a wider array of applications with a single processing element. To the best of our knowledge, no work has been published on the unification of NTRU with MM.
The rest of this paper is organized as follows: Section 2 provides a brief description of the NTRU Public Key Cryptosystem and Montgomery Multiplication. Section 3 presents the word-level Montgomery Multiplication algorithm and discusses the mathematical and algorithmic modifications that are necessary to provide support for NTRU. The new unified architecture is presented in Section 4. Section 5 describes how the unified multiplier can support every polynomial multiplication required by NTRU's major procedures. A performance analysis is presented in Section 6. Future improvements for the design are discussed in Section 7. Finally, conclusions are drawn in Section 8. 
PRELIMINARIES
This section introduces two topics to the reader as background for the sections that follow.
NTRU Public Key Crytosystem
NTRU is a polynomial ring-based public key cryptosystem that was fully introduced in 1998 [4] . The scheme is set up by three integers ðN; p; qÞ such that:
. N is prime, . p and q are relatively prime, gcdðp; qÞ ¼ 1, and . q is much much larger than p. NTRU is based on polynomial additions and multiplications in the ring R ¼ Z½x=ðx N À 1Þ. We use "Ã" to denote a polynomial multiplication in R, which is the cyclic convolution of two polynomials. After completion of a polynomial multiplication or addition, the coefficients of the resulting polynomial need to be reduced either modulo q or p. As a side note, the key creation process also requires two polynomial inversions, which can be computed using the Extended Euclidean Algorithm. NTRU requires approximately OðN 2 Þ operations and a key length of OðNÞ.
More information on NTRU can be found in [4] and [9] . We briefly outline the procedures below. Key Generation. To generate the public key, the user must:
. Choose a secret key, a random polynomial f 2 R, where coefficients are in ðÀ f modulo q. Once the above has been completed, the public key, h, is found as
Encryption. The encrypted message is computed as
where the message, m 2 R, and the random polynomial, r 2 R, have coefficients reduced modulo p.
Decryption. The decryption procedure requires three steps:
. a ¼ f Ã e ðmod qÞ . shift coefficients of a to the range (À Note that, by choosing f ¼ 1 þ p Á f 1 , where f 1 2 R in the key generation step, the second polynomial multiplication in the decryption step is eliminated since F p ¼ 1 ðmod pÞ.
Montgomery Multiplication
MM was introduced by P.L. Montgomery in [5] to improve the performance of computing the modular multiplication
where a, b, c, and m are integers. The reduction modulo m requires a division which is a costly operation. Instead, Montgomery replaces this division with a series of inexpensive shift operations. Montgomery's technique utilizes a residue representation:
There are two restrictions on choosing an integer residue, R:
. R > m and . gcdðR; mÞ ¼ 1. After computation of the residues, the MM algorithm computes the product, c:
As indicated by the bar notation above, the result, c, is in residue form. To convert the result back to nonresidue format, one more MM is computed:
MM can also perform modular multiplications of binary polynomials, as explained in [6] . Ideally, Montgomery's technique should be used for applications that require multiple modular multiplications over the same modulus (e.g., modular exponentiations) since the residue is preserved for each consecutive multiplication.
OUR CONTRIBUTION
In this section, we describe the mathematical and algorithmic changes that are necessary to achieve NTRU's polynomial multiplication using MM. As a note, we represent all operands as bit strings, which are, according to the context, interpreted as either one of the following:
. coefficients of the expansion of an integer in powers of 2, . coefficients of the expansion of a binary polynomial in powers of x, . the binary representation of the integer coefficients of a polynomial defined over an integer ring. There are differences in the way the three types of multiplications are computed. In integer multiplications, carries are propagated through the entire bit string representation of the product. In binary polynomial multiplications, however, no carries are propagated. In NTRU's polynomial multiplications, carries are propagated only within the integer coefficients but not among coefficents. These differences are minor and, in a hardware implementation, they can be implemented by inhibiting carry propagation when necessary. While regular modular multiplications can be unified using this method, to unify MM with NTRU, additional changes are necessary, which are described below.
Achieving NTRU with Montgomery Multiplication
To achieve NTRU using MM, the extra R factor has to be taken into consideration. We observe that, by setting the residue R ¼ x N and the modulus m ¼ x N À 1 in MM, the following property is introduced:
When this property is applied to Montgomery's method, the residue of an operand is simply the operand itself:
and the Montgomery product is in nonresidue format:
Therefore, by restricting R ¼ x N , NTRU's operands never need to be converted to and from residue form to receive the correct result from MM.
As a result of the unification, setting R ¼ x N restricts the residue and modulus for the integer and binary polynomial cases as follows: For integers:
For binary polynomials:
RðxÞ ¼ x NÁw and degreeðmðxÞÞ N Á w:
Unified Word-Level Algorithm
The word-level integer MM algorithm is given in Algorithm 3.2, which is a slightly modified version of the FIOS method introduced in [10] . In this algorithm,
. a½i, b½j, c½i, and m½i represent individual elements of the word array representations of integers a, b, c, and m, respectively, . the least significant word is stored in the 0th element of the array (e.g., m½0), . the word size is w bits in length, . n is the number of words in the arrays, . m½0 0 ¼ Àm½0 À1 mod 2 w is precomputed, and . CS has a total length of 2 w þ 1 bits such that:
-C, the most significant word, is w þ 1 bits long and -S, the least significant word, is w bits long. After completion of Algorithm 3.2, a long subtraction of c ¼ c À m may be necessary if c ! m.
Word-Level Montgomery Algorithm for GF ðpÞ
Step 1: for j ¼ 0 to n À 1
Step 2:
CS ¼ ða½0 Á b½jÞ þ c½0
Step 3:
Step 4:
CS ) w Step 6:
CS ) w Step 10: endfor Step 11:
c½n À 1 ¼ S Step 12: endfor With a few modifications, as proposed in [6] , Algorithm 3.2 can be used for binary polynomials as well. The modifications include changing integer multiplications to polynomial multiplications and additions to word size XOR operations. Also, since GF ð2Þ arithmetic eliminates carry propagation, CS only needs to be 2w bits long and the long subtraction after completion of the algorithm is no longer necessary.
We now describe the algorithmic changes that are necessary to achieve NTRU's polynomial multiplication with the MM algorithm. NTRU's operands are stored as word arrays where each coefficient is placed in a w-bit word. Since NTRU's modulus is comprised of N þ 1 words (coefficients), the inner loop in Algorithm 3.2 needs to be incremented by one to process all words of the modulus. In addition, the parameter n is changed to NTRU's parameter N. We assume that the operands and moduli for all three cases are stored as N þ 1 word arrays. The new algorithm is shown in Algorithm 3.2.
Word-Level Montgomery Algorithm for GF ðpÞ, GF ð2 k Þ, and NTRU Step 1: for j ¼ 0 to N À 1
Step 5:
Step 8:
There is one additional change that is necessary to make NTRU's polynomial multiplication work with the MM algorithm. In Steps 4 and 7 of Algorithm 3.2, the result needs to be reduced modulo q. In NTRU implementations, q is selected as 2 w for more efficient reductions. For this case, there is no carry propagation from one word to the next. Hence, the upper word of the result from Steps 4 and Step 7 needs to be cleared before it is reprocessed in Step 7.
HARDWARE DESIGN
In this section, we detail the hardware architecture that utilizes the MM algorithm to perform integer modular multiplications, binary polynomial modular multiplications, and NTRU's polynomial multiplications. Since NTRU's operands and modulus are polynomials whose coefficients are w-bit integers, a high-radix word-level design is the most convenient. We start to build a new unified core by extending a high-radix word-level Montgomery Multiplier core, which implements Algorithm 3.2, to support NTRU with minimal change to the hardware. To simplify the control logic, the Montgomery Multiplier core is fixed for a word size w ¼ 8 bits to support the maximum length of NTRU's polynomial coefficients. The following sections first present the Montgomery Multiplier core, introduce the new unified core, then discuss the control system which orchestrates the operations of Algorithm 3.2.
High-Radix Montgomery Multiplier Core
A slightly modified version of the high-radix Montgomery Multiplier Core in [8] is shown in Fig. 1 . The core is capable of performing all operations required by Algorithm 3.2 to achieve integer and binary polynomial modular multiplications. The core consists of two 8 Â 8-bit dual field multipliers, an 8 Â 8-bit dual field adder, and a three-way dual field adder. The three-way dual field adder accepts two operands that are 16 bits in length and a third operand that is 9 bits in length. The f sel 0 determines whether the dual field components will perform integer or binary polynomial arithmetic. If f sel 0 ¼ 1, the components perform integer arithmetic. Otherwise, the components perform binary polynomial arithmetic. The following describes how each step of Algorithm 3.2 is supported by the core.
The first multiplier and adder (left side of Fig. 1 ) are utilized for Step 2.
Step 3, the second multiplier is utilized because the only operation required is a multiplication. The S in Step 3 is the lower word (product1) of the result, CS 1 , of Step 2. Also, the equation requires the reduction modulo 2 w , which is simply the extraction of the lower word (product2) of CS 2 . Since product2 will need to be accessible for each execution of Step 7, prodcut2 is stored to register U outside of the core.
The operation in Step 4 requires an 8 Â 8-bit multiplication then a 16 Â 16-bit addition. As a result, the second multiplier and the three-way adder are used for the computation of Step 4. The result from the operation in Step 2 (CS 1 ), the result of m½0 Á U (CS 2 ), and sum2 ¼ 0 are passed on to the three-way adder. After processing the three inputs, the three-way adder produces the final result for
Step 4, CS 3 . . S t e p 5 o r 9 : CS ¼ CS ) w a n d S t e p 7 :
For these steps, all of the components of the core are used. The first multiplier and adder combination computes ða½i Á b½jÞ þ c½i to produce the intermediate result, CS 1 . The second multiplier computes the second multiplication (m½i Á U) in parallel with the first multiplier to produce the second intermediate result, CS 2 . Finally, the three-way adder adds the two intermediate results, CS 1 and CS 2 , and sum2 to produce the final result, CS 3 . For clarification, sum2 receives the shifted result from either Step 5 (when entering the loop) or Step 9 (when within the loop), which is just the upper 9 bits, acc hi, of previous operation in Step 4 or Step 7, respectively. . Step 8: c½i À 1 ¼ S This step simply assigns the lower word of the result from Step 7, acc lo, to the respective location in the word vector, c, which stores the final integer result. .
Step 11: c½n À 1 ¼ S This step requires that the lower word of the shift operation in Step 9 be assigned to the respective location in the word vector, c. Since the lower word, S, is 8 bits, only the least significant 8 bits of acc hi are assigned to the output.
Unified Core
The Montgomery Multiplier core is extended to a unified core, which is able to support all operations in Algorithm 3.2. The modifications mentioned in Section 3.2 do not require any hardware modifications to the Montgomery Multiplier core. Instead, these modifications require either a simple change to the control logic or additional hardware outside the Montgomery Multiplier core to create the new unified core, as shown in Fig. 2 . As a note, the design assumes that NTRU's integer moduli are p < q and q ¼ 2 k , where k 8. In addition, the new unified architecture reduces the coefficients of NTRU's product polynomial modulo q.
The first modification mentioned earlier requires that the length of the inner loop in Algorithm 3.2 be incremented by one as shown in Algorithm 3.2. Although this modification requires the control logic to perform an additional iteration, this does not necessarily mean that the counter size has to change. For instance, changing the control to count from 500 to 501 still requires a 9-bit counter. With the exception of N þ 1 being a power of 2 (e.g., incrementing the count from 511 to 512), no additional hardware on top of the original control (with no NTRU support) is required to perform this additional iteration.
The next modification requires that the "carry word" from the results of Step 4 and Step 7 be cleared when NTRU is selected. This "carry word" is the upper word of the result from the three-way adder, acc hi. In order to distinguish between the functions, the unified architecture supports a two-bit f sel signal is necessary. The least significant bit, f sel 0 , determines whether the multipliers and adders perform integer multiplications and additions (f sel 0 ¼ 1) or polynomial multiplications and additions (f sel 0 ¼ 0). The most significant bit, f sel 1 , determines whether NTRU is selected. Since NTRU's polynomial multiplication relies on the integer multiplication and addition of its coefficients, f sel 0 needs to be set to 1. Refer to Table 1 for clarification on the assignment of the f sel signal and its selected function. The "carry word," acc hi, is cleared by AND-ing each bit with f sel 1 , as shown in Fig. 2 . If NTRU is selected, f sel 1 ¼ 0, then the AND gates zero out all 9 bits of acc hi as needed. Otherwise, acc hi passes through the AND gates unchanged. Therefore, NTRU can be supported using the Montgomery Multiplier core with the addition of only nine gates.
Control
The control assumes that the word arrays a, b, c, and m reside in separate memory caches. It is also assumed that m½0 0 is precomputed and stored in a register. For the NTRU case, m½0 0 is always set to 1. By using the Montgomery Multiplier core and the additional hardware shown in Fig. 2 , the control executes Algorithm 3.2 in seven stages, as shown in Fig. 3 . The host system initializes the unified architecture for processing, by asserting the reset signal. At this point, Algorithm 3. . Stage 0 initializes the indexes i and j to zero and transmits the addresses for a½i, b½j, and c½i to the respective caches so that the data will be available for the next stage. . Stage 1 is responsible for setting up the core to perform the operation in Step 2. Since Stage 2 does not require any new data from memory and m½0 0 is precomputed, there is no need to transmit any new addresses to memory. . Stage 2 is responsible for setting up the core to perform the operation in Step 3. In addition, this 
SUPPORTED OPERATIONS AND LIMITATIONS FOR NTRU
Previous works [7] , [8] have established how the Montgomery Multiplier can achieve integer and binary polynomial modular multiplications. However, this is the first realization of a unified architecture that also supports NTRU's polynomial multiplication. This section details how the unified multiplier provides support for all the polynomial multiplications required by the key creation and encryption procedures and the decryption procedure with the addition of a modulo p reduction circuit. As mentioned earlier, we restrict the integer moduli to p < q and q ¼ 2 k , where k 8. The polynomial size N, however, can be set to arbitrary lengths and is only limited by the size of the counters in the control unit. The following indicates which operations are supported and the associated assumptions.
Public Key
The unified multiplier can perform the full operation above assuming that:
. The random polynomial, g, has coefficients from f0; 1; 2 k À 1g 2 and . The inverse polynomial F q of the private key f modulo q has been precomputed and has coefficients in the range ½0; q À 1.
2.
The unified multiplier can only perform the multiplication of r Ã h (mod q). It is assumed that:
. The random polynomial, r, has coefficients from f0; 1; 2 k À 1g, . The message, m, is added outside of the multiplier and has coefficients from f0; 1; 2 k À 1g, . The integer multiplication of p occurs outside of the multiplier either after the multiplier has computed r Ã h (mod q) or with the public key, h, prior to encryption.
Decryption:
Originally, the decryption process for NTRU consists of three steps:
a. a ¼ f Ã e ðmod qÞ, b. shift coefficients of a from ½0; q À 1 to ½À q 2 ; q 2 and c. d ¼ F p Ã a ðmod pÞ. The unified multiplier is able to compute Step a, while
Step b needs to be performed outside of the multiplier.
Step c cannot be computed by the multiplier because polynomial a's coefficients are no longer unsigned. In addition, the multiplier does not support reduction modulo p. Fortunately, there is a way to slightly modify the steps so that the unified multiplier can perform the polynomial multiplications within the decryption process with minimal additional hardware. Four steps are now required:
The unified multiplier can perform the full operation above as long as the random polynomial, f, has coefficients from f0; 1; 2 k À 1g. b. Shift coefficients of a from ½0; q À 1 ½À The unified multiplier does not perform this operation. It is assumed that the user will reduce the coefficients of a modulo p. In the end, b will have coefficients from f0; 1; 2g.
As a result of Steps b and c, the two polynomials, b and F p , are now compatible for 
Note that, 2
k À 1 serves as À1 mod q.
polynomial multiplication by the unified multiplier. However, a combinational reduction circuit needs to be built into the core to reduce the partial products modulo p to prevent overflows. The decryption of NTRU may be enhanced by choosing f ¼ 1 þ p Á f 1 . In this case, the convolution d ¼ F p Ã a ðmod pÞ disappears and the first step becomes a ¼ f Ã e ¼ e þ p Á f 1 Ã e. The convolution f 1 Ã e is handled by our multiplier core; however, multiplication by p and the addition of e needs to be handled externally.
PERFORMANCE ANALYSIS
This section analyzes the performance of the unified multiplier. The architecture was modeled using VHDL, simulated for functionality using Mentor Graphics' ModelSim 5.5f, and synthesized with Mentor Graphics' LeonardoSpectrum tool using TSMC 0:35"m technology [11] . The data in Table 2 summarizes the overall numerical results for various performance criteria.
As seen in Fig. 4a , the total area increases slowly with the modulus length. This is due to the increase of the counters' size in the control unit. Despite this increase, the area scales at a slow rate and the majority of the area is used for the core, as seen in Fig. 4b . Due to the increase in control logic, the maximum clock frequency of the multiplier slightly decreases as the size of the modulus grows. Each stage of the control, as described in Section 4.3, is completed within a clock cycle. The total number of clock cycles (#CC) for an operand multiplication is determined by counting the number of stages executed. In Fig. 3 , after a setup stage, both the inner loop and the outer loop are executed N times, which are two and six stages long, respectively. Hence, the total number clock cycles for a unified multiplication is
By multiplying #CC with the respective clock period, we obtain the timing results as shown in Table 2 . For instance, a single unified multiplication for N ¼ 1 is completed in 50 ns and for N ¼ 600 in a little over 9 ms.
In Table 3 , the performance of the unified multiplier is estimated for three cryptosystems, which represent each function of the multiplier. For the integer case, the unified multiplier can compute a 1,024-bit RSA operation with a short exponent in about 4 ms with less than 2; 900 gates, whereas the 1,024-bit RSA operation with a long exponent takes approximately 382 ms to complete with the same area. For the binary polynomial case, the unified multiplier requires approximately 15 ms to complete a 160-bit Elliptic Curve operation. It is assumed that projective coordinates are used and the final inverse computation time is negligible. Finally, for NTRU's highest security level (N ¼ 503), the unified multiplier can compute a polynomial multiplication in a little over 6 ms.
FUTURE IMPROVEMENTS
This section outlines the potential improvements to the unified design. Specifically, the core can be further optimized for the NTRU mode. In this mode, m½0 ¼ À1 and m½0 0 ¼ 1, therefore, Steps 3-5 in Algorithm 3.2 simplify as follows:
Step 5: CS ¼ CS >> w ¼ 0. Hence, the second multiplier can be bypassed. Ultimately, Step 3 becomes a simple assignment operation and Steps 4 and 5 can be merged to clear CS.
Step 4: CS ¼ 0. Furthermore, since NTRU's modulus is x N À 1, within the inner loop, m½i ¼ 0 except in the last iteration, where m½i ¼ 1. This simplifies the (m½i Á U) portion of Step 7 to a single addition which only occurs during the last iteration of the inner loop. Therefore, the second multiplier can be bypassed in Step 7 as well since only an addition is necessary. Alternatively, the second multiplier may be utilized to process the next coefficient multiplication since NTRU's partial product columns are independent. Hence, the inner loop may execute twice as fast with additional control logic.
For proof of concept, the design was presented in its most simplified form with only one unified core. With less than 3; 000 gates, the footprint is significantly smaller than other unified Montgomery architectures. However, the performance is not comparable to such designs. If higher speeds are desired, the design may be scaled and pipelined at the expense of larger footprint. This would translate into a speedup increasing (almost) linearly with the number of cores.
Finally, the user may add a modulo p reduction circuit within the unified multiplier core to fully support NTRU's decryption procedure. For typical NTRU parameters p ¼ 3 and q ¼ 256, we were able to build a reduction unit at a cost of 46 two input gates (less than 2 percent of the overall design).
CONCLUSIONS
A unified architecture was designed to be simple and effective in demonstrating that NTRU's polynomial multiplication could easily be achieved with high-radix Montgomery Multipliers. The mathematical basis and algorithmic changes required for achieving NTRU with the MM algorithm were established. These modifications were implemented to produce a new unified architecture by extending a generic Montgomery Multiplier core with only nine additional gates. The unified multiplier provides support for all the polynomial multiplications required by NTRU's key creation and encryption procedures, as well as the decryption procedure with the addition of a modulo p reduction circuit. However, our design does restrict the integer modulus q to a power of 2.
The performance and application analysis conducted on the new unified core has demonstrated that all three types of applications, RSA, ECC, and NTRU, can be achieved with high performance on small footprint. It is interesting to note that 503 NTRU offers the best performance for its security level compared to the other two applications. For instance, NTRU with N ¼ 503 provides a security level comparable to 4,096-bit RSA with an additional 2 ms over a short exponent 1,024-bit RSA operation. Despite the performance difference among these applications, the unified design is capable of supporting a majority of public-key cryptosystems such as NTRU [4] , RSA [1] , Diffie-Hellman [12] , Elliptic Curves [2] , [3] , etc. 
