NTRU is a lattice-based public key cryptosystem featuring reasonably short, easily created keys, high speed, and low memory requirements, seems viable for wireless network. This paper presents two optimized designs based on the enhanced NTRU algorithm. One is a light-weight and fast NTRU core, it performs encryption only. This work has a gate-count of 1175 gates and a power consumption of 1.51 μW. It can finish the whole encryption process in 1498 μs at 500 kHz. As such, it is perfect for wireless sensor network. Another high-speed NTRU core is capable of both encryption and decryption, with delays of 16,064 μs and 128,010 μs in encryption and decryption respectively. Moreover, it consists of 25,758 equivalent gates and has a total power consumption of 59.2 μW (it will be reduced greatly if low power methods were adopted). This core is recommended to be used in base stations or servers in wireless network.
Introduction
The wireless network is being expected to be widely used in various fields. Although wireless network offers lowcost deployment and convenience, security implementation is considered essential because of its inherent vulnerability. The public key cryptosystem using two different keys, such as RSA and ECC, has profound consequences in the areas of confidentiality, key distribution and authentication, but its common conception is complex, slow and power hungry [1] . The relatively new lattice-based public key cryptosystem-NTRU that features reasonably short, easily created keys, high speed, and low memory requirements [2] , seems viable for wireless network.
O'Rourke first implemented the NTRU core with a gate count of minimum 1483 gates (N = 503) [3] ; but it performs star multiplication only. Another design is realized in Kaps's paper (N = 167) [4] ; it performs only encryption with an area of 2850 gates and power consumption of 15.1 μW at 500 kHz. The most detailed low-cost implementation of NTRU is realized by AC Atici's [5] . The author presented two compact NTRU architectures (N = 167), one is for encryption only with an area of 2800 gates and a dynamic power consumption of 1.72 μW at 500 kHz. Another is capable of both encryption and decryption, and it consists of 10,500 gates and consumes 6 μW dynamic power. Several other papers focus on the optimization of area and power consumption, but only a few works are related to studies on the optimization of speed.
In Jeffrey Hoffstein's paper [6] , an alternative of NTRU is proposed with a new form 1 f pF = + to create private key f that speeds both the key generation and the decryption. Unfortunately, there is little research on the implementation of the enhanced NTRU.
The rest of this paper is organized as follows: Section 2 summarizes the basic of enhanced NTRU algorithm; Section 3 presents the architecture of an encryption-only enhanced NTRU optimized for small area and fast speed; in Section 4, an architecture of enhanced NTRU optimized for high speed is presented, it is capable of both encryption and decryption; in Section 5 we give synthesis results on an ASIC technology library; Section 6 concludes this paper.
Algorithm of Enhanced NTRU
NTRU is a parameterized family of lattice-based public key cryptosystems. Its basic operations are realized in a truncated polynomial ring
− . Polynomials in the ring have a degree of N-1 and all the coefficients are integers. Polynomial a R ∈ can be presented as:
The addition in the ring R is just as the same as general polynomial addition. But the multiplication in R is referred to star multiplication (see [2] for details) and denoted as a * symbol.
The NTRU cryptosystem depends on three parameters, where a large prime N limits the degree of polynomials in the ring and ( , ) p q satisfies gcd( , ) 1 p q = . It is also possible to consider p as a polynomial, for example 2 p X = + . In order to speed up both the key generation and the decryption process, a new form of enhanced NTRU is suggested [6] . In enhanced NTRU, users choose f to have the form 1 f pF = +
. Notice that this form has the property of 1 (mod ) 1 p f f p − = = , so it is not necessary to compute the inverse modulo p, and the second multiplication in the decryption process disappears. Some other optimizations presented in the rest of this paper are all based on enhanced NTRU.
Before we give an outline of enhanced NTRU algorithm, four sets of binary polynomials , , , F g r m need to be specified first. Some notations are used to specify the sets of polynomials:
R The set of binary polynomials in
The set of binary polynomials in R with d ones and N-d zeros.
Then the polynomials , , , F g r m are selected in the specified set:
private key is computed as: 
 Decryption: In order to decrypt the cipher e using private key f, the user first calculates:
Then the user chooses a R ∈ to satisfy this congruence and to lie in a pre-specified subset of R. Finally, a binary polynomial m should be found out to satisfy ( 2) ( 2)(mod 2 1) .
It can be proved that m equals the plaintext.
Light-Weight and Fast NTRU (Encryption-Only)
In order to meet the requirements in ultra-low cost environments like wireless sensor network, we proposed a light-weight NTRU structure based on the enhanced NTRU algorithm. It performs encryption only and is optimized for both fast speed and small area. The parameter set we have chosen is ( , ) (107, 64), N q = and 5 r d = , which was the lowest security recommended in [2] . Figure 1 shows the architecture of light-weight NTRU engine. It consists of a control engine, a 6 bits result buffer, a multiplication module, a look up table (LUT) and a non-zero coefficients sequence generator (NCSG).
The control engine manages the process of encryption. The result buffer is used to store final result and current sum in star multiplication process. The multiplication module performs star multiplication operation. Public key h is pre-computed and stored in the LUT. NCSG is designed to generate and rotate the degrees of non-zero terms of random polynomial r.
Due to NCSG, one operand in multiplication operation is bounded to 1'b1 (see Section 3.1 below), the multiplier is needless and multiplication module only consists of a 6bits adder and a router.
Non-Zero Coefficients Sequence Generator (NCSG)
In NTRU encryption process, the computation of h r * is not time-efficient. In fact, r is a quite sparse binary polynomial that has only r d non-zero coefficients. As a result, most of the multiplications in a star multiplication process are unnecessary.
In this paper, we introduce a structure of non-zero coefficients sequence generator (NCSG), which could record the degrees of non-zero terms in polynomial r when the coefficients of r loaded one by one. Then the non-zero coefficients sequence is rotated during the computation of star multiplication. According to this, the control engine generates the corresponding address of h for LUT. As Figure 2 shows, NCSG consist of a 7 r d bits circular register, a 7 bits counter, a 7 bits adder, a 2-input router, an AND gate and an OR gate.
Input of the right hand 7-bit register is composed of 2 paths and determined by a ctrl signal. Path1 is from the most significant 7 bits during the computation of star multiplication, the register performs as a general shift register and the degree of current non-zero term is output to control engine by left-hand 7 bits register. During loading polynomial r, the output of the 7 bits counter is as one source of path2, it is added to the least significant 7 bits (where the degree of current term is stored). The counter counts the zero from _ r in and resets if a one is input. Then the degree of next term is computed by the adder and loaded to the circular register. After N clocks, non-zero coefficients sequence is generated in the circular register. The clock of circular register is a gated clock based on ADD gate. During a star multiplication process ctrl equals 0 and _ r in remains high voltage, the clock is always enabled. During the loading stage where ctrl equals 1, the clock of circular register is enabled only when a one from _ r in is detected. Due to NCSG, the consumption of time to compute h r * is reduced to r d N × clock cycles. In addition, one operand to multiplication module is 1 constantly. As a result, the hardware cost for storing polynomial r is saved.
Control Engine
Control engine is the controller of the NTRU engine and designed with a 4-state finite state machine (FSM), which initialed with an idle state.
When a valid enc signal is detected, the encryption process starts and the FSM enters a load state, during which the coefficients of polynomial r are loaded one by one to NCSG. N clock cycles later the degrees of nonzero terms are generated in NCSG. Then FSM transits to multiplication state and begins to calculate the first coefficient of h r * , this process spends r d clock cycles. Multiplication is followed by an add state, where plaintext m is added to the current sum. At this time, e's first coefficient is calculated and the control engine outputs a done signal. After the addition of the message, the FSM again transmits to multiplication to compute the second coefficients of e. When the last coefficient of e is calculated, the FSM returns to idle state and then the encryption process is finished.
High-Speed NTRU
The performance gain of enhanced NTRU comes from elimination of an inversion modulo p in key generation and a star multiplication in decryption. On this basis, a high-speed NTRU engine is presented to further speed up the encryption process through improving the efficiency of star multiplication. The high-speed NTRU can be used in wireless networks that structured around base stations and centralized servers, which do not have the limitations associated with small portable devices. Figure 3 shows the architecture of the high-speed NTRU. For decryption, a 7N bits e buffer is used to store This parameter set of NTRU is considered with high level of security and extremely low decryption failure probability [8] .
Mapping Module
After we calculated (mod ) a f e q = * in the decryption process, we need to recover binary plaintext m from a. The Binary polynomial m satisfies Equation [6] .
According to [6] , the transformation in the mapping algorithm can be summarized as： 
The mapping module can be implemented in a completely combinational way, it performs the following function:
Algorithm( , , ), ,
where i is the value of a 7 bits counter increasing every clock cycle till i equals 2N. The connection of the mapping module and the result buffer is shown in Figure 4 . When the control signal sel SASA equals zero, a mapping operation is executed and o x is connected to the input of the right-hand 7 bits register; when sel equals 1, the result of star multiplication is loaded to the result buffer and data is as the input to the circular register.
According to the above description, the entire mapping process takes 2N clock cycles totally.
Small Hamming Weight Product
To further decrease the consumption of time in computation of the product h r * , an alternative form is suggested that takes advantage of sparse polynomials [6, 7] 
Control Engine
Control engine in this design has 6 states and begins with an idle state. On detection of a high signal of enc, the encryption process starts and the FSM enters a load state. We divide the whole encryption process into three steps, as shown in Figure 5 .  Step1: computing 1 h r α = * . During load state, the coefficients of 1 r are loaded one by one to NCSG. After N clock cycles the degrees of non-zero terms are generated in NCSG. FSM transits to multiplication state to calculate the first coefficient of α , this process takes 1 r d clock cycles, it should be noted that one multiplier is constant "1" during this state. The following state is result, during which the first coefficient of α is loaded to the e buffer. Then FSM returns to multiplication state to compute α 's second coefficient and stores it in result state. clock cycles. Multiplication is followed by add state, where plaintext m is added to current sum. At this time, β 's first coefficient is calculated, it is loaded to the result buffer in result state. After this, FSM begins to compute and store the second coefficients of β . When the last coefficient of β is stored in the result buffer, Step3 is followed. 
Implementation Results
In this paper, our designs were implemented with Verilog language and synthesized by Synopsys Design Compiler with a clock frequency of 500 kHz. The targeted ASIC technology library we used is the HJTC 0.18 μm standard cell library. We have also implemented AC Atici's work [5] for comparison. 
Step1
Step2 Step3
We have first synthesized two encryption-only NTRUs with the same parameters set ( , ) (107, 64) N q = and r d = 5, one is the implementation of AC Atici's scheme and another is our light-weight NTRU core. Table 1 reveals that the proposed implementation of the low-weight NTRU has a significant decrease in both encryption delay and area. The encryption delay is found to be 1498 μs, which is only 6.4% of AC Atici's design. This was expected since we took full advantage of sparse polynomial r in our design. We can also see that the area is much smaller than the contrast, only with 1175 equivalent gates.
As shown in Table 2 , the decrease in area is mainly due to the needless of the r buffer, which is used to store and rotate polynomial r with a consumption of 1243 gates, while the addition module-NCSG consumes only 466 gates. We also got a total power consumption of 1.51 μW from the result of Synopsys Design Compiler.
We have also synthesized our high-speed NTRU core, which can perform both encryption and decryption function. We chose ( , ) Table 3 shows that the high-speed NTRU can finish the process of encryption and decryption in 16,064 μs and 128,010 μs, respectively, which gains much in speed performance, especially in encryption process. However, the area is 25,758 gates and the total power consumption is 59.2 μW. The high-speed NTRU could be used in base stations and servers of wireless network. By the way, as power consumption is not the target we optimized for in this paper, low power methods such as clock gating and operand isolation are not used in our designs. If these methods are adopted, the power consumption will be greatly reduced.
Conclusions
In this paper we presented two hardware architectures of NTRU. The first one capable of encryption only is optimized for small area and fast speed, has a gate-count of 1,175 gates and a total power consumption of 1.51 μW. This NTRU core can finish the encryption process in 1498 μs at 500 kHz. It is very suitable for use in ultra-low cost environment such as wireless portable devices. Another one is designed for high speed with delays of 16,064 μs and 128,010 μs in encryption and decryption respectively, obtaining significant gains when compared with the original NTRU that provides the same level of security. This circuit consists of 25,758 equivalent gates and has a total power consumption of 59.2 μW. So the high-speed encryption-decryption NTRU core is recommended to be used in wireless base stations and servers.
Besides, the designs can be sped up by using parallel polynomial multiplier units.
