ABSTRACT Ring Lizard (RLizard) is a quantum-resistant public-key cryptosystem based on the ideal lattice. RLizard uses a sparse ternary polynomial, which facilitates implementation with lower complexity. The Lizard scheme's proposal for the National Institute of Standards and Technology's post-quantum cryptography standardization included its reference hardware design using the sparse ternary property; however, in this paper, we present the RLizard crypto-processor with the improved processing speed and security level against power analysis attacks. By additionally utilizing unused values for each memory access in the conventional RLizard crypto-processor, the processing speed of the proposed RLizard cryptoprocessors can increase by a factor of two or up to four times. The implementation results with three different FPGA devices show that the area overhead is approximately 50-100 flip-flops (FFs) and 50-300 lookup tables (LUTs), occupying approximately 2%-3% of the total area. The vulnerability to power analysis attacks and the proposed countermeasures were also analyzed. The experimental results prove the vulnerability of unprotected implementation, and the implementation results show that the masking and hiding countermeasures additionally require approximately 50-120 FFs and 100-360 LUTs. In addition, our idea can be applied to other ideal-lattice-based cryptosystems using a sparse binary or ternary polynomial, such as NTRU and Round5.
I. INTRODUCTION
Modern public-key cryptosystems, such as RSA [1] and elliptic curve cryptography (ECC) [2] , [3] , are based on factoring and discrete logarithm problems, but both problems could be more easily solved on a quantum computer by Shor's algorithm [4] . As alternatives to these problems, lattice problems, such as a learning with errors (LWE) problem [5] and a learning with rounding (LWR) problem [6] , have been known to be quantum-resistant and are attracting considerable attention as new digital signature [7] and encryption [8] methods. Although cryptosystems based on the standard lattice, which operates over matrices, require large sizes of keys, the key The associate editor coordinating the review of this manuscript and approving it for publication was Fan Zhang. size can be reduced by using the ideal lattice [9] , which operates over polynomial rings.
The dominant operation of the ideal-lattice-based cryptosystems is a convolution, which is the multiplication of two polynomials over a polynomial ring. For fast convolution, a number theoretic transform (NTT) is widely used [10] - [13] , whereas some cryptosystems [14] - [20] use sparse binary or ternary polynomials to lower the computational complexity of a convolution. Binary or ternary polynomials have coefficients ∈ {0, 1} or {0, 1, −1}, and ''sparse'' means most parts of the coefficients are zeros. Hence, a large part of the computations with zero coefficients can be removed, and multiplications with ±1 can be simplified to additions and subtractions.
From the proposals to the National Institute of Standards and Technology's (NIST's) post-quantum cryptography standardization [21], we could find following some ideallattice-based cryptosystems using a sparse binary or ternary polynomials: NTRU [15] , [16] , Round5 [17] , and ring Lizard (RLizard) [18] - [20] . These cryptosystems can be implemented with a simple structure and fast processing speed by using the sparse binary or ternary property. However, to our knowledge, there are only a few studies on their hardware implementation. The NTRU crypto-processor in [22] calculates all the coefficients of convolution product in parallel. This parallel style requires many resources, and the ternary property is not used. In [23] , the sparse ternary property is also not used, so n 2 multiplications are required, where n is the order of the polynomial. In [24] , the sparse ternary property seems to be used, but the detailed hardware structure is not explained. These three works regard the hardware implementation of the classic NTRU, and the proposals [15] - [17] of the NTRU and Round5 to the NIST's post-quantum cryptography standardization included only their software implementation. As for RLizard, the work in [19] included only its software implementation. The proposal of the Lizard scheme [20] included even the Verilog source code for its hardware implementation that uses sparse ternary property; however, its RLizard crypto-processor can operate at only a single basic processing speed.
Not only due to the vulnerability of the cryptographic scheme itself, the secret key can also be exposed by well-known side channel attacks, which are known to be very effective. Hence, when implementing a cryptographic scheme, resistance to existing side channel attacks must be seriously considered. This point is supported by that the resistance to side channel attacks are required in standards such as Common Criteria (CC) [25] and FIPS 140-3 [26] . However, the previous research [15] - [17] , [20] only addressed the hardware implementation of NTRU and RLizard, but did not include the countermeasure methods against side channel attacks.
In this paper, we show the improved hardware design of the RLizard cryptosystem. The contributions are as follows:
-The processing speed of the conventional RLizard crypto-processor [20] was significantly improved with almost no area overhead. -The vulnerabilities to power analysis [27] - [30] were analyzed, and countermeasures were proposed. If the order of polynomials is large, memory usage is essential, and managing memory accesses can have a significant impact on the processing speed. The main difficulty is that the amount of data that can be accessed from memory at one time is limited, so it is not simple to apply parallel processing with area increase for speed improvement. Our contribution is that under the data access limitation, we cleverly improved the processing speed by using discarded values for each memory access. This contrasts to the conventional RLizard crypto-processor in [20] that uses only one of the loaded values for each memory access. Additionally, the proposed method is applicable to hardware implementations of other cryptosystems using a binary or ternary polynomial, such as NTRU and Round5.
We also show the conventional RLizard crypto-processor's vulnerabilities to the power analysis. Although [31] has already conducted similar research on NTRU, their research focused on software implementation, and its computation of convolution is slightly different from that of our implementation. As a result, the power analysis experimental results were also different, and different countermeasures were proposed.
The organization of this work is as follows. Section II provides information regarding the cryptographic algorithms and main operations of the RLizard scheme. In Section III, the common overall structure is shown, and the proposed speed-up method is presented. In Section IV, the vulnerability to power analysis attacks and the proposed countermeasures are shown with the experimental results. Section V compares the implementation results of the RLizard crypto-processors. Section VI concludes this work.
II. PRELIMINARY
In this section, the cryptographic algorithms of the RLizard scheme are described. Then, the main operations of the algorithms for the hardware implementation are explained.
A. RLIZARD SCHEME
The RLizard scheme has three algorithms: RLizard.KeyGen, RLizard.Enc, and RLizard.Dec. Their inputs and outputs, such as keys, plaintexts, and cipher-texts, are expressed as polynomials, and the algorithms perform polynomial operations. The detailed operations are shown in Table 1 , where p and q are powers of two (p < q), and polynomials and their operations are defined over a ring R q = Z q [x] /(x n + 1). In Table 1 , a private key s(x) and temporary secret r(x) are sparse ternary secrets, so most of their coefficients are zero, and only h coefficients have nonzero values of ±1. An error polynomial e(x) has coefficients in the range (−7, 7) according to Gaussian distribution. a(x) and b(x) are public keys with coefficients in Z q . m(x) and m (x) are binary polynomials representing plaintext. c 1 (x) , c 2 (x) ∈ Z n q × Z n q is the intermediate result of the encryption algorithm and is scaled down by rounding off, so a cipher-text c = (c 1 (x) , c 2 (x)) ∈ Z n p ×Z n p . 
The polynomials of (1) Since x n ≡ −1 (mod x n +1), (1) can be represented as
u(x) is the ternary polynomial of s(x) or r(x), so instead of (2), we can use 
Except for the initial value of f k in line 2, Algorithm 1 is similar to the convolution computation algorithm in [31] ; however, Algorithm 1 is more suitable for hardware implementation, and a ternary polynomial is used instead of a binary polynomial.
Algorithm 1 Convolution Product and Addition
end for 8:
end for
III. HARDWARE DESIGN OF RLIZARD CRYPTO-PROCESSORS
In this section, the overall hardware structure of the conventional RLizard crypto-processor in [20] is first explained.
Then, our speed-up method is presented, and the modified hardware design is compared with the conventional one. For simple notation, the conventional RLizard crypto-processor is called RLZDx1, and the two proposed RLizard cryptoprocessors are called RLZDx2 and RLZDx4, as these cryptoprocessors are up to two and four times faster than RLZDx1, respectively.
A. COMMON HARDWARE STRUCTURE
The overall structure of the RLizard crypto-processor in [20] is shown in Fig. 1 . The adders and registers for accumulation only require a small number of resources, and the dualport memory for storing polynomials occupies most of the area. For example, key generation requires a total 1152-word space for 512-word a(x), 128-word s(x) and 512-word b(x) when the recommended parameters, p = 256, q = 1024, t = 2, n = 1024, and h = 128 [18] are used. From now, we use the recommended parameters if not mentioned. The required spaces of polynomials are shown in Table 3 . To reduce the memory size, the RLizard crypto-processor in [20] used the following two methods. First, e(x) and m(x) are stored in the remaining bits of the words for a(x) and b(x), as 12-bit space is still unused after storing two 10-bit coefficients of a(x) or b(x) within a word. Second, the output of each algorithm, f (x) is not stored in the memory. Instead, whenever one-word size of coefficients of f (x) are computed and collected, the word is outputted. For example, key generation requires only 640 words for input polynomials, a (x) and s(x), instead of 1152 words to additionally store the output polynomial, b(x). The proposed RLizard cryptoprocessors also have the same hardware structure.
B. PROPOSED SPEED-UP METHOD
To present the proposed speed-up method, this subsection compares the algorithms of the RLizard crypto-processors. At first, the algorithm of RLZDx1 is shown. Then, it is shown how the proposed method can improve the processing speed.
1) THE CONVENTIONAL METHOD IN THE RLZDX1
RLZDx1 performs (1) according to Algorithm 2. Compared to Algorithm 1, codes about memory access and uses of registers are added. Note that Mem0, Mem1, and Mem2 are logically separated for simple explanation. They are implemented as one dual-port memory in the crypto-processors. The function HalfWrd in lines 2, 5, and 9 selects the upper or lower half word. In decryption, w i and v i are one-byte size, so the lines 2 and 9 are replaced with the following codes:
, where the function Byte selects one byte from the given word.
Algorithm 2 Convolution and Addition in RLZDx1
for k= 0 to n − 1 2:
for i= 0 to h − 1 5:
end for 12:
f k ← t 0 13: end for 2 shows an execution example of Algorithm 2. The coefficients in the blue rectangle (Group1), w i and v j are accumulated to variable t 0 to calculate f k . Note that when loading the coefficients that belong to Group1, their adjacent coefficients are also loaded together for each memory access.
In addition to Group1, one more coefficient is loaded for each memory access in the key generation and encryption. These additional coefficients are marked by the red rectangle (Group2) in the Fig. 2 . For example, in Fig. 2 , when w 0 , v 6 and v 4 (Group1) are loaded for f 0 , w 1 , v 7 , and v 5 (Group2) are also loaded together, respectively. Although some of coefficients in Group2, such as w 1 , v 7 , and v 5 , can be used to calculate f 1 , RLZDx1 do not use them. If they are also used, the processing speed could be improved. However, this cannot be achieved simply by adding only one adder and one register, because only some coefficients in Group2 are required for f k−1 and the others are required for f k+1 . As shown in Fig. 2 , the positions of the red rectangles are not fixed.
In decryption, w i and v i are one-byte size, so two more coefficients, marked by the yellow rectangles (Group3) in Fig. 2 , were additionally loaded together. This means that the processing speed of decryption can be higher than that of key generation and encryption.
2) PROPOSED METHOD IN THE RLZDX2 AND RLZDX4
In RLZDx2 and RLZDx4, the adjacent coefficients abandoned in RLZDx1 were additionally used. RLZDx2 uses one more coefficient (Group2), so its processing speed is double that of RLZDx1. The pseudo-code is shown in Algorithm 3. In lines 13-14 of Algorithm 3, t j and t j +1 are two of t −1 , t 0 , and t 1 , so f k and one of f k−1 and f k+1 are calculated together.
Algorithm 3 Convolution and Addition in RLZDx2
Input: Mem0 ←(u pos , u sign ), Mem1 ←v(x), Mem2 ←w(x) Output: f (x) 1: t −1 ← 0 2: for k= 0 to n − 2 step 2 3: accumulated to a register t 1 . Then, t 1 is passed to t −1 , and v 0 and v 2 are accumulated. From t −1 and t 0 , f 1 and f 2 are obtained together. In the same way, f 3 , f 4 , . . . , f n−2 can be calculated. Only f n−1 is calculated in a different way. In Fig. 3 , v 6 and v 0 in the green rectangle are accumulated to a register t −1 . The value in t −1 is kept in t −1 until f 6 is calculated. Then, t −1 , w 7 , v 5 , and v 3 are accumulated to a register t 1 , and f 7 is obtained. RLZDx2 requires three additional registers, t −1 , t −1 , and t 1 . Furthermore, another adder is required to calculate one of t −1 and t 1 .
RLZDx4 uses Algorithm 3 for key generation and encryption, and Algorithm 4 for decryption. Algorithm 4 processes four coefficients together including three adjacent coefficients (Group2 and Group3). Consequently, RLZDx4 can process decryption four times faster than RLZDx1. The additional coefficients are processed in a similar way to the method in Fig. 3 .
C. HARDWARE DESIGN OF RLIZARD CRYPTO-PROCESSORS
The hardware structure of RLZDx1 and RLZDx2 is shown in Fig. 4 . The largest component is the dual-port memory, and only a few small registers and adders are additionally required. The values of u pos and u sign are loaded through W 0, and coefficients of w(x) and v(x) are loaded through W 1. The bit-selection to extract one or two coefficients from one-word
Algorithm 4 Convolution and Addition in RLZDx4
Input: Mem0 ←(u pos , u sign ), Mem1 ←v(x), Mem2 ←w(x) Output: f (x) 1: (t −3 , t −2 , t −1 ) ← (0, 0, 0) 2: for k= 0 to n − 4 step 4 3:
for i= 0 to h − 1 9: 
else then 23:
end if 25:
data, W0 and W1 is simply expressed by bit organizers (BOs) such as BO 1 and BO 2 in Fig. 4 . The accumulation part using W 1 is different depending on the crypto-processor version. The RLZDx1 in Fig. 4 (a) chooses one coefficient (Group1) from W 1 and accumulates it to the register t 0 using a single adder. After accumulation, the final value piles up in the register WrdOut. Meanwhile, the RLZDx2 in Fig. 4 (b) chooses two coefficients (Group1 and Group2) from W 1 and accumulates one coefficient to the register t 0 and the other to the register t −1 or t 1 using one additional adder. As explained in the previous subsection, register t −1 is additionally required for f n−1 . Similarly, the RLZDx4 can be designed by adding six more registers and two more adders compared to in RLZDx2, although the RLZDx4 is not included in Fig. 4 .
IV. COUNTERMEASURE AGAINST POWER ANALYSIS ATTACKS
All the above RLizard crypto-processors may be vulnerable to CPA attacks [28] . In this section, the vulnerability of RLZDx1 is analyzed representatively, and countermeasures against CPA attacks are proposed. Then, we present the power analysis experimental results on all the above RLizard crypto-processors.
A. VULNERABILITY TO POWER ANALYSIS ATTACKS
The power consumption depends on the processed data. By analyzing the correlation between data and power traces, secret data can be revealed. These attacks are called correlation power analysis (CPA). The main purpose of these attacks is to reveal the secret key; hence, crypto-processors are usually attacked while performing decryptions.
In this work, we conducted a CPA according to a typical attack model in [30] as follows.
- (4) is performed, some bits of the register t 0 are changed, which causes power consumption. The number of changed bits in the register t 0 can be expressed as:
where HD (x, y) is the Hamming distance (HD) of x and y. This is an HD model, which is one of the most commonly used power models. 
B. PROPOSED COUNTERMEASURE
Two types of countermeasures were used. The first method is masking, which eliminates the correlation between data and power consumption. A random number is usually mixed into the intermediate value related to the secret to prevent the HD for each guess from being calculated. In the RLizard cryptoprocessor, the intermediate value is stored in the register t 0 . The second method is hiding, which makes the power consumption constant or randomized regardless of the processed data. In the RLizard crypto-processor, the order of loading s pos [i] and s sign [i] can be randomized. The RLZDx1 with these countermeasures against CPA attacks is called RLZDx1-AntiCPA, and the pseudo-code of RLZDx1-AntiCPA for decryption is shown in Algorithm 5.
Algorithm 5 Convolution and Addition in RLZDx1-AntiCPA
generates l random bits 2: for k= 0 to n − 1 3:
r M ← 0 5:
For masking, we tried two methods. First, we randomized the initial value of the register t 0 . As a result, the actual power consumed corresponds to
where r is a random number. However, our experimental result shows that this masking method is not effective; hence, this masking method was not used and was excluded from Algorithm 5. This is because (5) and (6) have a high correlation value. This can be confirmed by that HD (r 0 , r 0 + r 1 ) and HD (r 0 + r 2 , r 0 + r 2 + r 1 ) have a correlation value 0.45, where r 0 , r 1 , and r 2 are random numbers. Alternatively, we randomized c 1,j , which is each coefficient of c 1 (x) accumulated to the register t 0 . As a result, VOLUME 7, 2019 actually consumed power corresponds to
In lines 6 and 12 of Algorithm 5, a random number r T is generated for each i and added to c 1,j before c 1,j is accumulated to the register t 0 . r T is also accumulated to r M in line 13, and r M is removed from f k in line 15. This masking method blinds all selected coefficients of c 1 (x), so simple power analysis, which observes the operations related zero values after setting coefficients of input polynomial v(x) as zeros and nonzeros, can also be prevented. Furthermore, different random values of r T are used for 0 ≤ i < h; thus, it is expected that secondorder CPA can also be prevented. [0] from being processed at the same time point after the start. Moreover, the values previously accumulated in t 0 at that time also changes every time. As a result, the correlation for the right guess becomes too small to be distinguishable.
C. EXPERIMENTAL RESULTS
We implemented RLizard crypto-processors on XC3S1500 (Spartan 3) using Xilinx ISE 14.2 and acquired power traces using LeCroy DDA-3000 Oscilloscope with 10 GHz sample rate at 1 MHz of clock frequency. We used a fixed secret, s sign [0] , s pos [0] = (0, 491). 40,000 different ciphertexts (c 1 (x) , c 2 (x)) were used when attacking RLizard crypto-processors without countermeasure, and the number of cipher-texts was increased to 200,000 when any countermeasure was used. At first, we implemented RLZDx1 and attacked. Fig. 5 shows the correlations for all the possible guesses over time. The blue line represents the correlation of the right guess, which has the highest value at ∼3000 ns after the start signal. Instead of correlations over time, observing only the maximum correlations for all the possible guesses is sufficient to determine the right guess. Fig. 6 In Figs. 6 (a) , (d), and (g), RLZDx1, RLZDx2, and RLZDx4 have maximum correlation values of 0.155, 0.172, and 0.287 for the right guess. These values are higher than those for the other guesses. This means that attacks were successful. Note that the maximum correlation values of the RLZDx2 and RLZDx4 are higher than that of RLZDx1. This is because two and four coefficients instead of one are selected depending on s pos [0] in RLZDx2 and RLZDx4, respectively. This may make the attack on RLZDx2 and RLZDx4 easier than that on the RLZDx1.
Figs. 6 (b), (e), and (h) show the experimental results on RLizard crypto-processors with masking. The maximum correlation for the right guess is not distinguishable, so we can see that the countermeasure is effective. On the other hand, Fig. 7 shows the experimental results on RLZDx1 with the method that randomizes the initial value of t 0 . t 0 is initialized as c 2,0 + r instead of c 2,0 , and power consumption is affected by random number r as explained by (6) . As shown in Fig. 7 , the maximum correlation for the right guess was reduced, but is still higher than the others. This means that this method is not effective enough to prevent CPA, as explained in the previous sub-section.
Figs. 6 (c), (f), and (i) show the experimental results on RLizard crypto-processors with hiding, which are similar to those when masking is applied. That is, this method is also an effective countermeasure against CPA.
V. IMPLEMENTATION RESULTS
This section presents the implementation results of RLizard crypto-processors to analyze the area overhead of the proposed speed-up method and the proposed countermeasure against CPA.
A. PERFORMANCE COMPARISON OF RLIZARD CRYPTO-PROCESSORS
We implemented RLizard crypto-processors in three different Xilinx FPGA devices, XC3S1500, XC6SLX9 and XC6VCX75T, using Xilinx ISE 14.2. The implementation results after place and route are shown in Table 4 . The source codes of [20] were slightly more optimized, so the results of RLZDx1 obtained here are slightly different from those of the source codes provided in [20] . In addition, XC3S1500 requires more FFs than the other devices. This is because the memory used in XC3S1500 does not support byte-write, which is useful when storing the coefficients of e(x) and m (x) in the remaining bits of each word after storing the coefficients of a(x) and b(x).
The RLZDx2 and RLZDx4 have processing speeds twice or four times faster than RLZDx1. Although the numbers of flip-flops (FFs) and lookup tables (LUTs) increase approximately 20% to 140%, it cannot be said that the required resources have increased much if two block RAMs (BRAMs), which are the largest components, are included. According to [20] , the area of the memory part is 98.3k gate equivalents (GEs), and the area of the other part is only 1.4k GEs in ASIC implementation using the Samsung 65-nm CMOS process. The area except for memory is less than 2% of the total area. Therefore, the processing time of the RLizard crypto-processor was improved to be two and four times faster, but the area overhead is negligible. Table 4 also shows the area overhead when the proposed countermeasure is applied. Compared with RLZDx1, RLZDx1-AntiCPA has an additional 51 FFs and 96 LUTs in XC6SLX9, most of which are caused by the registers and a random number generator (RNG) to store and generate r H and r T . To reduce the overhead, a compact RNG [32] , which requires only 15 FFs and 18 LUTs, was chosen. Because the RNG can be used to generate random errors in RLizard.KeyGen, the area of the RNG can be removed from the area overhead. In this case, the area overhead is only 36 FFs and 78 LUTs.
B. IMPLEMENTATION OVERHEAD OF COUNTERMEASURE AGAINST POWER ANALYSIS ATTACKS
In terms of the processing speed, r T for i = 0 and k = 0 and r H can be generated while the crypto-processor is in an idle state, and r T for 1 ≤ i < n can be generated in parallel with the calculation of f i (0 ≤ i < n−1) using 15-bit LFSR. The size of LFSR is determined according to log 2 hp = 15, as r H is used to blind hp-bit coefficients of c 1 (x). This does not increase the required clock cycles at all and reduces the maximum clock frequency, but not very large. 
VI. CONCLUSION
In this work, we proposed hardware implementation methods to improve the processing speed and the security against CPA of RLizard crypto-processor. Our implementation results show that the proposed RLizard crypto-processor can have twice and four times faster processing speed with little area overhead than the conventional one. As for the security against CPA attacks, the vulnerability was analyzed, and the proposed countermeasures were validated by experimental results. The implementation results show that the overhead for the proposed countermeasures is also very little. Although we did not perform simple power analysis (SPA) [27] and second-order power analysis [29] , we expect that our countermeasure method would be strong against them. In further work, such experiments would also be performed. In addition, we expect that our proposed methods can be applied to other ideal-lattice-based cryptosystems using binary or ternary polynomials such as NTRU and Round5.
