A new viewpoint on designing a finite-field multiplier architecture is suggested. The proposed architecture trades decrease in the number of clock cycles with not as much additional hardware area. The architecture presents high performance, succinct structure, reconfigurability and irrelevance on the choice of the Galois field. It can be applied to high security cryptographic applications such as lattice cryptography defined with Z p (x)/(x n +1).
Introduction
With the advent of quantum algorithms meant for quantum computers to appear in near future, 3 rd generation encryption methods such as RSA, ECC (elliptic curve cryptography), which ensure their security based on hard prime factorization or discrete logarithm problems, are in a precarious situation [1] [2] to be replaced with post-quantum cryptographic algorithms such as lattice cryptography.
Since 2009, when Craig Gentry proved the hard problem of fully homomorphic encryption [3] [4] , lattice cryptography has been drawing attention from various sorts of academic subjects including mathematics, computer science, VLSI implementation, etc.
The hardness of lattice cryptography begins with SVP (shortest vector problem) and CVP (closest vector problem) to devise the subset-sum problem, which extends to the LWE (learning with errors) [5] problem. Recently, to deal with the huge key size (O(n 2 )) with efficiency, Vadim and Peikert suggested the Ring-LWE [6] in which the key size reduces to the complexity O(n).
The Ring-LWE cryptography uses polynomial multiplication as the main operation. We utilize Cooley-Tukey FFT algorithm to effectively process the polynomial multiplication in this paper.
GF Multiplier
We consider arithmetic in an extension field of GF (2) . The extension degree is denoted by m, so that the field can be denoted by GF (2 m ). This field is isomorphic to GF (2) [x]/(P(x)), where
is an irreducible polynomial of degree m with p i  GF (2) , proceeds as follows:
Here we divide the last row of the equation (1), into even part and odd part.
Now we perform reduction using following property. 
Here, p m-1 of the second row of the equation (3) can be thought of zero in order for simplification without losing generality because prime polynomials with low hamming weights (trinomials or pentanomials having p i =1 in the lower order of the polynomial) are used in real-world applications. We can describe as below: 
Now reduction based on  2 can be performed. The first row of the equation (2) tells us that there needs to be a circuit which multiplies to the A() (which from now we will represent as -multiplying circuit) so that we can recursively calculate should multiply  to the result itself, which costs another one clock cycle, producing the result of the odd part of Z(). Thus if we take an odd exponent m, we can get the final result of Z() by adding (XORing) Z even () and Z odd () without wasting a cycle. Fig. 1 shows the conceptual schematic of   -multiplying circuit and fig. 2 shows the implementation of the Z even () of the equation (2). We can similarly implement the Z odd () of the equation (2) by simply adding two 2-to-1 multiplexers per bit plus  t -1 extra multiplexers for selecting -or  2 -multiplying circuit where  t is the hamming weight of the prime polynomial which is very small.
In this way, the number of clock cycles for one field multiplication was reduced by a factor of two. We can consequently build a t-times as fast multiplier as the serial one using up to  t -multiplying circuit costing less than t-times resources. The resource reduction can be achieved by sharing A() register. 
Performance Evaluation
For simple and reasonable comparison, we consider implementation in GF (2 2k ). Fig. 3 is the block diagram of the hybrid multiplier architecture. According to the SEC 1 [3] , the hamming weight of the suggested prime polynomials is very small relative to the exponent m thus can be ignored and the width of serial registers (B()) is the same in both architectures. So we focus our comparison on the shaded area of fig. 2 and fig. 3 . Table. 1 shows the comparison results. In the proposed architecture, there are m AND gates, m XOR gates and m registers in Z even (), m AND gates, m XOR gates, 2m 2-to-1 muxes and m registers in Z odd (), plus m registers (for storing A()) and m XOR gates for final addition. Critical delay path exists in Z odd () with additional 1 mux to the traditional architecture, while in the hybrid one is included a parallel GF(2 n=2 ) multiplier in which delay grows up severely as n increases. Cycle time is one clock longer than in the hybrid architecture because the exponent is even. 
Conclusion
A new approach on uplifting the performance of the serial multiplier architecture in GF (2 m ) is suggested. In high security cryptographic applications, the exponent in the finite fields should be a large prime number. The suggested multiplier architecture fits in such areas while requires moderately more resources than the hybrid architecture, which has limitation in choosing exponents of the finite fields. It can also be operated with high-speed clock frequency with its short critical delay path.
