Abstract: In this study, we present an efficient finite field arithmetic architecture based on systolic array for multiplication which is a core algorithm for division and exponentiation operations. In order to obtain dedicated area-efficient circuits, we adopt Montgomery multiplication algorithm and systolic array. First of all we induce an efficient arithmetic algorithm from typical Montgomery multiplication using an effective factor, then we design an efficient semi-systolic array based multiplication architecture which is highly suitable for pipelined operations. The proposed multiplier saves at least 40% area complexity as compared to the corresponding existing structures.
Introduction
Finite field arithmetic operations have recently been applied in a variety of fields, including cryptography and error-correcting codes [1, 2] . The multiplication among these operations is the most important arithmetic operation. This is because the time-consuming operations such as exponentiation, division, and multiplicative inversion can be decomposed into repeated multiplications. Thus, the fast multiplication architecture with low complexity is needed to design dedicated high-speed circuits.
Montgomery multiplication (MM) algorithm, introduced by Montgomery is one of most interesting and useful advances in this realm for fast modular integer multiplication [3] . The MM was successfully adapted to finite field GFð2 m Þ by Koc and Acar [4] . The MM over GFð2 m Þ is a very efficient solution for the design of a fast architecture and VLSI implementation [5, 6] . Hariri and Reyhani-Masoleh have considered concurrent error detection for MM over GFð2 m Þ [6] . Three different multipliers, namely the bit-serial, digit-serial, and bit-parallel multipliers, have been considered and the concurrent error detection scheme has been derived and implemented for each of them. Many semi-systolic multiplier over GFð2 m Þ have been developed [7, 8, 9, 10] .
Huang et al. [7] proposed the semi-systolic polynomial basis multiplier over GFð2 m Þ to reduce both area and time complexities. Also they proposed the semisystolic polynomial basis multipliers with concurrent error detection and correction capability. Kim and Jeon [8] proposed much faster Montgomery multiplier than the architecture proposed in [7] . They proposed a two-fold architecture so that two different architectures are operated at the same time. Kim and Kim [9] proposed an area-efficient multiplier than multipliers proposed in [7, 8] . Recently, Choi and Lee [10] proposed a low complexity semi-systolic multiplier based on the redundant basis representation of the finite field elements. In this paper, we induce an efficient multiplication algorithm for reduction of hardware complexity of typical architectures. The proposed algorithm enables multiplication to operate in pipelined computation so that two different operands can be computed in the same hardware architecture. Let α and β be two elements of GFð2 m Þ, then we define ¼ Á mod G, where G denotes GðxÞ. Also, let A and B be two Montgomery residues, then they are
a Montgomery factor, R and an irreducible polynomial, G are relatively prime, and gcdðR; GÞ ¼ 1. Then, the MM algorithm over GFð2 m Þ can be formulated as
Then, P can be expressed as
by the definition of the Montgomery residue. It means that P is the Montgomery residue of γ.
Proposed architecture
Based on the property of parallel architecture, we choose the Montgomery factor, R ¼ x bm=2c . Then, the MM over GFð2 m Þ can be formulated as
We know that x is a root of GðxÞ, i.e., GðxÞ ¼ 0 and g m ¼ g 0 ¼ 1 over all irreducible polynomials. Thus, x m and x À1 are as follows:
Àbm=2c mod G can be expressed as follows:
Now, it expresses that P can be divided into two parts. One is based on the negative powers of x and the other is based on the positive powers of x. Let l ¼ dm=2e and k ¼ bm=2c. P can be denoted by 
From (4) and (5) 
and
From (7) and (8), the recurrence equations of C and D can be formulated by the following equations, where C ð0Þ ¼ D ð0Þ ¼ 0. 
From (9) and (10), we can obtain the coefficient of C ðiÞ and D ðiÞ as follows: (6), (11) and (12), we perform index-transformation in the proposed formulas and eliminate unnecessary computations. We circularly shift the coefficient computation of C 
hjÀ1i þ a 
Complexity analysis
In CMOS VLSI technology, each gate is composed of several transistors [11] . We adopt that A AND2 ¼ 6, A XOR2 ¼ 6, and A LATCH1 ¼ 8, where A GATEn denotes transistor count of an n-input gate, respectively. Also, for a further comparison of time complexity, we adopt the practical integrated circuits in [12] and the following assumptions, as discussed in detail in [7] , are made: T AND2 ¼ 7, T XOR2 ¼ 12, and T LATCH1 ¼ 13, where T GATEn denotes the propagation delay of an n-input gate, respectively. A circuit comparison between the proposed multiplier and the related multipliers is given in Table I . In detail, the results show that the proposed semi-systolic multiplier saves about 50, 50, 45, and 40% area complexities as compared to the existing multipliers by Huang [7] , Kim-Jeon [8] , Kim-Kim [9] , and Choi-Lee [10] , respectively. The time complexity of our multiplier is the same with Kim-Jeon [8] . But our multiplier saves about 50, 27, and 36% time complexities as compared to Huang [7] , Kim-Kim [9] and Choi-Lee [10] , respectively.
Conclusion
In this paper, we propose a semi-systolic architecture for MM over finite fields. We induced an efficient algorithm which is highly suitable for the design of parallel pipelined structures. Our algorithm enabled the computation to share the hardware architecture so that we reduced not only time complexity but also hardware complexity by nearly 40% compared to the recent study. We expect that our architecture can be efficiently used for various applications including crypto coprocessor design, which demand high-speed computation, for security purposes.
