Abstract: This paper presents a semi-systolic Montgomery multiplier based on the redundant basis representation of the finite field elements. The proposed multiplier has less hardware and time complexities compared to related multipliers. We also propose a serial systolic Montgomery multiplier that can be applied well in space-limited hardware. Furthermore, a simple inversion based on the proposed scheme is presented.
Introduction
In finite field arithmetic, addition is trivial but multiplication is time-consuming. Other operations such as exponentiation and inversion can be performed using repeated multiplication. As a result, efficient multiplier architectures are important from a system performance point of view. Another crucial factor affecting field arithmetic efficiency is the choice of the basis. Wu et al. [1] proposed a redundant basis (RB) to embed a finite field into a minimal cyclotomic ring with the elegant multiplicative structure of a cyclic group. A number of systolic multipliers over GFð2 m Þ have been introduced [2, 3, 4, 5] . Recently, Huang et al. [5] proposed a semi-systolic multiplier to reduce both time and space complexities. Chiou et al. [4] proposed a semi-systolic Montgomery multiplier (MM) with concurrent error detection capability. However, most existing semi-systolic multipliers suffer from several shortcomings, including large time and/or hardware overhead. In this letter, we propose a low-complexity multiplication algorithm based on the RB and two systolic multipliers over GFð2 m Þ. The proposed scheme can be used as a kernel circuit for multiplication and exponentiation (inversion).
2 MM for finite field 2.1 Bit-parallel (semi) systolic MM Let β be a primitive nth root of unity in some extension of GFð2Þ. The nth cyclotomic field GFð2 n Þ over GFð2Þ is defined to be the splitting field of x n À 1 over GFð2Þ. Then, GFð2 n Þ is generated by β over GFð2Þ and any element A of GFð2 n Þ can be represented as
, where a i 2 GFð2Þ. Let GFð2 m Þ be a field that can be embedded in GFð2 n Þ. It has been shown that GFð2 m Þ is contained in GFð2 n Þ iff n is odd and m divides the multiplicative order of 2 mod n [1] . Note that the representation of A is not unique, since Then, the product T ¼ AB is obtained as T ¼ P nÀ1 j¼0 t j j , where t j ¼ P nÀ1 i¼0 a i b hjÀii . Note that hj À ii denotes that j À i is to be reduced modulo n.
MM was proposed originally for efficient integer modular multiplication. Later, it was shown that MM is also applicable to GFð2 m Þ. Instead of computing 
nÀ1 . Thus, the multiplication A by β can be obtained using one right cyclic shift of
The multiplicative inverse of β is can be performed as
is obtained by one left cyclic shift of A as
On the other hand, the squaring of an element A can be optimized owing to the fact that cross terms disappear because they come in pairs and the underlying field is GFð2Þ. Since n is odd and
. Thus, the squaring of A is obtained simply by using the subscript operation of the coefficient a i as
Note that these properties are useful for constructing efficient low-complexity field arithmetic architectures for GFð2 m Þ defined by the RB.
A new algorithm to compute the Montgomery product T ¼ ABR À1 over the RB can be obtained by
From this, an iterative procedure for computing ABR À1 can be formulated as follows.
for i ¼ 0; 1; . . . ; n À 1. After n iterations, T is obtained, where
j¼0 a i;j j is the ith intermediate value, Equation (1) can be reformulated as the following recursive equations.
from (2) and (3) it is evident that degðT nÀ1 Þ n À 1. To reduce the critical path delay, the operation of (3) can be reorganized as follows:
For simplicity, the binary field GFð2 4 Þ is used to illustrate the systolic multiplier architecture over the RB, where GFð2 4 Þ can be embedded in the minimal cyclotomic field GFð2 5 Þ. Based on the proposed algorithm, the hardware architecture of the semi-systolic multiplier is shown in Fig. 1(a) , where ðn À 1Þ Â n basic cells, n AND gates and n XOR gates are used and "•" denotes a 1-bit latch. In Fig. 1(b) , the basic cell at position (i; j) performs the following logic operations (1 i n À 1 and 0 j n À 1): a i;j ¼ a iÀ1;jþ1 ; c i;j ¼ b nÀiÀ1 Á a iÀ1;jþ1 ; t i;j ¼ t iÀ1;j È c iÀ1;j . In Fig. 1(a) , the cell at position (i; j) receives a iÀ1;jþ1 from the cell at position (i À 1; j þ 1) of the previous row and computes c i;j and t i;j , respectively. T from the left side and flows in the direction ½0; 1 T , where T denotes the transpose operator. a hjÀ3i (0 j n À 1) enters index ½1; j T from the top and flows in the direction ½1; À1 T , where a 0;jþ1 ¼ a hjÀ3i . a j (2 j n À 1) also enters index ½j; n À 1 T from the right side and flows in the direction ½1; À1 T , where a jÀ1;n ¼ a j . The values t j and c j (0 j n À 1) enter index ½1; j T from the top, respectively and are computed with the partial products generated by the previous row to give new partial products that are passed on to the next row, and then flow in the direction ½1; 0 T , where t 0;j ¼ t j ¼ 0 and c 0;j ¼ c j ¼ a hjþ1i Á b nÀ1 . The result T is obtained from the bottom row of the array after n À 1 iterations. It can be seen from Fig. 1(a) that a i;0 generated on the left side of the ith row enters the right side cell of the (i þ 1)th row (i.e., one left cyclic shift). In Fig. 1(b) , the basic cell consists of one 2-input AND gate and one 2-input XOR gate, and the cell at position (i; j) receives a iÀ1;jþ1 as its input from the (i À 1; j þ 1)th cell, t iÀ1;j and c iÀ1;j from (i À 1; j)th cell, and b i from (i; j À 1)th cell, respectively. In Fig. 1(a) , the left side input b i (resp., the right side input a jþ1 ) is staggered by one clock cycle relative to b iþ1 (resp., a j ), where n À 3 ! i ! 0 and 1 j n À 2. Fig. 1(a) in the east direction (projection vector ½0; 1 T and schedule vector ½2; 1 T ) and retiming by the cut-set systolization techniques [6] , a new one-dimensional serial systolic multiplier can be derived. The result is shown in Fig. 2(a) , where "•" denotes a 1-bit latch. This multiplier consists of n À 1 identical basic cells, one 2-input AND gate and one 2-input XOR gate, where the functions of the basic cell are depicted in Fig. 2(b) .
Bit-serial systolic MM By projecting
Note that according to the projection, the input values other than B enter the left side of the array in a serial form, while the coefficients of B should stay inside the array, i.e., b nÀiÀ2 (0 i n À 2) should remain at ith cell to be ready for the execution. It is possible to incorporate an additional one 2-to-1 MUX and one 1-bit latch into each cell in Fig. 2(a) , so that b i may also enter the array serially with the most significant bit first at the same time as the control sequence ctr. The multiplier of Fig. 2 is controlled by a control sequence ctr ¼ 011 . . . 1 of length n. When ctr ¼ 0 enters the ith cell, b nÀiÀ1 also enters that cell, and then its loading operation occurs, for 1 i n À 1. The basic cell of Fig. 2(b) consists of one 2-input AND gate, one 2-input XOR gate, one 2-to-1 MUX, and nine 1-bit latches, and its critical path delay is one 2-input AND gate and one 2-to-1 MUX delays. If the input data come in continuously, this multiplier produces output results at a rate of one per m þ 1 clock cycles with a latency of 3m þ 1 clock cycles. The result t j (0 j n À 1) emerges from the right side of the array in serial form with the least significant bit first.
In addition, an application of the proposed scheme is to compute the inverse of any element in GFð2 m Þ. 
). This method shows that the inversion contains m À 2 multiplications and m À 1 squarings. Note that the squaring can be easily obtained by the subscript operation. Hence, the inverse element can be calculated using m À 2 stages of the proposed multiplier.
Analysis and conclusion
We obtained the area of the gates, multiplexer and latch along with their worst-case intrinsic delays pertaining to unit drive-strength from the "SAMSUNG STD 150 0.13 µm 1.2 V CMOS Standard Cell Library" databook. Using these data we estimated the time and area complexities of the proposed structure and the related structures. The notations T GATEn and A GATEn denote the delay and area of the n-input cell, respectively. Table I summarizes the time and area requirements for the cells used in our analysis.
To demonstrate the efficiency of the proposed method, we measure the area-time (AT) complexity of each work and then calculate the improvement. From Table II, we can see that the semi-systolic multiplier of Fig. 1 obtains obvious area, time, and AT advantages over other multipliers. In detail, the comparison results show that the AT complexity of the proposed semisystolic multiplier is improved by approximately 62%, 56%, and 38% compared to Lee et al., Chiou et al., and Huang et al.'s multipliers, respectively. The proposed parallel (resp., serial) multiplier produces the results at a rate of one per 1 (resp., m þ 1) cycles with a latency of m þ 1 (resp., 3m þ 1) cycles using Oðm 2 Þ (resp., OðmÞ) area complexity. Note that the parallel semi-systolic architectures have better throughput but much higher hardware cost than the serial systolic architecture of Fig. 2 . This work presents an efficient multiplication algorithm for computing the modular multiplication, which is the crucial operation in the finite field arithmetic. The proposed scheme exploits the characteristics of the MM and the RB to construct a low-complexity systolic multiplication architecture. In particular, the serial systolic multiplier is attractive for space-constrained applications. The proposed architectures have the features of regularity, modularity, concurrency, and unidirectional data flow and thus are well suited for VLSI implementation. [3] et al. [4] et al. [5] Fig. 1 Fig. 2 # cells m
