Abstract: This paper presents an efficient parallel-in parallel-out systolic array for AB 2 over GF (2 m ) using the polynomial basis. As compared to existing related systolic arrays, the proposed array gains a significant reduction in hardware complexity. The proposed architecture includes the features of regularity, modularity and local interconnection. Accordingly, it is well suited for VLSI implementation and can be easily applied as a basic architecture for computing an inversion/division operation.
Introduction
Finite fields GF (2 m ) have received a lot of attention in many applications, such as error-correcting codes and cryptography [1, 2] . The important arithmetic operations involved in these applications are multiplication, inversion/division, and exponentiation. These operations can be carried out using a modular AB multiplier or a modular AB 2 multiplier. For example, the division is performed using multiplication and multiplicative inverse, that is A/B = AB −1 , while the inverse can be regarded as a special case of exponentiation, because B −1 = B 2 m −2 = (B(B(B · · · B(B(B) 2 ) 2 · · ·) 2 ) 2 ) 2 . In order to compute B −1 , the AB 2 can be used recursively. However, as an inverse operation is quite time-consuming, a high-speed circuit is preferable for such operations.
For this purpose, parallel-in parallel-out systolic arrays for AB 2 using the polynomial basis representation in GF (2 m ) have been proposed by Wei [3] and Wang and Guo [4] . Note that the systolic design in [3] has the bidirectional data flow, while the circuit in [4] has the unidirectional data flow. Lee et al. [5] proposed a systolic array for computing AB 2 + C over GF (2 m ) using the irreducible all-one polynomial. Unfortunately, such irreducible allone polynomials are very rare. Recently, Kim and Lee [6] proposed a low complexity parallel and serial systolic architectures for AB 2 multiplication in GF (2 m ). These circuits still have certain shortcomings as regards various applications due to their high circuit complexity and long cell delay. Thus, further research for an efficient power-sum circuit is needed.
In this paper, we propose an efficient parallel-in parallel-out systolic array with unidirectional data flow for AB 2 over GF (2 m ) using the polynomial basis representation. As compared to the related circuits in [3, 4, 6] , the proposed one gains advantages in terms of chip area and latency.
AB Algorithm in GF(m )
Let A and B be two elements in GF (2 m ) with a primitive polynomial G of degree m, where
The coefficients a i , b i , and g i are in {0, 1}. Each element in GF (2 m ) is a residue mod G and all arithmetic operations are performed by taking the results modulo 2.
It is easy to check that
It is well known that the power 2 operation of element B in GF (2 m ) can be represented as
The R = AB 2 mod G can be expressed as
where R = m−1 j=0 r j α j . Further expanding the last summations over j in (3), we obtain the following recursion for R :
where
j α j . With F , F , and (7), we can re-express the ith loop operation as follows:
m−2 . With (8), we obtain the following recursion for R :
The coefficients of
, and E (i) can be rewritten in terms of recurrence equations:
where 0 ≤ j ≤ m − 1 and c
Equations (14), (15), and (16) can be computed independently and have a similar computation form. Therefore, we can sequentially compute three euqations using one processing element. If we design a systolic architecture using above equations, the required latency is 4m + 3 clock cycles. Step 1: c
Step 2: c
for i = 1 to m do begin Step 4:
Step 5: for k = m/2 − 1 to 0 do begin Step 6: c
end for Step 10: end for Step 11:
for k = m/2 − 1 to 0 do begin Step 12:
2k + e (m) 2k
Step 13: end for Step 14: return R Fig. 1 . A two-dimensional SFG array for AB 2 in GF (2 4 ). and 8 have an identical computation structure, we can sequentially compute these steps using one processing element. Fig. 2 . The ith row of array realizes the ith iteration from Step 3 to Step 10 of Algorithm 1. As shown in Fig. 1, A, F , and F enter the array from the top in sequence, and B from the left. When input data pass through the array, the coefficients of R emerge from the bottom row.
Each P i,k cell ( Fig. 2 (a) ), for 1 ≤ i ≤ m and 0 ≤ k ≤ m/2 − 1, consists of two 2-input AND gates(AN D 2 ) and two 2-input XOR gates(XOR 2 ) and sequentially performs Steps 6, 7, and 8 in Algorithm 1. Each Q i cell (Fig. 2 (b) (Fig. 2 (c) ), for 0 ≤ k ≤ m/2 − 1, consists of four XOR 2 and four 1-bit latches and performs Step 12 in Algorithm 1. In other words, it computes r 2k+1 and r 2k using data from P m,k , for 0 ≤ k ≤ m/2 − 1. After execution of S cells, the result R will emerge from the bottom of S cells.
By applying the cut-set systolization techniques [7] to the SFG array in Fig. 1 , we can derive a parallel-in parallel-out systolic array for AB 2 be estimated based on the cells and latches.
Conclusions
In CMOS VLSI technology, each gate is composed of several transistors [8] .
We assume that AN D 2 , XOR 2 , LAT CH, and 3×1 SW consist of 6, 8, 8, and 16 transistors, respectively. Also, for a further comparison of time complexity, we adopt the practical integrated circuits in [9] and the following assumptions are made:
, and T 3×1SW =24 where T GAT En denotes the propagation delay of an n-input gate. Also, we assume that one XOR 3 and one XOR 4 are constructed using two XOR 2 and three XOR 2 .
In this paper, we have presented an efficient parallel-in parallel-out systolic array for AB 2 in GF (2 m ). Table I lists some important parameters of the proposed array and the related arrays in [3, 4, 6] . Accordingly, the proposed array saves about 78%, 53%, and 43% (for m ≥ 100) area-time product as compared to Wei's [3] , Wang-Guo's [4] , and Kim-Lee's [6] , respectively. As compared to the related circuit in [3, 4, 6] , the proposed array gains a significant reduction in hardware complexity. Also, the simplicity, regularity and modularity of our proposed architecture allow for easy extension and make this design for implementation using VLSI technologies, particularly for cryptographic applications.
