Systolic array implementation of Euclid's algorithm for inversion and division in GF(2m) by Jyh-Huei Guo & Chin-Liang Wang
SYSTOLIC ARRAY IMPLEMENTATION OF EUCLID'S ALGORITHM FOR INVERSION 
AND DIVISION IN GF(2m) 
Jyh-Huei Guo and Chin-Liang Wang 
Department of  Electrical Engineering, National Tsing Hua University 
Hsinchu, Taiwan 300, Republic of China 
clwang@ee.nthu.edu.tw 
Abstract - This paper presents a new systolic VLSI 
architecture for computing inverses and divisions in 
finite fields GF(2") based on a variant of Euclid's 
algorithm. It is highly regular, modular, and thus well 
suited to VLSI implementation. It has O(m2) area 
complexity and can produce one result per clock cycle 
with a latency of 8m-2 clock cycles. As compared to 
existing related systolic architectures with the same 
throughput performance, the proposed one gains a 
significant improvement in area complexity. 
I. INTRODUCTION 
Finite fields GF(2") have found many applications in 
areas of communications, such as error-correcting codes 
[l], [2] and cryptography [3]. In these applications, 
computing inverses and divisions in GF(2") is usually 
required. Since such computations are quite time- 
consuming, it is desirable to design high-speed circuits for 
them to meet the real-time requirements. 
There have been a number of hardware structures 
available for fast inversion andor division in GF(2") (see, 
for example, [4]-[12]). Among them, the designs in [4]-[9] 
employ serial-form input and/or serial-form output. 
Basically, such circuits with serial-form data transmission 
involve small hardware complexity but might have 
unsatisfactory throughput performance when m gets large. 
In contrast, the designs in [ lo]-[ 1Sj make use of parallel-in 
parallel-out U 0  and achieve higher throughput rates with 
more hardware complexity. All these parallel architectures 
are designed based on the concept of systolic arrays [13] 
and are able to provide the maximum throughput in the 
sense of producing new results at a rate of one per clock 
cycle, i.e., the time complexity is O(1). However, their 
hardware requirements seem too high for large fields; the 
area complexity is O(m*2") for the circuit in [lo] and is 
O(m3) for those in [ 1 11 and [ 121. 
In this paper, a new parallel-in parallel-out systolic 
array with 0(1) time complexity and O(m2) area complexity 
This work was supported by the National Science Council of the 
Republic of China under Grant NSC-85-2215-E-007-017. 
is proposed for inversion and division in GF(2"). The 
architecture is designed based on a variant of Euclid's 
algorithm for computing the greatest common divisor of 
two polynomials. It is highly regular, modular, and thus 
well suited to VLSI implementation. As compared to 
previous systolic architectures for inversion and division 
with the same throughput performance, the proposed one 
saves a significant amount of chip area. 
11. VARIANTS OF EUCLID'S ALGORITHM 
FOR COMPUTING INVERSIONS AND 
DIVISIONS IN GF(Zm) 
Let A(a ) and B(a) be two elements in GF(2"), G(a) be 
the primitive polynomial of degree m, and C(a)= A(a)/B(a) 
mod G(a). Then we have 
A(a)=um.lam-'+ + u,a+uo (1) 
B(a)=b,.,a"'+ a.. + b,a+b, (2)  
G(a)=am+gm-lam-'+ . * *  + g,a+go (3) 
c(a)=cm~Iam-l+ . * *  + c,a+c, (4)  
B(a)C(a)+G(a)D(a)=A(a) ( 5 )  
for some element D(a) in GF(2"), where each coefficient 
of the polynomials is in { 0, 1 } . All arithmetic operations in 
GF(2") are performed by taking the results mod 2, and 
C(a) is called the inverse of B(a) when A(a)=l. 
A. The Original Euclid's Algorithm 
the following Euclid's algorithm [2] can be used: 
R = B(a); S = G(a); U = A(a); V = 0; 
while R f 0, do 
To perform inversioddivision operations defined above, 
Q = S DIV R; (*DIV: polynomial division*) 
temp = S - QR, S = R; R = temp; 
temp = V -  Q @  V =  U, U =  temp; 
end 
U=C(a). 
One disadvantage of this algorithm is that it does not 
involve a fixed number of iterations for computing C(a) in 
a given field. This makes it not easily realized using VLSI 
techniques. 
B. The Modijied Euclid's Algorithm in [9] 
To overcome the above-mentioned problem, Brunner et 
al. [9] proposed a modification of Euclid's algorithm that 
0-7803-3073-0/96/$5 .OO 01996 IEEE 481 
always involves 2m iterations to compute an inverse or 
division in GF(2”). The algorithm can be summarized as 
follows: 
R = B(a); S = G = G(a); U = A(a); V = 0; 
count = 0; 
for i= l to2mdo 
if r, = = 0 then (*occurring m times*) 
R = a.R; U = a. U mod G; 
count = count + 1 ;  
ifs, = = 1 then 
end 
S = a.$ 
if count = = 0 then 
R t, S; U o  V; (*exchange operations*) 
U =  a.Umod G; 
count = count + 1;  
U = U/a mod G; 
count = count - 1 ;  
else 
S = S + R ; V = V + U ;  
else (*occurring m times*) 
end 
end 
end (*U=C(a) = A(a)/B(a) mod G(a); count = 0*) 
where r, and s, denote the most significant coefficients of 
R and S, respectively. This algorithm involves 2m iterations 
(i=l to 2m) and kount = 0” always occurs at the end of the 
last iteration [9]. In other words, both of the statements 
“count = count + 1” and “count = count - 1” run m times 
during the 2m iterations. To realize the algorithm, a serial- 
in serial-out pipelined architecture was given in [9]. This 
circuit possesses good area-time performance, but it is not a 
systolic design and suffers from broadcasting problems. 
The reason why the algorithm is not amenable to systolic 
array implementation is that its arithmetic operations are 
not uniform during the 2m iterations. It performs “U = a.U 
mod G” for some iterations, and performs “U = U/a mod 
G” for the others. 
C. A New Variant of Euclid’s Algorithm 
Note that, if the statement “U = U/a mod G” is removed 
from the algorithm given above, the final result will 
become U = C(a)am, instead of the desired answer U = 
C(a). To obtain the correct answer, we can execute the 
operation “U = U/a mod G’ m times after the 2m iterations 
have been completed. It can also be seen from the 
statements “temp = V - QlJ V = U, U = temp” given in 
Section 1I.A that removing the statement “U = U/a mod G” 
is equivalent to executing the statement “V = a.V mod G”. 
Moreover, the statements “R ++ S; U +-+ V; U = a.U mod 
G” are equivalent to “V = a.V mod G; R o S; U ++ V’. 
With these observations, we can derive the following 
algorithm for inversion/division in GF(2”): 
R = B(a); S = G = G(a); U = A(a);  V = 0; 
count = 0; 
Part A: 
for i = 1 to 2m do 
if r, = =O then 
R = a-R; U =  a.Umod G; 
count = count + 1;  
ifs, = = 1 then 
S = S + R ;  V =  V+ V; 
end 
S=aS;  V=a.VmodG; 
if count = = 0 then 
R t , S ;  U t ,  V;  
count = count + 1 ; 
count = count - 1; 
else 
else 
end 
end 
end (*U = C(a).a” mod G(a).; count = 0*) 
for i = 2m+l to 3m do 
U = Ula mod G; 
end (*U = C(a) = A(a)/B(a) mod G(a)*) 
Part B: 
Apparently, the new variant of Euclid’s algorithm 
consists of two parts; Part A first generates a temporary 
result C ( a ) d  mod G(a)., and then Part B divides it by am 
to yield the correct answer. Table I demonstrates a 
procedure of the proposed algorithm for computing 
inverses/divisions in GF(24), where G(a) = a4+a+l, A(a) = 
a3+a2+a, nd B(a) = a3+a+ 1. At step i = 2m = 8, U = a2+ 1
is the temporary result C(a).a” mod G(a)., and at step i = 
3m = 12, U = a+l is the correct answer C(a) = A(a)/B(a) 
mod G(a). As compared to the algorithm described in 
Section II.B, the new algorithm involves more uniform 
arithmetic operations during the recursively computing 
process, and is thus easier to realize using a systolic 
architecture. 
111. SYSTOLIC IMPLEMENTATION OF THE 
PROPOSED ALGORITHM 
Fig. 1 shows a systolic architecture to implement the 
proposed algorithm for computing inverses and divisions in 
GF(2”), where ‘0’ denotes a one-cycle delay element. It 
consists of a subarray of 2m Type-I cells and 2mxm Type-I1 
cells for realizing the Part-A operations and a subarray of 
mxm Type-I11 cells for realizing the Part-B operations. The 
ith row of each subarray performs the ith-iteration 
operations of the corresponding part. The functions of these 
three types of basic cells are illustrated in Figs. 2 to Fig. 4. 
482 
TABLE I 
An Example of Computing Inverses/Divisions in GF(Z4) 
Based on the Pro osed Algorithm 
( G ( ~ ) = ~ ~ + ~ + I ,  A ( ~ ) = ~  9 B ( ~ ) =  a3+a+1) 
Fig. 1. The proposed systolic architecture for computing 
inversions/divisions in GF(2”). m=3. 
+ 
count‘ t: f: 
Fig. 2. The circuit of Type-I cell in Fig. 1. 
RUmulti 
Exchange 
r’, s: U: g1 v: 
Fig. 3. The circuit of Type-I1 cell in Fig. 1. 
Fig. @d,!a 4. The circuit of Type-111 cell in Fig. 1. 
483 
Each Type-I cell is used to generate the following control 
signals: 
RUmulti = (r, = = 0) 
Add = (r, = =1) & (s, = = 1) 
Exchange = (rm = = 1) & (count = = 0) 
count’ = count - 1, if (count z 0) & (rm ==1) 
count’ = count + 1, else 
When RUmulti = 1, the corresponding row of Type-2 cells 
executes the operations given in (I); otherwise, it does the 
operations of (111). The Add and Exchange signals are used 
to determine whether the operations of (11) and (IV) are 
performed or skipped. The Part-A subarray generates the 
temporary result C(a)a” mod G at its bottom row, and 
then sends it to the Part-B subarray for further processing. 
With little effort, one can check the inversioddivision 
results will emerge from the bottom of the Part-B subarray 
at a rate of one per clock cycle. It can also be seen that the 
proposed systolic architecture has area complexity of O(m2) 
and a latency of 8m-2 clock cycles. 
Circuits 
tem 
Number of 
Cells 
Throughput 
(Ucycle) 
Latency 
(cycles) 
Maximum 
Cell Delay 
Cell 
Complexity 
IV. CONCLUSIONS 
Table I1 gives a comparison of the proposed parallel-in 
parallel-out systolic array for inversion and division in 
GF(2m) with those in [ 111 and [ 121. We can see from this 
table that all the architectures compared reach the same 
throughput rate of one result per clock cycle, but the 
proposed one has much smaller area requirement, much 
shorter latency, and much better area-time product 
performance. 
Wei Wang & Guo 
Cl11 C121 
m2(m-1) mz(m-1)/2 
1 1 
3mz-2m 2m2-3mR 
TmZ 
+T, +TXOR4 
TANDZ 
3 AND,’s 6AND,’s 
1 XOR, 2 XOR4’s 
1 XOR, 17 latches 
13 latches 
REFERENCES 
[ 11 W. W. Peterson and E. J. Weldon, Jr., Error-Correcting 
Codes. Cambridge, MA: MIT Press, 1972. 
[2] E. R. Berlekamp, Algebraic Coding Theory. New 
York: Mcgraw-Hill, 1968. 
[3] D. E. R. Denning, Cryptography and Data Security. 
Reading, MA: Addsion-Wesley, 1983. 
[4] C. C. Wang, T. K. Truong, H, M, Shao, L. J. Deutsch, 
J. K. Omura, and I. S. Reed, “VLSI architectures for 
computing multiplications and i Jerses in GF(2m),” 
IEEE Trans. Comput., vol. C-34, pp. 709-719, Aug. 
1985. 
[SI G.-L. Feng, “A VLSI architecture for fast inversion in 
GF(2”),” IEEE Trans. Comput., vol. 38, pp. 1383- 
1386, Oct. 1989. 
[6] C.-L. Wang and J.-L. Lin, “A systolic architecture for 
computing inverses and divisions in finite fields 
GF(Z?”),” IEEE Trans. Comput., vol. 42, pp. 1141- 
1146, Sep. 1993. 
[7] M. A. Hasan and V. K. Bhargava, “ Bit-level systolic 
divider and multiplier for finite fields GF(2m),” IEEE 
Trans. Comput., vol. 41, pp. 972-980, Aug. 1992. 
[SI K. Araki, I. Fujita, and M. Morisue, “Fast inverters 
over finite field based on Euclid’s algorithm,” Trans. 
[9] H. Brunner, A. Curiger, and M. Hofstetter, “On 
computing multiplicative inverses in GF(2m),” IEEE 
Trans. Comput., vol. 42, pp. 1010-1015, Aug. 1993. 
[ 101M. Kovac, N. Ranganathan and M. Varanasi, “SIGMA: 
A VLSI systolic array implementation of a galois field 
GF(2“‘) based multiplication and division algorithm,” 
IEEE Trans. VLSI Systems, vol. 1, pp. 22-30, Mar. 
1993. 
[ 1 l]S.-W. Wei, “VLSI architectures for computing 
exponentiations, multiplicative inverses, and divisions 
in GF(2”),” in Proc. 1995 IEEE Int. Symp. Circuits 
Syst., London, May 1995, pp. 4.203-4.206. 
[12]C.-L. Wang and J.-H. Guo, “New systolic arrays for 
C+AB2, inversion, and division in GF(2”),” in Proc. 
1995 European Conference Circuit Theory Design, 
Istanbul, Turkey, Aug. 1995, pp. 431-434. 
[ 13lH. T. Kung, “Why systolic architectures?,” IEEE 
Trans. Comput., vol. 15, pp. 37-46, Jan. 1982. 
IEICE, vol. E-72, pp. 1230-1234, NOV. 1989. 
TABLE I1 
Comparison of Some Parallel-In Parallel-Out Systolic 
Arrays for Computing InversionsiDivisions in GF(2”) 
I I 
AT-oroduct I O h 3 )  I oh3) 
Proposed 
Type I: 2m 
Type 11: 2mZ 
Type 111: m2 
1 
8m-2 
T - , + T x o w + ~ T ~  
Type I: 
5 AND,’s 2 XOR,’s 
5MUX,’s I INV 
log,(m+l) bits adder 
zero-check circuit 
9+21og,(m+ 1) latches 
Type 11: 
4AND,’s 2 XORz’s 
18 latches 
Type 111: 
4 latches 
1 XOR, 8MUXZ’s 
1ANDz IXORZ 
0(mZ) 
AND, : i-input AND gate; XOR, : i-input XOR gate. 
INV : inverter; MUX, : i-input multiplexer. 
TANDi : the propagation delay through an i-input AND gate. 
TxOw : the propagation delay through an i-input XOR gate. 
T, : the propagation delay through an i-input multiplexer. 
484 
