Bit-serial Systolic Array Implementation Of Euclid's Algorithm For Inversion And Division In GF(2/supm) by Guo, J.-H. & Wang, C.-L.
Bit-Serial Systolic Array Implementation of Euclid's Algorithm 
for Inversion arid Division in GF(2m) 
Jyh-Huei Guo and Chin-Liang Wang 
Department of Electrical Engineering, National Tsing Hua University 
Hsinchu, Taiwan 300, Republic of China 
Abstract 
This paper presents two serial-in serial-out systolic arrays 
for inversion or division in GF(2") with the standard basis 
representation. They can produce results at a rate of one per m 
cycles after an initial delay of 5m - 4 cycles. The proposed 
arrays involve unidirectional data flow and are highly regular 
and modular. Thus, they are well suited to VLSI implemenita- 
tion with fault-tolerant design. As compared to existing re- 
lated systolic designs with the same time complexity and I/O 
format, our proposed arrays gain a significant improvement in 
hardware area. 
I. Introduction 
Finite fields GF(2") have found various applications in ilr- 
eas of communications, such as error-correcting codes [ 11-[ 21 
and cryptography [3]. In these applications, computing in- 
verses or divisions in GF(2") is usually required. It is thus 
desirable to design hardware-efficient architectures for them. 
A number of VLSI architectures for computing inverses 
and/or divisions in GF(2") have been reported in the litera- 
ture. Among them, the circuits in [4]-[l l] are designed based 
on the concepts of systolic architecture [12]. As described in 
[12], a systolic system gains a considerable speed improve- 
ment over a von Neumann system and is particularly suitable 
for VLSI implementation due to its simple, regular communi- 
cation and control structure. The time complexity is O(1) for 
the divider in [4] (with O(m . 2") area complexity), the divid- 
ers in [5]-[26] (with O(m3) area complexity), the divider in 1171 
(with O(m ) area complexity), and the divider and inverter 
with O(m2) area complexity in [8]; while the time complexity 
is O(m) for the dividers in [9]-[Ill (with O(m2) area com- 
plexity) and the divider and inverter with O(m) area com- 
plexity in [8]. It should be noted that the circuits in [4]-[8] are 
parallel-in parallel-out schemes, while those in [9]-[ 1 11 are 
serial-in serial-out schemes. 
For some applications, where the value of m is large (for 
example, considering a cryptographic system with m = lOOO), 
a system with large area complexity will become very im- 
practical due to its large hardware area. For the situation, a 
system with small area complexity will be highly desirable. 
From above, we can see that the divider and inverter with 
O(m) area complexity in [8] have the least area complexity 
and area-time product. However, they are both parallel-in 
parallel-out schemes, where the I/O pin count is too large for 
large m, and involve bi-directional data flow. They have a 
throughput rate of producing results at a rate of one per 2nz - 
2 cycles, which may be too slow for some applications. 13e- 
sides, the input data should be interleaved to reach 100% 
utilization efficiency. 
In this paper, two serial-in serial-out systolic arrays are 
proposed for computing inverses or divisions in GF(2"). One 
is for division and the other one is for inversion. Both pro- 
posed architectures have O(m) area complexity and O(m) 
time complexity. If the input data come in continuously, they 
can produce output results at a rate of one per m cycles after 
an initial delay of 5m - 4 cycles. The proposed arrays are 
highly regular and modular and involve unidirectional data 
flow. As described in [13]-[14], a system with unidirectional 
data flow outperforms a system with bi-directional data flow 
in terms of chip cascadability, fault-tolerant, and possible 
wafer-scale integration. Thus, they are well suited to VLSI 
implementation with fault-tolerant design [ 131-[ 141. As com- 
pared to existing related systolic designs with the same time 
complexity and I/O format, our proposed arrays gain a sig- 
nificant improvement in hardware area. 
11. An Algorithm for Inversion / Division in GF(2") [SI 
Let A(x)  and B(x) be two elements in GF(2"), G(x) be the 
primitive polynomial used to generate the field, and C(x) be 
the result of A(x)  / B(x) mod G(x), where 
A ( x )  = am_lXm-' + um_2Xm-2 + . . . +a0 
B ( x )  = bm-lxm-l + b m _ 2 X m - 2  + . . . +bo 
G(x) = x m  + gm_,Xm-l + gm_2Xm-2 + . . . + go 
C(X) = Cm_IXm-I + cm_2x 
(1) 
(2) 
(3) 
(4) m-2 + . ' .  +CO 
Each coefficient of the polynomials is in (0, 1 } . To compute 
the inversion 1 / B(x) mod G(x), the Euclid's algorithm [2] 
can be used: 
Euclid's algorithm for  inversion in GF(2") 
R = B(x); S = G(x); U = 1; V = 0; 
while R f 0, do 
Q = S DIV R; (* DIV: polynomial division *) 
temp = S +  Q . R; S =  R; R = temp; 
temp = V + Q . U, V =  U, U =  temp; 
end (* V has the result of 1 / B(x)  mod G(x) *) 
One disadvantage of the algorithm is that it does not involve a 
fixed number of iterations for computing inverses in a given 
field. This makes it not easily realized using VLSI techniques. 
In [8], a new variant of Euclid's algorithm for computing 
divisions in GF(2") was proposed. The algorithm consists of 
2m - 2 iterations and can be summarized as follows: 
Algorithm for  division in GF(2") [8] 
R = B(x); S = G = G(x); U =  A(x); V = T = 0; 
state = 0; count = 0; 
f o r i = l  t o 2 m - 2 d o  
113 
R = x  . R; T =  x . Tmod G; 
if state = = 0 then 
count = count + 1 ; 
if r, = = 1 then (* r,: coefficient of xm of R *) 
(*I 
tmp = R; (1) 
R = R + S ;  (11) 
S=tmp; T =  U, (1) 
state = 1; 
end 
count = count - 1; 
if r, = = 1 then 
end 
if count = = 0 then 
else 
R = R + S; T = T + U, (11) 
(111) V =  T+ V; U O  V; 
state = 0; 
end 
end 
end (* U has the result C(x) *) 
The algorithm can be used to compute the inversion 1 / B(x) 
mod G(x) simply by setting the initial condition of U to "U = 
l", instead of "U = A@)". Besides, the operation ''2" = x . T 
mod G" in (*) can be simplified to "T = x . T' while comput- 
ing inversion in GF(2"). 
111. VLSI Architecture and Chip Implementation 
A .  A Dependence Graph for Inversion/Division in GF(2") (81 
Fig. 1 shows a dependence graph (DG) of the above divi- 
sion algorithm in GF(2,), where m = 3 [SI. It consists of 2m - 
2 Type-1 cells and (2m - 2) x m Type-2 cells, where the 
functions of these two types of basic cells are depicted in 
Figs. 2-3. The cells in the ith row of this array realizes the ith 
iteration. For each row, Type-1 cell generates the control sig- 
nals Ctrll, Ctrl2, and CtrU for the present iteration as well as 
computes the values of count and state for the next iteration 
(i.e., count' and state' in Fig. 2)  according to the following 
logic functions: 
Ctrll = (state == O)& ( rm == 1) 
Ctrl2 = (rm == 1) 
CtrD = (state == l)& (count == 0) 
count + 1, if state == 0 
count - 1, if state == 1 count' = 
rm == 1)& (state == 0)) 
or ((count == O)& (state == state = state, if 
Type-2 cells in the corresponding row receive Ctrll, Ctrl2, 
and CtrD from Type-1 cell and execute the operations of (I), 
(11), and (111) when the control signals Ctrll, Ctrl2, and CtrU 
are true, respectively, where the (i + 1)th Type-2 cell (0 2 i 2 
m - 1) from the right evaluates the (i + 1)th least significant 
coefficients of R, S, U, V, and T (i.e., TIi ,  sli, uti, vli, and t t i  in 
Fig. 3). The coefficients of division result C(x) will emerge 
from the bottom row of the array (i.e., after 2m - 2 iterations). 
B. A New Variant of the DG in Section III.A 
By projecting the DG in Fig. 1 along the east direction, a 
one-dimensional signal flow graph (SFG) array can be de- 
CZ CI CO 
Fig. 1. A dependence graph for division in G Q ~ ~ )  [ 8 ] .  
count state 
I 
adder 
Ctrl 1 
count' = count + 1, if c = 0 
count' = count - 1, if c = 1 
zero O I _  c--+ Ctrl3 
check 
o =  l ,ifcount'=O 
o = 0, else 
1 
state' 
+ 
count' 
Fig. 2. The circuit of Type-1 cell in Fig . 1 
Si i 
I k 
Fig. 3. The circuit of Type-2 cell in Fig. 1 
rived. One disadvantage associated with the SFG array is that, 
owing to the fact that each Type-1 cell in Fig. 1 involves one 
adder (subtractor) circuit, each basic cell of the SFG array 
will involve one adder (subtractor) circuit of log,(m + 1) bits 
and thus has an area complexity of O(log,m). When the value 
of m gets large, each basic cell will become large. To prevent 
such a problem, we can adopt the following approach. 
114 
Fig. 4 shows a new variant of the DG in Fig. 1, where 2m 
- 2 Type-3 cells, illustrated in Fig. 5, and (2m - 2) x m Type- 
4 cells, depicted in Fig. 6, are used. The functions of Type-3 
and Type-4 cells in Fig. 4 are similar to those of Type-1 and 
Type-2 cells in Fig. 1. The main difference is that each Type- 
3 cell, not like the Type-1 cell, doesn’t involve the adder 
(subtractor) circuit of log,(m + 1) bits. For each row of the 
DG, Type-3 cell generates the control signals Ctrl2 and CtrD 
for the present iteration as well as computes the value of st,ate 
for next iteration (i.e., state’ in Fig. 5). Type-4 cells in the 
corresponding row receive Ctrl2, Ctr13, and state from Type-3 
cell and generate the control signal Ctrll according to the 
logic function 
( 11 0) 
The control signals Ctrll, Ctrl2, and Ctr13 are used to control 
the Type-4 cells in the same way as they are used to control 
the Type-2 cells described in Section 1II.A. The method for 
tracing the value of count can be explained as follows. Each 
Type-4 cell incorporates extra one 2-to-1 multiplexer for 
tracing the value of count. For the Type-4 cells in the first 
row of the DG, only the (1, 1) cell receives 1 for the iinc, 
while the others receive 0’s for their inc’s. All the cells in the 
first row receive 0’s for their dec’s. 
1) For the first row of the DG: Since state = 0, the count-fllag 
(c-flag) will be 1 for the (1, 1) cell and 0 for the other 
Type-4 cells. The count-zero (c-zero) is 0 for the Type-3 
cell. Let this indicate that the count is 1 after 1 iteration. 
2) For the second row of the DG: If state = 0, the c-flag will 
be 1 for the (2, 2) cell and 0 for the other Type-4 cells and 
the c-zero will be 0 for the Type-3 cell. Let this mean tlhat 
the count becomes 2 after 2 iterations. If state = 1, the c- 
flag will be 0 for all the Type-4 cells and the c-zero will be 
1 for the Type-3 cell. Let this represent that the value of 
count reduces to 0 after 2 iterations. 
It can be checked that, for a certain row of the DG, one of ithe 
following two situations occurs. 
3) The c-flag is 1 for only one Type-4 cell and 0 for the other 
4) The c-zero is 1 for the Type-3 cell and the c-flag is 0 for all 
In general, the situation that the c-flag is 1 for the (i , j)  cell 
indicates that the value of count becomesj after i iterations, 
while the situation that the c-zero is 1 for the Type-3 cell in 
the ith row means that the value of count reduces to 0 after i 
iterations. If the c-flag is 1 for the (i,j) cell, where 2 4j I m, 
the c-flag will be 1 for the (i + 1 , j  + 1) cell (i.e., count in- 
creasing) or (i + 1 , j  - 1) cell (i.e., count decreasing) depend- 
ing on what the value of state is. If the c-flag is 1 for the (i, 1) 
cell, the c-flag will be 1 for the (i + 1, 2) cell for state = 0 or 
the c-zero will be 1 for the Type-3 cell in the (i + 1)th row for 
state = 1. Besides, if the c-zero is 1 for the Type-3 cell in the 
ith row, the c-flag will be 1 for the (i + 1, 1) cell. 
C. Bit-Serial Systolic Array Implementation 
By projecting the DG in Fig. 4 along the east direction 
following the projection procedure in [ 151, we can derivie a 
one-dimensional signal flow graph (SFG) array as shown in 
Fig. 7, where 2m - 2 basic cells as shown in Fig. 8 are used 
Ctrll = (state == O)&Ctrl2 
Type-4 cells and the c-zero is 0 for the Type-3 cell. 
the Type-4 cells. 
c2 CI CO 
Fig. 4. A new variant of the DG shown in Fig. 1. 
state dec 
state‘ 
Fig. 5 .  The circuit of Type-3 cell in Fig. 4. 
inc 
tm- I 
Ctrl2 
state 
Ctrl3 
s: g, U: v: 
Fig. 6. The circuit of Type-4 cell in Fig. 5. 
and ‘‘0” denotes a 1-bit 1-cycle delay element. The SFG array 
is controlled by a sequence 011-.1 of length m. According to 
the projection, each basic cell of the SFG array should contain 
the circuitry of Type-3 and Type-4 cells. Since the control 
signals Ctrl2 and Ctr13, state, and t,-, of the ith iteration must 
be broadcast to all the Type-4 cells in the ith row of the DG, 
115 
Fig. 7. A one-dimensional SFG array for division in GF(z3). 
state, 
ctrl, 
state, 
ctrl, 
Fig. 8. The circuit of the basic cell in Fig. 7. 
Fig. 9. A new serial-in serial-out systolic array for division 
in ~ ~ ( 2 ~ 1 .  
extra four 2-to- 1 multiplexers and four 1 -bit latches are added 
to each basic cell of the SFG array for this purpose. Three 2- 
input AND gates are also added to each basic cell of the SFG 
array owing to the fact that three zeros must be fed to each 
row of the DG from the rightmost cell. When the control sig- 
nal is in logic 0, these AND gates generate three zeros. Be- 
sides, each basic cell of the SFG array incorporates one 2-to- l 
multiplexer for data flow arrangement. 
The SFG array in Fig. 7 can be easily retimed by using the 
cut-set systolization techniques [ 161 to derive a serial-in se- 
rial-out systolic array as shown in Fig. 9. When the input data 
come in continuously, this array can produce output results at 
a rate of one per m cycles after an initial delay of 5m - 4 cy- 
cles. The coefficients of division result C(x) will emerge from 
the right-hand side of the array in serial form with the MSB 
first. 
From Section 11, it can be seen that the operation " T =  x . T 
mod G" of the division algorithm can be simplified to "T = x . 
T' while computing inversion in GF(2"). Thus, the systolic 
array in Fig. 9 can also be simplified for computing inversion. 
For performing "T = x T', t,-, and gi's (0 I i 5 m - 1) are no 
more needed and thus the logic gates associated with them 
can be saved. Figs. 10 and 11 show the correspondingly 
Fig. 10. A new serial-in serial-out systolic array for inversion 
in ~ ~ ( 2 ~ 1 .  
I 
v, 
U ,  
st 
t, 
r, 
dec 
inc 
state, 
ctrl, 
Fig. 1 1 .  The circuit of the basic cell in Fig 10. 
i i  
Fig. 12. A chip layout of the proposed systolic array for 
division in GF(2'). 
simplified systolic array and basic cell, respectively. Each 
basic cell saves one 2-to-1 multiplexer, one 2-input AND 
gate, and three latches. 
D. Chip Implementation 
For the purpose of verification, a prototpe chip of the 
proposed systolic array for division in GF(2 ) was designed 
based on the COMPASS 0.6 p CMOS standard cell library. 
We completed the work by using the VERILOG hardware 
description language and the CADENCE CAD tool. The re- 
sulting chip layout is shown in Fig. 12, where the entire die 
size is about 1.66 x 2.73 "2 and the core area is about 0.67 
x 1.62 111111'. The critical-path delay of the core circuitry is 
around 3 ns under the technology used. When the I/O pad 
delay is considered, it was estimated that the chip can run at a 
clock rate up to 167 MHz. 
116 
IV. Conclusions 
Figs. 
9 and 10 
1 l m  
5 m - 4  
[91-[111 
Circuits 
Item 
I 
Throughput 1 I (2m - 1) [9] 
(llcycles) 1 l m  [lo], [ l l]  
Latency 7m - 3 [9] 
(cycles) 5m - 1 [lo], [ l l ]  
In this paper, two serial-in serial-out systolic arrays hawe 
been presented for computing inverses or divisions in GF(2”) 
over the standard basis. One is for division and the other clne 
is for inversion. Both proposed arrays possess the features of 
regularity, modularity, and unidirectional data flow. Thus, 
they are well suited to be  implemented using VLSI techniques 
with fault-tolerant design [13]-[14]. 
Table I gives a comparison of the proposed arrays with 
existing related systolic designs. All the circuits compared 
have the same time-complexity of O(m) and VO format of  
serial-input serial-output. From this table, we can see that our 
proposed arrays have the smallest latency among the circu.its 
compared and the same throughput performance as the cir- 
cuits in [lo]-[ll]. Besides, our circuits involve only O(m) 
area-complexity, which is a significant improvement in hard- 
ware complexity, and one control signal, instead of two for 
the circuits in [9] and [ 111. Table I1 gives transistor count 
estimations of the circuits compared in Table I based on the 
following assumptions: an inverter, a 2-input AND gate, a 2- 
input XOR gate, a 2-to-1 multiplexer, a 2-input OR gate, and 
a 1-bit latch consist of 2, 4, 6, 6, 6, and 8 transistors, respec- 
tively [17]. According to the estimations, our proposed arralys 
have less transistor counts over a wide range of  m and ha.ve 
substantial reductions in transistor counts when the value of’ m 
is large. For example, considering the case of m = 1000, the 
transistor counts of the circuits in Figs. 9 and 10 are only 
about 1.89% and 1.64% of  that of the circuits in [lo]-[ll], 
respectively. Therefore, our proposed arrays are well suited to 
various applications, especially for those applications with 
large value of  m, such as cryptography. 
Area 
complexity 
Area-time 
product 
# of  control 
signals 
Operation 
References 
0(m2> 
0(m3> 0(m2> 
1 2 [9], [ l l]  
1 [lo] 
Division Fig. 9 
Inversion Fig. 10 Division 
W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Cod’es. 
Cambridge, MA: MIT Press, 1972. 
E. R. Berlekamp, Algebraic Coding Theory. New York: 
Mcgraw-Hill, 1968. 
D. E. R. Denning, Cryptography and Data Security. Reading, 
MA: Addsion-Wesley, 1983. 
M. Kovac, N. Ranganathan and M. Varanasi, “SIGMA: A VLSI 
systolic array implementation of a Galois field GF(2”) based 
multiplication and division algorithm,” IEEE Trans. VLSI Sys- 
tems, vol. 1, pp. 22-30, Mar. 1993. 
S.-W. Wei, “VLSI architectures for computing exponentiations, 
multiplicative inverses, and divisions in GF(2”),” in Proc. 1995 
IEEE Int. Symp. Circuits Syst., London, May 1995, pp. 4.203- 
4.206. 
C.-L. Wang and J.-H. Guo, “New systolic arrays for C + AB2, 
inversion, and division in GF(2”),” in Proc. 1995 European 
Conference Circuit Theory Design, Istanbul, Turkey, Aug. 
J.-H. Guo and C.-L. Wang, “Systolic array implementation of 
Euclid’s algorithm for inversion and division in GF(2“),” in 
Proc. 1996 IEEE Int. Symp. Circuits Syst., Atlanta, May 1996, 
J.-H. Guo and C.-L. Wang, “Hardware-efficient systolic array 
implementations of Euclid’s algorithm for inversion and divi- 
sion in GF(2”),” in Proc. 1996 Int. Comput. Symp.-Int. Conf 
Comput. Architecture, Kaohsiung, Taiwan, Dec. 1996, pp. 22 1 - 
228. 
C.-L. Wang and J.-L. Lin, “A systolic architecture for comput- 
ing inverses and divisions in finite fields GF(2”),” IEEE Trans. 
Comput., vol. 42, pp. 1141-1 146, Sep. 1993. 
1995, pp. 431-434. 
pp. 11.481-11.484. 
_ _  
[ 101 M. A. Hasan and V. K. Bhargava, A‘ Bit-level systolic divider 
and multiplier for finite fields GF(2“),” IEEE Trans. Comput., 
[ 1 11 S. T. J. Fenn, M. Benaissa, and D. Taylor, “GF(2”) multiplica- 
tion and division over the dual basis”, IEEE Trans. Comput., 
vol. 45, pp. 319-327, Mar. 1996. 
[12] H. T. Kung, “Why systolic architectures?,” Computer, vol. 15, 
[I31 H. T. Kung and M. Lam, “Fault tolerant and two level pipelin- 
ing in VLSI systolic arrays,” in Proc. MIT Con$ Advanced Res. 
VLSI, Cambridge, MA, Jan. 1984, pp. 74-83. 
[I41 J. V. McCanny, R. A. Evans, and J. G. McWhirter, “Use of 
unidirectional data flow in bit-level systolic array chips,” EIec- 
tron. Lett., vol. 22, pp. 540-541, May 1986. 
[15] S. Y .  Kung, VLSI Array Processors. Englewood Cliffs, NJ: 
Prentice-Hall, 1988. 
[ 161 S. Y .  Kung, “One supercomputing with systolic/wavefront array 
processors,” Proc. IEEE, vol. 72, pp. 867-884, July 1984. 
[ 171 N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: 
A System Perspective. Reading, MA: Addison-Wesley, 1985. 
vol. 41, pp. 972-980, Aug. 1992. 
pp. 37-46, Jan. 1982. 
Table I1 
Transistor count estimations of the circuits listed in Table I 
4256 6688 48k 
3696 5808 
* The transistor count of the circuit in [9] is about 3 times of that of the circuit in [IO]. 
117 
