Novel digit-serial systolic array implementation of Euclid's algorithm for division in GF(2m) by Guo,Jyh-Huei & Wang,Chin-Liang
NOVEL DIGIT-SERIAL SYSTOLIC ARRAY IMPLEMENTATION OF EUCLID'S 
ALGORITHM FOR DIVISION IN GF(2") 
Jyh-Huei Guo and Chin-Liang Wang 
Department of Electrical Engineering, National Tsing Hua University 
Hsinchu, Taiwan 300, Republic of China 
ABSTRACT 
In this paper, a novel digit-serial systolic array for computing 
divisions in GF(2") over the standard basis is presented. To the 
authors' knowledge, this is the very first digit-serial systolic divider 
for GF(2"). The proposed architecture possesses the features of 
regularity, modularity, and unidirectional data flow. Thus, it is well 
suited to be implemented using VLSI techniques with fault-tolerant 
design. One important feature of the proposed architecture is that 
different throughput performances can be easily achieved by vary- 
ing the digit size. By choosing the digit size appropriately, the pro- 
posed digit-serial architecture can meet the throughput requirement 
of a certain application with minimum hardware. 
1.  INTRODUCTION 
Finite fields GF(2m) have played an important role in areas of 
communications. Some applications, such as error-correcting codes 
[ I]-[2]  and cryptography [3], usually involve the division operation 
in finite fields CF(2"). Performing such arithmetic using software 
on a general-purpose computer is a straightforward method, but it 
will be neither fast enough nor cost effective for related real-time 
applications, especially for the public-key cryptosystems where 
large fields are adopted [3]. Therefore, special-purpose architec- 
tures for GF(2") division become indispensable. 
A number of VLSI architectures for computing inverses and/or 
divisions in GF(2") have been reported in the literature. Among 
them, the circuits in [4]-[12] are designed based on the concepts of 
systolic architecture [13]. Existing architectures for inversion 
and/or division in GF(2") can be categorized into two types: bit- 
parallel and bit-serial architectures. Basically, a bit-parallel system 
reaches much better throughput performance than a bit-serial one, 
but it involves much more circuit complexity. For some applica- 
tions, bit-serial computation may be too slow and fully bit-parallel 
computation may be faster than necessary and too hardware- 
consuming. To improve the trade-off between throughput perform- 
ance and hardware complexity, the adoption of digit-serial architec- 
tures [ 141-[ IS] seems to be a good approach. 
In this paper, a novel digit-serial systolic divider for GF(2"') 
over the standard basis is presented. If the input data come in con- 
tinuously, it can produce division results at a rate of one every M / 
L clock cycles with a latency of (5m / L )  - 1 clock cycles, where L 
is the selected digit size. The proposed array is highly regular and 
modular and thus well suited to VLSI implementation. As we will 
see that the proposed divider and some existing dividers reach the 
smallest Area-Time (AT) product of O(m2). The most important 
feature of the proposed architecture is that different throughput 
performances can be easily achieved simply by varying the digit 
size. If the digit size is chosen appropriately, the proposed digit- 
serial architecture can meet the throughput requirement of a certain 
application with minimum hardware. 
2. THE DIVISION ALGORITHM FOR GF(2") IN IS]  
Let A ( x )  and B(x)  be two elements in GF(2"j, G(x)  be the 
primitive polynomial used to generate the field, and C(x) be the 
result ofA(x) / B(x) mod G(x) ,  where 
(1) 
( 2 )  
+ go (3) 
( 4 )  
A ( x )  = u,_,x"-l + um.2Xm-z + ' .  ' + a ,  
C(x)=c,_,x"" + C , _ z X m - 2  + " '  +c,, 
B ( x )  = b,_,x"-l +b,_zXm-2 + . . .  +bo 
G ( x ) = x "  +g,,.,x"-' + g m - 2 ~ m - 2  + ' . .  
Each coefficient of the polynomials is in ( 0 ,  1 ) .  To compute the 
division A ( x )  / B(x)  mod G(x), the algorithm in [8] based on the 
Euclid's algorithm can be used. It consists of 2m iterations and can 
be summarized as follows: 
The GF(2") division algorithm in [8] 
R = B ( x ) ; S =  G = G ( x ) ;  U = A ( x ) ;  V =  T=O; 
state = 0; count = 0; 
for i = 1 to 2m do 
R = x . R; T =  x ,  Tmod G; 
if state = = 0 then 
count = count + 1 ; 
if r, = = 1 then (* rm: coefficient of x" of R *) 
tmp = R; (1) 
R = R + S ;  (W 
S=tmp;  T =  U;  (1) 
state = 1; 
end 
count = count - 1 ; 
if rm = = 1 then 
end 
if count = = 0 then 
else 
R = R + S ; T = T + U ;  
V = T + V ; U e K  
state = 0; 
end 
end 
end (* V has the result C(x) *) 
After 2m iterations, polynomial V contains the division result C(x). 
Two parallel-in parallel-out architectures and a serial-in serial-out 
architecture were developed based on this algorithm in [8] and 
[12]. In the following, a digit-serial architecture is given for the 
same problem. 
3. NOVEL DIGIT-SERIAL SYSTOLIC ARRAY 
IMPLEMENTATION 
3.1 Dependence graph of the above division algorithm (121 
0-7803455-3/98/$10.00 1998 TEEE 11-478 
0, 0, 0, 0 2  0, oil 
Input: I,' : io, b,, 01 6' : [g,, g,, a,, 01 
Output: 0, : [*, g,, *, c,] *: don't care 
Fig. I .  A dependence graph for division in GF(26). 
Fig. 1 shows a dependence graph (DG) of the above division 
algorithm for GF(2m), where m = 6.  It consists of 2m Type-I cells 
and 2m x m Type-2 cells, where the functions of these two types of 
basic cells are depicted in Figs. 2-3. The coefficients of A(x),  B(x),  
and G(x) enter the array from the top, and the cells in the i-th row 
of this array realizes the i-th iteration of the algorithm. The value of 
count is traced based on the tracing scheme in [12], where each 
Type-2 cell incorporates one 2-to-I multiplexer (the one with Inc 
and Dec as its inputs and C-flag as its output) for this purpose. The 
value of count will increase or decrease depending on the value of 
state. That is 
Ctrl2 
state 
Ctrl 1 
count + I, if state == 0 
count - I, if state == I 
) OUT2 
where count' represents the value of count for the next iteration. 
Besides, the logic value of C-zero of Type-1 cell can be used to 
determine whether the value of count equals to zero. For the Type- 
1 in the i-th row, 'IC-zero = 1" means that "count = 0" after i itera- 
state IN1 
state' 8 lnc' 
state' Inc' 
Fig. 2. The circuit of Type-I cell in Fig . I .  
IN3 IN 1 
IN2 
Dec'r',P', s: g, u', v :  Inc' - -  - 
OUT1 OUT3 
Fig. 3. The circuit of Type-2 cell in Fig. 1 .  
tion, while "C-zero = 0" means that "count # 0 after i iteration (for 
more details, see [ 121). 
For each row, Type-I cell also generates the control signals 
Ctrll, Ctrl2, and Ctrl3 for the present iteration as well as computes 
the value of state for the next iteration (Le., state' in Fig. 2) accord- 
ing to the following logic functions: 
( 6 )  
(7) 
(8) 
(9) 
Ctrll = (state == O)& (r, = 1) 
Ctrl2 = (r, == 1) 
Ctrl3 = (C- zero == 1) 
state' = state, if 
- ((r,,, --- ])&(state == 0)) 
Type-2 cells in the corresponding row receive Ctrll, Ctrl2, and 
Ctrl3 from Type-I cell and execute the operations of (I), (11), and 
(111) when the control signals Ctrll, Ctrl2, and Ctrl3 are true, re- 
spectively, where the (i + I)-th Type-2 cell (0 5 i I m - 1) from the 
right evaluates the (i + I)-th least significant coefficients of R. S, U. 
V, and T(i.e., r',, SI , ,  u',, v',, and t', in Fig. 3). The coefficients of the 
division result C(x) will emerge from the bottom of the array (Le., 
after 2m iterations). 
3.2 Novel digit-serial systolic divider for GF(2") 
size L ( L  2 2), we can proceed according to the following proce- 
dure: 
{ or ((C- zero == I)& (state == 1)) 
To design a digit-serial systolic divider for GF(2m) with digit- 
11-479 
2 region 2 region 1 IN 1 
0 , 8 0 1 ,  0 1 6 ~ 1 5 0 1 ,  0~~01201101009fo~ O7O6Os O4O3O2 Oleo 
region 3 
Fig 4. Partition one part of the DG in Fig. 1 with L = 3. 
1 in i, i, zj i, i, i, i i .I ! 
oi I , i l7  iI6zI5 zl, i13il,ill I , , I ,  
1 0, 0 7  0 6  os 0 4 0 3  0, 01 0 0  
0 olRo17 016015014 o13°1~o l l  OIOo9 
Fig. 5 .  Implement the circuit in Fig. 4 using one cell with L = 3. 
1) Partition the DG in Fig. 1 into 2m / L identical parts, where each 
part consists of L rows. For example, we can partition the DG in 
Fig. 1 into 4 parts with L = 3. 
2) Partition one part of the DG into (m / L )  + 1 regions. Fig. 4 
shows an example of L = 3. Region 1 and region (m  / L )  + 1 are 
in triangular forms while the other regions are in the form of 
parallelogram. Region 1 contains L Type-I cells and ( L  - l)L / 2 
Type-2 cells, region i (2 5 i 5 m / L )  comprises of L2 Type-2 
cells, and region (m / L )  + 1 includes ( L  + 1)L / 2 Type-2 cells. 
3) Scheduling: Those regions in Fig. 4 are defined as equitemporal 
regions, all the processor elements belonging to the same region 
are processed at the same time. The processor elements in region 
i are scheduled to be processed at the i-th clock cycle. 
4) To implement the circuit in Fig. 4 according to the above 
schedule, the cell in Fig. 5 can be used, where "*" denotes a 
one-cycle delay element. The cell is composed of L Type-3 
cells, shown in Fig. 6, L2 Type-2 cells, 3L 2-input AND gates, L 
2-to-I multiplexers, and 19L + 4 one-bit one-cycle delay ele- 
Ct state IN1 
OUT2 
state' Inc' 
C 
state' Inc' 
Fig. 6. The circuit of Type-3 cell in Fig. 5. 
OUT2 
+state 
+ Ctrl1 
*Ctrl3 I 
Fig. 7 A novel digit-serial systolic divider for GF(26) 
with digit size L = 3. 
ments. A control sequence Ct = 01 1-.1 of length m / L is used to 
control the cell. Since Ctrll, Ctrl2, Ctrl3, state, and tm-l from 
each Type-I cell in Fig. 4 must be broadcast to all the Type-2 
cells in the same row, each Type-3 cell incorporates four 2-to-I 
multiplexers and four one-bit latches for this purpose. When the 
control signal Ct is in logic 0, Ctrl2, Ctrl3, state, and t,,-l are 
loaded. Besides, 3L 2-input AND gates are added to the cell in 
Fig. 5 owing to the fact that three zeros must be fed into each 
row of the circuit in Fig. 4 from the rightmost cell. When the 
control signal is in logic 0, these AND gates generate zeros. In 
addition to above, the cell in Fig. 5 also incorporates L 2-to-I 
multiplexers for data flow arrangement. 
5 )  By concatenating 2m / L cells shown in Fig. 5 together, a digit- 
serial systolic divider for GF(2m) can be obtained. Fig. 7 shows 
the result. The coefficients of A(x),  B(x) ,  and G(x) enter the ar- 
ray from the left at a rate of L-bit every clock cycle and those of 
C(x) emerge from the right of the array at the same rate with a 
latency of ( 5 m  / L )  - 1 clock cycles. That is, the systolic divider 
can produce division results at a rate of one every m / L clock 
cycles. 
When the digit size L gets large, the maximum propagation delay of 
the cell in Fig. 5 will become large and thus the clock rate will 
decrease. To conquer such a problem, we can further pipeline the 
cell in Fig. 5 so that the maximum propagation delay can be kept 
small when the digit size L gets large. For example, we can further 
pipeline the cell in Fig. 5 into 3 stages by placing one one-cycle 
delay element on each of the communication links crossed by the 
gray lines. 
4. CONCLUSIONS 
11-480 
In this paper, a novel digit-serial systolic divider for GF(2") 
over the standard basis is presented. To the authors' knowledge, 
this is the very first digit-serial systolic divider for GF(2m). The 
proposed divider possesses the features of regularity, modularity, 
and unidirectional data flow. Thus, it is well suited to be imple- 
mented using VLSI techniques with fault-tolerant design [ 161-[ 171. 
TABLE I gives a comparison of some systolic arrays for divi- 
sion in GF(2"). The systolic divider in [4] is not considered for 
comparison due to its large area-complexity of U(m * 2"). It can be 
seen that the proposed divider and the dividers in J7]-[8] and 1121 
reach the smallest Area-Time (AT) product of U(m ). As compared 
to the dividers in [9]-[ 1 I], the proposed systolic divider has better 
performances since it involves less area-complexity and reaches 
higher throughput rate. For the dividers in 1.51-[7] and Fig. 4 of [8], 
they are only suitable for some very high-speed applications where 
the value of m is not large. For those applications where large value 
of m is adopted, their large area-complexity will make single-chip 
implementation impossible. As for the divider in Fig. 8 of [8] and 
the divider in [12], they involve small area-complexities but their 
low throughput rates may be too slow for some real-time applica- 
tions. The proposed digit-serial systolic divider has a throughput 
rate of L / m and involves U(Lm) area-complexity. By varying the 
digit size L, different throughput rates can be easily achieved. For a 
certain application, throughput requirement can be met by choosing 
the digit size L appropriately. By choosing the digit size L appro- 
priately, the proposed digit-serial architecture can meet the 
throughput requirement of a certain application with minimum 
hardware. 
5. REFERENCES 
[I] W. W. Peterson and E. J. Weldon, Jr., Error-Correcting 
Codes. Cambridge, MA: MIT Press, 1972. 
[2] E. R. Berlekamp, Algebraic Coding Theory. New York: 
Mcgraw-Hill, 1968. 
[3] D. E. R. Denning, Cryptography and Data Security. Reading, 
MA: Addsion-Wesley, 1983. 
[4] M. Kovac, N. Ranganathan and M. Varanasi, "SIGMA: A 
VLSI systolic array implementation of a Galois field GF(2") 
based multiplication and division algorithm," IEEE Trans. 
VLSlSystems, vol. I ,  pp. 22-30, Mar. 1993. 
[5] S.-W. Wei, "VLSI architectures for computing exponentia- 
tions, multiplicative inverses, and divisions in GF(2")," in 
Proc. 1995 IEEE Int. Symp. Circuits Syst., London, May 1995, 
DD. 4.203-4.206. 
I Area complexity 
1 
C.-L. Wang and J.-H. Guo, "New systolic arrays for C + AB', 
inversion, and division in GF(2")," in Proc. 1995 European 
Conference Circuit Theoty Design, Istanbul, Turkey, Aug. 
J.-H. Guo and C.-L. Wang, "Systolic array implementation of 
Euclid's algorithm for inversion and division in GF(2"')," in 
Proc. 1996 IEEE Int. Symp. Circuits Syst., Atlanta, May 1996, 
J.-H. Guo and C.-L. Wang, "Hardware-efficient systolic array 
implementations of Euclid's algorithm for inversion and divi- 
sion in GF(2")," in Proc 1996 Int. Comput. Symp.-Int. Conf: 
Comput. Architecture, Kaohsiung, Taiwan, Dec. 1996, pp. 
C.-L. Wang and J.-L. Lin, "A systolic architecture for comput- 
ing inverses and divisions in finite fields GF(2")," IEEE 
Trans. Comput., vol. 42, pp. 1141-1146, Sept. 1993. 
1995, pp. 431-434. 
pp. 11.481-11.484. 
221-228. 
[lo] M. A. Hasan and V. K. Bhargava, "Bit-level systolic divider 
and multiplier for finite fields GF(2m)," IEEE Trans. Comput., 
[ I  11 S. T. J.  Fenn, M. Benaissa, and D. Taylor, "GF(2m) multipli- 
cation and division over the dual basis," IEEE Trans. Comput., 
vol. 45, pp. 3 19-327, Mar. 1996. 
[ 121 J.-H. Guo and C.-L. Wang, "Bit-serial systolic array implemen- 
tation of Euclid's algorithm for inversion and division in 
GF(2")," in Proc. 1997 Int. Symp. on VLSI Technology, Sys- 
tems, and Applications, Taipei, Taiwan, June 1997, pp. 113- 
117. 
[I31 H. T. Kung, "Why systolic architectures?," Computer, vol. 15, 
pp. 37-46, Jan. 1982. 
[ 141 R. L. Hartley and P. F. Corbett, "Digit-serial processing tech- 
niques," IEEE Trans. Circuits Syst., vol. 37, pp. 707-719, 
June 1990. 
[I51 R. L. Hartley and P. F. Corbett, "Designing systolic arrays 
using digit-serial arithmetic," IEEE Trans. Circuits Syst. -]I: 
Analog and Digital Signal Processing, vol. 39, pp. 62-65, 
Jan. 1992. 
[I61 H. T. Kung and M. Lam, "Fault tolerant and two level pipelin- 
ing in VLSI systolic arrays," in Proc. MIT Con$ Advanced 
Res. VLSI, Cambridge, MA, Jan. 1984, pp. 74-83. 
[I71 J. V. McCanny, R. A. Evans, and J. G. McWhirter, "Use of 
unidirectional data flow in bit-level systolic array chips," 
Electron. Lett., vol. 22, pp. 540-541, May 1986. 
VOI. 4 I ,  pp. 972-980, Aug. 1992. 
. .  
TABLE I COMPARISON OF SOME SYSTOLIC ARRAYS FOR DIVISION IN GF(2m) 
Circuits The circuits The circuits The circuits in The proposed 
Throughput rate 171 & Fig. 4 in [SI 1 [9] 1 / (2m-  1)  
(unit=l/cycles) [10]-[12] 1 I m 
Time complexity 
L l m  Fig. 8 in [8] 1 / (2m - 2) 
171 & Fig. 4 in [SI O(1) 
O h )  
112) O(m) 
O(m) 
191-1111 O b L )  
Area-Time product 
Fig. 8 in-181 U(k)  
[7] & Fig. 4 in 181 U(mL) 
Fig. 8 in IS] O(m) 
O(m) 
parallel-in 
parallel-out 
O b Z )  
serial-in 
serial-out 
191-1111 O(m') 
1121 4 m 2 )  
O(m / L) 
digit-serial 
* L is the digit size. 
11-48 1 
