Design of a linear systolic array for computing modular multiplication and squaring in GF(2m)  by Lee, Won-Ho et al.
PERGAMON 




Computers and Mathematics with Applications 42 (2001) 231-240 
www.elsevier.nl/locate/camwa 
Design of a Linear Systolic Array for 
Computing Modular Multiplication and 
Squaring in GF(2 m) 
WoN-Ho LEE, KEON-JIK LEE AND KEE-YOUNG YOO 
Depar tment  of Computer  Engineer ing 
Kyungpook Nat ional  University, Taegu, Korea '° 
yook©knu, a¢. kr 
(Received July 1999; revised and accepted April 2000) 
Abst ract - -One  of the main operations for the public key cryptosystem is the modular exponen- 
tiation. In this paper, we analyze the Montgomery's algorithm and design a linear systolic array 
for performing both modular multiplication and modular squaring simultaneously. The proposed 
systolic array with less hardware complexity can be designed by making use of common-multiplicand 
multiplication in the right-to-left modular exponentiation over GF(2m). For the fast computation 
of the modular exponentiation, the proposed systolic array has 1.25 times improvement in area-t ime 
complexity when compared to existing multipliers. The proposed systolic array suffers a little loss in 
time complexity, but it has 1.44 times improvement in area complexity since it executes the common 
parts that exist in the simultaneous computation of both modular multiplication and squaring only 
once. It could be designed on VLSI hardware and used in IC cards. © 2001 Elsevier Science Ltd. 
All rights reserved. 
Keywords - -Pub l i c  key cryptosystem, Montgomery algorithm, Common-multiplicand multiplica- 
tion, Modular exponentiation, Systolic array. 
1. INTRODUCTION 
Galois fields have gained numerous practical applications in modern communication fields, such as 
error-correcting code, switching theory, cryptography, and digital signal processing. An example 
of cryptographic applications i  the Diffie-Hellman key exchange algorithm [1]. This system is 
based on modular exponentiation ver Galois field GF(2m). The modular exponentiation can also 
be used to perform public-key data exchange and digital signatures as shown by E1Gamal [2]. 
The exponentiation operation can be implemented using a series of squaring and multiplication 
operations in GF(2 'n) using the binary method. 
Recently, several types of multipliers for GF(2m), which is of particular interest for many appli- 
cations, have been proposed. The multiplication circuits proposed in [3,4] have complex, irregular 
communication and control structures, and thus are not suitable for VLSI implementation. There 
are three different popular basis representations for elements of GF(2m), and the multipliers of 
these types easily realized using VLSI techniques have been proposed. These are 
0898-1221/01/$ - see front matter (~) 2001 Elsevier Science Ltd. All rights reserved. Typeset by A AdS-TE x 
PII: S0898-1221 (01)00147-X 
232 W.-H. LEE et al. 
(1) the standard (polynomial) basis multiplier [5-7] which uses the standard basis represen- 
tation for all elements of the field, 
(2) the dual basis multiplier [8] which uses the dual basis representation forthe multiplicand 
and the standard basis representation forthe multiplier, and 
(3) the Massey-Omura normal basis multiplier [9], which uses the normal basis representation 
for both the multiplicand and multiplier. 
The latter two types of finite field multipliers need basis conversion, while the former type does 
not. In this paper, we will restrict our interest to the standard basis. 
In [5], Yeh et al. designed a systolic array using the standard basis method. It is well suited to 
VLSI implementation, but it requires two control signals. The multipliers for GF(2 m) described 
in [6] and [7] are also suitable for implementation using VLSI techniques, but they involve broad- 
casts in data flow. It is often desirable to avoid broadcasting when we design a high-speed system. 
In [10], Wang and Lin proposed a systolic array based on [6]. It has unidirectional data flow. 
In [11,12], Yen et al. proposed the common-multiplicand multiplication concept. 
In this paper, we propose a time-optimal linear systolic array for computing the modular mul- 
tiplication and squaring simultaneously in GF(2m). This array is efficiently used for the modular 
exponentiation in GF(2m). We find parallelism in the binary square and multiply the modular ex- 
ponentiation algorithm [13] and analyze the Montgomery algorithm [14] for simultaneously com- 
puting modular multiplication and squaring which efficiently utilizes the common-multiplicand 
multiplication concept. A dependence graph (DG) from the recurrence Montgomery algorithm 
using the common-multiplicand is erived. According to a space-time transformation, wedesign 
systolic arrays for performing the modular squaring and multiplication from DG, and propose a 
time-optimal linear systolic array. In order to implement the designed systolic array into hard- 
ware, we simulate this array using ALTERA MAX+PLUS II. We compare the systolic GF(2 m) 
multipliers in [10] and [15] with the proposed systolic array when they are applied in computing 
the modular exponentiation. The proposed architecture has regularity, modularity, concurrency, 
and unidirectional data flow, and thus is well suited to VLSI implementation. 
This paper is organized as follows. Section 2 gives overview of arithmetic operations in GF(2 m) 
such as the Montgomery modular multiplication and modular exponentiation. The concurrent 
execution of the Montgomery modular multiplication is discussed in Section 3. A linear systolic 
array for performing both modular multiplication and squaring is designed, and its performance 
is analyzed in Section 4. Finally, a conclusion is given in Section 5. 
2. ARITHMETIC OPERATION IN GF(2 m) 
2.1. Polynomial Representation 
An element of Galois field GF(2 m) can be represented in several different ways [16]. The 
polynomial representation is useful and suitable in hardware and software implementation. The 
algorithm for the Montgomery multiplication described in this paper is based on the polyno- 
mial representation. According to this representation, anelement of GF(2 m) is a polynomial of 
length m, i.e., of degree less then or equal to m - 1, written as 
m-1 
a(x)  = E aix~ = am_ ix  m-1 + am-2x  m-2 +. . .  + a lx  + ao, 
i=0 
where the coefficients a  E GF(2). These coefficients are also referred to as the bits of an element a 
of GF(2m), and the element a is represented as a = (am- la in -2" ' "  alao). 
The addition of two elements a and b in GF(2 m) is performed by adding the polynomial a(x)  
and b(x), where the coefficients are added in the field GF(2). This is equivalent to bit-wise XOR 
operation on both a and b. In order to multiply two elements a and b in GF(2m), we need an 
Design of a Linear Systolic Array 233 
irreducible polynomial of degree m. Let n(x) be an irreducible polynomial of degree m over the 
field GF(2). The product c = ab in GF(2 m) is obtained by computing c(x) = a(x)b(x) mod n(x), 
where c(x) is a polynomial of length m, representing the element c E GF(2m). Thus, the mul- 
tiplication operation in GF(2 m) is accomplished by multiplying the corresponding polynomials 
modulo the irreducible polynomial n(x). 
2.2. Montgomery  Mult ipl ication Algorithm 
Instead of computing ab in GF(2m), the Montgomery algorithm computes abr -1 in GF(2m), 
where r is a special fixed element of GF(2m). Almost the same idea was proposed by Montgomery 
for the modular multiplication on integers. Montgomery's technique is applicable to field GF(2 m) 
as well. The selection of r(x) = x "n turns out to be very useful. Thus, r is an element of 
the field, represented by the polynomial r(x)modn(x); i.e., if n = (nmnm-1...nlno), then 
r = (nm-lnm-2"'" nln0). The Montgomery multiplication method requires that r(x) and n(x) 
are relatively prime; i.e., gcd(r(x), n(x)) = 1. For this assumption to hold, it suffices that n(x) 
not be divisible by x. Since n(x) is an irreducible polynomial over the field GF(2), this will 
always be the case. Since r(x) and n(x) are relatively prime, there exist two polynomials r- l(x) 
and nt(x) with the property such that r(x)r-l(x) + n(x)nl(x) = 1, where r- l(x) is the inverse 
of r(x) modulo n(x). The polynomials r-l(x) and n'(x) can be computed using the extended 
Euclidean algorithm. The Montgomery multiplication of a and b in GF(2 m) is defined as the 
product c(x) = a(x)b(x)r-l(x)modn(x), which can be computed using Algorithm 1 [17]. 
[Algorithm 1: Montgomery  Mult ipl ication Algorithm for GF(2m)] 
Input: a(x), b(x), n(x) 
Output:  c(x) = a(x)b(x)x -m mod n(x) 
MMM(a(x), b(x)) 
Step 1. c(x) := 0 
Step 2. fo r i=0tom- ldo  
Step 3. c(z) := c(z) + aib(z) 
Step 4. c(x) := c(x) + con(x) 
Step 5. c(x) := c(x)/z. 
2.3. Modular  Exponent iat ion Operat ion 
The exponentiation peration can be implemented using a series of squaring and multiplication 
operations in GF(2 m) using the binary square and multiply algorithm [9]. This algorithm is 
usually employed to compute m(x)e(mod n(x)), where e can be expressed as e = ~-]in__o 1 ei2 i, and 
ei E {0, 1}. This algorithm has the left-to-right method and right-to-left method. The right-to- 
left method can be used to compute modular squaring and modular multiplication concurrently. 
The right-to-left binary square and multiply algorithm is represented as Algorithm 2. 









m(x), e, R2 = r(x)r(x) mod n(x), modulus n(x) 
c(x) = m(x)e(mod n(x) ) 
MMMEXP(m(x), e, R2) 
rh(x) := MMM(m(x), R2) 
~(x) := MMM(1, R2) 
for i = 0 to n -  1 do 
if ei = 1 then 6(x) := MMM(rh(x), 5(x)) 
rh(x) := MMM(~(x), rh(x)) 
a(x) := MMM(~(x), 1). 
234 W.-H. LEE et  al. 
3. MONTGOMERY ALGORITHM US ING THE 
COMMON-MULT IPL ICAND 
In this section, the common computational parts to perform concurrent execution of both 
modular squaring and multiplication i the modular exponentiation are extracted. By computing 
these parts only once, the execution time and hardware architecture can be reduced. 
In the case ei = 1 in Algorithm 2, rh(x) is applied in both MMM(rh(x),e(x)) and 
MMM(rh(x), rh(x)), so-called common-multiplicand multiplication, and then is called a common- 
multiplicand. Because there is no data dependency in Steps 4 and 5 in Algorithm 2, both 
MMM(rh(x), 5(x)) and MMM(~(x), rh(x)) can be computed concurrently. Our basic idea is com- 
puting common parts in both MMM(~(x), e(x)) and MMM(rh(x), ~h(x)) only one time. Thus, 
when it is implemented in hardware, hardware size can be reduced. Assuming that rh(x) and 
5(x) have m bits, and n(x) has m + 1 bits, the computation of MMM(rh(x),5(x)) in Step 4 can 
be modified as follows: 
MMM(rh(x), 5(x)) = ~n(x)e(x)r, 1 mod n(x) 
= ~n(x)e(x)x -m mod n(x) 
= ~(x)  ( -~-Yx m-1 + ~-L~x m-2 + ' "  + ~lx + co) x -m mod n(x) 
= (~n(x)-~_~x -1 mod n(x) + rh(x)~--~x -2mod n(x) +. . .  
+(n(x)el (x)x -m+l mod n(x) + rh(x)~ox -m mod n(z)) mod n(x). 
(1) 
Similarly, the computation ofMMM(rh(x), rh(x)) in Step 5 can be modified as follows: 
MMM(rh(x), rh(x)) = rh(x)(n(x)r- 1 mod n(x) 
rh(x)ff~(x)x -m mod n(x) 
~n(x) r~-~--L'Tx m-1 - ~--ZL-~x m-2 + -~-~) x-m mod n(x) I, m-1  -t" rn -  2 Jv " "" Jr- ?n  l x 
( rh (x )~z  -1 modn(x) + rh(z)m----A~-2x  modn(x) + . - .  
+rh(x)fnlxx -m+l mod n(x) + rh(x)~-'6x -m mod n(x) ) mod n(x). 
(2) 
In equations (1) and (2), ?Tt(X)X -1 mod n(x), rh(x)x -2 mod n(x) , . . . ,  ~(x )x  -m+l mod n(x), 
rh(x)x -m mod n(x) are common parts in both modular squaring and multiplication. The com- 
mon-multiplicand multiplication algorithm to calculate MMM(rh(x),5(x)) and MMM(~h(x), 








[Algorithm 3: Montgomery Algorithm for Computing Multiplication and Squaring] 
~(x),e(x) ,n(x)  
X(x)  = MMM(rh(x), 5(x)), Y(x) = MMM(rh(x), rh(x)) 
t(x)  = 
x(x)  = 0, Y(z)  = 0 
For i ---- 0 to m-  1 do 
= + t0n( ))/x 
= x(x )  + Y(x)  = Y(z )  + 
The value t(x), which is the common part of the multiplication and squaring, is computed 
once in Step 4 and then is used to compute X(x) and Y(x) in Step 5 at the same time. In 
Algorithm 3, all polynomial operations in GF(2 m) can be converted into bit-level operations. Al- 
gorithm 4represents a bit-level regular ecurrence equation of the Montgomery algorithm for com- 
puting X = MMM(a, b) and Y = MMM(a, a) using the common-multiplicand. The values a, b, 
and n are ~(x), e(x), and n(x), respectively. And value m is the maximum degree of polynomial 
Design of a Linear Systolic Array 















GFMMM(a, b, n, m) 
1. for j = 0 to  m do 
2. i f j  ~ 0 then T[1][j - 1] = T[O][j]^T[O][O]&N[O][j] 
3. = x [o ] [ j ]  
4. Y[1]~] = Y[0][j] 
5. N[1]Lj] = N[0]~] 
6. for i = 1 to m do 
7. for j -- 0 to  m do 
8. X[i + 1]~] = X[i][j] ^T[i]Lj]&S[i]~ ] 
9. Vii 4- 1] ~j] = Y[i][j]^T[i] Ij]&A[i] [j] 
10. if j ¢ 0 then  T[i + 1]~j - 1] = T[i]~] ^ T[i][O]&N[i][j] 
11. A[i]~ 4- 1] = A[i]~] 
12. B[i][j + 1] = S[i][j] 
13. N[i 4- 1]~] = N[i]~]. 
In Algorithm 4, a coordinate (i , j)  is in 0 < i < m - 1, 0 <_ j _< m, and initial values of inputs 
A, B, N, T, X, and Y are the following: 
A[m - i][O] = a[i], O < i < m -1 ,  B[m - i][O] = b[i], O < i < m -1 ,  
N[0]~j]--n[j], O < j < m, 
T[0]~j]=a[j], O < j <_ m -1 ,  T[i][m] = O, O < i < m -1 ,  
X[0][j] = 0, 0 _< j _< m, Y[0]LT" ] = 0, 0 _< j <_ m. 
In Algorithm 4, the first for loop (Steps 1-5) performs the modular reduction (x- l ) ,  and the 
second nested for loop (Steps 6-13) executes the modular multiplication. The array T, i.e., 
common parts, computed in the ith iteration is used to compute X and Y at the (i 4-1) iteration. 
The final result of MMM(a, b) is stored from bit X[m 4-1] [0] to bit X[m 4-1][m- 1], and the final 
result of MMM(a,a) is stored from bit Y[m -F 1][0] to bit Y[m -4- 1][m - 1]. 
As shown in Algorithm 3, the modular squaring MMM(a, a), i.e., the output Y[m 4- 1][0],..., 
Y[m + 1Jim - 1], is needed only when the exponent bit of the exponentiation operation is 1. 
4. SYSTOLIC ARRAY DES IGN 
4.1. Linear Systolic Array  for Computing Both Multiplication and Squaring 
In this section, we design a linear systolic array for the modified Montgomery algorithm over 
GF(2 m) according to the following procedure [18]. First, the data dependence graph (DG) and 
initial positions of variables for Algorithm 4 are extracted. Second, a space transformation 
matrix (S) which specifies a processor space of the computation space and a time vector (H) 
which specifies the sequence of the operations must be chosen. Third, a systolic array network 
from the space-time transformation is configured, and initial positions of input values are fixed. 
Finally, the execution of the derived systolic array is simulated and verified. 
The DG over GF(2 4) (m = 4) obtained from the recurrence quation is shown in Figure 1. 
In the DG, the cells in the first row execute the Steps 1-5 in Algorithm 4, while the others 
execute the Steps 6-13. The cells in the first column, represented by filled circles, execute 
Step 10. The initial position and data flow of each variable are as follows. The bit a[i] and b[i] 
(0 < i < m-  1) are supplied to point [i, 0] T, and flow into the [0, 1] T direction without updating. 
The bit n~],X[O][j](= 0), and Y[O][j](-- 0), 0 < j < m, are supplied to point [0,j] T, and flow 
into the [1, 0] T direction. The bit X[i] ~j] and Y[i][j] (1 < i < m) are computed at all computation 
236 
n[,q 
W.-H. LEE et al. 
n[3] a[2l nil] n[o] 
rts][31x[~][]] z[~[21 x[s-][21 z[~][t] x[51111 v[~[olx[~1[o] 
Figure 1. Dependence graph for GF(24). 
points except the furthest right cells in each row, and flow into the [1,0] T direction. Each bit 
of T[0ILJ ] (-- a~], 1 < j < m + 1) is supplied to points [0,j] T, and each bit of T[i][m] (-- O) 
(0 < i < m) is supplied to points [i,m] T. Each bit T[i][j] (1 < j _< m) is computed at all points 
represented by circles, and then flows into the [1,-1] 7 direction. 
All computation points on the direction vector [1, -2] T, which specifies a dotted line in Figure 1, 
are executed in parallel. The resulting bits X[m + 1][j] and Y[m + 1][j] (0 < j < m - 1) appear 
from points [m,j] x at 2(m + 1) ÷ j  time step. Since m bits of X and Y are output on each time 
step, the total latency which is the minimum computation time is 3m + 1. 
From the DG, the data dependence matrix (D1) of the modular reduction part is specified as 
follows: 
Dl=[dN'd'~°'dT'dT[°l 'dx'dy]= 0 -1  1 0 " (3) 
And, the data dependence matrix (D2) of the modular multiplication part is specified as follows: 
0 D2= [1 1 0 0 -1 1 00 J  (4) 
In Figure 1, the projection vectors are four directions, [0,1] x, [1,0] x, [1, -1] x, [1, 1] x, respec- 
tively. Because the array projected from [0, 1] T has the best performance, we only consider this 
case. In this case, the space transformation matrix (S) is [1,0]. An schedule time vector (H) is 
[2, 1]. Also, space-time transformation matrix (T) is 
Each computation point (Tp) is 
(6) 
Design of a Linear Systolic Array 237 
• [~l H3] 




- ~ ~ ' -  blO] b[1] b[2] b[3] 
-'. : _~- /"  T[Ol[O]  ... T [0 I [4 IT [ I ] [4 I  ... 2"[4114 I 
" ~ ~ - - - 2 _ - _ _ _  n[0] n[ll n[2] n[3l n[4l 






t -A  
41--y 
~--B  
~- -X  
P-T 
41--N 
(a) Modular multiplication part. (b) Modular reduction part. 
Figure 4. Structures of processing elements of the systolic array. 
The data dependence matrix (TD1) of modular eduction part is 
[~ ~] [~ 1 1 0 1 ~] [~ 2 1 1 2 21] 
TDI  = 0 -1  1 0 = 1 1 0 1 ' 
and data dependence matrix (TD2) of modular multiplication part is 
TD2= [~ ~] [~ 0 1 1 1 0 1 ~] _- [~ 1 2 2 1 1 2 21] 
1 0 0 -1  1 0 0 1 1 1 0 1 " 
(7) 
(8) 
In the data dependence matrix, the first row specifies delay time of data flow links, and the 
second row specifies data flow direction of data flow links. For example, designed systolic array 
with m = 4 is shown in Figure 2. 
In Figure 2, one delay element (denoted by "*") is placed at each vertical path that nO, N, 
X, and Y values flow. The number of processing elements (PEs) of modular eduction part 
238 W.-H. LEE et al. 
and modular multiplication part are 1 and 4, respectively. The PE of modular eduction part is 
rectangle, and the PEs of modular multiplication part are circles. The value T[0] stays in PEs 
of modular eduction part, and the values A, B, and T[0] stay in PEs of modular multiplication 
part. This array has a problem because of broadcast of values A and B. If values A and B can 
be supplied from the outside of the array in parallel, it would be easy to design the semisystolic 
arrays in hardware. So, we modify the structure of the array to get rid of such a broadcast. The 
values A and B can be supplied from the outside of the array using control signal. If control 
signal is not zero (= 1), the value a isstored at register in each PE and used for PE computation; 
otherwise, the values A and B are transferred into a neighbor PE. The modified systolic array 
is shown in Figure 3. The structure of PEs of the linear systolic array is shown in Figure 4. In 
Figure 4, a fill rectangle ( | )  is delay element. A control signal (ctl) always has m times of 0 
and 1, because the value nO is 1. The derived linear systolic array has five PEs, and after 13 clock 
cycles, all bits in the results of both multiplication and squaring are simultaneously generated 
from the left-most PE. 
4.2. Performance Analysis 
By generalizing the systolic array shown in Figure 4, we know that the total number of PEs of 
the proposed linear systolic array for GF(2 m) is m + 1--one PE for reduction part and m PEs for 
multiplication part of Algorithm 4. After 3m + 1 clock cycles, all bits of both multiplication and 
squaring are simultaneously output from the left-most PE. This computation cycle is the same 
as the minimum computation time steps of Algorithm 4. So, the proposed systolic array is said 
to be time-optimal. The proposed systolic array is designed from the algorithm that utilizes the 
common parts in the simultaneous computation ofboth modular multiplication and squaring. So, 
it can be much more efficiently used in the fast computation of the exponentiation over GF(2 m) 
than existing multipliers of references [10] and [15]. 
The proposed systolic array was described in VHDL, and then was simulated with ALTERA 
MAX PLUS-II tool over Pentium-II 333 for its computation time and correctness. The com- 
putation times of the proposed systolic array and existing multipliers performing only modular 
multiplication over m = 16, 32, 64 are shown in Table 1. From Table 1, we can figure out that 
the execution time of the proposed systolic array is just a little later than that of existing systolic 
multipliers. 
Table 1. VHDL simulation results of systolic arrays for GF(2m). 
Wang et al. [5] Heo et al. [10] Proposed Array 
m -- 16 1.425 #s 1.425 #s 1.455/~s 
rn -= 32 2.865 ~ts 2.865/m 2.895 ~ts 
m = 64 5.745 gs 5.745 #s 5.775/~s 
In order to compare the performance of the proposed systolic array with existing multipliers, 
the following assumptions in [19] are made: T2xoR = 4A, A2XOR = 3Au, T2AND --~ 2A, A2AND 
2Au,  T2MUX = 3A, A2MUX = 3Au, TL -~ 7A, AL ---- 8Au, where TGATE and AGATE are the 
time and area requirements of a two-input gate, and TL and AL are the delay and area of one 
bit latch, and A and Au are the delay and area of an invert circuit, respectively. Table 2 shows 
the area (A) and the computation time (T) of one cell (PE) of the proposed systolic array and 
existing systolic multipliers. The A, T, and the area-time (AT) complexity of the proposed 
systolic array for GF(2 m) are as follows: 
A = 136Au * m + 120Au = (136m + 120)Au, 
AT ~ 136mAu * 69mA = 9384m2Au A. 
T = 23A • (3m + 1) = (69m + 23)A, 
(9) 
Design of a Linear Systolic Array 239 
On the other hand, A, T, and the area-time (AT) complexity of the systolic multiplier of ref- 
erences [10] and [15] are as follows. Note that this systolic array computes only one modular 
multiplication 
A = 98Au • m = 98mAu,  T = 20A • 3m = 60mA, 
AT = 98mAu * 60mA = 5880m2AvA. (10) 
Obviously, the proposed systolic array has less performance than the systolic multipliers be- 
cause our systolic array computes both multiplication and squaring, but the existing multipliers 
compute only one multiplication. 
Table 2. Comparison of systolic arrays for multiplication and squaring in GF(2m). 
Wang et al. [5] Heo et al. [10] Proposed Array 
Number of PEs m m m + 1 
Cell Complexity 
3 2-input AND 
2 2-input XOR 
2 2-MUX 
10 Latch 
2 2-input AND 
2 2-input XOR 
2 2-MUX 
10 Latch 
3 2-input AND 
3 2-input XOR 
3 2-MUX 
14 Latch 
Latency 3m 3m 3m 
3A2AND -{- 2A2xoR 2A2AND + 2A2xoR 3A2AND + 3A2xoR 
Area Per Cell 
+ 2A2MuX + IOAL  = 98Au -t- 2A2MuX + IOAL ---- 96Au + 3A2MuX -t- 14AL  = 136Au 
T2AND -{- 2T2xoR T2AND + 2T2xoR T2AND ~- T2XOR 
Delay Per Cell 
-{- T2MUX + TL  = 20A ÷ T2MUX ÷ TL  = 20A -t- T2MUX -{- 2TL = 23A 
As shown in Algorithm 2, the modular exponentiation is performed by a series of simultaneous 
modular multiplication and squaring in order to implement a fast exponentiator. It can be 
implemented asfollows. 
Case 1. m iterations of the proposed systolic array performing both multiplication and squar- 
ing simultaneously. 
Case 2. m iterations of two multipliers of references [10] and [15] in parallel. 
The AT complexity of each case is as follows: 
Case 1 = 9384m3Au A,  Case 2 = l1760m3Au A.  (11) 
From (11), we know that the proposed systolic array is efficiently utilized in computation of 
the modular exponentiation. For the fast computation ofmodular exponentiation, the proposed 
systolic array has 1.25 times improvement in AT complexity when comparing to Case 2. The 
proposed systolic array suffers 15% loss in time complexity. As shown in the empirical simulation 
results of Table 1, the execution time of the proposed systolic array is almost he same as that of 
existing multipliers. But it has 1.44 times improvement in area complexity since it executes the 
common parts that exist in concurrent computation ofboth modular multiplication and squaring 
only once. 
From equations (9) and (10), we found out that in the case where one multiplier for the 
exponentiation is used, the hardware complexity ismuch more reduced than the proposed systolic 
array. But, the proposed systolic array makes 1.5 times (average case) speed-up, in worst case 2.0 
times speed-up when compared to using one multiplier for modular exponentiation. 
5. CONCLUSION 
In this paper, we have proposed a time-optimal linear systolic array performing modular mul- 
tiplication and modular squaring simultaneously for the efficient implementation f modular 
240 W.-H. LEE et al. 
exponentiation. It makes use of common-multiplicand concept in the Montgomery algorithm for 
computing multiplication and squaring simultaneously. We derived a DG from the recurrence 
Montgomery algorithm, and then designed a linear systolic array from DG by the space-time 
transformation. 
The number of PEs of the proposed systolic array over GF(2 m) is m+l ,  and its latency is 3m+1. 
The latency of the proposed systolic array is the same as the minimum parallel computation steps 
of the Montgomery algorithm performing both multiplication and squaring simultaneously. So, 
the proposed systolic array is a time-optimal systolic array. Since the proposed systolic array 
executes the common parts that exist in concurrent computation of both modular multiplication 
and squaring only once, it is efficiently used for the implementation of the fast exponentiator. The 
proposed systolic array suffers 15% loss in time complexity. But it has 1.44 times improvement 
in area complexity, and consequently has 1.25 times improvement in the AT complexity when 
compared to existing multipliers. We proposed, a linear systolic array that can be efficiently 
utilized for modular exponentiation. 
The proposed architecture has regularity, modularity, concurrency, and unidirectional data 
flow, and thus is well suited to VLSI implementation. The proposed systolic array could be used 
in cryptographic hardware and IC cards. 
REFERENCES 
1. W. Diffie and M. Hellman, New direction in cryptography, IEEE Trans. on Info. Theory IT-22 (6), 644-654, 
(1976). 
2. T. E1Gamal, A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Trans. 
Inform. Theory IT-31, 469-472, (1985). 
3. B.A. Laws and C.K. Rushforth, A cellular-array multiplier for GF(2m), IEEE Trans. Comput. C-20, 1573- 
1578, (1971). 
4. T.C. Bartee and D.I. Schneider, Computation with finite fields, Inform. Contr. 6, 79-98, (1963). 
5. C.S. Yeh, I.S. Reed and T.K. Truong, Systolic multipliers for finite fields GF(2m), IEEE Trans. Computer 
C-33, 357-360, (December 1984). 
6. P.A. Scott, S.E. Tavares and L.E. Peppard, A fast VLSI multiplier for GF(2m), IEEE J. Selected Areas in 
Comm. 4, 62-66, (January 1986). 
7. S. Bandyopadhyay nd A. Sengupta, Algorithms for multiplication i Galois field for implementing using 
systolic arrays, Proc. Inst. Elect. Eng., pt E 35, 336-339, (November 1988). 
8. I.S. Hsu, I.S. Reed, T.K. Truong, H.M. Shao and L.J. Deutsch, The VLSI implementation of a Reed-Solomon 
encoder using Berlekamp's bit-serial multiplier algorithm, IEEE Trans. Comput. C-33, 906-911, (October 
1984). 
9. C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura and I.S. Reed, VLSI architectures for 
computing multiplications and inverses in GF(2m), IEEE Trans. Computer C-34, 709-717, (August 1985). 
10. C.L. Wang and J.L. Lin, Systolic array implementation of multipliers for finite fields GF(2m), IEEE Trans. 
Circuits Systems 38, 796-800, (July 1991). 
11. S.M. Yen and C.S. Lalh, Common-multiplicand multiplication a d its applications topublic key cryptography, 
Electronics Letters 29 (17), 1583-1584, (August 1993). 
12. J.C. Ha, J.H. Oh, K.Y. Yoo and S.J. Moon, Circuit design of modular multiplier for fast exponentiation, 
Proceeding of KIISC 7 (1), 221-231, (1997). 
13. D.E. Knuth, The Art of Computer Programming, Volume P: Seminumerical A gorithms, Second Edition, 
Addison-Wesley, Reading, MA, (1981). 
14. P.L. Montgomery, Modular multiplication without rial division, Mathemat. of Comput. at. 44, 519-521, 
(1985). 
15. Y.J. Heo, K.J. Lee and K.Y. Yoo, Design of systolic array for modular multiplication i  GF(2k), Proceeding 
of ICCCS'98, 61-66, (November 1998). 
16. A.J. Menezes, Editor, Application of Finite Fields, Kluwer Academic Publishers, Boston, MA, (1993). 
17. C.K. Koc and T. Acar, Montgomery multiplication i  GF(2k), In Proceedings of Third Workshop on Selected 
Area in Cryptography, Queen's University, Kingston, Ontario, Canada, pp. 95-106, (August 1996). 
18. S.Y. Kung, VLSI Array Processors, Prentice-Hall, (1988). 
19. K. Hwang, Computer Arithmetic: Principles, Architectures, and Design, Wiley, New York, (1979). 
