# SYSTOLIC ARRAY IMPLEMENTATION OF EUCLID'S ALGORITHM FOR INVERSION AND DIVISION IN $\boldsymbol{G F}\left(\mathbf{2}^{m}\right)$ 

Jyh-Huei Guo and Chin-Liang Wang<br>Department of Electrical Engineering, National Tsing Hua University<br>Hsinchu, Taiwan 300, Republic of China<br>clwang@ee.nthu.edu.tw


#### Abstract

This paper presents a new systolic VLSI architecture for computing inverses and divisions in finite fields $\boldsymbol{G F}\left(\mathbf{2}^{\boldsymbol{m}}\right)$ based on a variant of Euclid's algorithm. It is highly regular, modular, and thus well suited to VLSI implementation. It has $O\left(m^{2}\right)$ area complexity and can produce one result per clock cycle with a latency of $8 \mathbf{m}-2$ clock cycles. As compared to existing related systolic architectures with the same throughput performance, the proposed one gains a significant improvement in area complexity.


## I. INTRODUCTION

Finite fields $G F\left(2^{m}\right)$ have found many applications in areas of communications, such as error-correcting codes [1], [2] and cryptography [3]. In these applications, computing inverses and divisions in $G F\left(2^{m}\right)$ is usually required. Since such computations are quite timeconsuming, it is desirable to design high-speed circuits for them to meet the real-time requirements.

There have been a number of hardware structures available for fast inversion and/or division in $G F\left(2^{m}\right)$ (see, for example, [4]-[12]). Among them, the designs in [4]-[9] employ serial-form input and/or serial-form output. Basically, such circuits with serial-form data transmission involve small hardware complexity but might have unsatisfactory throughput performance when $m$ gets large. In contrast, the designs in [10]-[12] make use of parallel-in parallel-out I/O and achieve higher throughput rates with more hardware complexity. All these parallel architectures are designed based on the concept of systolic arrays [13] and are able to provide the maximum throughput in the sense of producing new results at a rate of one per clock cycle, i.e., the time complexity is $\mathrm{O}(1)$. However, their hardware requirements seem too high for large fields; the area complexity is $\mathrm{O}\left(m^{*} 2^{m}\right)$ for the circuit in [10] and is $\mathrm{O}\left(\mathrm{m}^{3}\right)$ for those in [11] and [12].

In this paper, a new parallel-in parallel-out systolic array with $\mathrm{O}(1)$ time complexity and $\mathrm{O}\left(m^{2}\right)$ area complexity

This work was supported by the National Science Council of the Republic of China under Grant NSC-85-2215-E-007-017.
is proposed for inversion and division in $G F\left(2^{m}\right)$. The architecture is designed based on a variant of Euclid's algorithm for computing the greatest common divisor of two polynomials. It is highly regular, modular, and thus well suited to VLSI implementation. As compared to previous systolic architectures for inversion and division with the same throughput performance, the proposed one saves a significant amount of chip area.

## II. VARIANTS OF EUCLID'S ALGORITHM FOR COMPUTING INVERSIONS AND DIVISIONS IN $\boldsymbol{G F}\left(\mathbf{2}^{\boldsymbol{m}}\right)$

Let $A(\alpha)$ and $B(\alpha)$ be two elements in $G F\left(2^{m}\right), G(\alpha)$ be the primitive polynomial of degree $m$, and $C(\alpha)=A(\alpha) / B(\alpha)$ $\bmod G(\alpha)$. Then we have

$$
\begin{align*}
& A(\alpha)=a_{m-1} \alpha^{m-1}+\cdots+a_{1} \alpha+a_{0}  \tag{1}\\
& B(\alpha)=b_{m-1} \alpha^{m-1}+\cdots+b_{1} \alpha+b_{0}  \tag{2}\\
& G(\alpha)=\alpha^{m}+g_{m-1} \alpha^{m-1}+\ldots+g_{1} \alpha+g_{0}  \tag{3}\\
& C(\alpha)=c_{m-1} \alpha^{m-1}+\cdots+c_{1} \alpha+c_{0}  \tag{4}\\
& B(\alpha) C(\alpha)+G(\alpha) D(\alpha)=A(\alpha) \tag{5}
\end{align*}
$$

for some element $D(\alpha)$ in $G F\left(2^{m}\right)$, where each coefficient of the polynomials is in $\{0,1\}$. All arithmetic operations in $G F\left(2^{m}\right)$ are performed by taking the results mod 2 , and $C(\alpha)$ is called the inverse of $B(\alpha)$ when $A(\alpha)=1$.

## A. The Original Euclid's Algorithm

To perform inversion/division operations defined above, the following Euclid's algorithm [2] can be used:

$$
\begin{aligned}
& R=B(\alpha) ; S=G(\alpha) ; U=A(\alpha) ; V=0 ; \\
& \text { while } R \neq 0, \text { do } \\
& \quad Q=S \text { DIV } R ;(* \text { DIV: polynomial division*) } \\
& \quad \text { temp }=S-Q \cdot R ; S=R ; R=\text { temp; } \\
& \quad \text { temp }=V-Q \cdot U ; V=U, U=\text { temp; } \\
& \text { end } \\
& U=C(\alpha) .
\end{aligned}
$$

One disadvantage of this algorithm is that it does not involve a fixed number of iterations for computing $C(\alpha)$ in a given field. This makes it not easily realized using VLSI techniques.

## B. The Modified Euclid's Algorithm in [9]

To overcome the above-mentioned problem, Brunner et
al. [9] proposed a modification of Euclid's algorithm that
always involves $2 m$ iterations to compute an inverse or division in $G F\left(2^{m}\right)$. The algorithm can be summarized as follows:

```
\(R=B(\alpha) ; S=G=G(\alpha) ; U=A(\alpha) ; V=0 ;\)
count \(=0\);
for \(i=1\) to \(2 m\) do
    if \(r_{m}=0\) then (*occurring \(m\) times*)
        \(R=\alpha \cdot R ; U=\alpha \cdot U \bmod G ;\)
        count \(=\) count +1 ;
    else
        if \(s_{m}==1\) then
            \(S=S+R ; V=V+U ;\)
        end
        \(S=\alpha \cdot S ;\)
        if count \(==0\) then
                \(R \leftrightarrow S ; U \leftrightarrow V ;\) (*exchange operations*)
                \(U=\alpha \cdot U \bmod G ;\)
                count \(=\) count +1 ;
            else (*occurring \(m\) times*)
                \(U=U / \alpha \bmod G\);
                count \(=\) count -1 ;
            end
        end
end ( \({ }^{*} U=C(\alpha)=A(\alpha) / B(\alpha) \bmod G(\alpha) ;\) count \(\left.=0^{*}\right)\)
```

where $r_{m}$ and $s_{m}$ denote the most significant coefficients of $R$ and $S$, respectively. This algorithm involves $2 m$ iterations ( $i=1$ to $2 m$ ) and "count $=0$ " always occurs at the end of the last iteration [9]. In other words, both of the statements "count $=$ count $+1 "$ and "count $=$ count $-1 "$ run $m$ times during the $2 m$ iterations. To realize the algorithm, a serialin serial-out pipelined architecture was given in [9]. This circuit possesses good area-time performance, but it is not a systolic design and suffers from broadcasting problems. The reason why the algorithm is not amenable to systolic array implementation is that its arithmetic operations are not uniform during the $2 m$ iterations. It performs " $U=\alpha \cdot U$ $\bmod G$ " for some iterations, and performs " $U=U / \alpha \bmod$ $G^{\prime \prime}$ for the others.

## C. A New Variant of Euclid's Algorithm

Note that, if the statement " $U=U / \alpha \bmod G$ " is removed from the algorithm given above, the final result will become $U=C(\alpha) \alpha^{\mathbf{m}}$, instead of the desired answer $U=$ $C(\alpha)$. To obtain the correct answer, we can execute the operation " $U=U / \alpha \bmod G$ " $m$ times after the $2 m$ iterations have been completed. It can also be seen from the statements "temp $=V-Q \cdot U ; V=U, U=$ temp" given in Section II.A that removing the statement " $U=U / \alpha \bmod G$ " is equivalent to executing the statement " $V=\alpha \cdot V \bmod G$ ". Moreover, the statements " $R \leftrightarrow S ; U \leftrightarrow V ; U=\alpha \cdot U \bmod$ $G$ " are equivalent to " $V=\alpha \cdot V \bmod G ; R \leftrightarrow S ; U \leftrightarrow V$ ". With these observations, we can derive the following algorithm for inversion/division in $G F\left(2^{m}\right)$ :

```
\(R=B(\alpha) ; S=G=G(\alpha) ; U=A(\alpha) ; V=0 ;\)
count \(=0\);
```


## Part A:

for $i=1$ to $2 m$ do if $r_{m}=0$ then $R=\alpha \cdot R ; U=\alpha \cdot U \bmod G ;$ count $=$ count +1 ;
else
if $s_{m}==1$ then
$S=S+R ; V=V+U ;$
end
$S=\alpha \cdot S ; V=\alpha \cdot V \bmod G ;$
if count $=0$ then
$R \leftrightarrow S ; U \leftrightarrow V ;$
count $=$ count +1 ;
else
count $=$ count -1 ;
end
end
end ${ }^{*} U=C(\alpha) \cdot \alpha^{m} \bmod G(\alpha) . ;$ count $\left.=0^{*}\right)$

## Part B:

$$
\begin{aligned}
& \text { for } i=2 m+1 \text { to } 3 m \text { do } \\
& \quad U=U / \alpha \bmod G ; \\
& \text { end }\left({ }^{*} U=C(\alpha)=A(\alpha) / B(\alpha) \bmod G(\alpha)^{*}\right)
\end{aligned}
$$

Apparently, the new variant of Euclid's algorithm consists of two parts; Part A first generates a temporary result $C(\alpha) \cdot \alpha^{m} \bmod G(\alpha)$., and then Part B divides it by $\alpha^{m}$ to yield the correct answer. Table I demonstrates a procedure of the proposed algorithm for computing inverses/divisions in $G F\left(2^{4}\right)$, where $G(\alpha)=\alpha^{4}+\alpha+1, A(\alpha)=$ $\alpha^{3}+\alpha^{2}+\alpha$, and $B(\alpha)=\alpha^{3}+\alpha+1$. At step $i=2 m=8, U=\alpha^{2}+1$ is the temporary result $C(\alpha) \cdot \alpha^{m} \bmod G(\alpha)$., and at step $i=$ $3 m=12, U=\alpha+1$ is the correct answer $C(\alpha)=A(\alpha) / B(\alpha)$ $\bmod G(\alpha)$. As compared to the algorithm described in Section II.B, the new algorithm involves more uniform arithmetic operations during the recursively computing process, and is thus easier to realize using a systolic architecture.

## III. SYSTOLIC IMPLEMENTATION OF THE PROPOSED ALGORITHM

Fig. 1 shows a systolic architecture to implement the proposed algorithm for computing inverses and divisions in $G F\left(2^{m}\right)$, where ' $\bullet$ ' denotes a one-cycle delay element. It consists of a subarray of $2 m$ Type-I cells and $2 m \times m$ Type-II cells for realizing the Part-A operations and a subarray of $m \times m$ Type-III cells for realizing the Part-B operations. The $i$ th row of each subarray performs the ith-iteration operations of the corresponding part. The functions of these three types of basic cells are illustrated in Figs. 2 to Fig. 4.

TABLE I
An Example of Computing Inverses/Divisions in GF( $\mathbf{2}^{4}$ ) Based on the Proposed Algorithm
$\left(G(\alpha)=\alpha^{4}+\alpha+1, A(\alpha)=\alpha^{3}+\alpha^{2}+\alpha, B(\alpha)=\alpha^{3}+\alpha+1\right)$

| $i$ | count | $R$ | $S$ | $U$ | $V$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | 0 | $\alpha^{3}+\alpha+1$ | $\alpha^{4}+\alpha+1$ | $\alpha^{3}+\alpha^{2}+\alpha$ | 0 |
| 1 | 1 | $\alpha^{4}+\alpha^{2}+\alpha$ | $\alpha^{4}+\alpha+1$ | $\alpha^{3}+\alpha^{2}+\alpha+1$ | 0 |
| 2 | 0 | $\alpha^{4}+\alpha^{2}+\alpha$ | $\alpha^{3}+\alpha$ | $\alpha^{3}+\alpha^{2}+\alpha+1$ | $\alpha^{3}+\alpha^{2}+1$ |
| 3 | 1 | $\alpha^{4}+\alpha^{2}$ | $\alpha^{4}+\alpha^{2}+\alpha$ | $\alpha^{3}+1$ | $\alpha^{3}+\alpha^{2}+\alpha+1$ |
| 4 | 0 | $\alpha^{4}+\alpha^{2}$ | $\alpha^{2}$ | $\alpha^{3}+1$ | $\alpha^{3}+\alpha^{2}$ |
| 5 | 1 | $\alpha^{3}$ | $\alpha^{4}+\alpha^{2}$ | $\alpha^{3}+\alpha+1$ | $\alpha^{3}+1$ |
| 6 | 2 | $\alpha^{4}$ | $\alpha^{4}+\alpha^{2}$ | $\alpha^{2}+1$ | $\alpha^{3}+1$ |
| 7 | 1 | $\alpha^{4}$ | $\alpha^{3}$ | $\alpha^{2}+1$ | $\alpha^{3}+\alpha+1$ |
| 8 | 0 | $\alpha^{4}$ | $\alpha^{4}$ | $\alpha^{2}+1$ | $\alpha^{2}+1$ |
| 9 |  |  |  | $\alpha^{3}+\alpha+1$ |  |
| 10 |  |  |  | $\alpha^{3}+\alpha^{2}$ |  |
| 11 |  |  |  | $\alpha^{2}+\alpha$ |  |
| 12 |  |  |  | $\alpha+1$ |  |



Fig. 1. The proposed systolic architecture for computing inversions/divisions in $G F\left(2^{m}\right) . m=3$.


Fig. 2. The circuit of Type-I cell in Fig. 1.


Fig. 3. The circuit of Type-II cell in Fig. 1.


Fig. 4. The circuit of Type-III cell in Fig. 1.

Each Type-I cell is used to generate the following control signals:

$$
\begin{aligned}
& \text { RUmulti }=\left(r_{m}==0\right) \\
& \text { Add }=\left(r_{m}==1\right) \&\left(s_{m}==1\right) \\
& \text { Exchange }=\left(r_{m}==1\right) \&(\text { count }==0) \\
& \text { count' }=\text { count }-1, \text { if }(\text { count } \neq 0) \&\left(r_{m}==1\right) \\
& \text { count }=\text { count }+1, \text { else }
\end{aligned}
$$

When RUmulti $=1$, the corresponding row of Type- 2 cells executes the operations given in (I); otherwise, it does the operations of (III). The Add and Exchange signals are used to determine whether the operations of (II) and (IV) are performed or skipped. The Part-A subarray generates the temporary result $C(\alpha) \cdot \alpha^{m} \bmod G$ at its bottom row, and then sends it to the Part-B subarray for further processing. With little effort, one can check the inversion/division results will emerge from the bottom of the Part-B subarray at a rate of one per clock cycle. It can also be seen that the proposed systolic architecture has area complexity of $\mathrm{O}\left(\mathrm{m}^{2}\right)$ and a latency of $8 m-2$ clock cycles.

## IV. CONCLUSIONS

Table II gives a comparison of the proposed parallel-in parallel-out systolic array for inversion and division in $G F\left(2^{m}\right)$ with those in [11] and [12]. We can see from this table that all the architectures compared reach the same throughput rate of one result per clock cycle, but the proposed one has much smaller area requirement, much shorter latency, and much better area-time product performance.

## REFERENCES

[1] W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes. Cambridge, MA: MIT Press, 1972.
[2] E. R. Berlekamp, Algebraic Coding Theory. New York: Mcgraw-Hill, 1968.
[3] D. E. R. Denning, Cryptography and Data Security. Reading, MA: Addsion-Wesley, 1983.
[4] C. C. Wang, T. K. Truong, H, M, Shao, L. J. Deutsch, J. K. Omura, and I. S. Reed, "VLSI architectures for computing multiplications and i verses in $\mathrm{GF}\left(2^{\mathrm{m}}\right)$," IEEE Trans. Comput., vol. C-34, pp. 709-719, Aug. 1985.
[5] G.-L. Feng, "A VLSI architecture for fast inversion in GF( $\left.2^{\mathrm{m}}\right)$," IEEE Trans. Comput., vol. 38, pp. 13831386, Oct. 1989.
[6] C.-L. Wang and J.-L. Lin, "A systolic architecture for computing inverses and divisions in finite fields GF( $2^{m}$ )," IEEE Trans. Comput., vol. 42, pp. 11411146, Sep. 1993.
[7] M. A. Hasan and V. K. Bhargava, " Bit-level systolic divider and multiplier for finite fields GF $\left(2^{\text {m }}\right)$," IEEE Trans. Comput., vol. 41, pp. 972-980, Aug. 1992.
[8] K. Araki, I. Fujita, and M. Morisue, "Fast inverters over finite field based on Euclid's algorithm," Trans. IEICE, vol. E-72, pp. 1230-1234, Nov. 1989.
[9] H. Brunner, A. Curiger, and M. Hofstetter, "On computing multiplicative inverses in GF ( $2^{m}$ )," IEEE Trans. Comput., vol. 42, pp. 1010-1015, Aug. 1993.
[10]M. Kovac, N. Ranganathan and M. Varanasi, "SIGMA: A VLSI systolic array implementation of a galois field $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ based multiplication and division algorithm," IEEE Trans. VLSI Systems, vol. 1, pp. 22-30, Mar. 1993.
[11]S.-W. Wei, "VLSI architectures for computing exponentiations, multiplicative inverses, and divisions in GF $\left(2^{m}\right)$," in Proc. 1995 IEEE Int. Symp. Circuits Syst., London, May 1995, pp. 4.203-4.206.
[12]C.-L. Wang and J.-H. Guo, "New systolic arrays for $\mathrm{C}+\mathrm{AB}^{2}$, inversion, and division in $\mathrm{GF}\left(2^{m}\right)$," in Proc. 1995 European Conference Circuit Theory Design, Istanbul, Turkey, Aug. 1995, pp. 431-434.
[13]H. T. Kung, "Why systolic architectures?," IEEE Trans. Comput., vol. 15, pp. 37-46, Jan. 1982.

TABLE II
Comparison of Some Parallel-In Parallel-Out Systolic Arrays for Computing Inversions/Divisions in GF( $\mathbf{2}^{m}$ )

|  | $\begin{aligned} & \hline \text { Wei } \\ & {[11]} \end{aligned}$ | $\begin{gathered} \text { Wang \& Guo } \\ {[12]} \end{gathered}$ | Proposed |
| :---: | :---: | :---: | :---: |
| Number of Cells | $m^{2}(m-1)$ | $m^{2}(m-1) / 2$ | Type I: $2 m$ Type II: $2 m^{2}$ Type III: $m^{2}$ |
| Throughput (1/cycle) | 1 | 1 | 1 |
| Latency (cycles) | $3 m^{2}-2 m$ | $2 m^{2}-3 m / 2$ | $8 m-2$ |
| Maximum Cell Delay | $\begin{array}{r} \mathrm{T}_{\mathrm{AND} 2} \\ +\mathrm{T}_{\mathrm{XOR} 3} \\ \hline \end{array}$ | $\begin{gathered} \hline \mathrm{T}_{\mathrm{AND} 2} \\ +\mathrm{T}_{\mathrm{XOR} 4} \\ \hline \end{gathered}$ | $\mathrm{T}_{\mathrm{AND}^{2}}+\mathrm{T}_{\mathrm{XORB}^{3}}+2 \mathrm{~T}_{\mathrm{MUX} 2}$ |
| Cell Complexity | $3 \mathrm{AND}_{2}$ 's <br> $1 \mathrm{XOR}_{2}$ <br> $1 \mathrm{XOR}_{3}$ <br> 13 latches | $6 \mathrm{AND}_{2}$ 's <br> $2 \mathrm{XOR}_{4}$ 's <br> 17 latches | Type I: <br> $5 \mathrm{AND}_{2}$ 's $2 \mathrm{XOR}_{2}$ 's $5 \mathrm{MUX}_{2}$ 's 1 INV $\log _{2}(m+1)$ bits adder zero-check circuit $9+2 \log _{2}(m+1)$ latches Type II: <br> $4 \mathrm{AND}_{2}{ }^{\prime} \mathrm{s} 2 \mathrm{XOR}_{2}{ }^{\prime} \mathrm{s}$ <br> $1 \mathrm{XOR}_{3} \quad 8 \mathrm{MUX}_{2}$ 's <br> 18 latches <br> Type III: <br> $1 \mathrm{AND}_{2} \quad 1 \mathrm{XOR}_{2}$ <br> 4 latches |
| AT-product | $\mathrm{O}\left(\mathrm{m}^{3}\right)$ | $\mathrm{O}\left(\mathrm{m}^{3}\right)$ | $\mathrm{O}\left(\mathrm{m}^{2}\right)$ |

AND $_{i}: i$-input AND gate; $\mathrm{XOR}_{i}: i$-input XOR gate.
INV : inverter; MUX ${ }_{i}$ : $i$-input multiplexer.
$\mathrm{T}_{\mathrm{AND} i}$ : the propagation delay through an $i$-input AND gate.
$\mathrm{T}_{\mathrm{XOR} i}$ : the propagation delay through an $i$-input XOR gate.
$\mathrm{T}_{\mathrm{MUXi}}$ : the propagation delay through an $i$-input multiplexer.

