Abstract -Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura [11 recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the MasseyOmura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation.
I. INTRODUCTION T HE finite field GF(2m) is a number system containing 2m elements. Its attractiveness in practical applications stems from the fact that each element can be represented by m binary digits. The practical application of error-correcting codes makes considerable use of computation in GF(2"). Both the encoding and decoding devices for the important Reed-Solomon codes must perform computations in GF(2m) [2] , [3] . The decoding device for the binary BCH codes also must perform computation in GF(2m) [2] , [3] . On the other hand, recent advances in secret communication, such as encryption and decryption of digital messages, also require the use of computation in GF(2m) [4] . Hence, there is a need for good algorithms for doing multiplication and inversion in finite field.
Yeh, Reed, and Truong [5] presented a design of performing multiplication in GF(2m), which is suitable for VLSI implementation. In their design, the elements in the field are Manuscript received July 3, 1984 represented by a conventional basis {1, a, a2, a3 a ,mwhere a is a root of an irreducible polynomial of degree m over GF (2) . Some other previous works on the multiplier in GF(2m) by Bartee and Schneider [6] , Gallager [7] , and Laws and Rushforth [8] are also based on the conventional basis of GF(2m). However, these circuits are not suited for use in VLSI systems, due to irregular wire routing and complicated control problems as well as a nonmodular structure or lack of concurrency [9] .
Recently, Massey and Omura [1] invented a multiplier which obtains the product of two elements in the finite field GF(2m). In their invention, they utilize a normal basis of form {a, a 2, a4<,* * * a2mI} to represent elements of the field. In this basis, again, each element in the field GF(2m) can be represented by m binary digits.
In the normal-basis representation the squaring of an element in GF(2m) is readily shown to be a simple cyclic shift of its binary digits. Multiplication in the normal basis representations requires for any one product digit the same logic circuitry as it does for any other product digit. Adjacent product-digit circuits differ only in their inputs, which are cyclically shifted versions of one another. In this paper, a pipeline architecture suitable for VLSI design is developed for a Massey-Omura multiplier on GF(2m). In comparison to the multiplier designed in [5] (2) for which the roots {a, a, a2 , aI2()} are linearly independent. These linearly independent roots clearly form a normal basis of GF(2m).
Suppose that {a, a2, aC4, , a2 } is a normal basis of GF(2m). By (2) and (3) the square of (1) Fig. 1 .
By (2) and (3) it is readily seen that 1 = a + a2 + a4 + * + a2(m 1) for any element a in GF(2m). This implies that the normal basis representation of 1 is (1, 1, 1,* , 1).
Let ,B = [bo, b1,. *, bm l] and y = [co, Ci, * * ,Cm-] be two elements of GF(2m) in a normal basis representation. Then, the last term dm_I of the product,
is some binary function of the components of P3 and y, i.e., dm-l = f(bo, bi,* , bm_1; Co, cl, *** Cm-1)
Since squaring means a cyclic shift of an element in a normal basis representation, one has
Hence, the last component d,-2 of 82 iS obtained by the same function f in (6) operating on the components of t32 and Y2.
That is, 
The equations in (8) define the Massey-Omura multiplier. In the normal basis representation this multiplier has the property that the same logic functionf which is used to find the last component of din-i of the product 8 can be used to find sequentially the remaining components din-2, dm 3,*, do of the product. This feature of the product operation requires only one logic function f of the 2m components of /3 and y to sequentially compute the m components of the product. Fig. 2 illustrates the logic diagram of the above-described sequential-type Massey-Omura multiplier on GF(2m). Alternatively, for parallel operation this feature permits the use of m identical logic functions f for calculating simultaneously all components of the product. In the latter case, the inputs to the m logic functionsf are connected directly to the components of and y. The only difference in the connections to the components of P or y to a function f is that they are cyclically shifted versions of one another. Fig. 3 By (9) the product of /3 and y is (9) (1) + b2C0 + boc1 + b1co + b2C1 + b1C2. (11) Comparing (11) A pipeline structure of a Massey-Omura multiplier for GF(24) is shown in Fig. 5 . This structure has a sequential type of operation. For each of the two inputs, corresponding to 13 and y, to the f function, an inverter, two sets of shift registers B and R, and eleven pass transistors are utilized. Note that registers B and R have an identical circuit structure.
In Fig. 5 (k = 1, 2, 3, 4). Then the R-registers are cyclically shifted. Such a cyclic-shift operation is needed to sequentially yield the product components d3, d2, dl, and do of 8. While the R-registers are cyclically shifting the components of ,B (or y), the components of another element in GF(24) following / (or y) can be fed into the buffer B-registers. Therefore, the structure in Fig. 5 provides a pipeline operation in which no time is lost except for an initial fixed time delay. The VLSI layout of a Massey-Omura multiplier for GF(24) is shown in Fig. 6 . The technology used in this layout is 4 ,am NMOS. It has eight pins and takes a chip area of 1248 ,um x 996 pm. Fig. 7 illustrates a system structure of a pipelined MasseyOmura multiplier for GF(2 ¶). For this general case over GF(2m), the buffer and the cyclic shift mechanism in Fig. 7 have m -1 and m stages, respectively. Each stage consists of a shift register and a gate transistor. The product function f is a mod 2 sum of AND products of the components of the two inputs being multiplied. Such a circuit for function f consists of an AND programmed logic array (PLA) [4] followed by an XOR sequential PLA. In the XOR sequential PLA there are several levels of XORS. At each level, the inputs, pair It should be noted that as m gets large, the number of mod 2 sums in the function f becomes large. In this case, more XORS, and as a consequence more levels in the XOR sequential PLA, are required. To maximize the pipeline operation speed, shift registers are required between the XOR levels in order to store the XOR outputs of the intermediate levels.
Another approach to the realization of product functionf is to use a standard AND-OR PLA [10] . This is possible since x + y = xy v xy where "v" denotes Inclusive OR. In general, although the design off by the use of such a PLA is tedious, the product function f can be accomplished in less than one clock cycle. One tradeoff for such a design is the large chip area required. The required area for such a PLA increases the inversion. When the third multiplication is completed, Fig. 8 shows a flowchart diagram of this procedure. Ld2 = 1. Thus, the output product digits, which together This recursive algorithm for computing an inverse element represent the inverse element a-1, are fed into the output in GF(24) can be realized using the circuit shown in Fig. 9 . buffer flip-flops Bk. Finally, these are sequentially shifted In this circuit the parallel-type Massey-Omura multiplier from the inversion circuit. shown in Fig. 3 with the circuit for the product function f
The above technique for computing the inverse of an eleshown in Fig. 4 is utilized. ment in GF(24) takes four clock cycles. During these four To illustrate, let Ld1 and Ld2 be two control signals with a clock cycles, the circuit in Fig. 9 allows the bits of the next period of four clock signals as shown in Fig. 9 . Also let the element (following a) to be fed into it and the bits of the normal basis representation of a be (ao, a,, a2, a3). At the end previous element to be shifted out of it, simultaneously. This of the third clock pulse, the values a,, a2, a3 are stored in the type of circuit provides a full pipeline capability. A VLSI input buffer flip-flops B,, B2, B3, respectively. During the layout of the pipeline inversion circuitry for GF(24) is fourth clock cycle, a3, ao, al, and a2 are simultaneously presented in Fig. 10 Rk:SHITB RREGISTER layout is 4 Aum NMOS. It has ten pins and takes a chip area of 2220 ,um x 1440 ,um. Fig. 11 shows the system structure of an inversion circuit for the general finite field GF(2m).
V. CONCLUSION
In this paper, we have illustrated the VLSI design of a new multiplication algorithm in GF(2m). Consequently, the inverse element computation in GF(2') can be easily accomplished by using this new multiplier. Both multiplication and inversion circuits are being fabricated. The expected speed for both circuits is around 10 MHz. In comparison to the multiplier designed in [5] , the Massey-Omura multiplier is much simpler; it requires minimum controls and interconnections. For the multiplier in GF(2m), the design in [5] requires 10m registers, 2m AND gates, and 2m XOR gates, while the design in this paper requires only 2m registers. The number of gates required in the Massey-Omura multiplier is highly dependent on the irreducible polynomial used to generate the field. However, for some particular fields, the irreducible polynomial can be chosen such that the multiplier needs only 2m -1 AND gates and 2m -2 XOR gates [11] , [12] . Therefore, the chip areas required for the Massey- Omura multipliers of these fields are smaller than those of the design in [5] .
