Abstract-In this paper an algorithm for GF(2") multiplication/division is presented and a new, more generalized definition of duality is proposed. From these the bit-serial Berlekamp multiplier is derived and shown to be a specific case of a more general class of multipliers. Furthermore, it is shown that hardware efficient, bit-parallel dual basis multipliers can also be designed. These multipliers have a regular structure, are easily extended to different GF(2") and hence suitable for VLSI implementations. As in the bit-serial case these bit-parallel multipliers can also be hardwired to carry out constant multiplication. These constant multipliers have reduced hardware requirements and are also simple to design. In addition, the multiplication/division algorithm also allows a bit-serial systolic finite field divider to be designed. This divider is modular, independent of the defining irreducible polynomial for the field, easily expanded to different GF(2m) and its longest delay path is independent of m.
INTRODUCTION
INITE field arithmetic is fundamental to the implementa-F tion of Reed-Solomon (RS) codes [l] and certain cryptographic systems [2] . The most important finite field arithmetic operation is multiplication, and a considerable amount of work has been carried out defining finite field multipliers suitable for implementation in VLSI [3] , [4] , [5] , 161, [7] , [8] , [9] , [lo] , [ll] , [12] , [13] , [14] . Of these the most suitable for use in RS codecs appears to be the Berlekamp [3] and Massey-Omura bit-serial multipliers [9] , and the Mastrovito bit-parallel multiplier [6] , [7] .
The bit-parallel Mastrovito multiplier for GF(2m) operates over the polynomial basis and comprises m identical inner product modules and one further module, the design of which is dependent on the defining irreducible polynomial for the field [6] . The bit-parallel Massey-Omura multiplier had previously been regarded as the most suitable bitparallel multiplier for implementation in VLSI because it comprises m identical modules. However, it has been shown that the Mastrovito multiplier requires only around half the gates the Massey-Omura multiplier needs to implement the necessary combinational logic [6] . Given the multiplier's high regularity and low hardware requirements it is therefore considered highly suited to implementation in RS codecs. The bit-serial Massey-Omura multiplier for GF(2m) operates over the normal basis and consists of 2m register elements and at least (2m -1) AND gates and (2m -2) XOR gates [13] . The bit-serial Berlekamp multiplier, however, operates over both the polynomial basis and the dual basis
The authors are with the Department ofElectrica1 and Electronic Engineering, the University of Huddersfeld, Queensgate, Huddersfeld, HD1 3DH, U.K. E-mail: slfennQeng.hud.ac.uk. Manuscript revised Feb. 5,1995. For information on obtaining reprints of this article, please send e-mail to: transactionsQcomputer.org, and reference IEEECS Log Number C96007. and has lower hardware requirements than the MasseyOmura multiplier [8] . That is, a bit-serial Berlekamp multiplier for GF(2m) requires 2m register elements, m AND gates, and (m + H(pp) -3) XOR gates where H(pp) denotes the Hamming weight of the defining irreducible polynomial for the field. The bit-serial Berlekamp multiplier also has the advantage that it can be hardwired to carry out constant multiplication, further reducing its hardware requirements and allowing one such multiplier to yield many products [15] .
Crucial to the operation of the Berlekamp multiplier is the trace function and the concept of duality. The trace function is a linear function from GF(pm) to GF(p) , where the trace of p E GF(pm), tu@), is defined to be m-1 tY@) = Cp" . i=O Two bases are then said to be dual to one another if a set of conditions are satisfied by the trace values of the basis elements [3] .
In this paper we extend the idea of the trace function to encompass any general linear function, Using a linear function and a new definition of duality, a finite field multiplication/division algorithm is presented and from this the bit-serial Berlekamp multiplier is derived. This algorithm also allows us to derive a bit-parallel dual basis multiplier which has a number of appealing characteristics. For example the bit-parallel multiplier is regular, easily expandible to different GF(2m) and hardware efficient. It can also be easily hardwired to perform constant multiplication.
This same algorithm also permits a bit-serial systolic divider to be designed. Finite field division is required in a number of RS decoding techniques such as time domain decoding [16] and the standard Berlekamp-Massey algorithm for solving the key equation [17] . The presented divider is modular, easily expanded to different GF(2m) and the lohgest delay path is independent of m. These characteristics are in contrast to those of the inverters presented in [MI, [19] , [20] . Hence this divider is highly suited to VLSI implementation.
By selecting suitable dual bases, the problem of these operators functioning over two different bases can be largely circumvented. Dual bases can be chosen which require little or no extra hardware to convert them to or from the polynomial basis. These operators can, therefore, be utilized throughout RS encoding and decoding circuits and since dual basis multipliers have particularly low hardware requirements, this results in the overall RS codec being particularly hardware efficient. This, in combination with the fact that there also exist efficient dual basis inverters/dividers, means that the dual basis is an appropriate basis for RS codecs to operate over.
MATHEMATICAL BACKGROUND
It is assumed throughout this paper that the reader is familiar with the basic theory of finite fields, for more details see for example [21] .
Let F ", denote the set of all linear functions f : GF(p") +
GF(p).
The trace function is a well known example of such a linear function and has previously been exploited to produce finite field multipliers and to give results relating to the solution of quadratic equations over a finite field [3], 1171. However, there are a number of other available linear functions and it is frequently more convenient to employ one of these rather than the trace function. THEOREM 1. Let {A,} be a basis for GF(pm) of the form [lo] only the case where f is the trace function was considered, however these results also hold for any general f [22] . These results are reviewed and extended in Appendix A to consider the general case off being any suitable linear function. In addition, the optimal dual bases for GF(2m) (m = 2,3,
. . ., 10) are also presented.
GENERALIZED EXPRESSION FOR FINITE FIELD MULTIPLICATION AND DIVISION
A general result, considered only in the context of division was first presented in [23] . From this expression we are able to derive the Berlekamp bit-serial multiplier and also to present a class of bit-parallel dual basis multipliers. Furthermore, this theorem also forms the kernel of the bitserial systolic divider described in Section 6.
THEOREM 4 [22] , [23] . 
1=0
Then the following relation holds. [lo] it is also possible to use these multipliers throughout RS decoding circuits. We now derive the bit-serial Berlekamp multiplier. If Theorem 4 is restricted to operation over GF(2m) and we let ... This immediately gives rise to the bit-serial multiplication scheme originally proposed by Berlekamp. Consider With this multiplication scheme, both b and a are represented in the dual basis and c is represented in the polynomial basis. However, it is frequently required to enter both b and c in the dual basis, and so a dual to polynomial basis converter must be incorporated within the multiplier circuitry. As illustrated in Appendix A, the hardware required to carry out this transformation often proves trivial. For example, with GF(2") multipliers for (m = 2, 3, 4, 5, 6, 7, 9, 10) for which irreducible trinomials exist, no extra hardware is required to carry out this basis conversion only a reordering of basis coefficients. With GF(2' ) for which no irreducible trinomial exists, the dual to polynomial basis transformation' circuit only requires an extra two XOR E module are then combined to form the full bit-parallel dual basis multiplier for GF(z4) shown in Fig. 4 . If gates. For a more extensive survey and discussion of bitserial multipliers see for example, Chapter 3 in [7] . In some applications, it is required to adopt bit-parallel architectures rather than bit-serial ones to achieve the required performance. Although bit-serial dual basis multipliers based on the trace function have been widely employed in applications such as RS encoders [3], 1151, it will often prove advantageous to employ bit-parallel dual basis multipliers, particularly in more complex circuits such as RS decoders. To this end, we now consider the design of such multipliers.
BIT-SERIAL BERLEKAMP MULTIPLIERS
Let a, b, c E GF(2m) such that a = bc and let bt} be the dual basis to the polynomial basis for p E GF(Zm) and f E FZm. 
where bm+k (k 2 0) are given by (6). From these equations it can be seen that the m product bits are generated by m identical functions of the form
all that changes in these functions is the value of k.
A bit-parallel dual basis multiplier for GF(2") can, therefore, be constructed out of m GF(2) inner product modules (of type A, say) that implement (7) and one module (of type B, say) that generates the bk (k = m, m + 1, .. ., 2m -2) from (6).
An example of such a multiplier for Gf(z4) is given below.
Bit-Parallel Dual Basis Multiplier for GF(Z4)
Let p(x) = x4 + x + 1 be the defining irreducible polynomial for the field and let a be a root of p(x). From (7), four type A modules are required each implementing the function
This equation can be implemented by the circuit shown in Fig. 2 . From p(x) = x4 + x + 1 and (7) it is observed that
and so the type B module can be implemented by the circuit shown in Fig. 3 . The four type A modules and the type i=O is the polynomial basis representation of c, the registers in Once these values have been loaded into the registers, the product bits a, (i = 0, 1,2,3) become immediately available on the output lines. Note that in Fig. 4 both b and c enter the multiplier as represented in the dual basis and that by reordering the coefficients of c, the required dual to polynomial basis conversion is carried out. This is because as shown in Table 2 in Ap endix A, if 11, a, a', a3} is the polynomial basis then [I, Za', a) forms the dual basis. 
Complexity Analysis
It has been seen that a bit-parallel dual basis multiplier and one further module. It is not possible to generalize the hardware requirements of this extra module, although the lowest complexity modules for GF(2") for a range of m have been found [7] . Both the PDBM and the PMM also require 2m registers elements, and so these components are not considered in the comparison.
The delay of the PDBM is now considered. Let D A be the delay through a 2-input AND gate and let D, be the delay (For an explanation of (€9, see Appendix B.) Thus the delay through the PDBM in total is
It can be seen that the hardware complexity of the PDBM is at its minimum when p(x) is a trinomial. Furthermore, the delay through the PDBM is at its minimum when p(x) is a trinomial of the form p(x) = xm + x + 1. In Table 1 , the hardware complexity and delays of the PDBM and the PMM are given for GF(2") for (m = 2, 3, . . ., 10). The overall hardware levels and delays of the PMM are taken from [7] . The PDBM makes use the values of p(x) given in Table 2 in Appendix A. Because these polynomials have the lowest Hamming weights of the available defining polynomials and the lowest values of k for each m, these choices of p(x) yield the fastest and least hardware intensive bit-parallel dual basis multipliers. For GF(2*), the extra two XOR gates required to carry out the dual to polynomial basis conversion have been included. This allows both the multiplier and the multiplicand to enter the multiplier represented in the dual basis. From Table 1 , it can be seen that the PDBM and the PMM have very similar hardware requirements and longest delay paths. This is to be expected given the architectural similarities of the two multipliers, although the underlying algorithms are entirely different. The only significant difference is to be found with GF(2'), for when m = 8, the PBM requires 11 fewer XOR gates but has a longest delay path 30, more than the PMM. However, in [6], [7] , it was noted that with some values of m it is possible to reduce the required number of XOR gates by reusing partial sums. This approach can also be adopted with the PBM, and these modified values are shown in brackets in Table 1 .
TABLE 1 HARDWARE REQUIREMENTS AND DELAYS OF DUAL BASIS AND POLYNOMIAL BASIS BIT-PARALLEL MULTIPLIERS

BIT-SERIAL SYSTOLIC DIVIDER FOR GF(2")
We now present a dual basis, systolic bit-serial divider which is also based upon Theorem 4. If we take c = a / b and enforce the condition b # 0 in Theorem 4, we arrive at the following linear functions division algorithm (LFDA):
1) Construct (5).
2) Solve (5) to obtain the desired value of c.
Solving Systems of Simultaneous Linear Equations over GF(2)
The implementation of the LFDA can, therefore, be broken Simultaneous linear equations can be solved by G~~~~ down into two sections, the section that generates the sys-Jordan elimination (GJE), ~~~~~i~~ elimination with back tem represented in (5) and the section that solves these substibtion, LU decomposition, or matrix inversion and equations. The first of these problems is now addressed and Solving systems of simultaneous linear as an example, an implementation of the LFDA for ~~( 2~1 is equations is of 0(m3) complexity and so a degree of paral.. lelism is required if these techniques are to be implemented presented. In [25] , a systolic array for carrying out upper matrix triangularization was presented and in [5] this array was modified to carry out full GJE. This is achieved by clocking the array a further m clock cycles. Hence, to implement the LFDA, all that is required is to combine this GJE systolic array with an array of ME cells.
Generating the System of Equations
The GJX array given in [5] consists of two distinct processors, conventionally called " r o u n d and "square" processors. The round processors determine whether a particular row is to act as a pivot and the square processors carry out the GF(2) arithmetic specified by the round processor at the beginning of the row. For a' detailed description of these cells, see [5]. The full arrangement of round and square processors for carrying out GJE over GF (24) is shown in Fig. 7 . 
Complete Systolic Implementation of the LFDA
These architectures are now combined to produce the complete systolic implementatiop of the LFDA. An example of the systolic implementation of the LFDA for GF(z4) is shown in Fig. 8 . The square boxes D, represent delay units of i clock cycles, These are required because from Figs. 6 and 7 it can be seen that the output from the first MF cell is available in a different format to that required by the GJE array. There is a 3m clock cycle delay between the first coefficient of b entering the GJE array and the first coefficient of the vector becoming available at the bottom. Because an (m -1) delay is required to convert the order of the b, values from that produced by the MF cells to that required by the GJE array, the LFDA has a delay of (4m -1) clock cycles between the first basis coefficients of u and b entering the circuit and the first basis coefficient of c becoming available on the output. The LFDA therefore has a total computation time of (5m -1) clock cycles. This implementation of the LFDA also supports pipelining. As regards the hardware requirements of this implementation of the LFDA, each of the (m -1) MF cells requires six registers and the delay units D, consist of m(m + 1)/2 registers. The GJE array consists of 2(m2 + 4m -1) registers and so the divider requires ( 2 . 5~' + 1 4 . 5~~ -8) registers in total.
Comparisons with Other Dividers
A similar systolic divider has been presented by Hasan and Bhargava [5] . Both dividers have the same hardware requirements and operation time and both can also be pipelined. The main difference between these two dividersapart from the underlying algorithms-is the way in which they require the finite field elements to be represented. The divider presented in [5] operates over the polynomial basis and whereas the coefficients of u and b enter the circuit most significant coefficient first, the coefficients of c leave the circuit least significant coefficient first. In some situations this would prove unacceptable, and extra hardware would have to be included to reorder these coefficients. In the LFDA implementation given here, all coefficients enter and leave the circuit least significant coefficient first. However, it is to be remembered that u and b are represented in the dual basis while c is represented in the polynomial basis. Although as has already been pointed out, any required basis conversions can often be carried out with little or no extra hardware.
In fact, it can be regarded as an advantage that the LFDA operates over the dual basis. This is because multiplication is the most frequently used operation in the implementation of RS codecs and it is important to be able to utilize the most hardware efficient multipliers available. As has been seen earlier, hardware efficient dual basis multipliers exist in both bit-serial and bit-parallel forms and if the operating basis of a circuit is taken to be the dual basis, it is desirable that all the arithmetic operators should function over this same basis to avoid unnecessary basis conversions. And so in deciding whether to utilize the dividers presented here or in [5] , the main criteria will be the required operating basis.
The divider presented here also shares a number of advantageous characteristics with the divider described in [5] . For example both dividers can operate with any irreducible polynomial, there are no global communications, the divider is easily extended to different GF(2m) and the longest delay path is independent of m. Compare this with the inverters presented in [18] , [19] , [20] for example. Firstly and most obviously, it should also be noted that for an inverter to carry out division, an extra finite field multiplier is required. (However, it is to be noted that by appropriately initializing the inverter in [20] , this circuit can in fact carry out division). These inverters all have structures dependent upon the choice of p ( x ) and they are also nonsystolic. Furthermore, the inverters described in [19] and [20] require two control lines whilst the inverter presented in [18] requires four control lines. For GF(2m) with large values of m, therefore, the dividers presented here and in [5] are more suitable for implementation in VLSI.
CONCLUSIONS
In this paper we have suggested an alternative definition of duality based not on the trace function but on the idea of a general linear function5 This has the advantage of simplifying the selection of dual bases since tv(z) Vz E GF(2m) does not have to calculated andfcan be taken to be a single polynomial basis coefficient. The advantages to taking f to be the least significant polynomial basis coefficient, say, as opposed to the trace function will therefore be greatest when m is large. A general algorithm for carrying out finite field multiplication/division has been presented from which the Berlekamp bit-serial multiplier has been derived. This approach further allows for a class of bit-parallel dual basis multipliers to be designed. These multipliers have as low hardware requirements as any other bit-parallel multiplier, are easily extended to different GF(2") and have a regular structure. Furthermore it is straightforward to design these multipliers given the irreducible polynomial for the field. It has also been demonstrated that these multipliers can be easily hardwired to carry out constant multiplication. Through use of the general multiplication/division algorithm a systolic bit-serial divider has been derived. Being systolic, this divider is modular, easily expandable to different GF(2") and has a longest delay path independent of m. Consequently, it is anticipated this divider is most suited to applications where m is large or where high clock speeds are required. Furthermore, this divider also operates over the dual basis and so can be incorporated in circuits using hardware efficient, dual basis multipliers.
By using the techniques first presented in [lo] it is possible to select dual bases which are permutations or which are close to being permutations of the polynomial basis. This results in little or no extra hardware being required to implement basis conversions and so dual basis operators can be utilized throughout RS codecs. Thus the problems traditionally associated with dual basis operators-that of the circuits lEEE TRANSACTIONS ON COMPUTERS, VOL. 45, NO. 3, MARCH 1996 m 2 3 functioning over two different bases-can be largely overcome. Given that dual basis multipliers have low hardware requirements and there also exist efficient dual basis inverters and dividers ( [23] , [26] and Section 6 above), it is therefore suggested that the dual basis is an appropriate basis for RS codecs to operate over, whether the basis coefficients are represented bit-serially or in parallel.
P(x)
B dual basis Hence in this instance the dual basis is merely a permutation of the polynomial basis. Hence, in this instance, the dual basis can be obtained from the polynomial basis with two additions and a reordering of basis coefficients. We now apply these results to produce the optimal dual bases for GF (Zm) (m = 2,3 , . . ., 10).
A.3 Optimal Dual Bases for Selected GF(2m)
am-l am-2 am-3 ak+2 ak+l I . . . , I
1. There are irreducible trinomials (listed in By taking f to be the least significant polynomial basis coefficient rather than trace function, finding the appropriate value of , i ?
is simplified because the values of tr(z) Vz E GF(2m) do not have to be calculated. In fact, these values of p d o not have to be calculated at all since all we are ultimately interested in are the dual basis elements and these can be obtained directly from (9) and (10).
It is to be noted that the dual bases for GF(2m) with m l 10 have only been considered in this paper because we have considered dual basis operators with implementation in RS codecs in mind. Were it required for RS codecs to be implemented with m > 8 or with cryptographic applications with very large values of m, the techniques used in [lo] , [22] and reviewed above may have to be extended to cover irreducible polynomials which are not of the above form.
APPENDIX B
PDBM D E L A Y THROUGH THE B TYPE MODULE IN THE
To calculate the delay through the B type module of the PDBM, we must consider the generation of b2m-2. In par- and so we must find the greatest integer w such that
That is, we must find the greatest integer such that m -2
Wlm -k and so w = -[ZJ.
We must also include the delay involved in generating b2m-2 itself and hence we arrive at (8), as required.
