Abstract: Two new systolic architectures are presented for multiplications in the ®nite ®eld GF(2 m ). These two architectures are based on the standard basis representation. In Architecture-I, the authors attempt to speed up the operation by using a new partitioning scheme for the basic cell in a straightforward systolic architecture to shorten the clock cycle period. In Architecture-II, they eliminate the one clock cycle gap between iterations by pairing off the cells of Architecture-I. They compare their architectures with previously proposed systolic architectures and a semisystolic architecture, and show that their Architecture-I offers the highest speed and Architecture-II the lowest hardware complexity.
Introduction
In recent years, the ®nite ®eld has been widely used in various data communication applications such as switching theory, error correction coding, pseudo-random number generation and cryptosystems, [1, 2] . A high-speed and low-complexity design for ®nite ®eld arithmetic is very necessary for meeting the demands of wider bandwidths, better security, and higher portability for personal communication.
In these applications, the Galois ®eld of order q p m , denoted as GF( p m ), is usually used, especially in the case p 2. Addition over GF (2 m ) can be easily implemented by bit-wise exclusive-OR without any carry propagation problem. Multiplication over GF (2 m ), on the other hand, is much more complex. In the implementation of multiplication over GF (2 m ), the design may use standard basis, normal basis, or dual basis representation. In this paper, we will propose two high-speed and low-complexity systolic architectures based on standard basis representation (SBR).
For a high-speed multiplier over GF(2 m ), several designs [3±5] adopting the architecture of semi-systolic arrays have been proposed. However, all these semi-systolic architectures have to broadcast some global signals. It becomes more dif®cult to handle the broadcasting problem as the bit length m becomes larger. On the other hand, due to the regularity of cells and the locality of connections, a`pure' systolic array, instead of a semi-systolic array, is usually a more appropriate choice for VLSI implementation [6±13] . These systolic architectures usually decompose multiplication over GF (2 m ) into a sequence of additions to sum up partial products, and modular operations to perform SBR conversion. We may classify these systolic designs into two categories according to their computation procedure: (1) summing ®rst and (2) modulus ®rst. The`summing ®rst' designs perform the addition ®rst and then covert the sum into SBR form. Within this category, Wang et al. [7] implemented AB in a straightforward way, Wei [8] implemented AB 2 C, Guo et al. [9] used a high-radix implementation for computing AB, and Mekhallalati et al. [10] slightly modi®ed the systolic architecture in a semi-systolic array by applying a re-timing process to reduce the initial delay. Conversely, the`modulus ®rst' designs convert both the addend and the augend into SBR form ®rst and then perform the addition. Within this category, Yeh et al. [6] implemented an architecture to compute AB C, Ghafoor et al. [11] adopted the same architecture to perform the exponentiation operation, and Hasan et al. [12] arranged the converted addend and augend in a matrix form so that all the computations of multiplication, division, and inversion can be treated as matrix operations.
Among all these architectures, Yeh's design [6] is the fastest due to its shortest clock period, while one of Mekhallalati's designs (Systolic-II) [10] has a superior performance in the area-time product. In Yeh's architecture, the main operation is decomposed into two parallel operations to shorten the clock period and some¯ip-¯ops are inserted in the architecture to avoid the inherent one-clockcycle-gap problem of a bit-by-bit systolic array. However, the partitioning scheme needs one extra control signal, and the insertion of¯ip-¯ops increases both the area and power consumption. In Mekhallalati's design, re-timing is applied on the connections between the ith cell and the (i 1)th cell of a semi-systolic architecture, where i is odd, to avoid the one-clock-cycle-gap problem and to reduce the latency down to m clock cycles, where m is the degree of the ®nite ®eld GF(2 m ). In this design, a circuit-level optimisation is also applied on the cells to shorten the clock period.
In this paper, we propose two new architectures, Architecture-I and Architecture-II, to further improve the operation speed and to reduce the area complexity. Architecture-I effects the partitioning on the general cells in Kung's design [14] to shorten the clock period. Architecture-II is constructed by pairing off the cells in Architecture-I to reduce the latency. As will be shown in Section 4, the partitioning of cells makes Architecture-I one of the fastest designs for computing independent multiplications, while the alliance of partitioning and pairing makes Architecture-II the fastest design for computing dependent multiplications. Moreover, Architecture-II has the lowest area-time complexity no matter whether the computed multiplications are dependent or independent.
2 Systolic Architecture-I GF(2 m ), an extension ®eld of GF(2), contains 2 m elements and a special polynomial F(x). Here, F(x) is a monic, irreducible polynomial over GF(2) of degree m and can be expressed as
where f i is either 0 or 1. If a is a root of F(x), the set {1, a, a 2 , F F F , a m71 } forms the standard basis of GF(2 m ). For any two elements A(a) and B(a) P GF(2 m ), they can be expressed in SBR form as:
where a i and b i are binary numbers. The multiplication of A and B can be computed by multiplying A(a) with B(a) ®rst and then performing (modulo F(a)) to convert the product back to the SBR form. An algorithm for computing the multiplication P(a) A(a) Â B(a) over GF(2 m ) can be expressed as:
Multiplication algorithm over GF(2 m ) by using modulus operation
where a j is the jth coef®cient of A(a), R i (a) m71 j0 r i j a j is the partial sum after the ith iteration, and
. This is because a m7i B(a) is already in SBR form. Hence, the computation of R i (a) can be treated as the combination of a modular operation and an addition. The modular operation (R i71 (a)a) (mod F(a)), can be computed by converting the highest order term of (R i71 (a)a) into the SBR form ®rst, and then adding the converted result with the remaining part of (R i71 (a)a). That is,
At the bit level, the algorithm becomes
In the above algorithm, the main operation can be computed bit-by-bit by adding three operands: r i71 m71 f j , r i71 j71 , and a m7i b j . Because r i71 m71 , the MSB of R i71 (a), is involved in the computation of r i j for all j, it is more ef®cient to compute R i (a) with the most signi®cant bit calculated ®rst. Hence, in the above algorithm, we compute R i (a) starting from the MSB toward the LSB. A 2D systolic architecture [7] for the implementation of this multiplication algorithm is illustrated in Fig. 1 . In this ®gure, the data dependency between bits and between iterations is shown. The cell on the jth column and ith row computes the jth bit of R i (a) by computing
where r i71 m71 is the most signi®cant coef®cient of R i71 (a). By applying vertical projection on this 2D systolic array, we get the 1D systolic array as shown in Fig. 2 . The timing sequence of r i j 's is illustrated in Fig. 3 . Note that the delay time between successive iterations is two clock cycles; there is a one clock cycle gap between r i71 m71 and r i m71 , a clock cycle gap between r i m71 and r i1 m71 , and so on. This is due to the inherent characteristics of the (mod F(a)) operation in the multiplication algorithm, as will be stated in the next paragraph.
In eqn. 4, it can be seen that r i71 j71 is required to compute r i j . Consider we want to compute R(a) bit-by-bit. After the computation of r i71 j , we need one more clock cycle to calculate r i71 j71 before the computation of r i j . That is to say, there exists one clock cycle delay between the computation of the same order coef®cient in two adjacent iterations. Hence, the average computation time of this architecture for N m-bit multiplications over GF(2 m ) becomes 2mN clock cycles: mN cycles to operate plus mN interlaced clock cycles to wait. To further improve the performance of the architecture, these idling clock cycles could be utilised to compute another independent operation without any time con¯ict; i.e. this bit-by-bit architecture can achieve the performance of m clock cycles per operation when computing independent multiplications.
In this paper, we propose a new architecture to further improve the computation speed by partitioning the main operation of the bit-level algorithm. In eqn. 4, note that f j , a m7i , and b j could be available in advance. As r The Fig. 6 . The delay time between successive iterations is still two clock cycles, while the clock period has now been shortened due to the partitioning operation. Note that Architecture-I can also achieve full utilisation when calculating independent multiplications over GF(2 m ). The clock period before partitioning is the delay of two XOR gates and one AND gate. After partitioning, the clock period becomes the delay of one XOR gate and one AND gate. Since the computation of p i j and r i j can be done in parallel, the average number of clock cycles per multiplication is not changed. The detailed comparison of Architecture-I and some other architectures will be presented later, in Section 4.
Systolic Architecture-II
As mentioned before, even though Architecture-I can compute a sequence of independent multiplications with full utilisation, it can only achieve 50% utilisation for a sequence of dependent multiplications. This is because the dependence between multiplications precludes the possibility of interlaced computations. In this case, each cell can only operate half of the time and has to wait for the other half due to the one-clock-cycle-gap problem. In this section, we propose another architecture, Architecture-II, which is more ef®cient than Architecture-I when computing dependent multiplications. In Architecture-II, we use cell merging in order to calculate some of the operations before hand. These pre-computed operations make the removal of the idling cycles in Architecture-I possible. This removal of idling cycles can thus increase the computation ef®ciency when dealing with dependent multiplications.
As mentioned before, there is a one-clock-cycle-gap problem in a bit-by-bit architecture. Our Architecture-I is basically a bit-by-bit architecture, and the one-clock-cyclegap problem does exist in that architecture. To avoid this problem, we merge the cells in Architecture-I in a speci®c way, as shown in Fig. 7 . In this ®gure, we group r (Fig. 8) . The area and power consumption of the architecture can thus be greatly reduced after eliminating these latches.
After merging the cells, the kth general cell of Architecture-II in iteration i computes the following four operations r where j 2k. The operation of Architecture-II can be expressed in the following algorithm (for m is odd): Since there is no one-clock-cycle-gap problem in Architecture-II, we can compute N dependent m-bit multiplications in mN clock cycles with the clock period being about one AND and two XOR gate delay. Conversely, Architecture-I computes N dependent m-bit multiplications in 2mN clock cycles with a clock period of about one AND and one XOR gate delay. Therefore, for computing dependent multiplications, Architecture-II is superior to Architecture-I in computation speed, area size, and power consumption. Detailed comparisons of Architecture-II to some other architectures will be presented in the next section.
The comparisons of the average speed for computing a sequence of dependent and independent multiplications are shown in Table 1 . In this table, all the bit-by-bit structures (including [6, 7, 12] and Architecture-I) need 2m clock cycles per operation for dependent multiplications, and m clock cycles per operation for independent multiplications. On the other hand, for most architectures which are not bitby-bit structures, like our Architecture-II, and Mekhallalati's systolic-II design [10], the average time becomes m clock cycles per operation for both dependent and independent multiplications. Note that, however, Mekhallalati's systolic-Ib design [10] needs 2m clock cycles per operation for both dependent and independent multiplications. This is due to the fact that when this architecture is computing a multiplication, no other multiplication can be computed in parallel.
Among the bit-by-bit structures Architecture-I uses partitioning to shorten the clock period. In Yeh's design [6] and Hasan's design [12] , different partition methods are adopted. On the other hand, for those structures which are not bit-by-bit architecture, the clock period is lengthened after merging. In both of Mekhallalati's designs [10], the cells are merged via the re-timing process of a semisystolic array, and the expanded clock period is shortened by applying optimisation to the merged cells. In our Architecture-II, however, the cells are merged from Archi- tecture-I, which has applied partitioning on the general cells in Fig. 1 . Due to the ®ner structure in Architecture-I, it is easier, and there is more¯exibility to carry out the merging while keeping a balanced pipeline structure. The timing and area estimation is based on the delay and gate count information of a TSMC 0.35m cell library. Since there is no 5-input XOR gate in the library, we estimate its delay and gate count ourselves. In Table 1 we can see that the delay in [6, 15] is similar to the delay of Architecture-I. This is because all these designs have applied partitioning on their architectures. However, due to their complicated methods of handling the one-clock-cycle-gap problem, Yeh's and Hasan's designs [6, 15] consume a larger area than our Architecture-I. Table 1 shows that the computation speed of our Architecture-II is the fastest when calculating independent multiplications. Moreover, as mentioned before, a large number of latches can be removed after merging. Therefore, the area size and power consumption are greatly reduced in Mekhallalati's designs and our Architecture-II design. In Table 1 , we can also see that Architecture-I is superior to others in the speed matter of the computation of independent multiplications, while Architecture-II is more suitable for computing dependent multiplications.
In Figs. 11 and 12 we illustrate the implementation of modular exponentiation with Architecture-I and Architecture-II acting as the main cores, respectively. In these ®gures, the controllers generate the controlling signals to collect the results of modular multiplications and to arrange the inputs of modular multiplications. According to an HSPICE simulation, the delay for Architecture-I is 1.6 ns and the delay for Architecture-II is 2.3 ns. To calculate exponentiations over GF (2 155 ), Architecture-I can achieve 4 Mbit/s and Architecture-II can achieve 2.8 Mbit/s.
Conclusion
We have proposed two architectures for increasing the performance of multiplication over GF (2 m ). This was achieved by increasing the pipeline stage to shorten the clock cycle period, and by pairing off the cells to avoid the one-clock-cycle-gap problem. Among these two architectures, Architecture-I is suitable for calculating independent multiplications and Architecture-II is suitable for calculating dependent multiplications. Architecture-II has lower complexity in area-time. Two architectures are also proposed to compute exponentiations over GF(2 m ), based on our Architecture-I and Architecture-II, respectively. The architecture using Architecture-I as its main core can achieve 4 Mbit/s while the architecture using Architecture-II can achieve 2.8 Mbit/s.
References

