Abstract-This paper presents an error tolerant hardware efficient VLSI architecture for bit parallel systolic multiplication over dual base, which can be pipelined. This error tolerant architecture is well suited to VLSI implementation because of its regularity, modular structure, and unidirectional data flow. The length of the largest delay path and area of this architecture are less compared to the bit parallel systolic multiplication architectures reported earlier. The architecture is implemented using Austria Micro System's 0.35um CMOS technology. This architecture can also operate over both the dual-base and polynomial base.
INTRODUCTION
Finite field also known as Galois Field arithmetic operations over GF (2 m ) find increasing applications in public-key cryptography, error detecting and correcting code [9] , VLSI testing [10] , digital signal processing [11] . There are different equivalent representations of the elements of the finite field over GF (2 m ) e.g. Polynomial base (PB), normal base, and dual base. Dual-basis operators frequently have the lowest hardware requirements of all available operators [18] [19] . Two basic operations over GF (2 m ) are addition and multiplication. Addition over GF (2 m ) is relatively straightforward to implement, requiring at most m XOR gates. Multiplication operation is much more expensive in terms of gate count and clock cycle. Other operations of the GF (2 m ) fields like exponentiation, division, and inversion can be performed by repeated multiplications. Based on different base representation, a variety of architectures for multiplication have been proposed. For high speed VLSI implementation, the preferred architecture for polynomial basis (PB) multiplier is systolic array architecture. In this type of architecture, a basic cell is repeated in an array and signals flow unilaterally between neighbours. PB systolic array multipliers in GF (2 m ) can be classified into four categories, namely bit serial, bit-parallel, hybrid and digit-serial. The bit serial architecture has minimum area and minimum throughput among all the categories. The problem with serial architecture is its latency. The bit-serial architecture, which processes one bit of input data per clock cycle, is area-efficient and suitable for low-speed applications. The most widely used bit serial multiplier is dual basis Berlekamp bit serial multiplier [12] . This multiplier requires less hardwire. PB bitserial and bit-parallel systolic multipliers were presented in [8, 13] . A bit-serial dual basis systolic multiplier over GF (2 m ) was presented in [3] , which requires higher hardware compared to that needed for multiplier proposed in [6] and does not support pipelining. To support pipelining, a modified version which requires less hardware is presented in [14] . The bit parallel multiplier needs largest area and provides maximum throughput. Bit-parallel architecture, capable of processing one whole word of input data per clock cycle, is ideal for high-speed applications when pipelined at the bit-level. These architectures are typical examples of the area-speed tradeoff paradigm. Mastrovito has proposed an algorithm along with its hardware architecture for PB multiplication [7] known as the Mastrovito algorithm/multiplier. A formulation for Polynomial basis multiplication and generalized bit-parallel hardware architecture for special reduction polynomials has been presented in [2] . A testable polynomial basis bit parallel multiplier circuits over GF (2 m ) was presented in [21] . Although bit-serial dual basis multipliers have been widely employed in applications such as RS encoders [3] , it has been proven in [19] that it is advantageous of employing bit-parallel dual basis multipliers, particularly in more complex circuits such as RS decoders and syndrome calculators. Bit-parallel dual basis multipliers therefore allow for reduced complexity constant multipliers. In this paper, we present a hardware efficient fast bit parallel systolic architecture with error detecting capability using parity prediction technique over dual base which can be pipelined.
II PRELIMINARIES a) Polynomial Multiplication
Let GF(N) denote a set of N elements, where N is a power of a prime number, with two special elements 0 and 1 representing the additive and multiplicative identities respectively and two operator addition '+' and multiplication '.'. The GF(N) defines a finite field, if it forms a commutative ring with identity over these two operators in which every element has a multiplicative inverse. Finite fields can be generated with primitive polynomials of the form P(x) = Then the bases are said to be dual with respect to f and β if
In this case {λ i } is the standard basis and {μ i } is the dual basis. We now restate the multiplication algorithm utilized here. This result was first presented in the context of division [16] but has subsequently if i=j
This work was supported in part by Royal Society (UK) Grant.
978-1-4244-2587-7/09/$25.00 ©2009 IEEE been used to describe finite-field multiplication [15] . Furthermore, as observed in [1] , the following represents a generalized and alternative representation of Berlekamp bit-serial multiplier.
Theorem 1 [14] : 
We have modified eqn. (1) as follows. (2) Where all that changes in these functions is the value of k.
A bit-parallel dual basis multiplier over GF (2 m ) can, therefore, be constructed using two cells. We introduce cell-1 as shown in Fig. 2 to generate eqn. (3) and also introduce a cell-2 for generating eqn. (2) as shown in Fig. 1 bottleneck to support pipelining in this design. The horizontal data path consists of AND-XOR binary tree, the depth of tree is O(m). We try to modify the horizontal data path by replacing the binary tree of depth O(m) with a binary tree of depth of O(log 2 m). For this purpose, we introduce a new cell [ Fig. 2 ] to generate the eqn. (2) . The complete circuit for dual basis systolic multiplier over GF (2 4 ) is shown in Fig. 3 . Latches are introduced in Fig. 3 , to make this architecture suitable for pipelining. There is m-clock cycle delay between 'b', 'c' entering in the multiplier and becoming available in the output lines. After the initial delay, results can be produced continuously one per clock cycle.
b) Hardware and Delay Analysis
We compare our proposed architecture with the bit parallel architecture described in [16] From the table we can conclude that in this architecture, the number of AND gates are same compared to previous architecture [19] , but for m-bit dual basis systolic multiplier m no. of XOR gates are less required in this architecture as well as the longest path delay of this architecture is also reduced by m-bit for AND gates and for XOR gates delay is reduced by log 2 m instead of m.
In Table 2 , the hardware complexity and delays of the DPM [19] and the our proposed DPM architecture are given for GF(2 m ) for (m = 2, 3, . . .,10). From Table 1 , it can be seen that for every case, the hardware complexity and delays of our proposed DPM architecture are less compared to those of the DPM architecture [19] .
IV. Error Detection Using Parity Checking
We use error-detection scheme with a very high probability of detecting faults in the bit-parallel systolic multiplication over GF (2 m ) using dual base with some additional outputs, called the check-bits as shown in Fig. 4 . We assume that no interconnections or buses have any fault and each test phase with the test-circuits is separately controllable. At first, we attach parity-bits to the input elements: b p and a p and multiplying (AND) the inputs we have,
Now, from eqn. (2) of the previous architecture, we get
Now, we denote the modulo2 addition of these outputs of the multiplier by, r = c 0 ⊕ c 1 ⊕ c 2 ⊕ c 3.
Here, we add some extra lines and gates for the testing purposes The q line is derived from modulo addition of bp.cp and the y i lines.
Now, rearranging, we see that q and r are same:
A parity checking circuit is presented in the figure which is correctly functioning for the Bit-parallel systolic multiplication over GF (2 4 ) using dual base. If the circuit operation is correct then q and r will agree and p = r ⊕ q = 0. If any cell in the circuit is faulty, that will change the output lines and that fault reflects in the r line, as q remains unaltered, so p=1 and the fault is detected. And if there is any failure in the y i line that can also be detected by p=1. Actually few of the y i terms cancel the output parity checking operation as because they appear an even number of times in the coefficient of the output and cancelled out in the parity-checking operation. It can be improved further as the y i terms are the sum of the results of some of the individual cells. So, if it is possible to temporarily disconnect those cells and connect with some lines to produce the desired feedback lines then the extra gates will not be required for the check line q. Then the circuit complexity will be reduced and less time will be required.
DELAY:
As the architecture is pipelined, so the path delays of each stage is same, except the last stage. The last has the maximum path delay. This can be calculated as for m-bit architecture: So, T d = 2mT XOR + T AND In our example in fig.1 , we calculate the path delay as T d = 8T XOR + T AND a) Simulation Result We have modeled our proposed architecture in VHDL. The design was simulated in "Model Sim XE III 6.3c" and checked the functionality of the multiplier for different values of m. The physical synthesis and place and route are done using Magma design Automation EDA tools based on Austria Microsystems 0.35 micron technology. The post CTS-post detailed route layout of design for GF (2 4 ) is shown in Fig. 5 . 
V. CONCLUSIONS
The paper presented a fast dual-basis error tolerant bit-parallel systolic multiplier architecture over GF (2 m ), which can be pipelined and which requires less hardware compared to that required in the multiplier architecture proposed earlier. Our proposed multiplier can also operate over both the dual-base and polynomial base. The proposed multiplier provides shorter longest delay path compared to that provided by the architecture presented earlier. A simple and efficient error detection procedure using parity checking has been incorporated with some additional AND-XOR gates.
