A new systolic serial-parallel scheme that implements the Montgomery multiplier is presented. The serial input of this multiplier consists of two sets of data that enter in a bit-interleaved form. The results are also derived in the same form. The design, with minor modifications, can be used for the implementation of the RSA algorithm. The circuit yields low hardware complexity and permits high-speed operation with 100% efficiency.
Introduction
The core of an RSA [1] crypto-system is the modular exponentiation, which can be fragmented into a sequence of modular multiplications and squarings. These operations have to be performed in a serial pipelined way, because of the operands length (>512 bits). The most efficient algorithm for modular multiplication was presented by Montgomery [2] . One approach [3] , [4] proposes a direct implementation of the Montgomery scheme by using two similar circuits: one for multiplication and one for squaring. However, it suffers from a large combinational delay. Another approach [5] , [6] suggests the realization of the modular multiplication and squaring in two discrete stages: the pure product generation and the modular reduction. In this approach, the combinational delay is reduced to half, over doubling the performance.
In this paper, a new implementation of a Montgomery multiplier is presented, which is based on the direct approach achieving higher performance. The circuit is modified in an elegant way in order to realize both the modular multiplication and squaring in a bit-interleaved form. The modular exponentiation takes approximately 2n 2 clock cycles with the minimum hardware complexity per bit, reported so far.
The Montgomery Multiplier
The Montgomery algorithm is presented below:
(Inputs)
Modulus : N (n-bits integer)
Multiplier : B (n bits integer); B=b n-1 , b n-2 , …, b 0
Multiplicand : A (n bits integer); A=a n-1 , a n-2 , …, a 0 
q i+1 = P mod 2;
End; {For}
Given that N is an odd number we define
. Thus, (1) can be rewritten as follows:
At the ith step, the term
is computed in the circuit's upper part of Fig.  1b , while the results are shifted and accumulated in the lower part according to (3) . The q i values are derived serially during the first n cycles, while at the next n cycles the modular product P is produced. The systolic operation of this circuit requires the interleaving of the serial data b i with zeros. Due to the internal pipelining, the feedback of q i is delayed by two clock cycles. The zero-bit interleaving enables the synchronization of q i with the next iteration of (3). The Montgomery product P is derived in the same bit-interleaved way. However, the idle time slot can be exploited by computing the modular product of a second number. In this manner, two modular product bits are generated in successive clock cycles without interference of their intermediate results.
In each multiplication cycle, the control line R is fed with two traveling 'ones', which enable the downloading of two interleaved Montgomery products into a register via a multiplexer in each cell.
The carry generated by the upper part of the (n-1)th cell must be added with the carry of the lower part, within an extra Full-Adder as shown in Fig. 1b . 
The Montgomery Exponentiator
The RSA algorithm can be implemented with the use of the square-and-multiply scheme.
(Inputs)
Message : M (n-bits integer)
Encryption Key : E (e n-1 ,…,e 0 ) (Output) The previously presented interleaved computation of two Montgomery products in two consecutive time slots, can be of great interest regarding that, the above algorithm requires one multiplication and one squaring per iteration. The first slot can be used for the modular squaring (A 2
·2
-n )modN while the second for the modular multiplication (A·B·2 -n )modN. The squaring result A is produced in both serial and parallel form. The parallel form is latched and used at the next iteration as the parallel input for both operations. The latching is controlled by the R signal. The new cell is shown at Fig. 2a .
The initial value of the latches is the value of M. The P line carrying the a i ,b i interleaved bits of A=(A 2
-n )modN and B=(A·B·2 -n )modN respectively, is redirected into the serial input of the multiplier via a multiplexer, for the next iteration. This multiplexer permits the input of the initial value of B. The encrypted message is obtained after 2n 2 clock cycles as the final value of B. 
Conclusions
The circuit of Fig. 2 is systolic, operates with 100% efficiency, interleaving multiplication and squaring on a bit basis, while the maximum combinational delay is equal to that of a gated Full-Adder (T c ). The utilization of the proposed design for both squaring and multiplication, permits the application of large numbers, i.e. over 1024 bits. The critical path delay of [4] and [5] comprises two Full-Adders and some controlling logic. Therefore, it is normalized to 2T c . Additionally, the architecture of [6] , does not include the control circuit for the RSA algorithm realization. An overall comparison in terms of hardware complexity (H), the time required for a full exponentiation (T exp ) and performance (C p =H· T exp ), is depicted at Table 1 . The proposed design is approximately 2 and 3 times more efficient than [4] and [5] respectively. Compared to [6] , our circuit's performance is about 20% higher. This is due to the direct implementation of the Montgomery algorithm, which yields a decrease of the circuit's complexity, equal to 19 gates per bit.
