New elliptic curve cryptographic processor architecture is presented that result in considerable reduction in power consumption as well as giving a range of trade-off between speed and power consumption. This is achieved by exploiting the inherent parallelism that exist in elliptic curve point addition and doubling. Further trade-off is achieved by using digit serialparallel multipliers instead of the serial-serial multipliers used in conventional architectures. In effect, the new architecture exploits parallelism at the algorithm level as well as at the arithmetic element level. This parallelism can be exploited either to increase the speed of operation or to reduce power consumption by reducing the frequency of operation and hence the supply voltage.
INTRODUCTION
Elliptic Curve Cryptosystem (ECC) has received considerable attention from mathematicians around the world ever since the original proposal by N. Koblitz and V. Miller in 1985 [1] [2] [3] [4] [5] [6] [7] [8] . ECC is based on the Discrete Logarithm problem over the points on an elliptic curve. To date, no significant breakthroughs have been made in determining weaknesses in the algorithm. The fact that the problem appears so difficult to crack means that key sizes can be reduced in size considerably, even exponentially [2, 7] , especially when compared to the key size used by other cryptosystems. This made ECC become a challenge to the RSA, one of the most popular public key methods. Although critics are still skeptical as to the reliability of this method, several encryption techniques have been developed recently using the properties of elliptic curve.
Several cryptographic processors have been proposed in the literature recently [4, 5, 12] . The conventional approach used in the design of these processors is to adopt serial computations at both the algorithmic level by using a single multiplier, as well as at the arithmetic level by using a serial multiplier. The reason for sequential operation is that it leads low area for large word lengths that is needed for secure encryption (i.e. > 160 bits [8] ). This classical approach could lead to the reduction of area, however, the constraint of current technology is not on gate count but power consumption. Reducing area is not necessarily the best approach to reducing power consumption.
In this paper, a power-time flexible architecture is proposed that exploits the parallelism inherent at both the algorithmic level and the arithmetic level of ECC. This is contrary to existing designs [4, 5, 12] , which opt for sequential operations to minimize area. It is strongly believed that these two aspects would lead to an even better trade-off between the time and power consumption.
ELLIPTIC CURVES OVER GF(2 k )
It will be assumed that the reader is familiar with the arithmetic over elliptic curve. For good review the reader is referred to [8] . The elliptic curve equation over GF (2 k For elliptic curve cryptography several point addition and doubling operations are needed [2, 6, 8] . As seen from the equations above, any point operations over elliptic curve requires inversion, which is the most expensive operation over GF(2 k ) [1, 8, 12] . A common approach is to eliminate the need for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. inversion by representing the elliptic curve points as projective coordinate points [1, 4, 6, 8, 12] . This results in replacing the inversion with several multiplication operations. This approach is also adopted in the processor proposed here.
GF(2 k ) ECC POINT OPERATIONS OVER PROJECTIVE COORDINATES
To eliminate the need for performing inversion in GF (2 k ), its coordinates (x,y) are projected to (X, Y, Z), where x=X/Z 2 , and y=Y/Z 3 . The projected elliptic curve equation becomes:
The formulas for projective point addition of two elliptic curve points are as follows:
The formulas for projective point doubling of P is given by:
The complete data flow graph for doubling a point is shown in Figure 1 . It requires ten multiplications and four k-bit XOR operations. Figure 2 shows the data flow graph for adding two elliptic curve points. It requires twenty multipliers and seven k-bit XOR gates. From the binary method, any elliptic curve crypto processor that uses projective coordinates must implements the dataflow graphs in Figure 1 and 2 iteratively.
PROPOSED ARCHITECTURE
The architecture of the new processor is shown in Figure 3 . The new architecture has the following features:
• It has three digit serial-parallel multipliers,
• It can perform multiply-add operation in the same instruction, • It has a power management unit.
The basic motivation behind the design of the proposed architecture is to exploit, as much as possible, the full parallelism that exists in the ECC. The trade-off between power and time can be achieved by reducing the clock frequency and hence consequently the source voltage. It is well known that reducing the source power is the most effective means of reducing power consumption. In the proposed design here, this is exploited at the algorithmic level by using more than one multiplier. It is also exploited at the arithmetic element level by using serial-parallel multiplier such as those reported in [10, 11] rather than the conventional approach of using a serial multiplier. The benefits of parallel implementation of ECC on power are discussed in more details in section 7.
The reason for using three multipliers only is as follows. From  Figures 1 and 2 , the corresponding critical path of each dataflow diagram is effectively of 5 GF(2 k ) multiplications and 7 GF(2 k ) multiplications, respectively. Here the time GF(2 k ) addition is ignored since it is negligible compared to multiplication. Therefore, the lower bound of the minimum computation time to perform one elliptic point operation in the calculation of nP is 12 GF(2 k ) multiplications.
Figure 1. Doubling an elliptic curve point Data flow graph
It can be easily seen from Figures 1 and 2 that performing three multiplications in parallel will meet this lower bound, and any further concurrent multiplications will not actually achieve any further reduction in the computation time. It should also be noted that the utilization of the three multipliers is very high. As can be seen from Figures 1 and 2 , all the three multipliers will be used in eight out of the 12 steps, and in only two out of the 12 cycles where a single multiplier is used.
The advantage of performing multiply-add operation in one instruction is that the dataflow in Figures 1 and 2 computations where the addition of the output of two multipliers must be carried out. Such a feature will circumvent the need to store these values back in the registers and fetching them back again for their subsequent addition. This will save both in cycles and power. The purpose of the power management unit is to ensure that the power consumption of blocks that are not
Figure 2. Data flow graph for adding two points
used is kept to a minimum. This is achieved by clock-gating the registers of these blocks and ensuring that the logic in these blocks is static. There are two possible cases where blocks are not used. The first is when not all three multipliers are used, and the second is when the application wordlength is less than the wordlength of the processor. 
PERFORMANCE COMPARISONS 5.1 Serial-Parallel Vs. Serial Multiplication
In the crypto processor presented here we also propose to use GF(2 k ) digit serial multipliers such as those reported in [10, 11] . These digit serial-parallel structures lead to a much better tradeoff between power and time. Given N as the wordlength and M as the number of digits of size (N/M), Table 1 shows the comparison of serial-parallel multiplier with a serial one with the same digit size. 
Since Power, P=fCV S 2 and assuming that V S =kf o , where f o is the maximum operating frequency for the given V s , then P=kf 3 C.
Given that serial-parallel computation requires M cycles compared to M 2 cycles for the serial multiplier, the clock frequency of the serial parallel multiplier can be reduced by a factor of M for the same execution time. As can be clearly seen from Table 1 , operating the serial-parallel multiplier at clock frequency of f o /M will lead to a reduction of power by a factor of M 2 . A further advantage of the proposed architecture is that it allows the designer the flexibility to operate at a higher clock frequency up to f o , but of course at the expense of higher power consumption. Clearly this demonstrates the superiority of using digit serial-parallel computation. As with regard to the GF(2 k ) modulo adder, it is to be implemented in bit parallel fashion since the area is not significant compared to the multiplier and minimizing the addition time will reduce the overall multiply-add cycle time.
Parallel Vs Sequential Implementation
The power consumption of using three multipliers is compared with that of using a single multiplier and two multipliers in Figure  4 for different execution times. Here time is computed as follows, time=No. of cycles x f o . Also, the same assumptions made in Table 1 In existing designs [4, 5] , a single multiplier is used to perform all the multiplications needed in Figures 1 and 2 . The reason is that using more than one multiplier is perceived to be too expensive. However, as can be seen from Figure 4 , the proposed architecture would lead to much lower power consumption than using one or two multipliers for the same execution time.
It is also clear that using three multipliers gives a wider range of trade-off between power and speed. In fact, the case of using two multipliers does not provide any advantage over the other two options. Finally, the proposed architecture can support a further reduction in power by switching to one multiplier based operation in cases where a further reduction in power is required. In this case the power management unit will simply ensure that the other two multipliers do not consume any dynamic power.
CONCLUSION
A novel GF(2 k ) elliptic curve crypto processor is proposed in this paper. The new architecture results in considerable reduction in power consumption as well as offering users a range of trade-off between power and time. The basic feature of the new architecture is that it exploits the inherent parallelism in the computation of doubling and adding points over an elliptic curve as well as in multiplication. Performance evaluation shows a considerable advantage over sequential implementation in terms of power consumption and time.
