ABSTRACT This paper proposes a novel scalable digit-serial inverter structure with low space complexity to perform inversion operation in GF(2 m ) based on a previously modified extended Euclidean algorithm. This structure is suitable for fixed size processor that only reuse the core and does not require to modulate the core size when m modified. This structure is extracted by applying a nonlinear methodology that gives the designer more flexibility to control the processing element workload and also reduces the overhead of communication between processing elements. Implementation results of the proposed scalable design and previously reported efficient designs show that the proposed scalable structure achieves a significant reduction in the area ranging from 83.0% to 88.3% and also achieves a significant saving in energy ranging from 75.0% to 85.0% over them, but it has lower throughput compared to them. This makes the proposed design more suitable for constrained implementations of cryptographic primitives in ultra-low power devices such as wireless sensor nodes and radio frequency identification (RFID) devices.
I. INTRODUCTION AND RELATED WORK
In resource-constrained platforms, the implementation of public key cryptosystems (PKC) is a challenge due to the limitations of area and power consumption [1] . Compared to other PKC algorithms, elliptic curve cryptography (ECC) algorithms have the merit of giving the same level of security using smaller key sizes and this leads to using these algorithms in resource constrained applications. Recently, there are a lot of hardware implementations of ECC that meet the area, energy and timing limitations of these applications [2] - [5] . These implementations are mainly concentrated on the efficient implementation of the operations of field multiplication and field inversion as they are the most costly operations in ECC cryptography. The field inversion is much slower and more expensive in power consumption than the field multiplication. Thus, improving the performance of the inversion operation will lead to a total improvement in the performance of the ECC system.
The systolic architectures for binary field inversion can be classified into three basic types. The first type of the systolic architectures composed of two-dimensional arrays of processing elements (PEs) and have area complexity of O(m 2 ) [6] , [7] . These architectures are more suitable for highthroughput applications that require small values of m, but they are not suitable for applications that require large values of m due to the high area complexity that makes implementing these architectures on a single chip infeasible. The second type of systolic architectures are consists of one-dimensional arrays of PEs and have low area complexity of O(m). These architectures include folded bit-parallel architectures [8] , bit-serial architectures [8] , [9] , and in place bit parallel architectures [8] , [10] . The throughput of these architectures may be very slow for some real-time applications. The third type of systolic architectures is the Digit serial architectures [9] , [11] , [12] that consider the tradeoffs between area complexity and throughput in its circuit implementation. A digit-serial architecture with a digit size of d bits has area complexity of O(dm). For different sizes of d, we can easily obtain different throughputs.
In this paper, we propose a scalable digit-serial architecture that is suitable for resource-constrained devices to perform inversion operation in GF(2 m ) based on a previously modified extended Euclidean algorithm. This architecture is composed of a one-dimensional array of PEs and has low area complexity of O(T ), where T is the number of PEs in the systolic array. This architecture is explored by applying a nonlinear technique, proposed by authors in [13] and [14] , to the inversion algorithm.
The paper is organized as follows: Section II discusses the adopted extended Euclidean-based finite field inversion algorithm over GF(2 m ). Section III shows how to parallelize this algorithm using nonlinear data scheduling and projection techniques. Section IV discusses the proposed scalable architecture. Section V discusses the proposed design complexity and compares it to the previous work. Finally, Section VI provides the conclusions of this work.
II. FINITE FIELD INVERSION
A finite field over GF(2 m ) could be defined using the irreducible polynomial:
where
A field element A in GF(2 m ) can be represented by the polynomial:
where a i ∈ GF(2) for 0 ≤ i < m.
Suppose a polynomialÂ in GF(2) represents the multi-
The most commonly used inversion algorithms are based on Fermat's little theorem, extended Euclidean algorithm (EEA), and Gaussian elimination. In practice, EEA is mostly used to carry out inversion.
Yan et al. [15] proposed a modified EEA-based inversion algorithm that solves the problem of long division needed in each iteration of the conventional EEA-based inversion algorithm by exchanging the degree comparison with a ring counter. This algorithm computes four intermediate polyno 
end if 17: for m − 1 ≤ j ≤ 0 do 18 :
: 
equal to 1, this bit does not need to be computed or stored as mentioned in [15] . Thus, Q(x) coefficients can be stored in a register of size m. Algorithm 1 is the bit-level version of the modified EEA-based inversion algorithm of Yan [15] . In this algorithm, the terms r i j , s i j , y i j and h i j represent the j-th bit of R, S, Y and H at iteration i, respectively.
III. PARALLELIZATION OF THE INVERSION ALGORITHM
The ranges of i and j indices of Algorithm 1 define a set of points in a convex hull D in the 2-D integer space, i.e. D ⊂ Z 2 [13] . In this algorithm, their are two input variables Figure 1 at the bottom of the gray nodes column. The bits c1 i , c2 i are generated inside the right nodes (gray nodes) in each row i and broadcasted to the remaining nodes in the same row. This is pointed out by the horizontal lines in Figure 1 . Also, sign i bit is generated inside the right nodes (gray nodes) using iteration steps 13 and 15. This is pointed out by the vertical line in the right-most column in Figure 1 Each point in the DG of Figure 1 is assigned a time value t(p) using timing function S and a parameter T , which represents the number of nodes to be computed at the same time step.
where terms ph/T and ph/T represent floor and ceiling functions, respectively, and the 'ph' represents a place holder for the argument. The node timing function S is shown in Figure 2 for m = 3 and T = 2. The nodes that have the same time index is indicated by the light blue areas. The values in each light blue area represent the time index. This time index depends on the values of both i and j indices. When m is not an integer multiple of T , we notice that the number of nodes processed at each time step are not the same. To have a constant number of nodes processed at each time step, m should be an integer multiple of T . Thus, the value of m should be increased to m using the following relation:
For the case when m = 3 and T = 2, we get m = 4 and µ = 1 as shown in the figure. We chose to pad the LSB bits of R, S, H , Y with µ zeros to make m an integer multiple of T . This is indicated by the dark blue nodes shown in Figure 2 .
Using the nonlinear scheduling function S represented by Eq. (4), we will be able to control the workload per time step and the number of time steps required to complete the execution of the inversion operation. In this case, the workload is equal to T = 2 and the inversion operation will require (2m − 1)( m/T + 1) time steps to complete. Now, we need to map several nodes of the DG onto a single node that forms the resulting systolic array. The projection technique discussed in [13] , [14] , [16] , [17] can be used to perform this mapping operation. The third author of this paper explained in [13] how to carry out the projection operation using a projection Matrix P. Since our algorithm is two-dimensional, P will reduce to a row vector. The valid projection matrices associated with the scheduling function S are as follows:
Each scalable systolic array configuration is associated with one projection matrix, therefore the processor design space allows for two scalable systolic designs. The scalable systolic array related to the projection matrix P 2 has a high control complexity and is not suitable for VLSI implementation. Thus, this design will be ignored in this research paper. In the following section, we will investigate the scalable systolic array related to the projection matrix P 1 . 
IV. THE PROPOSED SCALABLE DESIGN FOR THE INVERSION ALGORITHM
Using projection matrix
∈ D will be mapped onto the point:
where µ = T m/T − m, and δ = 0
The scalable systolic array design resulted from this mapping, when m = 3 and T = 2, is shown in Figure 3 . The number of PE's is T + 1. Thus, the required number of PEs depends on the value of T and there is not any dependency on the field size m. Figure 4 (a) shows the details for the control processing element PE T . Fig. 4 (b) shows the details for PE j . Each PE processes m/T = m /T bits, where m is given in Eq. (7), and works on one bit at each clock cycle. The processing elements PE j , 0 ≤ j ≤ T − 2, store (m /T ) bits for S, R, Y , and H as shown by the four sets of FIFO buffers in Fig. 4 (b) . Also, the processing element PE T will need to store (m /T ) bits for sign 0 as shown by the FIFO buffer in Fig. 4 (a) . The processing element PE T −1 will need to store only (m /T ) − 1 bits for S, R, Y , and H .
The operation of each PE j (0 ≤ j < T ) for the proposed scalable design can be summarized as follows: 1) For the first (m /T ) + 1 time steps (i.e. 1 ≤ t ≤ (m /T ) + 1), all the D-FFs of the FIFO_sign will set to have the initial value of sign 0 equal to 1 through all these time steps. Also, through these time steps, MUX8 is set to accept input D 0 . The register at the output of shifter is loaded every (m /T ) + 1 time steps. 2) For the first (m /T ) + 1 time steps (i.e. 1 ≤ t ≤ (m /T ) + 1), MUXs M 1 , M 3 , M 4 , and M 7 are set to accept the inputs of s 0 m−1 , s 0 k , r 0 k , and y 0 k corresponding to the polynomials S, R and Y , respectively. Through these time steps, the flip-flops of FIFO_H is cleared to to have the input bits of h 0 k equal to zero. PE j will accept bits s 0 k , r 0 k , and y 0 k at time t such that:
These bits will be loaded in FIFO_S, FIFO_R, and FIFO_Y , respectively. 
5)
The output product H is available at time t, which satisfies the inequalities:
We can conclude that the scalable design is suited to resource-constrained embedded applications due to the following reasons:
1) Ability to control the number of PEs in the systolic array. 2) Inter-processor communication is limited to one-bit data only.
V. COMPLEXITY COMPARISON
From Fig. 3 , we can estimate the time and area complexities of the proposed scalable design. Table 1 compares the area , latency, and critical path delay of the proposed scalable digitserial design to the closest digit-serial competitor designs in the literature [12] , [18] .
In Table 1 we have: 1) T A is AND gate delay 2) T MUX is MUX delay 3) T X is XOR gate delay 4)
where S is the pipelined stages inserted in each PE and d is the digit size. 5)
In order to verify the area and performance (delay and power) of the proposed scalable design, we used Synopsys synthesis tools package version 2005.09-SP2 for logic synthesis and power analysis of the proposed design as well as the most efficient digit-serial designs of [12] and [18] . The designs are first described using VHDL and then synthesized to obtain the gate level for field size of m = 233, digit size d = 4, S = 1, and T = 64 using (45 nm, 1.1 V) standard-cell CMOS technology. Table 2 shows the obtained synthesis results (area, delay, power) of the different digitserial inverters. Also, it shows the calculated energy as well as the throughput rate that are used to measure the degree of the improvement achieved in each digit-serial inverter design. The power was estimated at a low frequency of 100kHz which is suitable for ultra-low power devices like RFID.
From this table, we notice that the proposed scalable design has a significant reduction in the area (ranging from 83.0% to 88.3%) and energy (ranging from 75.0% to 85.7%) over the compared efficient designs that make it very suitable for constrained implementations of cryptographic primitives in ultra-low power devices that have tight restrictions on area and power consumption. On the other hand, the proposed design has significantly lower throughput values compared to all other designs.
VI. SUMMARY AND CONCLUSION
This paper presented a new scalable systolic array structure to perform inversion operation in GF(2 m ) based on a previously modified extended Euclidean algorithm. This structure is suitable for fixed size processor that only reuse the core and does not require to modulate the core size when m is modified. This structure is extracted by applying a nonlinear methodology that gives the designer more flexibility to control the processing element work load and also reduces the overhead of communication between processing elements. Implementation results of the proposed scalable digit-serial design and the previously reported efficient digit-serial designs shows that the proposed scalable structure achieves a significant reduction in the area and power that makes it more suitable for constrained implementations of cryptographic primitives in ultra-low power devices such as wireless sensor nodes and radio frequency identification (RFID) devices. 
