This paper presents a novel bit-serial semi-systolic array structure to simultaneously execute modular multiplication and squaring operations in GF(2 m ). The architecture is explored by using a systematic methodology based on the proper choice of the scheduling and projection vectors applied to the algorithm dependency graph. The explored architecture has the advantage of sharing the data-path between the two operations, and hence it leads to saving more space compared to the case of using a separate data-path for each operation. Also, the simultaneous calculation of both operations significantly decreases the execution time required to perform modular exponentiation operation, as it mainly depends on these two core operations. Complexity analysis indicates that the developed bit-serial semi-systolic array structure outperforms the latest exiting competitor bit-serial systolic and non-systolic structures in terms of area-time (AT) by at least 24%. This makes the proposed structure more appropriate for use in resource-constrained cryptographic processors.
Introduction and related work
Modular multiplication and squaring operations are at the heart of modular exponentiation. Thus, the performance of the modular exponentiation operation is mainly affected by the performance of these two operations. There are various hardware structures, in GF(2 m ), developed to increase the performance of these crucial operations [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] . Unluckily, these hardware implementations mainly concentrate on increasing performance of multiplication operation and do not support the unified structures used to simultaneously compute both operations. Thus, they limit the use of modular exponentiation in several resourceconstrained cryptographic and error-correcting codes applications due to their considerable area and time overhead.
There are several unified systolic and semi-systolic array structures presented in the literature to concurrently execute both modular multiplication and squaring operations in GF(2 m ). Choi et al. [13] presented an approach for merging both operations in a combined systolic array structure. The proposed approach has the merit of reducing the space overhead of the systolic array as well as improving its utilization. They developed bit-serial and bit-parallel systolic array structures based on the proposed approach. Kim et al. [14] , presented a bit-parallel systolic array structure based on the unified algorithm reported in [15] . This algorithm is based on the bipartite method discussed in [16] . In this method, the operand multiplier is divided into two parts that can be executed in parallel leading to a significant reduction in algorithm latency. Also, Kim et al. [17] presented a unified bit-parallel semi-systolic array structure to concurrently execute modular multiplication and squaring in GF(2 m ). The developed structure is based on the Montgomery multiplication algorithm.
On the other hand, there are conventional (Non-systolic) architectures used to separately perform both operations based on the Mastrovito multiplier algorithm. The efficient serial architectures that are suitable for the targeted resource-constrained applications are the architectures of [18, 19] . These architectures are extracted based on the irreducible ω-nomial (polynomials with ω non-zero terms) and trinomials and have lower critical pass delay compared to the previously reported results.
In this paper, we present a novel bit-serial semi-systolic array structure to concurrently execute multiplication and squaring in GF(2 m ) based on the bipartite multiplication-squaring algorithm reported in [15] . The structure is explored by using a systematic methodology consists of the following three steps: 1) extracting the algorithm dependency graph (DG); 2) assigning time values to each node of the DG based on a chosen scheduling vector; 3) Proper projection of several nodes of the DG to a specific processing element (PE) cell based on a chosen projection vector. The developed bit-serial semi-systolic array structure has lower area and AT complexities compared to the existing most recent bit-serial systolic and non-systolic structures of [13, 18, 19] . This enables the use of the proposed bit-serial array structure in different resource-constrained cryptographic and error-correcting code applications.
The paper is arranged as follows: Section 2 briefly explains the adopted bipartite multiplication-squaring algorithm. Section 3 gives the hardware details of the developed bit-serial semi-systolic array. Section 4 provides the complexity analysis of the developed and related bit-serial structures. Section 5 provides the work conclusion.
Bipartite multiplication and squaring algorithm in GF(2 m )
The details of the bipartite multiplication and squaring algorithm in GF(2 m ) are stated in [14, 15] . In this section, we only provide a brief discussion about this algorithm to help understand the developed design. Let CðxÞ and DðxÞ represent any two polynomial elements in GF(2 m ). Also, let FðxÞ be the irreducible polynomial used to produce the filed elements of this field. These polynomials can be expressed as:
DðxÞ ¼
FðxÞ ¼
where coefficients c j ; d j ; f j 2 GFð2Þ. Since x is a root of FðxÞ, x m mod FðxÞ and x mþ1 mod FðxÞ can be expressed as follows:
Assume F 0 ðxÞ is available in advance and Let l ¼ dm=2e, k ¼ bm=2c. We can express the modular multiplication and squaring as:
We can split PðxÞ and SðxÞ into two portions as:
PðxÞ ¼ ðHðxÞ þ xGðxÞÞ mod FðxÞ ð 8Þ
where,
The term CðxÞx 2i mod FðxÞ is common in Eqs. (10), (11), (12) , and (13) and can be defined as C i ðxÞ ¼ C iÀ1 ðxÞx 2 mod FðxÞ, where C 0 ðxÞ ¼ CðxÞ and 0 i l À 1. Using Eqs. (4) and (5), we can formulate C i ðxÞ as:
We can formulate the coefficients of C i ðxÞ; c i j , as:
We can express HðxÞ; GðxÞ; VðxÞ; UðxÞ based on (14) as:
The recurrence equations of HðxÞ; GðxÞ; VðxÞ; UðxÞ can be expressed as:
We can express the coefficients of H i ðxÞ; G i ðxÞ; V i ðxÞ; U i ðxÞ in the recursive form as: (27) can be computed concurrently as there is no data dependency between them.
PðxÞ and SðxÞ can be calculated based on Eqs. (8) and (9) as follows:
SðxÞ ¼ ðV l ðxÞ þ xU k ðxÞÞ mod FðxÞ
where g k À1 ¼ u k À1 ¼ 0. Based on Eqs. (28) and (29), we can calculate the coefficients of PðxÞ; SðxÞ as follows:
where g k À1 ¼ u k À1 ¼ 0 and 0 j m À 1.
Proposed bit-serial semi-systolic array
We used the recursive equations of (15), (24) , (25) , (26) 
j , f j and f 0 j are the inputs to the nodes of the upper section of the DG. In this section, the partial results of h i j , g i j , v i j and u i j alongside the transmitted input bits of f j and f 0 j are represented by the vertical lines. The diagonal red lines in this section indicates the computed partial results of c i j . Also, the input bits of c 2ðiÀ1Þ ; c 2iÀ1 ; d 2ðiÀ1Þ ; d 2iÀ1 alongside the resulted partial bits of c iÀ1 mÀ2 ; c iÀ1 mÀ1 are broadcasted horizontally to all nodes of this section. The output bits of h l j ; g k j ; v l j ; u k j produced from the upper section alongside the transmitted bits of f j are fed as inputs to the lower section of the DG to compute the bits of modular multiplication p j and squaring s j as shown in Fig. 1 .
We used the approach previously reported in [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ] to obtain the scheduling vector S ¼ ½2 À 1 and the projection vector P ¼ ½1 0 T . These vectors are used to assign different time values to each node in the DG and project several nodes of the DG to a specific PE cell. Fig. 1 also shows the resulted node timing. The adopted timing results in a serial feeding of the inputs and outputs of the DG. Fig. 2 shows the resulted semi-systolic array structure after applying the projection vector P ¼ ½1 0 T to the DG. It consists of l þ 1 PEs, where l ¼ dm=2e. PEs are classified into three types as follows: PE i , PE l , and PE lþ1 . PE i represents the first l À 1 PEs, while PE l , and PE lþ1 represent the l th and ðl þ 1Þ th PEs, perspectively. PE l is a simplified version of PE i where there is no need to compute c i j in this PE. PE lþ1 is the last PE used to compute the bits of modular multiplication (p j ) and squaring (s j ) according to Eqs. (30) and (31), respectively. As we notice from Fig. 2 , bits of u i j , v i j , g i j , h i j , and f j are pipelined between all the PEs. Bits of c i j are pipelined between the first l PEs. Bits of c i jÀ1 , c i jÀ2 , f 0 j , 0 j m À 1, are pipelined between only the first l À 1 PEs. Bits of c 2ðiÀ1Þ ; c 2iÀ1 ; d 2ðiÀ1Þ ; d 2iÀ1 are located at the first l PEs. Fig. 3 shows the logic details of each PE. The last two bits c iÀ1 mÀ2 ; c iÀ1 mÀ1 are held in PE i at clock cycles 3ði À 1Þ þ 2, 1 i l À 1, using the MUX-Latch combinations shown in Fig. 3(a) . Also, the last two bits u k mÀ1 and g k mÀ1 are held in PE lþ1 at clock cycles 3l þ 2 using the MUX-Latch combinations shown in Fig. 3(c) . Select signal S in is pipelined between the PEs through the D s latches to control the MUXes to synchronize the latching process.
The following briefly describes the operation of the developed semi-systolic array.
1) At the first clock cycle t ¼ 1, the select input of the two MUXes, shown in Fig. 3(a) , sets (i.e., S in ¼ 1) in PE 1 to pass the last two input bits c 0 mÀ1 and c 0 mÀ2 to be used alongside the input bits c 0 ; c 1 ; d 0 ; d 1 ; c 0 mÀ3 ; u 0 mÀ1 ; v 0 mÀ1 ; g 0 mÀ1 ; h 0 mÀ1 ; f mÀ1 ; f 0 mÀ1 to compute the intermediate bits c 1 mÀ1 ; u 1 mÀ1 ; v 1 mÀ1 ; g 1 mÀ1 ; h 1 mÀ1 . Notice that input signals c iÀ1 j ; c iÀ1 jÀ1 ; c iÀ1 jÀ2 shown in PE i , Fig. 3(a) , will be assigned the values of c 0 mÀ1 , c 0 mÀ2 , and c 0 mÀ3 at this clock cycle. 2) At the second clock cycle t ¼ 2, the select input of the two MUXes resets (i.e., S in ¼ 0) in PE 1 to hold the last two bits of c 0 mÀ1 and c 0 mÀ2 to be used alongside input bits c 0 ,
h 0 j , f j and f 0 j , 0 j m À 3, besides the located bits c 0 ; c 1 ; d 0 ; d 1 and kept bits c 0 mÀ1 ; c 0 mÀ2 are used to compute the intermediate bits c 1 j ; u 1 j ; v 1 j ; g 1 j ; h 1 j , 0 j m À 3. 4) At clock cycles t ¼ 3ði À 1Þ þ 1, 2 i k, PE i starts running precisely like PE 1 and sets the MUXes (i.e., S in ¼ 1) to pass the last two input bits c iÀ1 mÀ1 and c iÀ1 mÀ2 to be used alongside bits c 2ðiÀ1Þ ; c 2iÀ1 ; d 2ðiÀ1Þ ; d 2iÀ1 ; c iÀ1 mÀ3 ; u iÀ1 mÀ1 ; v iÀ1 mÀ1 ; g iÀ1 mÀ1 ; h iÀ1 mÀ1 ; f mÀ1 ; f 0 mÀ1 to compute the intermediate bits c i mÀ1 ; u i mÀ1 ; v i mÀ1 ; g i mÀ1 ; h i mÀ1 . 5) At clock cycles t ¼ 3ði À 1Þ þ 2, 2 i k, the select input of the two MUXes resets (i.e., S in ¼ 0) in PE i to hold the last two bits of c iÀ1 mÀ1 and c iÀ1 mÀ2 to be used alongside input bits c 2ðiÀ1Þ ,
, f mÀ2 and f 0 mÀ2 to update the intermediate bits c i mÀ2 ; u i mÀ2 ; v i mÀ2 ; g i mÀ2 ; h i mÀ2 . 6) Through clock cycles t 3ði À 1Þ þ ðm À jÞ, 2 i k and 0 j m À 3, the remaining input bits
h iÀ1 j , f j and f 0 j , 2 i k and 0 j m À 3, besides the located bits c 2ðiÀ1Þ ; c 2iÀ1 ; d 2ðiÀ1Þ ; d 2iÀ1 and kept bits c iÀ1 mÀ1 ; c iÀ1 mÀ2 are used to compute the intermediate bits c i j ; u i j ; v i j ; g i j ; h i j , 2 i k and 0 j m À 3. 7) Through clock cycles t ! 3ðl À 1Þ þ ðm À jÞ, 0 j m À 1, PE l runs exactly like PE i to update the intermediate bits, u k j ; v l j ; g k j ; h l j , 0 j m À 1. Notice that there is no need to update c l j signal in PE l . Also notice that when l ≠ k (this means that l is odd) the located bits of c 2iÀ1 and d 2iÀ1 will be assigned zero values resulting in the updated values of u l j and g l j remains the same as the previous values of u k j and g k j resulted from PE k (i.e., u l j ¼ u k j and g l j ¼ g k j ). On the other hand, when l ¼ k (this means that l is even) all the located bits c 2ðlÀ1Þ ; c 2lÀ1 ; d 2ðlÀ1Þ ; d 2lÀ1 will be assigned zero values leading to u l
8) Through clock cycles t ! 3ðlÞ þ ðm À jÞ, 0 j m À 1, PE lþ1 runs to compute serially the bits of modular multiplication p j and squaring s j . It is worth noting that the control signal S in is set (S in ¼ 1) at clock cycle t ¼ 3ðlÞ þ 1 to pass the last bits u k mÀ1 and g k mÀ1 through the MUXes of PE lþ1 and it resets (S in ¼ 0) at the beginning of clock cycle t ¼ 3ðlÞ þ 2 to hold these bits inside PE lþ1 to be used through the remaining cock cycles. Also, it is worth noting that some D-latches are added before the XOR gate inputs in PE i and PE lþ1 to decrease the critical path delay (CPD) and hence increasing the clock frequency. This leads to obtaining the final result after 3l þ m þ 1 clock cycles instead of 3l þ m clock cycles.
Complexity analysis
We used NanGate (15 nm, 0.8 V) Open Cell Library to estimate the area (A) and delay (T) of the basic cells (2-input AND gate, 2-input XOR gate, 2-to-1 MUX, and D-latch) in terms of 2-input NAND gate. Based on the estimated values of the basic cells, we evaluated the area and time complexities of the proposed and compared efficient serial structures of [13, 18, 19] as shown in Tables I and II , respectively. Choi design [13] is a systolic serial structure while Masoleh designs [18, 19] are nonsystolic serial structures. Masoleh designs are extracted based on the irreducible ω-nomials (irreducible polynomials with ω non-zero terms) and trinomials. The trinomial-based designs have better performance and thus they are selected here for comparison. The estimated area and delay of the basic cells are as follows: A AND ¼ 1:2, T AND ¼ In Table I , we estimated the total gate count (TGC) of each array structure in terms of the field size m based on the area values of the basic cells. In Table II , we multiplied the estimated latency of each design by the corresponding critical path delay (CPD) to obtain the total delay (TD). The Area-Time (AT) complexity of each array structure is also given in this table and it is estimated by multiplying TGC of each array structure with the corresponding TD. By examining the expressions given in Tables I, we can deduce that the total area (TGC) of the basic cells of the proposed serial structure, H9 ¼ 71:7m À 7:9, is lower than that of the other serial structures. Also, the expressions given in Table II show that the total delay of the proposed serial structure, TD4 ¼ 91m þ 36:4, is lower than that of the other serial structures. Moreover, the AT complexity of the proposed serial structure, AT 4 ¼ 6;520m 2 À 2;464m À 287:56, is significantly lower than that of the other serial structures by at least 24% as shown in Table III . The evaluated throughput of the compared structures is given in Table II and the given results show that all structures  have same throughput. Based on the analytical analysis  given in Tables I and II , we quantified the amount of area (A), total computation Time (T), and area-time (AT) for m ¼ 233 and t ¼ 74 as shown in Table III . The attained results indicate that the developed bit-serial array structure outperforms the compared ones in AT by at least 24%. This makes it more suitable for use in resource-constrained cryptographic processors.
Summary and conclusion
In this paper, we developed a unified bit-serial semi-systolic array structure that simultaneously computes both modular multiplication and squaring operations in GF(2 m ). The shared data-path between the two operations acquires the advantage of reducing the area overhead compared to using a separate data path for each operation. Also, the concurrent computation of both operations leads to significantly increasing the execution speed of the modular exponentiation operation. The obtained results display that the developed bit-serial array structure outperforms the most recent exiting competitor serial structures in terms of area-time (AT). This makes it more suitable for use in resource-constrained cryptographic processors. (1) H1 ¼ 5m þ 3, H2 ¼ 5m þ 2t À 4, H3 ¼ 19m þ 4t À 4, H4 ¼ 176:4m þ 8:1t À 12:1, where t is the power of the second trinomial term, (x m þ x t þ 1).
(2) H5 ¼ 7m þ 6, H6 ¼ 7m þ 6, H7 ¼ 18m À 2t À 2, H8 ¼ 179:1m À 5:6t À 16:7 (3) l ¼ dm=2e.
(4) H9 ¼ 71:7m À 7:9. (1) AT 1 ¼ 8;605m 2 À 2;868m, AT2 ¼ ð6;473:9 þ 2;240:3dlog 2 ðmÞeÞm 2 þ ð297:3 þ 102dlog 2 ðmÞeÞmt À ð444:1 þ 153:7dlog 2 ðmÞeÞm, AT 3 ¼ ð6;573 þ 2;274:6dlog 2 ðmÞeÞm 2 À ð205:5 þ 71:1dlog 2 ðmÞeÞmt À ð612:9 þ 212:1dlog 2 ðmÞeÞm, AT4 ¼ 6;520m 2 À 2;464m À 287:56 (2) TD1 ¼ 109:2m À 36:4, TD2 ¼ 12:7mdlog 2 ðmÞe þ 36:7m, TD3 ¼ 12:7mdlog 2 ðmÞe þ 11:3m, TD4 ¼ 91m þ 36:4 
