Abstract-Systolic implementation of Karatsuba algorithm (KA)-based digit-serial multiplier over G F(2 m ) on fieldprogrammable gate array (FPGA) platforms has many attractive features, such as efficient tradeoff in area-time complexity and high-throughput rate. But on the other side, it suffers from high register-complexity, which leads to increase in area and power consumption. In this paper, we present an algorithm and architecture for efficient FPGA implementation of KA-based digit-serial systolic multiplier over G F(2 m ) based on the National Institute of Standards and Technology (NIST) recommended polynomials. A number of efficient techniques have been explored and used to realize efficient implementation of these multipliers. First, we propose a novel KA-based approach, where the computational complexity is significantly reduced compared with the existing one. Second, we propose efficient register minimization techniques, such as redundant register removal, two-stage pipelining, and register sharing to reduce the register complexity of the proposed structure. Third, we adopt an efficient FPGA-specific digit-parallel implementation strategy to optimize the area-time-power complexities of the proposed structure on FPGA platforms. The results obtained from FPGA synthesis indicate that the proposed multiplier (for field based on NIST trinomial G F(2 233 )) has significantly lower area-time-power complexities than the existing designs, e.g., the proposed structure could achieve 65.7% and 73.6% reduction on area-delay product and power-delay product over the best of existing KA-based systolic structures, respectively.
systems, and smart-grid communication etc [1] [2] [3] [4] . The elliptic curve cryptography (ECC) gives much stronger security per bits compared to the RSA [5] , [6] , and hence it can play a promising role in the above mentioned applications [5] . For ECC realization we need to perform point addition operations, which can be efficiently realized by operations, like additions, squaring, and multiplications in binary extension fields (addition and squaring are fast operations). The field multiplication is considered as the bottleneck of ECC due to its large area consumption, longer computation time, and higher power consumption. The National Institute of Standards and Technology (NIST) [6] has recommended 5 irreducible polynomials for ECC implementation, and considerable efforts have been made on efficient implementation of the multiplication over G F (2 m ) based on these NIST polynomials [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] .
In terms of the throughput style, the multipliers over G F (2 m ) can be classified as three forms: i.e., bit-parallel, bitserial, and digit-serial [7] [8] [9] [10] . The bit-parallel structures usually deliver one output per cycle, but their area-complexities are large. The bit-serial structures are small and simple, but they process one bit per cycle. The digit-serial structures provide necessary trade-off between area and time complexities, and therefore, they have gained substantial attentions recently.
Karatsuba Algorithm (KA) is widely used in finite field multiplication over G F (2 m ) to reduce the computation complexity [23] , [24] . Recently, a number of KA-based systolic multipliers over G F (2 m ) have been reported (the systolic designs usually provide high-throughput computation) [12] [13] [14] [15] [16] . Efficient digit-serial and digit-parallel systolic multipliers over G F (2 m ) based on KA is introduced in [12] . A digit-serial systolic KA-based multiplier for special classes of polynomials over G F (2 m ) is reported in [13] . Another subquadratic space-complexity KA-based parallel systolic multiplier using block recombination technique is presented in [14] . Besides, a low-complexity digit-serial dual basis systolic multiplier over G F (2 m ) using KA is proposed in [15] . An efficient KA-based parallel systolic multiplier based on two-term decomposition is presented in [16] .
FPGA devices have been popularly used in many embedded systems due to their features such as low cost, reconfigurability, and convenient development. In this paper, we propose a low-complexity systolic Karatsuba multiplier over G F (2 m ) based on NIST polynomials for FPGA implementation.
Moreover, we propose several novel techniques to achieve optimal implementation. The major contributions of this paper are:
• We propose a novel KA-based approach that the computational complexity is much smaller than the existing one. The corresponding KA-based systolic structure has significantly lower area-time-power complexities than the existing designs.
• Register complexity is found to be significantly high in the existing systolic structures. Therefore, we employ several register minimization techniques, in order to reduce the register-complexity of the proposed systolic structure significantly.
• We adopt an efficient FPGA-specific digit-parallel implementation strategy to optimize the area-time-power complexities of the proposed structure. We have implemented the proposed structure as well as the existing designs on the same FPGA device. The results obtained from synthesis confirm the efficiency of proposed structure over the existing designs. The rest of the paper is organized as follows. In Section II, the mathematical formation for the derivation of proposed KA-based algorithm is detailed presented. The proposed algorithm is introduced in Section III. In Section IV, the proposed digit-serial systolic KA-based multiplier is presented along with novel register minimization techniques. In Section V, the comparison in terms of the area, time, and power complexities of the proposed design are presented and compared with the existing designs. Conclusion is given in Section VI.
II. MATHEMATICAL FORMULATION
OF PROPOSED ALGORITHM In this section, we present the mathematical formation for the derivation of proposed algorithm based on two-term Karatsuba decomposition for NIST recommended polynomials, which is extended to four-term KA decomposition.
A. Existing Two-Term KA of [12] and [24] Let A, B and C be field elements in G F(2 m ), and the finite field multiplication can be expressed as
where
i=0 c i x i , for a i , b i and c i ∈ {0, 1}, and f (x) is an irreducible polynomial.
For two-term KA [12] , [24] , a multiplicand element A = m−1 i=0 a i x i can be expressed as:
Similarly, the other multiplicand
Product of A and B is then given by:
B. Proposed Two-Term KA Let us define three partial products of (4) as: 
, and each C L ,u can be obtained as
For any given value of v, we derive B
L x wv+u . Similarly, we have the same steps of (5)
Besides, we further optimize (4) as
Let us define C L + C L H + C H = C T 0 , and then we rewrite (10) into
C. Detailed Final Step
Let us define C L , C H , and C T 0 as:
where the detailed step of calculating ( 2 can be seen in Subsections D and E later.
From (11) and (12), we have
where 
D. For the NIST Recommended Pentanomials
which implies that
Substitute x m in (13) with (16), we have (14) with (16), (11) can be computed as
Then we can obtain
For simplicity of discussion, let us first define
where the term
for NIST pentanomials as follows [21] :
Similar expressions as those of (19)- (20) can be obtained
1) Register Sharing Algorithm: Besides, from the operations of (20), we find that there are
Here, we select bits {c 3 to reduce the register-complexity in the systolic structure.
E. For the NIST Recommended Trinomials
Let f (x) be a trinomial of degree m as f (x) = x m +x k 4 +1, for 1 < k 4 ≤ m − 1, where x is the root of the trinomial, we have
such that
Substituting it in (13), we have
such that C L H1 can be obtained directly from (C L + C H x) through bit-addition operations. Replacing x m in (14) with (24), (11) can be computed as
Then we can obtain C L HT 2 · x k 4 mod f (x) by the following steps:
Similarly, we define
L HT 2,i x i can be derived from C L HT 2 for NIST trinomials as [9] 
1) Register Sharing Algorithm: As in the case of pentanomials, in (28) we can find that there are (m − k 4 ) identical bits in C L HT 2 and C L HT 2 · x k 4 . Let us define x m , x m+1 , . . . , x m+k 4 −1 as extended polynomial basis, and we have
Here, we can select bits {c
L HT 2 . This facilitates the reduction of registers in the systolic structure.
III. PROPOSED ALGORITHM In this section, we present the proposed two-term KA as well as four-term decomposition strategy. Comparison of computational complexity (especially the final step) between the proposed algorithm and the existing one is also included.
A. Proposed Two-Term KA
The proposed digit-serial algorithm based on twoterm Karatsuba-like decomposition (4)-(30) is described in Algorithm 1, where Steps 2.1 to 2.6 refer to digit-serial multiplication.
1) Comparison of Computational Complexity of the Final
Step: Compared with the final steps in [12] , our final step of (11) has significantly less computational complexity, e.g., the modular reduction operations are reduced from 3 [12, Fig. 6 ] to only 1 (according to (11) ), which results in significant reduction of required number of XOR gates. The detailed comparison of computational complexities of the final 
can be either a NIST recommended pentanomial or trinomial). 2. Multiplication step 2.1 for u = 0 to w − 1. [12] step of proposed algorithm and the existing algorithm is shown in Table I (Equation (28) requires k 4 XOR gates, while the operations of (11) and (26) need (5m − 1) XORs, which can be seen in Table IV later.).
for
v = 0 to d − 1. 2.3 C L ,u = d−1 v=0 B wv+u L a 2wv+2u . 2.4.a D L = D L + w−1 u=0 C L ,u . 2.4.b D L H = D L H + w−1 u=0 C L H,3.1 C = {(D L ) 2 + (D L + D L H + D H ) 2 x + (D H ) 2 x 2 } mod f (x) = {(C L + C H x) 2 + (C T 0 ) 2 x} mod f (x) = {C L HT 1 + x m · C L HT 2 } mod f (x).
B. Proposed Four-Term Karatsuba-Like Decomposition
1) Existing Four-Term KA of [12] and [24] : Similar to twoterm decomposition, we have
Let us define as follows:
From [12] and [24] , we have
2) Proposed Four-Term KA: Then, we have the following steps: 
Thus we can rewrite (42) into
Then, we can use similar steps of (12)-(30) to compute the final step.
3) Comparison of Computational Complexity of the Final
Step: In case of four-term decomposition also, our final step of (46) has significantly less the computational complexity compared to that of [12] . The number of modular reduction operations are reduced from 6 (based on (41)) to only 1 (according to (46)). The detailed comparison of computational complexities of the final step of proposed algorithm and the existing algorithm is shown in Table II , where our proposed algorithm involves significantly less number of XOR gates (Similar to the calculation of Table I , the XOR gates are estimated from (46).). Note that the proposed two-and four-term KA decomposition can be extended to n-term KA decomposition for n > 4.
IV. PROPOSED DIGIT-SERIAL KA-BASED SYSTOLIC MULTIPLIER
For simplicity of discussion, we present here systolic multipliers over G F (2 m ) based on the proposed two-term Karatsuba-like decomposition. Note that the design strategy proposed for two-term decomposition can be extended to have the structure of multiplier based on n-term decomposition.
A. Proposed Two-Term KA-Based Digit-Serial Systolic Multiplier
The proposed digit-serial multiplier based on Algorithm 1 is shown in Fig. 1 . It consists of 3 systolic arrays, where each array consists of d processing elements (PE)s and one shiftaccumulation (SAC) cell. The proposed design also has one pre-computing-adder (PCA) cell. Besides, a final modular addition (FMA) cell is needed for the operation of Step 3.1 of Algorithm 1. Three pairs of operands ( A l , A H , B l , B H , A l H ,  B l H ,) , are fed to 3 separate systolic arrays, according to Algorithm 1 (the internal structure of PCA is shown in Fig. 2(a) ). The internal structures of the first PE (from left) and regular PE (PE-2 to PE-d) are shown in Figs. 2(b) and (c), respectively. According to Step 2.3 of Algorithm 1, each regular PE receives the w-position-right-shifted input from left and then adds that with the result from m/2 parallel AND gates. The output of the PE (comprised of (m/2 + vw) bits) is then fed to the PE on its right. The SAC cell receives the first input and feeds its output to the FMA cell in w successive clock cycles to yield the final output C.
1) Register Minimization Techniques:
The key strategy to minimize register-complexity of the structure is to minimize the number of signal bits in every pipelining stage. The details of different register minimization scheme used in the proposed structure are discussed in the following.
• Elimination of Redundant Registers. The internal structure of PCA cell of the proposed multiplier is shown in Fig. 2(a) . Unlike the regular PE, PE-1 does not have any XOR cell, as shown in Fig. 2(b) . The internal structure of a regular PE (PE-2 to PE-d) is shown in Fig. 2(c) .
To reduce the register-complexity within the regular PE, we remove the registers used for pipelining identical bits. The detailed internal structure of employing this technique to the regular PE is shown in Fig. 2(d) , where only m/2 bit-registers (within the dotted area) are used (the input from left side of PE will right-shift w positions and these shifted bits are redundantly pipelined within each PE, thus, these registers can be removed and these bits can directly connect to the final output). It is noted that the critical-path of these PEs is (T A + T X ), where T A and T X refer to the propagation time of an AND gate and an XOR gate, respectively. This technique is also being used in FMA cell, as described below.
• Two-Stage Pipelining. This technique is applied for the FMA cell of proposed structure (as shown in Fig. 3 ), specifically for NIST pentanomials. As we consider the operations of (18)- (20), the maximum time required to obtain C 2 L2(E) from C 2 L2 is 2T X . To maintain the same critical-path as that in other parts of the structure, we have used a two-stage pipelining (TSP) technique to reduce the delay of obtaining C 2 L2(E) from C 2 L2 to T X , as shown in Fig. 3(b) . To minimize the number of registers in stage-1, we perform only those XOR operations which are needed according to (19) , while the rest of the XOR operations are executed in stage-2. The critical-path of proposed FMA is (T A + T X ), which is also the same as that of Fig. 3(a) .
• Register Sharing. This technique is applied in parallel with the two-stage pipelining technique. As shown in (19) - (20), (26)- (27), in the process of obtaining C 2
L2(E)
from C 2 L2 , multiple bits can be shared to reduce the register complexity. The details of the FMA cell applying this technique is shown in Fig. 3 . Note that we have also used redundant register elimination technique in the FMA cell such that the bits which do not participate in the computation can be connected with the final stage operation without pipelining. The detailed register count (after applying the proposed register minimizing techniques) for the FMA cell can be seen in Tables III and IV, respectively, for NIST recommended pentanomials and trinomials. The proposed systolic structure gives the first output (d + w + 6) cycles after operands are fed to the structure, while the successive outputs are obtained in every w cycles thereafter.
B. Digit-Parallel Implementation
The proposed multipliers are targeted to be implemented in FPGA. Therefore, we derive here a digit-parallel structure which could be more suitable for FPGA implementation.
For any integer value of d, we can have d = pr + q, where 0 ≤ q < r and r < d. Without loss of generality, we can assume q = 0. 1 Then, we can rewrite (7) as
where 0 ≤ h ≤ p − 1 and 0 ≤ f ≤ r − 1. Based on (44), we can modify the PEs in the structure of Fig. 2 to derive suitable structures for FPGA platform. For example, to obtain a proposed structure for p = 2, a pair of PEs of previous proposed structure can be merged to form a new regular PE as shown in Figs. 4(a)-(d) . The critical-path of the new PE of Fig. 4 (d) thus becomes (T A + 2T X ). Accordingly, the internal structure of the FMA cell should also be changed, as shown in Figs. 4(e) and (f), for the case of NIST recommended pentanomials and trinomials, respectively. The strategy used to derive the structure of p = 2, can be extended for other values of p, which will result in structures with different criticalpaths. Note that this digit-parallel implementation is different from that of [12] . In the proposed design, the number of bits of operand A fed to the PEs remains the same as the bit-level design, while in [12] , the number of such bits increases with the digit-size for digit-level parallel implementation. The proposed strategy (Fig. 4) is quite useful for FPGA platform since the basic unit of FPGA can be mapped with multiple logic gates, which mainly depends on the technology of the FPGA devices. If the value of p is chosen appropriately, the proposed multiplier when implemented in FPGA platform can have the optimal area-time complexity. 
A. Comparison of Area and Time Complexities
The area and time complexities in terms of gate count, register count, latency, critical-path, and average computation time (ACT) of the proposed structures for both NIST pentanomials and trinomials are listed in Table I (for simplicity of discussion, only designs based on two-term KA decomposition are listed here). For a fair comparison, we list only polynomialbasis-KA-designs: four recently reported structures of [12] , [13] , [17] , and [20] .
As shown in Table V , different structures have different area-time complexities for different digit-sizes. Structures of [12] , [17] , and [20] are suitable for NIST trinomials, while the design in [13] fits to both pentanomials and trinomials. To have a fair comparison, we have coded the proposed structures ( p = 1, p = 2, and p = 4) and the existing designs of [12] , [17] , and [21] in VHDL based on two NIST polynomials ( f (x) = x 163 +x 7 +x 6 +x 3 +1 and f (x) = x 233 + x 74 + 1) with the same digit-size (as recommended in [12] , where d = w = √ m/2 )), respectively (Liu et al. [17] have shown their efficiency over [20] , thus we only compare with the synthesized results of [17] ). We have also synthesized these structures on Altera Quartus II 12.1 with the same Stratix II EP2S180f1508 FPGA device (the same FPGA device which is used in [13] ). The area-time-power complexities (in terms of number of adaptive LUT (ALUT), delay, power consumption, area-delay product (ADP), and power-delay product (PDP)), of the proposed and the existing designs of [12] , [13] , [17] , and [21] are listed in Tables VI (the results of [13] are from [13, Table IV] ). Note that we have checked the functionality of proposed structure through simulation and RTL Viewer tool provided by Altera Quartus II and found it correct.
As shown in Table VI , the proposed structures ( p = 1, p = 2, and p = 4) significantly outperform the existing designs (including [17] and [21] ), i.e., have smaller ADP and less PDP. This should be the fact that: (i) the proposed algorithm has lower complexity than existing one (as shown in Tables I and II) ; (ii) the employment of several register minimization techniques also brings benefit to the overall performance, especially on the reduction of ALUTs (e.g., the proposed structure of p = 1 has less 2240 ALUTs than [12] ). For NIST pentanomial f (x) = x 163 + x 7 + x 6 + x 3 + 1, the proposed design ( p = 1) has at least 74.9% reduction on ADP over the existing design of [13] . While for NIST trinomial f (x) = x 233 + x 74 + 1, the proposed design ( p = 1) has at least 17.6% and 3.4% reduction on ADP and PDP over the design of [12] , respectively.
For our optimized structure specifically for FPGA platforms, the proposed structure ( p = 4) has at least 97.2% reduction on ADP over the existing structure of [13] based on NIST pentanomial. While for NIST trinomial, the proposed structure ( p = 4) has at least 65.7% and 73.6% reduction on ADP and PDP over the structure of [17] , respectively.
B. Discussion
It can be seen from Table VI that the proposed architectures of p = 4 achieves the best performance in area-time-power complexities among all the designs, where the reduction of registers brought by the proposed digit-parallel implementation is significant. It is also noted that as p increases, the ALUTs inside of FPGA are being used more efficiently since one ALUT can be mapped to multiple logic operations and thus the overall number of ALUT is reduced. As shown by the reduction of area-time-power complexities of proposed structures from p = 1 to p = 2. However, the reduction of number of ALUT does not increase linearly with p, as evidenced by the cases (from p = 4 to p = 2) of proposed structures (on this device, the ALUT has almost reached its maximum utilization in case of p = 2). Therefore, for practical applications, one can choose suitable values of p (corresponding with digit-size of d) to obtain optimal implementations on specific FPGAbased platforms.
VI. CONCLUSION
Efficient design and implementation of multiplier over G F (2 m ) based on NIST polynomials is proposed, especially on FPGA platforms. We have proposed an efficient KA-based algorithm for digit-serial multiplication where the complexity is significantly reduced compared with the existing one. Based on the proposed algorithm, we have derived a digit-serial structure, where three different approaches have been introduced to reduce the register-complexity. Moreover, we present digit-parallel structures for FPGA platforms. The FPGA synthesis results show that the proposed multipliers have significantly lower area-time-power complexities than the existing competing designs, as evidenced by the fact that for NIST trinomial G F(2 233 ), the proposed structure (for p = 4) can achieve nearly 65.7% and 73.6% reduction on ADP and PDP over the best of the existing architectures, respectively. The proposed structures can also be used in application specific integrated circuit (ASIC) platforms for lowpower and high-performance implementation. Future work will focus on practical application of proposed multipliers in resource-constrained platforms such as wearable devices and deeply embedded systems.
