Abstract-Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high register complexity, especially in field-programmable gate array (FPGA) platforms where the register resources are not that abundant. In this paper, we have shown that the AOP-based systolic multipliers can easily achieve low register-complexity implementations and the proposed architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting scheme in which the register complexity involved within existing AOP-based systolic multipliers is significantly reduced. We have found out that the modified AOP-based structure can be packed as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm that can fully employ the proposed AOP-based computation core. The proposed Montgomery algorithm employs a novel precomputedmodular operation, and the systolic structures based on this algorithm fully inherit the advantages brought from the AOP-based core (low register complexity, low critical-path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing designs, the proposed designs achieve at least 61.8% and 47.6% less area-delay product and powerdelay product than the best of competing designs, respectively. Index Terms-All one polynomial (AOP), finite field multiplication, irreducible trinomials, low register complexity, Montgomery algorithm, systolic structure.
polynomial basis [5] [6] [7] [8] [9] [10] [11] [12] [13] and normal basis [14] [15] [16] [17] , which can be selected to represent the field operations. Nevertheless, in hardware realization, polynomial basis multipliers usually have simpler hardware structures than normal basis ones and hence are more widely used [8] .
All-one-polynomials (AOPs) and trinomials are two of the important irreducible polynomials being used [7] [8] [9] [10] [11] , [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] . The AOP-based multipliers can be used for the nearly AOP, which could be used for efficient realization of cryptosystems [24] . The AOP-based structures can be used as a kernel circuit for field exponentiation, inversion, and division architectures [24] , while trinomialbased multipliers are more popular than AOP-based ones, as two trinomials have been recommended by the National Institute of Standards and Technology (NIST) for ECC implementation [5] . However, because of the complexity differences, AOPs and trinomials are not usually considered together in practical field multiplication implementations [18] .
There are basically two kinds of structures for multipliers over G F (2 m ): systolic design and nonsystolic design. Systolic multipliers over G F (2 m ) based on irreducible polynomials are preferred in high-performance applications due to their features, such as modularity and regularity [5] [6] [7] [8] [9] [10] [11] . Systolic structures also have high register complexity, since all processing elements (PEs) in the systolic array need to use registers for pipelining [5] , while nonsystolic designs usually have lower complexity with larger critical-path delay.
For practical applications, especially in field-programmable gate array (FPGA) platforms, where the register resources are not that abundant, low register-complexity systolic structures are required. Many efforts have been reported to reduce the register complexity in systolic multipliers based on irreducible AOPs and trinomials [7] [8] [9] [10] , [23] [24] [25] [26] . A bit-parallel AOP-based systolic multiplier has been introduced in [23] . Furthermore, another efficient AOP-based design is presented in [24] . Moreover, one low-complexity systolic Montgomery AOP-based multiplier has been proposed in [7] . Lee et al. [7] presented a bit-parallel systolic trinomial multiplier. Meher [8] proposed efficient bit-parallel systolic and supersystolic designs. Xie et al. [9] introduced a low register-complexity systolic structure. Very recently, Montgomery systolic multipliers were presented where the register count was efficiently reduced [10] . Several other works were reported for efficient realization of finite field Montgomery multiplication over G F(2 m ) [11] , [17] .
Based on the above discussion, in this paper, we introduce a novel strategy to design low register-complexity structures for multiplications over G F (2 m ). First, two low registercomplexity AOP-based systolic multipliers are proposed. Then, the two designs are optimized as standard computation cores. After that, an efficient Montgomery multiplication algorithm (for trinomials) based on a novel precomputedmodular (PCM) operation for low register-complexity implementation is proposed. The proposed structures based on the proposed Montgomery algorithm can successfully employ the proposed AOP-based cores. Finally, FPGA implementation results are presented to confirm the efficiency of the proposed architectures.
The rest of this paper is organized as follows. The proposed AOP-based computation core is presented in Section II. The applications of the proposed AOP-based core, namely, the proposed Montgomery algorithm and systolic multipliers for trinomials, are described in Section III. In Section IV, we benchmark the hardware and time complexities of the proposed designs along with the corresponding existing works. Conclusions are given in Section V.
II. LOW REGISTER-COMPLEXITY AOP-BASED SYSTOLIC MULTIPLIERS (AOP-BASED COMPUTATION CORE)
In this section, we briefly review the AOP-based multiplication algorithm first, and then present our proposed architectures based on the existing structures. [24] For simplicity of discussion, let f (α) = α k + α k−1 + · · · + α + 1 be an irreducible AOP of degree k over G F(2) (where k + 1 is prime and 2 is the primitive modulo k + 1). For any x ∈ G F (2) , and x is a root of f (α) = 0, we have
A. Review of AOP Multiplication Algorithm
and then we have
Then, let {x k+1 , x k , . . . , x, 1} be the extended polynomial basis [27] . For any A, B, C ∈ G F(2 m ), these can be represented in the extended polynomial basis as
where a j , b j , c j ∈ G F(2), for 0 ≤ j ≤ k − 1, and a k = 0, b k = 0, and c k = 0. Let us define C as the product of A and B, and then we have
which can be written in this form
Let us define A 0 = A and
Then, we have
where
One can also extend to obtain A i+l from
B. Existing Systolic Structures
The conventional systolic structure based on the algorithm in Section II-A is shown in Fig. 1 (structure-I: S-I), where it consists of (k + 1) PEs (including three types of PEs: PE-1, PE-2, and regular PE). The internal structures of these PEs are shown in Fig. 1(b)-(d) , respectively, where BSC denotes the bit-shifting cell. The latency of the structure in Fig. 1 is (k +1) cycles, where the duration of each cycle period is T A + T X (T A and T X refer to the delays of an AND gate and an XOR gate, respectively). Existing low critical-path structure of [24] for AOP-based multiplication (structure-II: S-II), where the black box denotes the registers. A recent work has presented a low critical-path delay systolic structure (only T X ) [24] , and it is shown in Fig. 2 (structure-II: S-II). The entire structure contains (k + 2) PEs, where the internal structures of PEs are shown in Fig. 2(b) -(e), respectively. The latency of this structure is (k + 2) cycles (critical-path delay: T X ).
C. Modified Low Register-Complexity Structures
For the structures of Figs. 1 and 2, we find that k 2 registers in the PEs pipeline identical data (in shifted order) to the neighboring PEs. These registers can be removed if we change the broadcasting strategy. As shown in Figs. 3 and 4, i.e., MS-I and MS-II, a shifted connection strategy is used in which the input A is directly fed to each PE and thus reduces the registers required. Moreover, the details of shifted connection are also shown in Figs. 3 and 4. To reduce the complexity further, we have used NAND and XNOR gates to replace the original AND and XOR gates, as depicted in [7] and [11] (the critical-path delay is then shortened to T NA + T XN , where T NA and T XN represent the delays of NAND and XNOR gates, respectively, as evidenced by the normalized area and delay comparison of various logic gates shown in Table I ). It is noted that for AOP-based multiplication, the last PE (inside the dotted box) can be removed as b k = 0. The modified structures involve nearly the same time complexity as the previous ones, but the register complexity is significantly reduced.
D. Low-Latency Implementations
For practical applications, we can further reduce the latencies of structures shown in Figs. 3 and 4, for k + 1 = pq + f , where 0 ≤ f ≤ q. Without loss of generality, we assume f = 0, and then, we can decompose the original systolic array of k +1 PEs into p parallel arrays to achieve low-latency implementations, as shown in Fig. 5 . An extra pipelined adder tree consisting of XNOR gates and registers is needed to add the results from p arrays together to yield the final result C.
E. Digit-Parallel Structures
We can combine neighboring PEs in a systolic array into one PE to reduce the register usage further. Fig. 6 shows an example of combining two neighboring PEs into one PE (based on the PEs from MS-I). The critical-path delay of the new PE thus turns into (T NA +2T XN ). For simplicity, we define the structure based on new PE in Fig. 6 (b) as a digit-level parallel structure with digit size d = 2. If we choose the value of d appropriately, the proposed architecture can achieve the optimal area-time complexity for specific applications.
F. Area-Time Complexities
The area-time complexities of the proposed designs in Figs. 3-6 are shown in Table II , along with the existing and conventional designs of Figs. 1 and 2. It can be seen that the proposed designs involve significantly less area-time complexity when compared with competing ones, especially in terms of the register complexity.
G. FPGA Implementation of Various AOP-Based Structures
We have also implemented these AOP-based systolic structures to confirm the efficacy of the proposed structures. We have synthesized these designs using Xilinx ISE 14.1 on Virtex 6 FPGA family with k = 162. The results in terms of area-time-power complexity are shown in Table III .
It can be seen that the proposed structures outperform the existing ones, especially for area complexity. Since there is only a minor difference between the critical-path delays of T NA + T XN and T XN on FPGA platforms, the proposed MS-II does not have a significant advantage over existing ones. Therefore, the proposed MS-I can be used more widely than MS-II in practical applications.
H. AOP-Based Computation Core
To fully utilize the special property of the proposed AOP-based multipliers, we pack the structure of (or combine with the structure of Fig. 6 ) as a standard computation core. The standard computation core is shown in Fig. 7 , which consists of k+1 input bits from A, k+1 bits from input B, and k + 1 bits of output C. For practical applications of this standard computation core, we can replace k + 1 with any other integer. It is noted that both PEs of MS-I and MS-II can be used as internal structures for this computation core.
III. APPLICATION OF THE PROPOSED AOP-BASED COMPUTATION CORE
In this section, we focus on the application of the AOP-based computation core to obtain a low registercomplexity Montgomery multiplication based on trinomials.
A. Montgomery Multiplication Algorithm
Let f (x) be a degree m irreducible trinomial over GF (2) as
where 1 ≤ n ≤ m − 1, such that we can have the Montgomery multiplication as [12] (2); r is the Montgomery factor that satisfies gcd(r, f (x)) = 1 (gcd refers to the greatest common divisor). Different algorithms have different selections of r to have the corresponding structures, as shown in [10] [11] [12] .
In this algorithm, we have chosen r = x t = x (m−1)/2 (for the NIST recommended trinomials, m is an odd number). Then, (13) can be expressed as
For C 1 , we define A
where 0 ≤ i ≤ t − 1. C 1 can be expressed as
Let us define
Since x is the root of f (x) = x m + x n + 1, we can have x m + x n = 1 and x m−1 + x n−1 = x −1 . Substituting these into (19) yields
for 0 ≤ j ≤ m − 2 and j = n.
Similarly, for C 2 , we can define
. With these definitions, C 2 can be expressed as
Let us define again A
for 1 ≤ j ≤ m − 1 and j = n.
B. Proposed Montgomery Multiplication Algorithm
Equations (12)- (22) represent the standard Montgomery multiplication process. To facilitate the Montgomery multiplication suitable for employing the proposed AOP-based computation core, we present the following proposed algorithm.
Let x m be an extended polynomial basis. From (19), we define
such that a (25) where ξ(·) represents the bit-selection operation.
We can similarly extend (23) to A
U,m+1 x m+1 , where x m+1 is an extended polynomial basis and
U,m = a (1)
Thus, a 
In conclusion, we can have
where x m+1 , . . . , x m+t −1 are defined as extended polynomial basis and (applicable to two trinomials recommended by NIST, where m − n > t)
where a
. ., and A (t )

, that is ξ(A (t )
Similarly, for C 2 , we have
and (applicable to two trinomials recommended by NIST, where m − n > t)
Similar to (27) , we can have
Based on the above, (14) can be rewritten as
Algorithm 1 Proposed Montgomery Multiplication
From (28) and (29) and (32) and (33), we can derive A
(t ) U
and A
(t )
V directly from operand A through XOR operations, and thus, we define this operation as PCM operation as
Based on (23)- (35), the proposed Montgomery multiplication algorithm for employing the AOP-based computation core is thus given in Algorithm 1.
In Algorithm 1, Step 2.2 refers to the bit-parallel multiplication process. According to the proposed algorithm, we generate operands at the first cycle period, and then, they are distributed into t partial products to be accumulated in a systolic way, which greatly facilitates employing the proposed AOP-based computation core, since all involved bits are already generated from PCM (the details can be seen in Section III-C).
C. Proposed Low Register-Complexity Systolic Structure Employing the AOP-Based Computation Core
The proposed structure based on the proposed Algorithm 1 (employing the proposed AOP-based computation core) is shown in Fig. 8(a) . It contains one AOP-based computation core and three extra PEs. PE-0 yields two outputs (each output with m + t − 1 bits) to the computation core to be selectively connected with m − 1 input ports. As shown in Fig. 8(c) , PE-0 performs the PCM of operand A and yields two outputs ( A
(t )
U and A (t ) V ) to the computation core, respectively (m bits of operand A are shared). The PCM of (32) only takes one T X delay, while the PCM of (29) takes 2T X . To lower the critical-path delay, we have used two-stage XOR operations to minimize the critical-path delay to one XOR delay [stage-I uses the least number of XOR gates required by (29), while stage-II realizes the rest of operations of (29) and (32)], as shown by an example design in Fig. 9 . PE-1 calculates the multiplication of operand A and b t according to Algorithm 1, while PE-2 functions as the final addition to produce the output C.
The internal structure of the AOP-based computation core is shown in Fig. 8(b) , where we have used PEs from MS-I as internal PEs for e = 2 (one can extend the structure to any value of e). The computation core contains (2t + 1) PEs, where the detailed designs of PEs are shown in Fig. 8(d)-(f) , respectively. PE-1 performs multiplication between one m-bit operand and one bit of operand B and then yields the result to their right. The regular PE performs multiplication between selected operand and one bit of operand B. The result of multiplication is added with the input from previous PE and then produces the result to the PE on its right. The last PE, PE-2, performs the addition of two systolic arrays and yields the final result.
The critical-path delay of the proposed multiplier of Fig. 8 is (T NA + T XN ) (if we choose the PEs from MS-II, the criticalpath delay will be T XN ). The proposed design gives the first output of desired product (t + 3) cycles after the pair of operands are fed to the structure, while the successive outputs are produced in every cycle thereafter.
D. Low-Latency Structure
Let 2t = eh + l, where 0 ≤ l ≤ h. For simplicity, we can assume l = 0; however, it can be extended to l = 0. Then, we can rewrite (34) as
where the original two systolic arrays in the computation core of Fig. 8 can be divided into e arrays (each array has h PEs), as shown in Fig. 10 . The latency of the structure in Fig. 10 is only (h + 3 + log 2 e ) cycles (PE-0 and PE-1 take two cycles to be processed in parallel, while the adder tree and PE-2 require log 2 e and one cycle, respectively), which is significantly shorter than the previous one in Fig. 1 . A pipelined adder tree is used to add together the results of e systolic arrays of the computation core. 
E. Digit-Parallel Structure
We can also employ the PEs from Fig. 6 to have digitparallel structure to reduce the register complexity further. It is noted that the digit-parallel structure can be combined with the low latency one to achieve optimal implementation.
IV. AREA AND TIME COMPLEXITIES
In this section, we benchmark the hardware and time complexities of the proposed architectures.
A. Comparison
The area and time complexities in terms of logic gate count, register count, latency, and critical-path delay of the proposed and existing structures of [7] [8] [9] [10] are listed in Table IV. The proposed architectures outperform the existing ones, especially in the register count. The proposed architectures have lower area-time complexity than the design of [7] . When compared with the low-latency supersystolic structure of [8] , the proposed architecture (Fig. 10) has shorter latency (if we choose e = √ m ) and less registers. Furthermore, when compared with the two architectures in [9] and [10] , the proposed architectures not only have lower register count, but also constitute significantly lower latency. Among all the existing architectures, only the work of [8] and [9] has proposed the similar digit-parallel structures, as shown in Fig. 6 . From Table IV , it is shown that the proposed digit-parallel structures have less register count and shorter latency than those of [8] and [9] . For a fair benchmark, we have also given the comparison of register count and latency of various architectures based on trinomials f (x) = x 233 +x 74 +1 and f (x) = x 409 +x 87 +1, as shown in Table V . It can be seen that the proposed architecture, especially Fig. 8 (MS-I) , has the highest efficiencies in terms of both register count and latency.
B. FPGA Implementations
We have implemented the proposed architectures, including the structures of Fig. 8 (e = 2) and Fig. 10 (e = 16 and d = 2) , using Xilinx ISE 14.1 on the Virtex 6 FPGA family based on the trinomials f (x) = x 233 + x 74 + 1 and f (x) = x 409 + x 87 + 1. The area-time-power complexities of the best existing designs [9] , [10, Fig. 3 ] are also obtained. The area-time-power complexities of all these designs are shown in Table VI . As shown in Table VI , the proposed structures significantly outperform the existing designs. The proposed structures are found to have at least 61.8% and 47.6% less area-delay product (ADP) and power-delay product (PDP) than the state-of-theart previous architectures, respectively. It is also noted that as field-size increases from 233 to 409, the proposed structures are found to be more efficient in area-time-power complexities, e.g., the proposed designs have at least 61.8% and 47.6% less ADP and PDP than the existing architectures at G F (2 233 ), while the proposed ones have at least 66.2% and 56.2% less ADP and PDP than the competing ones at G F(2 409 ).
C. Discussion
It is noted that from Table VI, the proposed architecture of Fig. 10 (e = 16 and d = 2) achieves the best areatime complexity among all the designs. The reduction of registers brought by digit-parallel implementation is significant. For practical applications, one can choose suitable values of d (coordinating with the selection of e) to obtain optimal realizations based on the usage models and performance and implementation objectives.
It is worth mentioning that after packing as a computation core, the AOP-based multipliers can be used as a regular component in practical cryptosystems usage though AOP-based designs are usually not preferable in such systems due to security issues [5] . In the future, we plan to extend the AOP-based cores to pentanomial-based cryptosystems.
V. CONCLUSION
An efficient, new scheme for low-complexity implementation of finite field multipliers over G F (2 m ) based on trinomials benchmarked on the FPGA platform has been proposed. We have proposed a modified data broadcasting technique to reduce the register complexity within the existing AOP-based multipliers. Then, the AOP-based multipliers have been packed as standard computation cores to be used for trinomial-based multipliers. Moreover, a novel low register-complexity Montgomery multiplication algorithm for systolic trinomial-based finite field multipliers is presented. The systolic multiplier based on the proposed algorithm can employ the AOP-based computation core to offer low register-complexity implementations. We have also introduced structures for low-latency and digit-parallel implementations. Both the theoretical analysis and the FPGA implementation results have confirmed the higher efficiency of the proposed architectures compared with the competing ones.
