Abstract In this letter, we propose a low-latency semi-systolic architecture for multiplication based on the shifted polynomial basis over finite fields. The proposed multiplier saves at least 49.9% time complexity and 23.7% area-time complexity as compared to the related multipliers. The proposed multiplier can be used as a core circuit for various applications.
Introduction
Finite field arithmetics have been widely used in the various areas such as error-correcting codes and cryptosystems [1, 2, 3, 4, 5, 6] . Multiplication over finite fields is very important operation because time-consuming operations such as division and exponentiation can be performed by repeated multiplications. Thus, an efficient multiplication architecture with low complexity is needed to design dedicated high-speed circuits. Various architectures for arithmetic over GF (2 m ) have been developed [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] . In this paper, we propose a low-latency multiplication algorithm and a semi-systolic multiplier based on the shifted polynomial basis over GF(2 m ).
Proposed multiplication algorithm based on SPB
GF(2 m ) has 2 m elements and is associated with an irreducible polynomial
b i x i−v be two elements using SPB over GF(2 m ). The multiplication based on the SPB is defined [26, 27, 28, 29, 30] . We know that x is a root of F(w), i.e., F(x) = 0 over all 
For the convenience of derivation of our multiplication algorithm, we assume that m is even. For deriving an efficient parallel architecture, let v = m/2. Then, the multiplication based on SPB can be presented as
We can observe that S, T,S, andT require Ax 2i and Ax −2i . For the convenience of the presentation of equations, we put h = v/2 and assume that v is even. We define
From above equations, the recurrence equations of S, T, S, andT can be formulated as
They can be simultaneously executed because there are no data dependency. After computing S, T,S, andT, we require computing C and
Finally, the result of multiplication can be obtained by com- 
Proposed semi-systolic array for multiplication
In this section, we propose a low-latency semi-systolic multiplier based on SPB using the proposed algorithm. The semi-systolic arrays for computing C and D over GF (2 12 ) are presented in Fig. 1(a) and Fig. 1(b) , which are composed of m×h W 
j . Each V j cell employs one 2-input AND, one 3-input XOR gates, and one 1-bit latch to compute c j . EachV j cell employs one 2-input AND, one 3-input XOR gates, and two 1-bit latches to compute d j . The semi-systolic arrays for computation of C and D in Fig. 1 take h + 1 clock cycles, respectively. The low-latency architecture for multiplication based on SPB over GF(2 m ) is depicted in Fig. 3(a) . Fig. 3(b) shows the Y module for computing C + x −1 D mod F which includes m 2-input AND, m − 1 3-input XOR, one 2-input XOR gates, and m 1-bit latches. The latency of the proposed architecture requires h + 2 clock cycles. If we assume that one 3-input XOR gate is constructed using two 2-input XOR gates, each clock cycle takes delays of one 2-input AND gate, two 2-input XOR gates, and one 1-bit latch.
Complexity analysis and conclusion
For a comparison of the time and area complexity, we adopt the "SAMSUNG STD 150 0.13 µm 1.2V CMOS Standard Cell Library" databook. Based on this library, we can estimate the time and area complexities of the proposed and the related multipliers. We assume that A AN D2 = 6.68, T AN D2 = 0.094ns, A XOR2 = 12.00, T XOR2 = 0.167ns, A L AT C H1 = 16.00, and T L AT C H1 = 0.157ns, where A G AT E n and T G AT En denote the transistor count and delay of an n-input gate, respectively. A comparison between the proposed and the related semisystolic multipliers is given in Table I . Although Choi-Lee's multiplier [24] has the least area complexity among multipliers in Table 1 , its throughput is 1/2. But our multiplier has the least time and area-time complexity among multipliers in Table 1 and its throughput is 1. Although the area complexity of our multiplier is nearly the same with KimKim [15] , our multiplier saves about 64.9%, 49.9%, and 50.0% time complexities as compared to Huang [13] , KimKim [15] , and Choi-Lee [24] , respectively. Considering AT (area-time) complexity, our multiplier saves about 68.1%, 49.7%, and 23.7% as compared to Huang [13] , Kim-Kim [15] , and Choi-Lee [24] , respectively. In this paper, we propose the low-latency semi-systolic architecture for multiplication over finite field GF(2 m ). As compared to related works, we reveal that the proposed multiplier has lower latency and more efficient areatime complexity than other works. Moreover, since the proposed multiplier incorporates simplicity, regularity, modularity and pipelinability, it is well suited to VLSI implementation. We expect that our architecture can be efficiently used for various applications including crypto coprocessor design, which demand highspeed computation, for security purposes. 
