Abstract. Multiplication is the main finite field arithmetic operation in elliptic curve cryptography and its bit-serial hardware implementation is attractive in resource constrained environments such as smart cards, where the chip area is limited. In this paper, a new serial-output bitserial multiplier using polynomial bases over binary extension fields is proposed. It generates a bit of the multiplication in each clock cycle with the latency of one cycle. To the best of our knowledge, this is the first time that such a serial-output bit-serial multiplier architecture using polynomial bases for general irreducible polynomials is proposed.
Introduction
The multiplication over finite (or Galois) field GF (2 m ) is the main arithmetic operation in the elliptic curve cryptography [7, 11] and choosing a suitable basis plays an important role in efficient implementation [6] . A field element can be represented using different bases, such as polynomial basis (PB), normal basis, and dual basis. Among them, representation of field elements using a polynomial basis is simpler and has received more attention for hardware implementation.
A hardware implementation of a finite field multiplier can be categorized either as a bit-parallel or bit-serial type. In a bit-parallel multiplier over GF (2 m ), once 2m bits of two inputs are received, m bits of the product are obtained together at the output after a propagation delay through various logic gates. Such a parallel type multiplier (see for example [16, 10, 15, 5, 18, 13, 12] ) requires O(m 2 ) number of gates. On the other hand, a bit-serial multiplier takes m clock cycles for one multiplication using O(m) number of gates.
Bit-serial multipliers can be categorized into two types of either parallel or serial output. In the parallel-output bit-serial (POBS) multipliers, all m output bits of the product are available at the end of the m-th cycle, whereas serialoutput bit-serial (SOBS) multipliers generate one bit of the product in each of these m cycles. Examples of the former type includes the well known LSBand MSB-first bit-serial polynomial basis multipliers [14, 3] and the normal basis multiplier due to Agnew et al. [1] while those of the latter type are Berlekamp's bit-serial dual basis multiplier [2] and Massey-Omura's original bit-serial normal basis multiplier [8] . Usually, POBS multipliers run at a much higher clock rate than their SOBS counterparts. However, the latency to generate the first bit of the product in the SOBS multipliers is one clock cycle as compared to m clock cycles for the POBS ones. Therefore, in applications that require implementation on resource constrained environment such as smart cards, SOBS multipliers result in faster overall computation than POBS multipliers since such a system is usually running at low operating clock frequency. In this paper, we propose a new SOBS PB multiplier for a general irreducible polynomial. To the best of our knowledge, this is the first time that a SOBS PB multiplier is proposed for general polynomials.
The organization of this article is as follows. In Section 2, the traditional bitserial architectures for PB multiplication over GF (2 m ) are introduced. In Section 3, the matrix formulations for the PB multiplication is revisited. Then, we derive formulations for the proposed multiplier structure. A new serial-output bit-serial multiplier is proposed in Section 4. Finally, conclusions are given in Section 5.
The finite field GF (2 m ) consists of 2 m field elements and is constructed by the polynomial basis {1, α, α 2 , · · · , α m−1 }, where α is a root of the irreducible polynomial
In (1), 1 ≤ t 0 < t 1 < · · · < t ω−2 , and ω is the number of non-zero terms. Then, each field element B ∈ GF (2 m ) can be written with respect to this basis as
where b i s are the coordinates of B. For convenience, these coordinates will be denoted in vector notation as
where T denotes the transposition of a vector or a matrix. There are two types of bit-serial, namely LSB-first and MSB-first, multipliers [3] . The LSB-first bit-serial multiplier is shown in Figure 1 (a). In this multiplier structure, both X = x m−1 , · · · , x 1 , x 0 and Y = y m−1 , · · · , y 1 , y 0 are m bit registers. Let X(n) and Y (n) denote the contents of X and Y at the n-th, 0 ≤ n ≤ m, clock cycle, respectively. Suppose the X register in Figure 1 (a) is initialized with A, i.e., X(0) = A, then the output of this register at the n-th clock cycle is X(n) = X (n) ∈ GF (2 m ), which is calculated from the input of this register, i.e., X (n−1) , using the α module shown in Figure 1 (a) as
where X (0) = A. Also, suppose that the register Y is initially cleared, i.e., Y (0) = 0. Then, one can obtain the content of Y at the first clock cycle as Y (1) = b 0 A and in general at the n-th clock cycle as
Let C denote the PB multiplication of A and B, i.e., C = AB mod P (α). Then, using (2) and (4) recursively, one can obtain
and noting the fact that X(n) = X (n) , one can determine that after m clock cycles (6) is done using m 2-input AND gates. This is shown with the double circle module with a dot inside in Figure 1(a) . Also, the sum operation in (6) is implemented with m 2-input XOR gates which is shown with a double circle module with a plus inside. Since the coordinates of B enter the multiplier from the least significant bit (LSB), i.e., b 0 , this multiplier is referred to as the LSB first bit-serial multiplier.
The MSB-first bit-serial multiplier is shown in Figure 1 (b). This structure implements
where the mod P (α) operations after multiplications by α are omitted for simplicity. If the registers U and V are initialized with A = (a m−1 , · · · , a 1 , a 0 ) and 0 = (0, · · · , 0, 0), respectively, then one can verify that after the m-th clock cycle the register V contains the coordinates of C, i.e., V (m) = C. It is noted that for parallel load of inputs into the registers in Figure 1 , multiplexers may be used. These are not shown in the figure for simplicity.
Matrix Formulations for PB Multiplication Revisited
In [10, 9] , Mastrovito showed that the coordinates of C = AB mod P (α) are obtained from the matrix-by-vector product of c = [c 0 , c 1 ,
where M is an m×m binary matrix whose entries depend on the coordinates of A and the entries of the reduction matrix
. The Mastrovito matrix M has been studied in [15] and [5] for irreducible trinomials and arbitrary polynomials, respectively. Then, a systematic design to obtain the Mastrovito matrix M for general irreducible polynomials is presented in [18] .
To find the PB multiplication, another approach is proposed in [17] and [12] for irreducible trinomials and arbitrary polynomials, respectively. The multiplication operation in this approach consists of two parts of the product of two field elements A = (a m−1 , · · · , a 1 , a 0 ), B ∈ GF (2 m ), i.e., AB, followed by the modular reduction, i.e., C = AB mod P (α). Let us denote the result of the product of two polynomials
where
. It is shown in [12] that the coordinates of E and D can be obtained from the following:
Then, one can calculate the coordinates of C = (c m−1 , · · · , c 1 , c 0 ) from the following reduction equation [12] 
Let us define the down shift of the matrix S by j rows as S[↓ j] and the right shift of S by i columns as S[→ i], where the emptied positions after the shifts are filled by zeros. Then, it is shown in [4] that the Q T matrix in (12) can be represented as
where the sets (1)) and
In (14), I m−1×m−1 is an m − 1 × m − 1 unity matrix and 0 1×m−1 is a zero row vector with m − 1 zero entries. Then, using (13), the matrix reduction equation of (12) is simplified in [4] to
and
It is noted that to obtain the set N ⊂ {0, 1, · · · , m − 1} in (17), one can use the algorithm proposed in [18] . For the irreducible polynomial P (x) with the second highest degree t ω−2 ≤ (m + 1)/2, it is proved in [4] 
In the following, we show another approach to find this set for arbitrary irreducible polynomial.
For a given irreducible polynomial P (x) stated in (1), the reduction matrix defined in (8) is fixed. Thus, the entries of Q are constant, i.e., q i,j ∈ {0, 1}, and can be found from (8) for the underlying polynomial P (x). Let us assume the entries of column 0 of Q, i.e., q i,0 , 0 ≤ i ≤ m − 2, are given. Let n and r j (0 ≤ j ≤ n − 1) be the number of nonzero entries and their row positions of the column 0 in this matrix, respectively, i.e.,
This column is equal to the row 0 of Q T and is obtained from (13) for j = 0. Then, one can easily see that R = N, i.e., the elements of N are the locations of non-zero entries of column 0 of the reduction matrix. x ti + 1 which is obtained from (1), one can easily see that r 0 = 0 for any irreducible polynomial [4] .
Remark 2. It is noted that for the irreducible trinomial P (x) = x m + x + 1, i.e., t ω−2 = 1, ω = 3, the column 0 of Q has only one nonzero entry, i.e., n = 1, which is in the row r 0 = 0. Remark 3. If t ω−2 > 1, then the second nonzero entry in the column 0 of Q is r 1 = m − t ω−2 .
In the following, we slightly simplify e in (17) to present the key formulation for the proposed SOBS multiplier. Since
one can see that I m×(m−1) [→ i]e is equal to the up shift of the vector [e 0 , · · · , e m−2 , 0]
Therefore, we conclude the above discussion to state the following.
Lemma 1. Let the finite field GF (2 m ) be constructed by the general irreducible polynomial P (x) = x m + ω−2 i=1 x ti + 1, then the coordinates of the PB multiplication of C = AB mod P (α) can be obtained from two steps of
followed by
where d, e, e[↑ i] and e [↓ j] are obtained from (10), (11), (20) and (16), respectively.
Proposition 1. The reduction matrix method stated by (21) and (22) in Lemma 1 requires
number of two-input XOR gates with the critical path delay of at most
where T X is the time delay of an XOR gate.
Proof. The number of bit-wise addition (XOR gates) required for (21) is
Similarly, implementation of (22) requires
Thus, by adding (25) and (26), the proof of (23) is complete. The time delay of (24) is obtained if we add the delay of (21), i.e., log 2 n T X , with the delay of (22), i.e., log 2 ω T X .
New Serial-Output Bit-Serial Multiplier
Unlike the bit-serial multipliers presented in Section 2, this multiplier generates one bit of the multiplication in each clock cycle with the latency of one clock cycle.
Architecture
In order to develop a bit-serial multiplier, Lemma 1 is used to generate the coordinates of C in the order of c 0 , followed by c 1 , · · · , and c m−1 . The new architecture, which is referred to as serial-output bit-serial (SOBS) multiplier, is shown in Figure 2 
As seen in this figure, the output of shift register L are connected to n−1 right shift (RS) blocks as well as the BTX array. The RS(r i
where − denotes nothing is connected to those r i left-most coordinates. The outputs of RS(r 1 ) and RS(r n−1 ) blocks, i.e., L →r1 and L →rn−1 , respectively, are shown in Figure 2(b) . This figure also shows how the outputs of the BTX array, i.e., v m−1 , · · · , v 1 , are obtained. As seen in Figure 2(b) , the BTX array requires m − 1 − r 1 BTXs whose number of inputs vary from 2 to n. Specifically, it consists of m − 1 − r n−1 BTXs with n inputs, r n−1 − r n−2 BTXs with n − 1 inputs, · · · , and r 2 − r 1 BTXs with 2 inputs, i.e., 2-input XOR gates. In general, the BTX array includes r i+1 − r i BTXs with i + 1 inputs for 1 ≤ i ≤ n − 1 (assume r n = m − 1). Therefore, as seen in Figure 2(b) , the outputs of the BTX array, i.e., v i s, are obtain as follows:
Using Figure 2 (b) or (27), one can obtain the number of XOR gates required for realizing the BTX array in Figure 2 (a) as
Also, the time delay of the longest path between the inputs and outputs of the BTX array is log 2 n T X .
BTX array 
using m − 1 AND gates and m − 2 XOR gates with T A + log 2 (m − 1) T X time delay. Similarly, the output of IP(m) generates
which requires m AND gates and m − 1 XOR gates with T A + log 2 m T X time delay.
Initialization and Multiplication Operation
In this section we show that by properly initialization of the shift registers, the bit-serial multiplier generates the coordinates of C in such a way that c 0 and c m−1 are the first and last bits output from c, respectively. Let us initialize the shift register L and U with the coordinates of A as
In fact, only one bit of U, i.e., u m−1 , is initialized with a 0 and other bits are cleared. Also, the register B is initialized with the coordinates of B as Figure 2 (a) after the τ -th clock cycle. Then, by substituting (31) into (27) and using (29), one can obtain the initial value of the output of IP(m − 1) in Figure 2 (a) as
Using (11) and (21), one can simplify (32) to x 0 (0) = i∈R e i = e 0 . Similarly, let U (τ ) and d(τ ) be the contents of the shift register U and signal d in Figure 2 (a) after the τ -th, 0 ≤ τ ≤ m − 1, clock cycle. Then, by using (10) and (30), one can see that
Thus, noting that the contents of register X are initially cleared, i.e., x j = 0, j = 0, one can find that c in Figure 2 (a) outputs c 0 after initialization, i.e.,
In the following, we show that the output c in Figure 2 (a) generates c τ after the τ -th clock cycle. At this time, the coordinates of register L is changed from the initial value of
Then, using (32) with the new value of L, the output of IP(m − 1) generates
which simplifies to
if (11) and (21) are used.
To obtain the output of c after the τ -th clock cycle, i.e., c(τ ), we need to obtain the content of the shift register X, which are found as
By recursive using (36), one can find x i (τ ) = x 0 (τ − i) for τ ≥ i, which can be written to
if we use (35). Thus, the output of Figure 2 (a) after the τ -th clock cycle is c(τ ) = j∈T x j (τ ) + d(τ ). Therefore, by using (33), (35), (37) and Lemma 1, one can find c(τ ) = c τ .
An Example
We consider the field GF (2 7 ) defined by the irreducible polynomial P (x) = x 7 + x 5 + x 3 + x + 1 for which the reduction matrix can be obtained as 
It is seen from the column 0 of (38) that n = 2, r 0 = 0, and r 1 = 2. For this example, R = {0, 2} and T = {0, 1, 3, 5}. Table 1 shows how Figure 2 (a) generates the coordinates of C at each clock cycle τ . τ v6, v5, v4, v3, v2, v1 x0 x1, x2, x3, x4, x5 d c = x0 + x1 + x3 + x5 + d 0 a6, a5, a6 + a4, a5 + a3, a4 + a2, a3 + a1 e 0 0, 0, 0, 0, 0 d0 e 0 + d0 = c0 1 0, a6, a5, a6 + a4, a5 + a3, a4 + a2 e 1 e 0 , 0, 0, 0, 0 d1 e 1 + e 0 + d1 = c1 2 0, 0, a6, a5, a6 + a4, a5 + a3 e 2 e 1 , e 0 , 0, 0, 0 d2 e 2 + e 1 + d2 = c2 3 0, 0, 0, a6, a5, a6 + a4 e 3 e 2 , e 1 , e 0 , 0, 0 d3 e 3 + e 2 + e 0 + d3 = c3 4 0, 0, 0, 0, a6, a5 e 4 e 3 , e 2 , e 1 , e 0 , 0 d4 e 4 + e 3 + e 1 + d4 = c4 5 0, 0, 0, 0, 0, a6 e 5 e 4 , e 3 , e 2 , e 1 , e 0 d5 e 5 + e 4 + e 2 + e 0 + d5 = c5 6 0, 0, 0, 0, 0, 0 0 e 5 , e 4 , e 3 , e 2 , e 1 d6 e 5 + e 3 + e 1 + d6 = c6 Table 1 . The multiplication operation for GF (2 7 ) generated by x 7 + x 5 + x 3 + x + 1.
Complexity Analysis
In this section, we obtain the space and time complexities of the proposed serialoutput bit-serial (SOBS) multiplier.
Proposition 2. For the finite field GF (2 m ) generated by the general irreducible ω-nomial P (x) = x m + ω−2 i=1 x ti + 1, the SOBS PB multiplier (Figure 2(a) ) requires 3m + t ω−2 − 1 1-bit register, 2m − 1 2-input AND gates, and (n + 1)
Proof. The number of 1-bit registers includes the ones in the L and U shift registers, i.e., 2m − 1, the register B, i.e., m, and the shift register X, i.e., t ω−2 , Thus, the multiplier requires 3m + t ω−2 − 1 1-bit registers. The IP(m) and IP(m − 1) blocks require m and m − 1 AND gates, respectively. Therefore, the multiplier requires 2m − 1 2-input AND gates. The number of XOR gates is obtained by adding those for the BTX array, the IP(m) and IP(m − 1) as well as the BTX blocks, which are (28), m − 1, m − 2, and ω − 1, respectively. As a result, the number of XOR gates required in the multiplier is (n − 1)(m − 1)
i=1 r i and the proof is complete.
The time complexities of the multiplier are determined by three factors: latency, the number of clock cycles required for whole multiplication, and the critical path delay. Let us define the latency as the number of clock cycles needed that the first bit of the output be available. Based on this definition, one can see that the latency of the SOBS multiplier is one and the entire multiplication requires m clock cycles. The critical path delay, which is the longest path from the registers to the output c, determines the maximum operating frequency. By properly implementation of the BTX block in Figure 2(a) , one can minimize this delay to obtain it as follows. Proposition 3. Let T A and T X be the delay of an AND gate and an XOR gate, respectively. Then, the critical path delay of the SOBS PB multiplier (Figure  2(a) ) is at most T A +max(T 1 , T 2 ), where
Proof. The critical path delay of the multiplier is determined by the maximum delay between the two paths from the shift registers of L and U to the output c. In order to minimize this delay, one can implement c in Figure 2 (a) as c = c +x 0 , where
Since the path delay from the shift register U to the output d is T A + log 2 m T X and (39) requires log 2 (ω − 1) T X using a BTX, one can see that the delay to generate c is at most T = T A + ( log 2 (ω − 1) + log 2 m )T X . Also, the delay to generate x 0 from the shift register L is T = T A +( log 2 (m − 1) + log 2 n ) T X . Therefore, the total delay to generate c is T X + max(T , T ) which is equal to T A + max(T 1 , T 2 ) and the proof is complete. Table 2 shows the comparison of the proposed SOBS PB multiplier with the traditional LSB-first and MSB-first ones presented in Section 2 in terms of time and space complexities for irreducible ω-nominal and trinomial. To illustrate the differences between the complexities of the proposed multiplier with the ones of other multipliers, the complexities for irreducible trinomials are also tabulated in this table. The number of XOR gates γ in this table is obtained for the irreducible trinomial
Comparison
For the GF (2 233 ) field recommended by NIST, one can use m = 233, k = 74, and T 3 = T A + 10T X in this table. As seen from this table, the proposed SOBS multiplier has the lowest latency at the expense of longer critical path and more area requirement. Table 2 . Comparison of multipliers in terms of time and space complexities for irreducible ω-nomial and trinomial, where γ = (n + 1)(m − 1) + ω − 2 − n−1 i=1 ri, T1 = (1 + log 2 (ω − 1) + log 2 m ) TX , T2 = (1 + log 2 (m − 1) + log 2 n ) TX , and T3 = TA + (2 + log 2 m )TX .
Conclusions
A new serial-output bit-serial multiplier structure for general irreducible polynomials has been proposed. The proposed multiplier can be used for applications, such as, RFID tags, where the field size and irreducible polynomial are fixed. We have obtained the complexities of the proposed multiplier and compared them with the ones of the LSB-first and the MSB-first multipliers. Unlike the parallel-output multipliers which require m clock cycles for the latency, the proposed serial-output bit-serial multiplier has the latency of one clock cycle. This is achieved at the expense of longer critical path delay and more area requirement.
It is interesting to note that by connecting the output of the proposed multiplier to the serial-input of the LSB-first multiplier, one can obtain a hybrid structure which performs two multiplications together. The results of such a hybrid structure are available in parallel after m clock cycles and it has practical applications for fast cryptographic computations.
The proposed bit-serial multiplier can be extended to obtain a new serialoutput digit-serial multiplier by replicating the BTX, IP(m), and IP(m − 1) blocks in Figure 2(a) . The latency of such a digit-serial multiplier is one and it generates K bits of the multiplication in each clock cycles with the total m K clock cycles for the entire multiplication.
