In [11] , a new approach to subquadratic space complexity multiplication for extended finite fields has recently been proposed for hardware implementation. In this article, we develop the corresponding algorithm for software implementation. Compared to the Karatsuba algorithm, the proposed algorithm has a lower theoretical time complexity when the size of the input is greater than a fixed integer. While its recursive implementation is as simple as that of the Karatsuba algorithm, it requires less memory to store the look-up table than the latter, e.g., 512 bytes vs. 128 kilobytes in our implementation. To the best of our knowledge, this is the first better alternative to the Karatsuba algorithm for software implementation dealing with "intermediate" sized finite fields.
m. When elements of GF (2 n ) are represented using a polynomial basis, they are polynomials of degree less than n and coefficients in GF (2) . Similarly, addition and multiplication operations in GF (2 n ) are defined as addition and multiplication of two such polynomials modulo f (u).
Since no carry occurs in GF (2) addition, GF (2 n ) addition is straightforward.
For the software implementation, the polynomial basis GF (2 n ) multiplication operation may be performed by the following two steps: the conventional polynomial multiplication followed by the reduction modulo the field generating irreducible polynomial. The second step becomes simple if some special irreducible polynomials are chosen to generate GF (2 n ), e.g., irreducible trinomials or pentanomials. In fact, there is no known value of n for which an irreducible polynomial of degree n and weight w < 6 does not exist over GF (2) [1] .
The elementary method to perform the first step, i.e., the conventional polynomial multiplication, is the school method which has an asymptotic complexity O(n 2 ). Recursive application of the divide-and-conquer based Karatsuba algorithm leads to a running time of O(n log 2 3 ) ≈ O(n 1.585 ) for n = 2 i (i > 0) [2] . The k-way (k > 2) generalization of the Karatsuba algorithm has a better asymptotic complexity [3] . For practical implementation, the crossover point between the school and the Karatsuba methods is reported to be n = 576 in [4] . Other more sophisticated algorithms, e.g., the FFT and Cantor's multiplication, have better asymptotic complexity and are suitable for large values of n [5] . The crossover point between these methods and the Karatsuba method is relatively large, e.g., the crossover point between the Karatsuba and the Cantor algorithms are 35840 and 16000 in [4] and [6] , respectively. For software implementation of these algorithms, "a typical experience is that the school method is best for small inputs, Karatsuba's algorithm takes over for intermediate sizes, and a fast, for example, an FFT-based method, excels for large problems" [7] .
In this article, we focus on the software implementation of the GF (2 n ) multiplication algorithm for cryptographic applications. In some of these applications, the value of n is often a few hun- and GF (2 571 ). According to the above suggestion, it seems that the Karatsuba algorithm is the best choice for intermediate sizes of n. In the following, we will combine two existing techniques: optimal Toeplitz matrix-vector product formulae [8] and the coordinate transformation technique of [9] and [10] , and propose an alternative scheme for software implementation of the subquadratic GF (2 n ) multiplication. Theoretical analysis indicates that it is faster than the Karatsuba algorithm.
In order to speedup the implementation of the Karatsuba algorithm, in practice the multiplication of polynomials of degree less than 8 are often performed by the table-lookup technique [4] .
The corresponding table uses 128 kilobytes. In our implementation of the proposed algorithm, the table-lookup technique is also adopted. Although the size of tables that we employ is relatively small, i.e., 512 bytes, experimental results show that the algorithm is still faster than the Karatsuba algorithm for n = 2 i , where (6 < i < 18). Another advantage of the proposed algorithm is its simplicity, i.e., its recursive implementation is as simple as the Karatsuba algorithm. Therefore, the proposed method is a better alternative to the Karatsuba algorithm.
The remainder of this article is organized as follows: The algorithm is proposed in Section II, and the software implementation for irreducible trinomials is presented in Section III. Finally, concluding remarks are made in Section IV.
II. ALTERNATIVE SUBQUADRATIC MULTIPLICATION ALGORITHM FOR SHIFTED POLYNOMIAL BASIS
In this section, first we briefly review the subquadratic space complexity scheme of [11] and then present explicit formulae for general irreducible trinomials.
A. Optimal Toeplitz Matrix-Vector Product Formula for n = 2 i (i > 0)
In this subsection, the noncommutative matrix-vector multiplication scheme for n = 2
and its asymptotic time complexities are presented. Let T = (m k,i ) be an n × n Toeplitz matrix over GF (2), where
that T is determined by the 2n − 1 elements of the first row and the first column. Let V be an n × 1 column vector. Matrix T and vector V can be split as follows:
where T 0 , T 1 and T 2 are (n/2) × (n/2) matrices and are in the Toeplitz form, and V 0 and V 1 are (n/2) × 1 column vectors. The following noncommutative formula compute the Toeplitz matrix-vector product T V [8] :
where
. Please note that in (1) the product of an n × n Toeplitz matrix and an n × 1 vector is primarily reduced to three products of matrix and vector of sizes (n/2) × (n/2) and (n/2) × 1.
Let T mvp (n) denote the time complexity for the computation of matrix-vector product T V .
The following recurrence relations can be established when (1) is used recursively to compute
For the purpose of comparison, we also present the recurrence relation that describes the time complexity of the Karatsuba algorithm as follows.
T kara (n) = 3T kara (n/2) + c kara n + d kara .
After solving these recurrence relations using Lemma 1 of [11] , we have
where subscripts (i.e., mvp and kara) are omitted.
Therefore, recursive application of (1) and the Karatsuba algorithm have the same asymptotic complexity, i.e., O(n log 2 3 ). In the next section we will estimate values of e, c and d, and show
) of the proposed algorithm is smaller than that of the Karatsuba algorithm if formula (1) and the Karatsuba method are applied until n reaches 1.
Similar to the generalization of the Karatsuba method from n = 2
where p is a small odd prime [3] , we may also obtain matrix-vector product formulae for other small primes. The case of n = 3 i (i > 0) has been presented in [11] . In situations where none of these small primes divides n, one possible solution is to increase the size of the matrix and vector by 1 by padding zeroes, and then apply the existing complexity results to the case n + 1.
B. Formulation Using Shifted Polynomial Basis for General Irreducible Trinomials
Shifted polynomial basis (SPB) is a shifted version of the standard polynomial basis. GF (2 n ) multiplication operation in the SPB is similar to that of the standard polynomial basis. GF (2 n )
hardware multipliers based on the SPB often have lower time complexity than the corresponding polynomial basis multiplier. Let x be a root of the irreducible polynomial f (u) and GF (2
is defined as follows [12] :
polynomial basis with respect to M .
From now on, we assume that
T be the column vector of SPB basis elements, A = (a 0 , a 1 , · · · , a n−1 ) T be the coordinate column vector of the field element a = x
and B, C and D are defined similarly.
Besides the two-step multiplication method mentioned in the introduction section, the other method to compute the product c = ab is to form a binary n × n matrix Z first, which depends on b and f (u), and then perform a matrix-vector product. Namely, the product c = ab may be computed as follows:
where Z i is the coordinate column vector of x i−v b with respect to the SPB (0 ≤ i ≤ n + 1), and Z is an n × n matrix.
This matrix-vector product method has been widely used to design GF (2 n ) multipliers using VLSI technologies. However, Z is not generally a Toeplitz matrix. Using coordinate transformation techniques presented in [9] and [10] , one may first transform Z into a Toeplitz matrix T , i.e., T = U Z, where U is the transform matrix. Then compute the Toeplitz matrix-vector product D = T A. Finally, the result C is obtained by C = U −1 D. In [11] , this idea has been used and the following simple transformation matrix U has been obtained:
where I v×v is the v × v identity matrix.
Since an n × n Toeplitz matrix is determined by the 2n − 1 elements of the first and last columns, we now determine the explicit formulae of these elements in T . Due to the property of the transformation matrix U , it is clear that T is obtained from Z by moving the lower n − v rows to the upper n − v rows. Therefore, we first determine the first and the last columns of Z.
From (3), we know that these two columns correspond to coordinate vectors of element x −v b and x n−v−1 b, respectively, which are given as follows:
and
Let x n−v−1 b and x −v b be the two elements of GF (2 n ) whose SPB coordinates form the first and last columns of T . Applying the transformation U to Z, we may obtain explicit formulae of elements x n−v−1 b and x −v b as follows:
where < j > denotes j mod n.
Note that explicit formulae of matrix T and U for irreducible pentanomials g(u) = u n + u k+1 + u k + u k−1 + 1 (1 < k < n − 1) have been presented in [11] .
III. SOFTWARE IMPLEMENTATION

A. Data Representation and Computation of Inner-product
Let z be the full width of the data-path of the general-purpose processor on which the software will run, e.g., z = 32 for a Pentium processor. For the simplicity of introducing the data representation, we assume that n is a multiple of z in this paper. Since no irreducible binary trinomial exists for the case 8|n, we assume that the multiplication operation is performed in the ring GF (2)[u]/(f (u)), where f (u) is a trinomial and n = 2 i (i > 2), so that the following discussion is meaningful.
In software implementation, the column vector A = (a 0 , a 1 , · · · , a n−1 ) T is defined as a 1-
, which is stored in n z successive computer's memory words, each having z consecutive coefficients (i.e., bits) are packed into one word. In this paper, we suppose that bit a 0 is stored in the least significant bit (LSB) of V A [0] and bit a n−1 the most significant bit (MSB) of
Since the detailed representation of T relies on the computation procedure of the matrix-vector product T A, we now introduce the latter. The usual computation procedure regards T A as an array of n inner-products to be computed one at a time in top-to-bottom order. In software implementation, the inner-products c i (0 ≤ i ≤ n − 2) is calculated using two steps:
Step 1: Compute GF (2) products t i,j a j by processor's AN D instruction, where 0 ≤ j ≤ n−1.
Step 2: Compute the GF (2) summation of the above n products using bit-wise XOR operations.
Due to the property of the Toeplitz matrix T , we know that it is sufficient to represent T by its first and last columns. Please note that elements in the first row are the same as those in the last column. Therefore, we may use a 1-dimensional array V T [0..( Combining the results of (4) and (5), we have the following explicit formula of the i-th bit
This formula shows that the i-th bit of V T is the summation of b <2v−i> and either 0 or b <v−i> for the case 1 ≤ i ≤ 2n − 1. But we know that the i-th bit of V A is a i , i.e., bit sequences b <2v−i> and a i have different subscript orders: one is in ascending order and the other descending order.
Since our goal is to compute the inner-product in the above Step 1, we should first reverse the bit order of the input vector B and then use this formula to form V T . In our implementation, a small table is used to perform this bit-reverse operation. Its size is 256 bytes. The content of the i-th entry is the reserved eight bits of the binary representation of integer i (0 ≤ i ≤ 255).
Based on these representations, the following loop may be used to illustrate the computation of all inner-product c i s (0 ≤ i ≤ n − 1).
programming language like C, we may perform step S1 via the or (00000000) 2 if the binary representation of integer i contains an odd or even number of 1 bits (0 ≤ i ≤ 255), respectively. Please note that for the sake of software implementation on a typical processor, this table is allowed to have considerable amount of redundancy since only 256 bits are sufficient. Here, we also note that there is a parity bit in some processor's status register, e.g., Intel's x86 and MCS51 series. This bit is set if the least-significant byte of the result contains an even number of 1 bits, i.e., it is the GF (2) summation of 1s in the least-significant 8 bits. Therefore, we may not only speed-up the computation but also save this small table if the proposed algorithm is coded using the assembly language of these processors.
Finally, we indicate an optimization of the above inner-product computation. Using the method in Table I , it requires 32 shift operations of the 64-bit vector V T to compute the matrix-vector product of size n = 32. If the minimal memory access unit of most computer is one byte, we may compute c 0 , c 8 , c 16 and c 24 before performing the first shift operation, then c 1 , c 9 , c 17 and c 25 , and so on. In general, this optimization can save 24 shift operations.
B. Proposed Algorithm and Theoretical Analysis
We now present the proposed algorithm in the C programming language. The recursive subprogram "void mvp(w32 * c, w32 * a, w32 * b, int s)" is called by the GF (2 n ) multiplication subprogram "void multiplication(w32 * c, w32 * a, w32 * b)", where w32 denotes the 32-bit unsigned long data type and " * c" the pointer to the product of a and b. Integer s denotes the number of the computer words that used to store the vector c. For the purpose of comparison, we also present the Karatsuba algorithm in Table IV .
In Tables III and IV, For practical implementation of the Karatsuba algorithm on a 32-bit computer, actual computations, i.e., step S0 in Table IV is performed when the size of the operand reaches 1 computer word (32 bits). In [4] , it is shown that the best way to perform the multiplication of polynomials of degree less than 32 is by applying the Karatsuba method twice, which yields 9 multiplications of 8-bit blocks, and then obtaining these 9 16-bit products via the table-lookup technique (the corresponding table uses 128 kilobytes). We also follow this approach in our implementation.
For the implementation of the matrix-vector product algorithm, actual computation step S0 in Table III is performed when the size of the operand reaches 4. The reason is that the minimal operation unit of the AN D operation is 1 computer word, and we also want to apply the matrixvector product formula (1) twice in step S0 so that both subprograms have the same number of recursive callings.
C. Timing Results
In order to compare the actual performance of these two algorithms, we have implemented them in ANSI C and tested on the following two computers: 
