Abstract-We develop a new and simple way to describe Karatsuba-like algorithms for multiplication of polynomials over F 2 . We restrict the search of small circuits to a class of circuits we call symmetric bilinear. These are circuits in which AND gates only compute functions of the form P i2S a i Á P i2S b i ðS f0; . . . ; n À 1gÞ: These techniques yield improved recurrences for MðknÞ, the number of gates used in a circuit that multiplies two kn-term polynomials, for k ¼ 4; 5; 6; and 7. We built and verified the circuits for n-term binary polynomial multiplication for values of n of practical interest. Circuits for n up to 100 are posted at
DEFINITIONS
Polynomials. The input to our problem will be polynomials of degree n À 1 We refer to a polynomial of degree n À 1 as an n-term polynomial (even though some of the terms may be zero). For integers i < j, we identify a polynomial A ¼ a i x i þ a iþ1 x iþ1 þ Á Á Á þ a j x j with the tuple ða i ; a iþ1 ; . . . ; a j Þ, and denote A½i ¼ a i and A½i::j ¼ ða i ; . . . ; a j Þ. Symmetric Circuits. For the purpose of this paper, we restrict ourselves to circuits with the following structure, which we call symmetric bilinear circuits. A symmetric bilinear circuit contains only binary XOR (addition) and binary AND (multiplication) gates. It consists of a top layer consisting only of XOR gates; For an integer n > 1, let MðnÞ be the size of the smallest circuit over ð^; È; 1Þ computing the polynomial product of two n-term polynomials. We emphasize that there does not necessarily exist a symmetric bilinear circuit of size MðnÞ for all n, although all the best sizes we know of can be achieved using circuits of this form.
We will use three metrics for the class of symmetric bilinear circuits. Let C be a circuit in the class. The multiplicative cost of C, denoted M^ðCÞ, is the number of AND gates. The upper additive cost of C, denoted M È ðCÞ, is the number of XOR gates in the top layer of C. The bottom additive cost of C, denoted M È ðCÞ, is the number of XOR gates in the bottom layer of C. With respect to multiplicative complexity, there is only one optimal symmetric bilinear circuit for n ¼ 2. The top layer calculates
The multiplication layer calculates
The multiplicative cost is three (which is optimal, among all boolean circuits). Both additive costs M È and M È are 2. For degree-2 polynomials, a computer search yields exactly six circuits of multiplicative complexity six in the prescribed class. 2 Of these, two have bottom additive cost 6 (the other four circuits have bottom additive costs 7, 7, 8, 8) .
Linear Operators and Representation of Circuits. Let W be an n Â m matrix over F 2 . The function x 7 ! W Á x can be computed using only XOR gates. We denote by sðW Þ be number of XOR gates necessary to compute this mapping. For a symmetric bilinear circuit C that on input a ¼ ða 0 ; . . . ; a nÀ1 Þ and b ¼ ðb 0 ; . . . ; b nÀ1 Þ computes the output c ¼ ðc 0 ; . . . ; c 2nÀ1 Þ, there exists a unique matrix T such that the ith AND gate computes the ith coordinate of the Hadamard product ðT Á aÞ ðT Á bÞ. We call this matrix T the top matrix of C. Similarly, the bottom part of the circuit can be described as a matrix, which we call the main matrix of C, denoted R. More precisely, given the top matrix T , the main matrix R satisfies c ¼ R Á ðT Á aÞ ðT Á bÞ ½ :
A symmetric bilinear circuit C is completely described by the two matrices ðT; RÞ, along with XOR circuits computing them. The two matrices corresponding to the circuit in Fig. 1 are
Note that the matrix T is used twice in the circuit C.
PREVIOUS WORK
Asymptotic Complexity. Much work has been put into giving asymptotically good algorithms for binary polynomial multiplication, see [5] , [6] , [7] . Currently, the asymptotically best algorithm is due to Harvey, van der Hoeven, and Lecerf who showed that MðnÞ Oðn log n 8 log ? n Þ, where log ? ðÁÞ denotes the iterated logarithm, an unbounded but extremely slow growing function.
Concrete Complexity. For values of n that are interesting for cryptographic purposes (say, n 1000), the asymptotic bounds do not say much about the concrete circuit complexity. For this we need to employ a combination of different recursive relations.
This problem has received much attention in recent years, see [1] , [2] , [4] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] . Bernstein, in [1] , used various (new and old) recursive constructions to build small circuits for n-term polynomial multiplication for 2 n 1000. Notably he obtains results "better than anything that can be found in the hardware literature" [1, page 7] , including results reported in [9] , [17] , [20] , [24] , [25] . In recent work [4] , Cenk and Hasan show that smaller circuits exist for many values of n. Their work applies new recurrence relations in a manner similar to that of [1] . Recurrence relations sufficient to obtain the values stated in [4] are shown on Table 1 .
Known Recursive Constructions
Many different recursive constructions for polynomial multiplications have been suggested. Most of these constructions are based on either interpolation or a generalization of Karatsuba. Interpolation methods are usually based on the approach of Toom in [26] . Concrete constructions for k ¼ 2; 3 have been proposed by Bernstein [1] and by Cenk and Hasan [4] .
That Karatsuba multiplication can be generalized has been observed many times. Notable references include Montgomery [11] , Weimerskirch and Paar [14] , Bernstein [1] , and more recently Cenk et al. [4] , [27] . Some recurrences have been patented, as in Montgomery's patent [28] and in Koç and Erdem's patent [29] . Table 1 contains known recurrences. The "alg" column in each of the tables assigns numbers to some of the associated algorithms. These number will be used later in this paper, when reporting how our circuits are obtained. This paper improves on the recurrences for Mð4nÞ; Mð5nÞ; Mð6nÞ, and Mð7nÞ (Table 2 ).
FINDING NEW KARATSUBA-LIKE RECURRENCES
There is a standard way to convert any symmetric bilinear circuit C, for polynomial multiplication of k bits, into a recurrence. The recurrence yields an upper bound on Mðk Á nÞ in the following way: to multiply two k Á n-term polynomials A; B, divide A; B into k blocks. That is, write A as
with each A i an n-term polynomial, and similarly write B as
The product A Á B can be written as
where
To compute the polynomials U 0 ; . . . ; U 2kÀ2 we use the circuit C, where each XOR gate is replaced with polynomial addition (bitwise XOR) and each AND gate is replaced with a circuit for n-term multiplication. Each of the top XOR gates in C results in n gates, and each of the bottom XOR gates result in 2n À 1 gates. Each AND gate is replaced by a circuit for n-term multiplication, using MðnÞ gates. We immediately have that the cost of computing U 0 ; . . . ; U 2kÀ2 is n Á M È ðCÞ þ ð2n À 1ÞM È ðCÞ þ M^ðCÞ Á MðnÞ. Finally, to obtain the actual bits of the result, we have to take care of the overlap between U i x iÁn and U iþ1 x ðiþ1ÞÁn . One way to do this is by doing bitwise XOR with the high-order n À 1 bits from U i and the low-order n À 1 bits from U iþ1 for i ¼ 0; . . . ; 2k À 3. This uses ð2k À 2Þ Á ðn À 1Þ gates. This gives the generic recurrence
An upper bound for the cost M È ðCÞ is twice the cost of multiplication by the top matrix T . An upper bound for the cost M È ðCÞ is the cost of multiplication by the main matrix R.
Let A ¼ ðA 0 ; . . . ; A kÀ1 Þ, B ¼ ðB 0 ; . . . ; B kÀ1 Þ, and U ¼ ðU 0 ; . . . ; U 2kÀ2 Þ. The Karatsuba recurrence is Mð7nÞ 22MðnÞ þ 107n À 33 Eq. (13) U ¼ R Á ðT Á AÞ ðT Á BÞ ½ :
As we noted earlier, the polynomials U i overlap. Let t ¼ M^ðCÞ (i.e., the number of AND gates in C). Then P ¼ ðP 0 ; . . . ; P tÀ1 Þ satisfies P ¼ ðT Á AÞ ðT Á BÞ:
We will show that the bits of c can be obtained by multiplying P with an extended version of the matrix R.
Example: Karatsuba (k ¼ 2)
To illustrate how to optimize the bottom layer, we describe a way to obtain the Karatsuba-recurrence Mð2nÞ 3MðnÞ þ 7n À 3 in Table 1 . This recurrence is not novel, but this particular way of deriving it naturally generalizes to arbitrary values of k. Let the input polynomials be
and
for n-term polynomials A 0 ; A 1 ; B 0 ; B 1 . Let the result be
and let U t ¼ P iþj¼t A i Á B j , for t ¼ 0; 1; 2. Multiplication by the matrices T; R corresponding to Fig. 1 can be done with one and two additions, respectively. Thus Eq. (1) yields
To improve on this, we consider the overlap more carefully. For a ð2n À 1Þ-term polynomial D let LðDÞ, MðDÞ, and HðDÞ be the unique polynomials with n À 1, 1, and n À 1 terms, respectively, that satisfy
and ðP 0; P 1; P 2Þ ¼ ðT Á AÞ ðT Á BÞ:
Our aim is to write the polynomial C in terms of P 0 ; P 1 ; P 2 . Using the fact that U 0 ¼ P 0 ; U 1 ¼ P 0 þ P 1 þ P 2 , and U 2 ¼ P 2 , we have
Now we can write the components of the polynomial C as linear functions of the low, middle, and high parts of the polynomials computed by the multiplication gates 
All matrices in Eqs. (2) and (3) are matrices of polynomials. Now it remains to find good circuits to compute the linear operators in Eqs. (2) and (3). Since these matrices are small, it is easy to see that the first can be computed using two additions, and the second can be computed using five additions. Each addition in the computation of the first linear mapping costs 1 XOR gate, and in the second linear mapping each operation costs n À 1 XOR gates. The top part still uses 2n XOR gates. This yields the previously stated recurrence
Generalizing to k ! 3
The construction of the previous section generalizes to any value of k. Given the two matrices ðT; RÞ, which define a symmetric bilinear circuit C for multiplication of k-term polynomials, the output bits C½ of C satisfy the following equations:
Let the extended matrix E be defined as
where R i is the ith row of R. Let ðP 0 ; . . . ; P t Þ ¼ ðT Á aÞ ðT Á bÞ. Then the remaining components of C½ can be written as C½0::n À 2 C½n::2n À 2 . . .
This yields the recurrence MðknÞ M^ðCÞ Á MðnÞ þ 2n Á sðT Þ þ ðn À 1Þ Á sðEÞ þ sðRÞ: (8) Note that this allows for a succinct description of a multiplication circuit: each such circuit is described by XOR circuits for T , R and E.
Critical Paths. Let d T ; d R ; d E be the length of the critical paths of circuits computing the linear transformations T; R; E, respectively. The total delay D C of our construction satisfies the following recurrence We remark that obtaining a circuit for linear operators purely using common subexpressions results in so-called "cancellationfree" circuits (also called SUM-circuits). For some linear operators these circuits are highly suboptimal [30] , see also [31, Section 5.3]. Indeed, for the extended matrix E in the 6-way split below, the only minimal sized circuit we have found has cancellation, so it could not have been obtained using only common subexpressions.
Finding Recursion Circuits. The recurrence in Eq. (8) suggests the following strategy to finding recurrences upper bounding Mðk Á nÞ for a fixed k: First find circuits for k-term multiplication with the smallest possible number of AND gates. Among these, find one where Eq. (8) is as good as possible.
We remark that both of these tasks are computationally very challenging; computing the tensor rank is NP-hard [32] . The problem of finding the smallest XOR circuit for a given matrix is NPhard and max-SNP-hard [33] , meaning that if NP 6 ¼ P even finding a circuit which is at most a particular constant larger than the optimum is intractable.
Note that the matrix T fully determines the matrices R and E. The requirement of T is that the elements of the Hadamard product ðT Á aÞ ðT Á bÞ span the target bilinear forms in c. We use the following randomized heuristic: 1) randomly generate candidate matrices for T ; 2) solve for matrices R, E; 3) find (linear) circuits for multiplication by T , R, and E. For the first step we used the heuristics of [33] . 3 We parametrized this search so as to favor matrices with rows of low hamming weight (heuristically, this is expected to yield circuits with fewer XOR gates). For the second step we used the heuristics of [34] for small low-depth XOR circuits, although we prioritized size. This explains the large depth of our circuits for R and E with n ¼ 6 (see Table 3 ).
New Recurrence Relations
Using the approach described in the previous sections we obtain several new recurrence relations. These are shown in Table 2 . We describe the recurrence by describing the two matrices T; R associated with the recurrence (the matrix E is derived from R). The straight-line programs computing each of the matrices are available as supplemental material to this article. It is easily seen that sðT Þ ¼ 3, sðRÞ 6, and sðEÞ 12. This gives the recurrence
which is the same recurrence as reported in [27] . 
Again, it is not hard to verify that sðT Þ 5, sðRÞ 12; sðEÞ 24, and that this uses 9 multiplications. We get the recurrence
We note that this is a little better than what one would get by applying Eq. (4) twice. 5-Way Split. For n ¼ 5, a computer search gave matrices T; M; E with sðT Þ 8, sðRÞ 19, and SðEÞ 38, using 13 multiplications. The matrices are given in Appendix A.1. This gives the recurrence
6-Way Split. For n ¼ 6, a computer search found matrices T; M; E with sðT Þ 13 gates, sðRÞ 30, and sðEÞ 59, using 17 multiplications. The matrices are given in Appendix A.2. This gives the recurrence
7-Way Split. Similarly for n ¼ 7, a computer search found matrices T; M; E with sðT Þ 16, sðRÞ 41, and sðEÞ 75. This structure required 22 multiplications. This leads to the recurrence
The matrices are given in Appendix A.3.
3. The techniques of [2] can be used to speed up this search.
MULTIPLICATIVE COMPLEXITY OF POLYNOMIAL MULTIPLICATION
A natural question about the recurrences in the previous section is whether they can be improved. Do matrices giving better recurrence relations exist? In particular, is it possible to find matrices giving a smaller number of multiplications in the recursion? It turns out that using this technique, for k ¼ 2; 3; 4; 5 there is not, but for k ! 6 there could be. In this section we will briefly sketch why.
There is a known relationship between error correcting codes and quadratic boolean circuits computing finite field or polynomial multiplication (see [35] , [36] ). Roughly speaking, any quadratic boolean circuit computing n-term polynomial multiplication induces an error correcting code with certain parameters. Therefore the nonexistence of certain codes can be used to prove the nonexistence of certain circuits. More specifically, Kaminski [37] shows that if there exists a method for multiplying two n-term polynomials in F 2 ½X which uses l multiplications over F 2 , then there exists a linear code of length l, weight 2n À d, and dimension d. This holds for all n d < 2n. Table 4 gives multiplicative complexity lower bounds derived using known bounds for the length of linear codes (see e.g., [38] ). The column ðl; d; wÞ indicates parameters for a code with length l, dimension d and weight w that does not exist and therefore establishes the lower bound. We leave it as open problems to close the gaps for k ¼ 6; 7, and to find recurrences with better low-order terms than what we provide in the previous section.
RESULTS
Although Eqs. (10), (11), (12) , and (13) improve on known Karatsuba recurrences, they only yield upper bounds on the circuit complexity of multiplication. Determining the actual size of the circuits that can now be obtained require an experimental approach. Our approach was as follows: first find small circuits for n ¼ 2; 3 . . . ; k. Then, to find a circuit for multiplication of ðk þ 1Þ-term polynomials, apply each of the applicable recursive constructions, using the previously found circuits as base cases. Then look at obvious inefficiencies (unused gates, two distinct gates computing the same function, etc.). Finally, select the smallest circuit and continue. Table 5 shows the sizes and depths obtained, and compare them to the previous state of the art: Table 5 Alg denotes what recurrence is used, the number of the algorithm, (see Tables 1 and 2 ); The * for n ¼ 11 indicates that the circuit published in [39] is used; Bold text indicates an improvement.
CONCLUSIONS
We proposed a new method to describe, find, and analyze Karatsuba-like recurrences. The method leads to better recurrences than previously known for splitting into 4, 5, 6, and 7 blocks. We constructed circuits for binary polynomial multiplication which are smaller, and often of lower depth, than previously known. The circuits were verified by computing the algebraic normal form of the outputs. Circuits up to degree 99 are posted at http://cswww.cs.yale.edu/homes/peralta/CircuitStuff/BinPolMult.tar.gz.
APPENDIX A MATRICES AND THEIR ADDITIVE COMPLEXITY
This section contains the matrices T; R found for 5 through 7-way split. We also report upper bounds on the additive complexities of multiplying by these matrices (and by the derived matrix E). We note that we could derive circuits with lower depths than those reported in Table 5 simply by optimizing the circuits for T; R; E with respect to depth rather than size. This could be of value in applications.
A.1 5-Way Split
The matrices used in the 5-way split are as follows: 
The extended matrix E is derived from R. The additive complexities of our circuits for T; R; E are 8; 19; and 38, respectively.
A.2 6-Way Split
The matrices used in the 6-way split are as follows: The additive complexities of our circuits for T; R; E are 13; 30; and 59, respectively.
A.3 7-Way Split
The matrices used in the 7-way split are as follows: The additive complexities of our circuits for T; R; E are 16; 41; and 75, respectively.
