We set new speed records for multiplying long polynomials over finite fields of characteristic two. Our multiplication algorithm is based on an additive FFT (Fast Fourier Transform) by Lin, Chung, and Huang in 2014 comparing to previously best results based on multiplicative FFTs. Both methods have similar complexity for arithmetic operations on underlying finite field; however, our implementation shows that the additive FFT has less overhead. For further optimization, we employ a tower field construction because the multipliers in the additive FFT naturally fall into small subfields, which leads to speed-ups using table-lookup instructions in modern CPUs. Benchmarks show that our method saves about 40% computing time when multiplying polynomials of 2 28 and 2 29 bits comparing to previous multiplicative FFT implementations.
Introduction
Multiplication for long binary polynomials in the ring F 2 [x], where F 2 is the finite field with two elements, is a fundamental problem in computer science. It is needed to factor polynomials in number theory [vzGG96] [vZGG02] and is central to the modern Block Wiedemann algorithm [Tho02] . Block Wiedemann is in turn integral to the Number Field Sieve [AFK + 07], the critical attack against the RSA cryptosystem. Another application of Block Wiedemann is the XL algorithm [CCNY12] , which is a critical attack against multivariate publickey cryptosystems [DGS06] .
To the best of our knowledge, all currently fast binary polynomial multiplication algorithms are based on a Fast Fourier Transform (FFT) algorithm. The FFT is an algorithm for efficiently evaluating polynomials at subgroups in the underlying finite field.
Previous works on multiplying binary polynomials
There have been good reviews of algorithms and implementations for multiplying binary polynomials in [BGTZ08] and [HvdHL16] .
In general, these previous methods of multiplication were based on "multiplicative" FFTs, evaluating polynomials at multiplicative subgroups formed by roots of unity. Since the sizes of multiplicative groups existing in F 2 k are restricted, multiplicative FFTs in the binary field are somewhat more difficult than that over the reals and the complex. These multiplicative FFTs, evaluating polynomials at n points, reach their best efficiency only when n is a size of a multiplicative subgroup.
In [BGTZ08] , Brent et al. implemented mainly the Schönhage [Sch77] algorithm for long polynomials with complexity O(n log n log log n) of field operations. Schönhage's algorithm is based on ternary FFTs over binary finite fields. In their implementation, the optimal number of evaluation points is 3 k .
Harvey, van der Hoeven, and Lecerf [HvdHL16] presented multiplication using an FFT of a mixed radix approach. They applied several discrete Fourier transforms(DFTs) for different input sizes, e.g., Cooley-Tukey [CT65] for the largest scale DFT. In particular, they need to find a suitable finite field which is of a size close to a machine word and simultaneously allows both abundant multiplicative subgroups. They proposed F 2 60 which elegantly satisfies these conditions.
Recent Progress: Additive FFTs
Following Cantor [Can89] , alternative methods are developed to evaluate polynomials at points that form an additive subgroup in a field of characteristic 2. These methods are called "additive FFTs".
Cantor presented a basis, termed "Cantor basis" in the literature, for constructing a finite field as well as an FFT over the field. His FFT works with the complexity of O(n log n) multiplications and O(n log log 2 3 n) additions( XOR) for evaluating n = 2 m elements.
In 2010, Gao and Mateer [GM10] presented an additive FFT (heretofore "GM FFT") over F 2 k , where the evaluation points are an additive subgroup of size 2 k in the underlying GF. The additive subgroups are easier to form than multiplicative subgroups in the fields of characteristic 2. However, the complexity in GM FFT is O(n log 2 n) XOR operations for evaluating a polynomial at n = 2 m points in general. It can be optimized to O(n log n log log n) only when m is a power of 2, and the polynomials in this case are represented in a special polynomial basis introduced by Cantor [Can89] . In 2014, Bernstein and Chou [BC14] presented an efficient implementation of the GM FFT. In 2014, Lin, Chung, and Han [LCH14] proposed a more general variant ("LCH FFT"), which uses a different polynomial basis than the standard one. The polynomial basis results in a more regular structure in the butterfly stage. When representing the underlying finite field in a Cantor basis, the butterfly stage is the same as the optimized GM FFT and Bernstein-Chou. In the subsequent work [LANH16] , they presented a method for converting the polynomial from standard basis. The complexity for the conversion is O(n log n log log n) XOR operations for n being any power of two, which reaches the same complexity as multiplicative FFTs.
More details of the LCH FFT are reviewed in Sec. 2.3.
Our Contributions
In this paper, we present a faster method of multiplication for long binary polynomials based on the recently developed LCH FFT comparing to previously multiplicative-FFT-based algorithms. From the faster results of our additive-FFT-based implementations, we confirm that the recent development of additive FFTs helps the multiplication for binary polynomials.
Our implementation is faster than previous multiplicative FFT codes for two reasons. First, the FFT in our algorithm uses simple binary butterflies stages leading to 1 2 n log n multiplications for n evaluation points, compared to (for example) the ternary FFT used in Schönhage's algorithm which leads to 4 3 n log 3 n multiplications. This factor confers a 10%-20% advantage over multiplicative-FFT-based implementations and will be discussed in Sec. 3. 1 . Second, we exploit the fact that the multipliers in the additive FFT are in subfields, which reduces the average time taken per multiplication. We will discuss the method in Sec. 3.2 and 3.3. The overall improvement is 10%-40% over previous implementations.
Preliminaries

Multiplying with Segmentation of Binary Polynomials
In this section, we discuss the general method of multiplications for long binary polynomials.
Suppose we are multiplying two polynomials a(x) = a 0 +a 1 x+· · ·+a d−1
The polynomials are represented in bit sequence with length d. The standard Kronecker segmentation for multiplying binary polynomials is performed as follows:
1. Partition the polynomials to w-bits blocks. There are n = d/w blocks.
a (y) := a 0 + a 1 y + · · · + a n−1 y n−
such that a 0 = (a 0 + a 1 z + . . . + a w−1 z w−1 ), a 1 = (a w + . . . + a 2w−1 z w−1 ), . . . , a n−1 = (a (n−1)w + a (n−1)w+1 z + . . . + a nw−1 z w−1 ) and same for b j .
3. Calculate c (y) = a (y) · b (y) = c 0 + c 1 y + · · · ∈ F 2 2w [y] (using FFTs).
4.
Map z → x and y → x w . Then, collect terms and coefficients to find the result of multiplication of binary polynomials. We need to add together at most 2 coefficients at any power.
FFT-based Polynomial multiplication. It is well known that polynomial multiplication can be done using FFT [CLRS09] . To multiply two degree-(n−1) polynomials a (y) and b (y) ∈ F 2 2w [y] with FFT algorithms, the standard steps are as follows:
1. (fft) Evaluate a (y) and b (y) at 2n points by an FFT algorithm.
(pointmul)
Multiply the evaluated values pairwise together.
(ifft)
Interpolate back into a polynomial of degree ≤ 2n−1 by the inverse FFT algorithm.
The complexity of polynomial multiplication is the same as the FFT in use.
Alternative Representations of Finite Fields
The field of two elements, denoted as F 2 , is the set {0, 1}. The multiplication of F 2 is logic AND and addition is logic XOR. In this paper, every field will be an algebraic extension of F 2 . We will apply interchangeable representations for each used finite field (or Galois field, GF). The illustrative example is the field of 2 128 elements, denoted by F 7 in [Can89] and [BGTZ08] . We will switch representations during computation to achieve a better efficiency for field multiplication. All fields of the same size are isomorphic and the cost of changing representation is a linear transformation.
The Irreducible Polynomial Construction of F 2 128
We choose the basic working field to be the same as in AES-GCM, denoted as F 2 128 :
An element in F 2 128 is represented as a binary polynomial of degree < 128. The F 2 128 can also be a linear space of dimension 128 with the basis (x i ) 127 i=0 .
The cost of multiplication for F 2 128 . There are hardware instructions for multiplying small binary polynomials which fits for the multiplication of F 2 128 in many platforms. PCLMULQDQ is a widely used instruction for multiplying 64-bit binary polynomials in x86. Since one PCLMULQDQ performs 64 × 64 → 128 bits, the multiplication of F 2 128 costs roughly 5 PCLMULQDQ (3 for multiplying 128-bit polynomials with Karatsuba's method and 2 for reducing the 256-bit result back to 128 bits with linear folding). More details about multiplications of F 2 128 can be found in [GK14].
Cantor Basis for Finite Field as Linear Space
Gao and Mateer presented an explicit construction of Cantor Basis for finite field in [GM10] . The Cantor basis (β i ) satisfies β 0 = 1, β 2 i + β i = β i−1 for i > 0. 
Definition 2.2. Given a basis (β i ) m−1 i=0 in the base field, its sequence of subspaces is V i := span{β 0 , β 1 , . . . , β i−1 }. Its subspace vanishing polynomials (s i ) are,
Note that V i is a field with linear basis (β j ) i−1 j=0 only when i is power of 2.
From [Can89] and [GM10] , vanishing polynomials s i (x) w.r.t. the Cantor basis (β i ) has the following useful properties:
• (linearity) s i (x) contains only monomials in the form x 2 m .
If k = 2 i0 + 2 i1 + · · · + 2 ij , where i 0 < i 1 < · · · < i j , then we can write s k (x) = s 2 i 0 (s 2 i 1 (· · · (s 2 i j (x)) · · · )). Therefore, every s i is a composition of functions which only has two terms (see Table 1 ).
Evaluating s i (x) in Cantor basis Computing s i (x) in the Cantor basis is very fast since the representation of s i (α) would exactly be that of α shifted right by i bits, or s i (φ β (j)) = φ β (j i). For example, we have s i (β i ) = β 0 = 1.
The Lin-Chung-Han (LCH) FFT
In this section, we introduce how to evaluate a degree (n − 1) polynomial at n points in a binary field with the LCH FFT. We assume that n is a power of 2.
The polynomial is padded with 0 for high-degree coefficients if the actual degree of polynomial is not n − 1.
The LCH FFT requires that the evaluated polynomial is converted into a particular kind of basis, called novelpoly basis.
The novelpoly basis
The polynomial basis used in LCH FFT was presented in [LCH14] . The novelpoly basis for polynomials must be distinguished from bases for field. Although the LCH FFT is independent of underlying basis of finite field, we assume that the field is in Cantor basis for simplicity.
Definition 2.3. Given the Cantor basis (β i ) for the base field and its vanishing polynomials (s i ), define the novelpoly basis w.r.t. (β i ) to be the polynomials (X k )
I.e., X k (x) is the product of all s i (x) where the i-th bit of k is set.
Since deg(s i (x)) = 2 i , clearly deg(X k (x)) = k.
To perform LCH FFT, the evaluated polynomial f (x) has to be converted into the form f (x) = g(X) = g 0 + g 1 X 1 (x) + . . . + g n−1 X n−1 (x).
LCH's Butterfly
The evaluation of a polynomial in the novelpoly basis can be done through a "Butterfly" process, denoted as FFT LCH . 1 The general idea of evaluating
Since s k (x) is linear, evaluations at V k−1 + β k−1 can be quickly calculated with the evaluations at V k−1 and the butterfly process. It is a divide-and-conquer process that the polynomial f (x) = g(X) can be expressed as two half-sized polynomials h 0 (X) and
FFT LCH is detailed in Algorithm 1. The FFT LCH evaluates the converted polynomial g(X) at points V log n + α. Line 5 and 6 perform the butterfly process (see Fig. 1 and Fig. 2 ). Inverse FFT LCH simply performs the butterflies in reverse.
Algorithm 1: LCH FFT in novelpoly. 1 FFT LCH (f (x) = g(X), α) :
input : a polynomial: g(X)
We note there are two multipliers s k (α) and s k (β k ) in the FFT LCH and s k (β k ) = 1 in Cantor basis, avoiding one multiplication. This constant reduction of multiplications won't affect the asymptotic complexity; however, one can not bear the extra multiplication in practice. Although FFT LCH is applicable to any basis of field, the choice for a practitioner might be Cantor basis only.
Conversion to novelpoly Basis w.r.t. Cantor Basis
The evaluated polynomial has to be in novelpoly basis for performing FFT LCH . We review the conversion algorithms in this section. The fast conversion relies 
. Recursively divide f 0 (x) and f 1 (x) by lower s i (x) and eventually express f (x) as a sum of non-repetitive products of the s i (x), which is the desired form for g(X). We know the division comprise only XOR operations since the coefficients of s i (x) are always 1 in Cantor basis. Therefore the complexity of division by s i (x) depends on the number of terms in s i (x). This is functionally equivalent to the Cantor transform.
[LANH16] does better by 1. finding the largest 2 2 i such that 2 2 j < deg f and then do variable substitution (Alg. 2) to express f as a power series of s 2 i .
2.
Recursively express the series in s 2 i as a series in X j (s 2 i ), where j < 2 2 i .
Recursively express each coefficient of
Algorithm 2: Variable Substitution
The detail of basis conversion is given in Algorithm 3 and an example is given in Appendix B. Note that the algorithms rely on the simple form of (s i ) instead of field representations of coefficients.
Algorithm 3: Basis conversion: monomial to novelpoly w.r.t Cantor.
Binary Polynomial Products with Additive FFT
We can have a fast multiplication for binary polynomials simply by applying LCH FFT with the Cantor basis as the underlying FFT in the general method of Sec. 2. 1 .
Besides the straightforward method, we also present a faster algorithm by accelerating the field multiplication in the FFT. The acceleration relies on a special tower field representation, making all multipliers in butterflies short.
A Simple Method of Multiplying Binary Polynomials
A simple version of our multiplication for binary polynomials is to keep the working field in the representation of polynomial basis and apply LCH FFT with evaluation points in Cantor basis to the general multiplication in Sec. 2. 1 . The details of the straightforward algorithm is presented in Alg. 4.
For more details of FFT LCH , we choose w = 64 and use F 2 128 as our base field. The evaluated points are {φ β (0), . . . , φ β (2n − 1)}, which can be seen on line 6 in Alg. 4. The multiplier s k (α) is calculated in Cantor basis and then switched to its representation in F 2 128 with a linear map (field isomorphism). Although the multipliers in the Cantor basis are short numbers, they are random-looking 128-bit polynomials in F 2 128 . The field multiplication in F 2 128 is performed with 5 PCLMULQDQ (cf. Sec. 2.2.1). The other multiplier s k (β k ) = 1 in the Cantor basis.
Advantages of the Additive FFT We can expect that the simple structure of the LCH FFT leads to a lower complexity. One way is to count the butterflies. LCH FFT has a binary structure, which means at each of log n layers there are n/2 butterflies for a total of 1 2 n log n multiplications. Considering a ternary FFT instead, there will be log 3 n layers of analogous structure to butterflies, in number n/3 each. At each of these structures, one has to make four multiplications for a total of 4 3 n log 3 n multiplications. All else being equal, the multiplicative complexity of binary structure of the additive FFT is about 1.68 times lower than that of the ternary FFT, which is used in the Schöhage-like algorithm in [BGTZ08] . Similarly we hold an advantage over the even more complex FFT method in [HvdHL16] .
Results Please refer to Tab. 2 in Sec. 5. Simply using an additive FFT confers a 10%-20% advantage over state-of-the-art libraries in [HvdHL16, BGTZ08].
The Tower Construction for Binary Finite Fields
We consider this sequence of extension fields.
Thus, decimal subscripts or a tilde denotes the field is in tower representation.
We can now define a basis for F 2 256 as a vector space over F 2 .
Definition 3. 1 .
Henceforth (v k ) will be our default basis unless otherwise specified.
Numbers in hex such as 0x1f also denote the representatives under the basis (v i ).
Hence the sequence (v k ) can also be written as
Under this notation, we can order two elements in the tower field and thus define "big" or "small" by comparing their representative numbers. We can see that the representation of each field is embedded in the lower bit(s) of the field that is twice as wide. For example, the elements {0x00, 0x01, . . . , 0x0f} in F 256 form the subfield F 16 . Therefore, a "small" number in the tower field is often in a subfield.
Compatibility between Tower and Cantor Bases
We discuss the "compatibility" between tower and Cantor bases in this section. It is clear the basis (v 0 , v 1 , . . .) is not a Cantor basis since v 2 3 +v 3 = v 2 . However, Claim 1. The sequence of subspaces (V k ) and subspace vanishing polynomials (s i ) are the same w.r.t. the Cantor basis (β j ) and the tower basis (v j ).
We first assume that V k := span(v 0 , . . . , v k−1 ) and s k (x) = c∈V k (x − c) and we will show that these are the same as those from a Cantor basis. Note again that V k is a field only for k = 2 m a power of two, and only in this case we
Proof: See Appendix A.
Note that s i (x) w.r.t. any basis of field is linear [LCH14, Theorem 1]. Thus, the vanishing polynomials of tower and Cantor bases are the same. Since the s i (x) determines the V i by unique factorization theorem, we have proved Claim 1.
Subfield Multiplication in Tower Fields
We show multiplying by a subfield element is not only cheaper than general multiplication but also calculated as a vector-scalar product in this section. The cost depends on the size of the subfield. For example, to multiply a ∈ F 2 2w by b ∈ F 2 w , the a is represented as a polynomial a := a 0 + a 1 x ∈ F 2 w [x] with a 0 , a 1 ∈ F 2 w . Hence the product of a · b ∈ F 2 2w is calculated as two multiplications in F 2 w , i.e., (a 0 + a 1 x) · b. It is easy to generalize to the following proposition.
Proposition 2. Given a ∈ F 2 l 1 = V l1 , b ∈ F 2 l 2 = V l2 , and l 2 |l 1 , a · b ∈ V l1 can be performed with l 1 /l 2 field multiplications in F 2 l 2 .
LCH FFT over Tower Fields
Besides the compatibility with Cantor allowing efficient FFT LCH and basis conversion in tower fields, we show the multipliers in FFT LCH fall to smaller numbers which can be optimized by subfield multiplication with Prop. 2 in this section.
Recall that the two multipliers are s k (α) and s k (v k ), which corresponds to s k (β k ) in Alg. 1 .
First we have s k (v k ) = 1 by Prop. 1 . We can thus avoid the multiplication of s k (v k ) as in Cantor basis. One butterfly unit thus contains one multiplication and two additions. Figure 1 shows the details of the butterfly unit. It is also an example of evaluating a degree-1 polynomial f (x) = f 0 +f 1 x at points {c, c+1}, which α = c with the notation in Alg. 1 . Since X 1 (x) = x, the degree-1 polynomial after basis conversion is identical to the original polynomial, i.e., g(X)
Now we discuss the multiplier s k (α).
Proposition 3. For any
Contrast this with the Cantor basis, where β 2
, s 1 shortens the tower representation of x by exactly one bit. Since s k is just applying s 1 consecutively k times, the multiplier s k (α) in LCH butterflies is (a) k bits shorter than α in the tower representation, and (b) independent of the least significant k bits of α. Fig. 2 depicts the evaluation of a degree-7 polynomial at 16 points {0, 1, . . . , 0xf} with LCH FFT, using 4 (= log 16) layers of butterflies. We can observe that the multipliers in butterflies are smaller than the actual evaluation points. Figure 2 : The forward butterfly units for evaluating a degree-7 polynomial f (x) = g(X) = g 0 + · · · + g 7 X 7 at 16 points {0, 1, . . . , 0xf}.
Since coefficients above degree-7 are all 0, the first layer is simply a "fanout" (in software, copying a block of memory). The multipliers in second layer are calculated by evaluating the degree-4 s 2 (x) at two α ∈ {0, 0x8 ∈ F 16 }, resulting in the small multipliers {0, 0x2 ∈ F 4 }. The third layer evaluates the degree-2 s 1 (x) at 4 points α ∈ {0, 0x4, 0x8, 0xc} and results in the multipliers {0, 0x2, 0x5, 0x7}. In the last layer, the multipliers are the subfield elements {0, 0x2, 0x4, 0x6, . . . 0xe} themselves because s 0 (α) = α.
Faster Multiplication using FFTs in
The main idea of our algorithm is to trade an expensive field isomorphism for the faster multiplication by small subfield elements.
We show the algorithm for multiplying binary polynomials in Alg. 5. Assume we use a w = 64-bit word first. The input polynomials are in 64-bit blocks after partitioning(Split()). We perform the basis conversion before changing representations so as to have more densely packed data during the conversion. Then we convert the data to the tower representation. The data are kept in the tower representation for performing FFT LCH . We efficiently compute the subfield multiplication with the technique in Sec. 4. 1 . The pointmul are also performed in F 2 128 ; the somewhat different operations are detailed in Sec. 4.1.3. The inverse butterfly and basis conversion stages are then computed (still in the tower representation) before converting back to the polynomial basis of F 2 128 . Then we split the 128-bit results into 64-bit blocks and collate coefficients for the final result (InterleavedCombine()).
An initial block width of w = 128 bits is also possible and in this case the operative field is F 2 256 .
Algorithm 5: Multiplications for binary polynomials.
Implementation
F 2 128 Multiplication by subfield elements
The operation for multiplying an element of tower fields by an subfield element is implemented as a scalar multiplication of a vector over various subfields by a scalar with Prop. 2. We show how to calculate the product efficiently in current mainstream computers in this section.
Scalar Multiplication with Vector Instruction Set
The typical single-instruction-multiple-data (SIMD) instruction set nowadays is Advanced Vector Extensions 2 (AVX2), providing 256-bit ymm registers on x86 platforms. We use the table-lookup instruction VPSHUFB in AVX2 for multiplying elements in F 2 128 by subfield elements which are F 16 to F 2 32 . We demonstrate the scalar multiplication over subfields with PSHUFB instruction which is the precursor to VPSHUFB and uses 128-bit xmm registers.
PSHUFB takes two 16-byte sources which one is a lookup table of 16 bytes x = (x 0 , x 1 , . . . , x 15 ) and the other is 16 indices y = (y 0 , y 1 , . . . , y 15 ). The 16-byte result of "PSHUFB x, y" at position i is x yi mod 16 if y i ≥ 0 and 0 if y i < 0. VPSHUFB simply performs two copies of PSHUFB in one instruction. The two instructions are suitable for scalar multiplication over small fields [CYC13] .
For scalar multiplication over F 16 , we first prepare 16 tables; each table stores the product of all elements and a specific element in F 16 . Suppose we have a ∈ F 32 16 and b ∈ F 16 , we can apply VPSHUFB to the prepared "multiply-by-b" table and a for the product of a · b. Since we use one 256-bit register to store 64 elements in F 16 , the data in a have to be split into nibbles (4-bit chunks) before applying VPSHUFB.
The scalar multiplication over F 256 is similar to F 16 except that the number of prepared tables becomes 256 and one multiplication costs 2 VPSHUFB.
For scalar multiplication over F 2 16 or F 2 32 , we implement the field multiplication as polynomial multiplication in F 256 [x] because 2 16 or 2 32 prepared tables is too much for caches. The Karatsuba's method is applied to reduce the total number of multiplications in F 256 while multiplying polynomials.
Transpose the Data Layout for Higher Parallelism
While multiplying F 2 128 elements by elements in F 2 16 or F 2 32 , the bytes in a F 2 128 element might multiply by different multipliers in F 256 . Since we use scalar multiplication over F 256 as our building blocks, multiplying by different multipliers reduces the efficiency.
Example: Product of F 2 128 and F 2 16 elements The natural data layout of F 2 128 consists of eight consecutive F 2 16 elements, each with its two bytes stored side by side. Suppose we are multiplying a = (a 0 , . . . , a 15 ) ∈ F 2 128 := F 16 256 by c = (c 0 , c 1 ) ∈ F 2 16 := F 2 256 . The a is naturally stored as ((a 0 , a 1 ), (a 2 , a 3 ), . . . , (a 14 , a 15 )) ∈ F 8 2 16 . To perform the field multiplication in F 2 16 := F 256 [x] with Karatsuba, we then must compute ((a 0 , 0), (a 2 , 0), . . . , (a 14 , 0)) · c 0 and ((0, a 1 ), . . . , (0, a 15 )) · c 1 . In this case, one has to mask off half the components in a and thus reduces the efficiency.
"Byte-Slicing" layout: To perform many scalar multiplications withs VPSHUFB efficiently, a possible solution is to store each of the 16 bytes in an F 2 128 element in a separate register. This rearrangement of data layout is exactly equivalent to a 16 × 16 transposition of a byte matrix. After each byte in the same position of F 2 128 elements is collected in the same register (cf. Fig. 3 ), SIMD instructions can cover entire registers.
Transposition as needed: While multiplying two elements by a subfield element instead of one multiplicand in previous example, e.g., [a, b ∈ F 2 128 ] multiply by c ∈ F 2 16 , we only have to split the even and odd bytes in a and b for performing scalar multiplication efficiently. Hence, we can just use a 2 × 2 transposition, which converts a = (a 0 , . . . , a 15 ) and b = (b 0 , . . . , b 15 ) to (a 0 , b 0 , a 2 , b 2 , . . . , b 14 ) and (a 1 , b 1 , a 3 , . . . , b 15 ), to split the even and odd bytes. The 2 × 2 transposition involves only 2 registers instead of 16 registers in a 16 × 16 transposition and thus is more efficient. For multiplying by an F 2 32 element, we need to 4 × 4 transpose our data. The following Fig. 3 depicts the process of data rearrangement on different demand.
(k 0 , . . . , k 15 ) (i 2 , j 2 , k 2 , l 2 , i 6 , . . . , l 14 ) (a 14 , b 14 , c 14 , . . . , l 14 ) ( l 0 , . . . , l 15 ) (i 3 , j 3 , k 3 , l 3 , i 7 , . . . , l 15 ) (a 15 , b 15 , c 15 , . . . , l 15 ) Figure 3 : "Byte-matrix transpose" for F 2 128 elements: The data layout in the middle is for multiplying by F 2 32 elements, and the rightmost layout is for pointmul.
To transpose a matrix, we use similar techniques in [War12], i.e., transposing 16 × 16 can be done after transposing 4 × 4. While performing the transposition, we first collect elements in one register with a byte shuffle instruction (VPSHUFB). The interchange of data between different registers is accomplished by swizzle instructions in AVX2 instruction set. We refer readers to [Int15] [Fog17] for more information on AVX instructions.
Field Multiplication in pointmul
In pointmul, we need a generic field multiplication instead of subfield multiplication in FFT LCH for data in tower representation. Assuming the data in F 2 128 and in the byte-slice layout, we need the pointwise field multiplication in pointmul, in contrast to scalar multiplication in FFT LCH . Since the byte-slice layout, the multiplication in F 2 128 can be easily performed as polynomial multiplication in F 2 64 [x] in the SIMD manner. In other words, we reduce one parallelized multiplication in F 2 128 to several parallelized multiplications in F 2 128 . The process is recursively applied until the pointwise multiplication in F 16 . We then apply the method in [CLP + 17] , which use the logarithm/exponential tables for SIMD field arithmetic, for the pointwise multiplication in F 16 .
Revisiting the Choice of Base Field
We discuss the choice of working field based on the cost of field multiplication. While multiplying d bits polynomials, we split the data into w-bit wide chunks and then apply FFT with l = 2w-bit base field. Here we want to decide on a suitable size of base field. In the typical x86 CPU, one may choose l to be 64,128, or 256.
Irreducible polynomial constructed field: In this construction, the field multiplication is implemented with PCLMULQDQ instruction. Using Karatsuba's method and linear folding for the reduction, one multiplication costs (1+2) = 3, (3 + 2) = 5, and (9 + 4) = 13 PCLMULQDQ in F 2 64 , F 2 128 and F 2 256 respectively. The number of butterflies in FFT LCH are d w log 2d w . With one field multiplication for each butterfly, we conclude that working on F 2 128 has lowest number of PCLMULQDQ for FFT LCH .
Tower field: For l-bit tower field, the costs of multiplying by subfield elements in FFT LCH or generic elements in pointmul is proportional to length of field l as describing in Prop. 2. Since number of multiplications in FFT LCH proportions to d l log d l , to enlarge l by two results in one layer less (∝ log d l ) in FFT LCH but the same cost for each layer of butterflies(∝ l · d l ). The cost of generic multiplication in pointmul is ∝ l 1.7 using recursive Karatsuba in Sec. 4. 1. 3. If we compare l = l 1 with l = 2l 1 , the smaller field is more efficient. However, the length of field should fit the underlying machine architecture for better efficiency. In the Intel Haswell architecture with AVX2 instruction set, it is the most efficient to use 256 bits as a unit because memory access is the best with 256-bit alignment. Therefore, we choose l = 256 bits for Intel Haswell in our implementation.
Field Isomorphism
The change of field representations is simply a matrix product for a pre-defined matrix I with the data α ∈ V k as a vector. We compute the product I · α with the famous method of four Russians(M4R) [AH74] .
With M4R of l-bit, one first prepars all possible products of I and l-bit vectors, i.e., prepares . . , v k−1 ). To compute I · α, one splits α to l-bit chunks, looks up the prepared tables for various segments of α, and combines the results. The number of operations proportions to bit-length of α.
The choice of l depends on the size of cache for efficiency. The size of I for 128-bit field is 128 × 128 bits, and L1 cache is 32 KiB for data in the Intel Haswell architecture. Therefore, we choose the M4R with 4-bit, which results in 32KiB = 16 × 128 × 128 bits prepared tables.
Calculation of Multipliers in the Butterflies
Suppose we want to calculate s i (α) for α ∈ V k in a tower field. s i (α) can be calculated recursively via s i (α) = s i−1 (s 1 (α)). Since s 1 is a linear map, we can prepare a table S1 := [s 1 (v 1 ), . . . , s 1 (v k )] for the images of s 1 on all basis elements (v 1 , . . . , v k ). Since α = b i ·v i with b i ∈ {0, 1}, we have s 1 (α) = S1·α which is a matrix production. For further optimization, we can omit the least i bits of α While evaluating s i (x) by Prop. 3 .
In our implementations, we actually precomputed 31 tables S1, S2, . . . , S31 for the evaluations of s 1 (x), s 2 (x), . . . , s 31 (x) at up to 2 32 points and avoid the need for recursion. Again, we use M4R of 8-bit to accelerate matrix-vector products. Querying a result of 32-bit α costs 4 table lookups and the storage for tables is 4 KiB(4 × 2 8 × 32 bits), which fits our L1 cache.
Results and Discussion
We benchmark our implementation 2 on the Intel Haswell architecture (same as [HvdHL16] ). Our hardware is Intel Xeon E3-1245 v3 @3. The first notable result is that our simple version over F 2 128 outperforms the previous implementations in [BGTZ08] and [HvdHL16] for polynomials over 2 17+6 bits. The result shows that additive FFT are better based FFT for multi-plying binary polynomials given the same multiplication in the base field using PCLMULQDQ.
We can also compare the effectiveness of field multiplication from second and third rows. The comparison between F 2 128 and F 2 128 shows that the subfield multiplication with VPSHUFB outperforms generic multiplication with PCLMULQDQ. This is a counter-intuitive result since PCLMULQDQ is dedicated to multiply binary polynomials by design. The effect can be shown qualitatively: we can do 32 multiplications in F 256 using 2 VPSHUFB's. Multiplying by F 2 32 elements (the largest multipliers in the butterflies) costs 9 F 256 multiplications. Multiplying an F 2 128 element by an F 2 32 element is 4 F 2 32 multiplications, so each F 2 32 to F 2 128 product takes 9/4 VPSHUFB's on average. Similarly, a F 2 16 to F 2 128 product takes 3/4 VPSHUFB's. The average is less than 2 VPSHUFB's compared to 5 PCLMULQDQ's when using F 2 128 . [VDHLL17] is currently the fastest implementation. We note the same technique can be applied to additive FFT as well as about 50% saving of cost. The method for additive FFT is described in Appendix C.
Discussion
The profiles of various components The ratio of relative cost in our fast subfield multiplication between basis conversion : butterfly process : pointmul : change of representations for multiplying 2 20+6 -bit polynomials are 1: 3.06: 0.08: 0, 1: 2.11: 0.27: 0.54, and 0.73: 1.74: 0.47: 0.57 for F 2 128 , F 2 128 and F 2 256 , respectively.
The results show that the change of tower representation increases the cost in pointmul besides the cost itself. However, the efficient subfield multiplication reduces the cost of butterfly process and the effects are greater than the cost increased. The version of F 2 256 can even reduce the cost of basis conversion because the better aligned memory access fits into machine architecture.
More results on newer Intel architecture and profiles can be found in Appendix D.
Truncated FFT with Non-Power-of-Two Terms
The experiments are actually the optimal cases for multiplying polynomials with power-of-two terms by additive-FFT-based multipliers.
Since the cost of additive-FFT over number of terms of polynomial is highly stairwise, one can truncate the FFT to make the curve somewhat smoother as in [BGTZ08] . One truncated version of additive FFT for n = 3 · 2 k was shown in [CLP + 17].
Further Discuss on Other Possible Implementations
In this section, we discuss two possible variants of implementations.
Tower field implementations with evaluation points in Cantor basis
If we choose evaluation points in Cantor basis, the evaluation of s i (α) seems faster than the M4R technique in previous section. However, we still need to change the representation of s i (α) from Cantor basis to tower field in this case. This operation results in the same cost as the calculation of s i (α) in tower field with M4R.
Performing subfield multiplication in Cantor basis For subfield multiplications in tower field, the efficiency comes from the powerful VPSHUFB instruction. Since we don't have Prop. 2 in Cantor basis, we can not use VPSHUFB in Cantor basis in the same way of tower fields. Another options for implementing field multiplication in Cantor basis is to use bit-sliced data and logical instructions [BC14] . There are at least 64 ymm registers for operating in 64-bit base field, resulting inefficiency from too much data in play.
Concluding Remarks
We have presented our efficient multipliers based on recently developed additive FFTs, which has similar but lower multiplicative complexity as the ternary variant of Schönhage's algorithm used in [BGTZ08] .
In [BGTZ08] , Brent et al. also implemented the Cantor [Can89] algorithm beside Schönhage. They concluded that "Schönhage's algorithm is consistently faster by a factor of about 2 (than Cantor)". In their implementation, multiplicative FFT outperformed additive FFT. Our experiments show that recently developed additive-FFT does help for multiplying binary polynomials of large degrees in practice. We derive a further advantage by exploiting the lower cost of multiplying by subfield elements in tower fields.
Future Work: Our implementation is written in C and may be improved with assembly for better register allocations. Further, AVX-512 instructions, featuring 512-bit SIMD instructions, will be more widely available soon. We probably cannot speed up the additive FFT multiplier to a factor of two, but surely AVX-512 can be expected to offer a substantial advance.
A Claims about s i (x) in Tower Fields and Proofs
We first discuss the special case that k = 2 l is power of two and there is a F q for q = k. By Galois's theory, we have s k (x) = b∈Fq (x − b) = x q + x. Proposition 4. If q is a power of two, and choose any a ∈ F q such that
Proof.
But zero here would be contradictory because all q solutions of x q = x are already in F q , but x l is in F q 2 \ F q , hence we must have x q l + x l = 1. Proof of Prop. 1 . If q = 2 j is a power of two, then V q = F 2 q and the result holds according to the Proposition above. So we assume that the proposition holds for all k < h and h = q + < 2q, where q = 2 j , and note that for
In the proof of Prop. 3, we need the following consequences from definition 3.1 and 2.3.
Proof of Prop. 3. If i = 2 k , then v i = x k+1 , and v 2 i + v i = x 1 x 2 · · · x k = v 2 k −1 by the defintion of the x i so the claim holds. Using mathematical induction, we let 2 k > i = 2 k−1 + j.
The first term is the product of two terms in F 2 2 k−1 and therefore is itself in F 2 2 k−1 = V 2 k−1 ⊆ V i . The second is the product of v 2 k−1 and an element of V j (by the induction hypothesis), which is by Corollary 3 also in V 2 k−1 +j = V i .
B Basis Conversion: An Example
Figure 4: From f (x) = f 0 + · · · + f 15 x 15 to g(X) = g 0 + · · · + g 15 X 15 by Algorithm 3. Figure 4 shows an example for converting a degree-15 polynomial to novelpoly basis. We have to divide by 3 different s i (x)'s, namely s 1 (x), s 2 (x), and s 3 (x), which has 4 terms. One can see there are actually 4 layers of division in Figure 4 and the number of XOR's are the same in all layers.
What the first 2 layers perform is variable substitution into terms of y = s 2 (x) = x 4 +x (See algorithm 2). The first layer is a division by s 2 (x) 2 = x 8 +x 2 and second layer is divisions by s 2 (x) = x 4 +x on the high degree and low degree polynomials from the first layer. The third layer is a division by s 3 (x) = y 2 +y by adding between coefficients of terms differing by a factor of s 2 (x), and 4 positions apart (see 3rd column of Table 1 ). The last layer divides by s 1 (x) = x 2 + x for 4 short polynomials (Last loop in Algo. 3). Note that we only do division by two terms polynomials in the conversion.
C Frobenius bijection between F 2 [x] <128·n and F n 2 128
In this section, we perform an FFT directly on F 2 [x] instead of combining many coefficients to a larger finite field, i.e., abandoning the Kronecker segmentation. We refer the reader to [VDHLL17] for preliminaries and focus on performing the idea with additive-FFT in this section.
C.1 The partition of evaluated points
Given φ 2 is the Frebenius operation in F 2 128 , which is exactly to square an element in an extension field of characteristic 2. Thus, we have φ 2 (β 0 ) = β 0 , and φ 2 (β i ) = β 2 i = β i + β i−1 for i > 0 in Cantor basis. Moreover, the order of φ 2 for β i>0 is 2 · 2 log 2 i .
Definition C. 1 . Given a(x) ∈ F 2 [x] <128·n and n = 2 m with m < 63 , define a linear map E βm+64+Vm : F 2 [x] <128·n → F n 2 128 is the evaluations of a(x) at points β m+64 + V m , i.e., E βm+64+Vm : a(x) → {a(β m+64 + i) : i ∈ V m } .
We can see the evaluation of a(x) at n = 2 m points which is less than the number of coefficients. However, the evaluation points are in F 2 128 . It can be checked the order of φ 2 (β m+64 ) is 128.
Proposition 5. E βm+64+Vm is a bijection between F 2 [x] <128·n and F n 2 128 .
C.2 Truncated Additive FFT
To evaluate a polynomial a(x) ∈ F 2 [x] <128·n at 2 m points in F 2 128 , we treat each coefficient of a(x) is an element in F 2 128 . Since there are 128 · n = 128 · 2 m coefficients and only 2 m evaluation points, the E βm+64+Vm is exactly a truncated additive FFT to 1/128 points of an FFT at points β m+64 + V m+7 . The basis conversion can be proceeded straightforwardly with alg. 3. With the notation of alg. 1, the evaluation of a(x) at β m+64 +V m can be accomplished with the scalar displacement α = β m+64 and reserve only first 1/128 results of the evaluation points V m+7 .
C.3 Encoding: the First Seven Layers of Truncated FFT
We can also translate the truncated FFT to an "encoding", corresponding to the starting 7 layers of butterflies, of input coefficients and followed by a normal FFT LCH over F 2 128 [x] <n at points V m with α = β m+64 . Since there is only one bit for each element in the first layer of the truncated FFT, each element of 8th layer is effected by 2 7 bits in the first layers The multipliers from the first to 7th layers are (s m+i (β m+64 )) 1 i=7 . Hence, we have the "encoding" for the i-th element is 127 j=0 a j·n+i ·( k-th bit of j is 1 s m+7−k (β m+64 )). Every element in 8th layer of FFT LCH is a result of linear map with input (a j·n+i ) 127 j=0 which is a 128 × n transpose of input data. Table 3 shows the raw data of various componets of our implementations. Table 4 shows the performance of our implemetations in Intel Skylake architecture.
D Profiles: Raw Data
