Abstract. This work presents a new method to compute the GHASH function involved in the Galois/Counter Mode of operation for block ciphers. If X = X1 . . . Xn is a bit string made of n blocks of 128 bits each, then the GHASH function effectively computes X1H n + X2H n−1 + . . . XnH, where H is an element of the binary field F 2 128 . This operation is usually computed by using n successive multiplyadd operations over F 2 128 . In this work, we propose a method to replace all but a fixed number of those multiplications by additions on the field. This is achieved by using the characteristic polynomial of H. We present both how to use this polynomial to speed up the GHASH function and how to efficiently compute it for each session that uses a new H.
Introduction
The Galois/Counter mode (GCM) is one of the modes of operation recommended by the National Institute of Standards and Technology (NIST) [10] . As an authenticated encryption mode, it generates both a cipher text and a authentication tag for each session. The cipher text is obtained by using the counter mode of encryption (CRT) with a 128 bit block cipher algorithm (usually AES). The authentication part is performed by using a universal hashing (GHASH) based on multiplication in the binary field F 2 128 [9] . More precisely, if X = X 1 . . . X n is a bit stream divided into 128 bit blocks, the authentication process computes X 1 H n + X 2 H n−1 + · · · + X n H, where H depends on the encryption key.
High speed GCM designs are usually based on both fast implementations of the block cipher [2, 4, 6, 16] and bit-parallel multiplier over F 2 128 [1, 3, 8, 11] . One can refer to [7, [13] [14] [15] for various implementations of the AES-GCM mode of encryption.
In practice, the F 2 128 multiplier operates faster than the block cipher. However, it is easy to speed up the block cipher computation by taking advantage of the natural parallelism of the CRT mode. In that case, the F 2 128 multiplier's critical path of the GHASH function becomes the bottleneck. To overcome this problem, it is always possible to parallelize the GHASH computation process as suggested in [9] . Such solutions imply a proportional increase in the number of multipliers.
In this paper, we propose a new way to speed up the GHASH function for long messages, by moving the bottleneck from the multiplier to that of one XOR and one AND gate. This is achieved by first computing the characteristic polynomial of H and then performing the computation of the GHASH function modulo this polynomial. Effectively, we replace the finite field multiplication by a faster operation. In addition, our method can be parallelized to further reduce the GHASH computation time.
The remainder of this paper is organized as follow: Section 2 briefly describes the functioning of GHASH and the implementation of the binary field F 2 128 . In section 3, we describe our approach to use characteristic polynomials to compute GHASH and a parallelization of this operation using multiple polynomial reduction units. In section 4, we present how to efficiently compute a characteristic polynomial over F 2 128 . Finally, a few concluding remarks are made in section 5.
GHASH authentication function
GCM uses two main operations: encryption and authentication. Encryption is performed using a block cipher encryption function whereas the authentication mainly requires the use of multiplication over F 2 128 . One can find a full description of the encryption process in [9] . Below we only describe the authentication part.
Authentication function
Let X = X 1 X 2 . . . X n be a bit string divided into 128 bit blocks (X n might be padded with 0's) and H the 128 bit hash sub-key. The GHASH function performs as follow:
One can verify that Algorithm 1 computes X 1 H n + X 2 H n−1 + · · · + X n H. In practice, X is obtained as the concatenation of the cipher text, the authentication data and the length of those two data, and H is generated by applying the block cipher to the zero block. Note that Algorithm 1 requires n multiplications and n additions on the field F 2 128 . Assuming that we have only one multiplier and one adder, these operations would be performed in sequence and the total computation time for GHASH is approximately n times the combined delay of a multiplication and an addition over the field. On binary fields, the delay of an addition is exactly that of a bitwise XOR operation. In Table 1 , we summarize various bit-parallel multiplication methods and their space and time complexities in terms of gate counts and gate delays.
Multiplier
#AND #XOR Gate delay (DM ) CRT [1] O(m 
Underlying computations
In GCM specifications, the binary field F 2 128 is defined as
. That is to say, the set of residues of polynomials with coefficients in F 2 modulo x 128 + x 7 + x 2 + x + 1. An element A of this field can be easily represented as a vector of length 128 with coefficient in F 2 .
In that case, addition is very simple as it just consists of a bitwise XOR between the two summands. Multiplication can be performed by a succession of shifts and additions. Depending on the platform on which GCM is implemented, it is possible to improve this operation, as shown in [9] or [3] for instance.
One common feature among several of the previously reported methods for improving GCM is that all efforts are made to optimize the field multiplication only. In this work, we propose a way to improve the efficiency of the overall computation of GHASH H (X) = X 1 H n + X 2 H n−1 + · · · + X n H.
Computing GHASH using a characteristic polynomial
In order to easily present our method, here we consider a bit string, divided into n blocks: X 1 X 2 . . . X n and an element H of F 2 128 . Our goal is to compute GHASH H (X) = X 1 H n +X 2 H n−1 + · · · + X n H. As H ∈ F 2 128 there exists a polynomial χ H of degree 128 (at most), with coefficients in F 2 , such that χ H (H) = 0, called the characteristic polynomial of H in F 2 . This section shows how to use such a polynomial to limit the number of field multiplications involved in the polynomial evaluation.
Using
is called the characteristic polynomial of H and satisfies χ H (H) = 0 and has its coefficients in F 2 . 
It is important to note that all the c i 's are in F 2 , which means that computing (X m−i+1 + c i X i ) is just an addition over F 2 128 or a simple parameter assignment. Moreover, all these additions are independent and can all be performed in parallel. Figure 1 shows a diagram of a possible implementation of this operation in hardware. The output of the polynomial reduction unit (PRU) is a polynomial of degree m − 1 in H. We can take the output of the PRU to an multiplyand-add unit to apply the Horner scheme and eventually obtain the desired GHASH value, which is an element of F 2 128 (see Figure 2) .
A generalized description of the operation of the structure of Figure 2 is given in Algorithm 2, where a polynomial P of degree n ≥ m can be computed using at most m − 1 field multiplications, replacing each of the remaining n − m multiplications by a maximum of m parallel field additions.
Algorithm 2 computes P in two steps:
-first it computes the remainder of P modulo χ H using a shift-and-add algorithm, -then it effectively calculates P using the Horner scheme. 
Algorithm 2 GHASH
Complexity: For long messages, the value of n is expected to be much longer than that of m, e.g. if the size of X is 1 MBytes, then n = 2 16 , and in Algorithm 2, the upper for loop will dominate in terms of computational cost. As noted previously, one important feature of Algorithm 2 is that all the additions involved in the upper for loop are independent. Moreover, the c i 's are elements of F 2 , which means that the operation Y i−1 + c i C is just a bitwise XOR operation. In the end, each step of this loop requires m bitwise XOR operations. Thus, our method trades n − m multiplications by H over F 2 m against m(n − m)(W H − 1) bit level XOR operations, where W H is the number of non-zero coefficients of χ H .
In the case of a hardware implementation, on a parallel architecture, a step of the loop would require m 2 XOR gates, all used in parallel, and the gate delay is D X + D A (delay of one XOR and one AND gates). In the end, the implementation of the GHASH function as shown in Figure 2 has a space complexity of O(m 2 ) and the total time delay for the upper loop is (n − m)(D X + D A ). The final m − 1 field multiplications can be performed using any of the already available methods as listed in Table 1 .
Comparison: Let n be the length of the message to be treated (considered as a sequence of m bit blocks). Implementations based on Algorithm 1 performs, at each step of the algorithm, one field multiplication and one block bitwise XOR operation. Denoting D M as the delay of a field multiplication, the total delay is:
On the other hand, Algorithm 2 first requires the computation of the characteristic polynomial. We denote the cost of this computation as D χ . It is important to remark that this cost is constant, i.e. independent of n. Then, Algorithm 2 performs m 2 parallel bitwise XOR operations in each pass of the upper loop. Finally, the algorithm ends by computing (m − 1) field multiplications and m bitwise XOR operations on m bit blocks. Assuming that the m Y -registers of the PRU can be initialized in parallel, the total delay can then be approximated as:
Asymptotically (i.e. for large n), the new method reduces the critical path delay of the GHASH function to that of one XOR plus one AND operation (i.e. O(1)), whereas that of the traditional method is due to the multiplier, i.e. O(log m) for bit-parallel implementations. This reduction in the critical path is obtained at the cost of a PRU. However, the number of multiplications in the final Horner scheme being constant, using a smaller but slower multiplier can greatly reduce the additional cost, in terms of space, without changing the overall asymptotic complexity.
Delay reduction with multiple PRUs
Let n = 2n be an even integer (in order to simplify notations). Then one can write
Now assume that we have two PRUs whose feedback connections are defined by the characteristic polynomial of H 2 over F 2 . Then these PRUs can process even and odd numbered blocks of X in parallel in n 2 − m steps. Referring to (1), the outputs of PRUs are P 1 mod χ H 2 and P 2 mod χ H 2 , that we call partial GHASH. One can then simply multiply and add according to the equality
More generally, let us assume that we have r PRUs whose feedback connections are defined by the characteristic polynomial of H r over F 2 . For simplicity, we also assume that n is a multiple of r (we pad zeros to X if needed). Then we can decompose X into r different sets P 1 , P 2 , . . . , P r such that
where for 1 ≤ i ≤ r we have used Fig. 3 . (The number of iterations reduces to n r − m if each PRU can be initialized with m coefficients in parallel at the beginning of its operation.) We now write (2) in terms of P i as follows:
Since each P i is a degree m − 1 polynomial in H r , one can verify that the sum of the products in (3) is nothing but a degree rm polynomial in H (with the constant term being zero). We can then use one more PRU whose feedback connection is defined by the characteristic polynomial of H and reduce the polynomial in H from degree r(m − 1) to degree m − 1. Note that computing this characteristic polynomial, unlike that of H r , can be done in parallel with obtaining P i . Moreover, if r is a power of two, then H r and H share the same characteristic polynomial and hence the latter does not need to be computed a second time. We obtain the final result by applying the Horner scheme to the output of the final PRU and this requires m − 1 multiply-and-add operations.
Hence, given χ H r and using r+1 PRUs and one multiplier (see Fig. 3 ), the GHASH computation of X has the following time delay
Comparison: As noted in [9] , the conventional method of computing GHASH can also be parallelized. Assume that we have r field multipliers. Then all P i , for 1 ≤ i ≤ r, can be computed in n r − 1 iterations, assuming that we have H, H 2 , . . . , H r , and the time delay of each iteration is
, which incurs a delay of (log 2 r)D X , assuming additions are done in parallel in a binary tree fashion. Thus, given H i , 1 ≤ i ≤ r, using r field multipliers and r − 1 adders, one can compute GHASH based on Algorithm 1 with the following time delay:
. . . In hardware implementation, D X is normally larger than D A , and more importantly, for bit parallel implementation we have D M ≥ (log 2 m)D X . For example, as mentioned in Table 1 , the Karatsuba algorithm based bit-parallel multiplier over F 2 128 has D M > 7D X . Thus for a large n (i.e., long messages), it is clear that (4) is considerably smaller than (5).
Computation of the characteristic polynomial over F 2 128
In this section, we first recall a method from Gordon [5] that determines the characteristic polynomial of an element of a finite field and requires, in our context, 127 multiplications and 127 squarings. We then propose a new method taking advantage of the tower structure of F 2 128 that can be faster than the first one depending on the representation of the finite field.
Gordon's method: Let A ∈ F 2 m . Then the characteristic polynomial of A is given by
We want to find the polynomial in the form
B i x i . In other words, the coefficients of the representation of field element χ A (x) correspond to those of the polynomial χ A (T ) mod χ x (T ). As χ A (T ) has degree 128, we have χ A (T ) = (χ A (T ) mod χ x (T ))+χ x (T ). This can be easily evaluated by adding χ A (x) to χ x (x) modulo 2. Evaluating χ A (x) can be done using 127 field multiplications and 127 fields squaring. Gordon's method is quite general and can be applied to any binary field. We now propose a new method that takes into account that F 2 128 has a very special structure. We will use the following lemma to compute the polynomial χ A . Let P = d i=0 p i T i be a polynomial in F 2 n [T ] and k an integer. We denote
Lemma 1. Let F 2 m be a binary field, and let F 2 2m a degree 2 field extension of F 2 m . The following assertions hold
and k an integer then
Proof. 1. Let us write
Then, we apply σ k to P Q
which proves the assertion. 2. We now prove the second assertion of the lemma. Let Q = P σ 2 m (P ). We first check that Q(A) = 0, we have
In order to show that Q ∈ F 2 m [T ] we have to prove that σ m (Q) = Q, since this would mean that each coefficient of Q is in F 2 m . We compute
This completes the proof.
Let us now see, how to use the above result to compute the characteristic polynomial of an element A ∈ F 2 8 over F 2 . We begin with the polynomial P 0 = T + A which satisfies P (A) = 0 and P ∈ F 2 8 [T ].
-Now we apply the lemma for the field extension F 2 8 /F 2 4 to obtain a polynomial P 1 = P 0 σ 4 (P 0 ) which satisfies P 1 (A) = 0 and
Moreover we have P 1 = (T + A)(T + A 2 4 ). -We apply again the lemma for the field extension F 2 4 /F 2 2 to obtain a polynomial P 2 = P 1 σ 2 (P 1 ) which satisfies P 1 (A) = 0 and
Moreover we also have P 2 = (T + A)(T + A 2 4 )(T + A 2 2 )(T + A 2 6 ). -Finally we apply the lemma a third time for the field extension F 2 2 /F 2 to the polynomial P 2 .
We get
, and P 3 ∈ F 2 [T ] and satisfies P 3 (A) = 0. If we look at the expression of P 3 we can see that it is equal to the characteristic polynomial of A.
This method is generalized for binary field F 2 m where m = 2 k is a power of 2 in Algorithm 3.
Algorithm 3 Computing the characteristic polynomial of A
Require: A ∈ F 2 2 k Ensure: PA (the minimal polynomial of A) P ← T + A for i = k − 1 to 0 do P ← P × σ 2 i (P ) end for return (P ) Proposition 1. Algorithm 3 returns the characteristic polynomial of A over F 2 .
Proof. Applying Lemma 1, for successive extension field of degree 2 shows that the returned polynomial P satisfies P (A) = 0 and P ∈ F 2 m . Let us now prove that the polynomial P is the characteristic polynomial of A, i.e., we show that the returned P satisfies
Specifically, we will show by induction on the index of the loop that the computed P in the jth loop is the characteristic polynomial of A over the field F 2 k−j .
-For j = 0 this is clear that P = T + A is the characteristic polynomial of A over F 2 k .
-For j = 1 we have P = (T + A)(T + A 2 k−1 ), thus the assertion is true.
where d = 2 k−(j+1) . This completes the proof.
Complexity.
Let us now evaluate the complexity of Algorithm 3. We suppose that each polynomial multiplication is done using the Karatsuba method, and each exponentiation to 2 k is done by computing k successive squares.
First, we make two observations:
-Let us first check that the degree of P after j iterations of the for loop has degree 2 j . This is true when j = 0 since P is initialized by T + A. In each loop the degree of P is multiplied by 2, thus after j loop, the degree of P must be equal to 1 × 2 j . -At the end of the j-th iteration of the for loop, the polynomial P belongs to
, then, at step j + 1 we compute P = P × σ 2 k−(j+1) (P ), which belongs to
From the above two remarks, we can now evaluate the complexity of Algorithm 3.
-At the j-th iteration of the for loop, P is a monic polynomial of degree 2 j (and thus has 2 j coefficients). As its coefficients are in F 2 2 k−j+1 and that we need to compute the 2 k−j -th power of each of them, we have to perform 2 j × 2 k−j squarings over F 2 2 k−j+1 to compute σ 2 k−j (P ). Let us denote S 2 k as the cost of a squaring over F 2 2 k , we assume that S 2 k = 2S 2 k−1 (squaring is linear over binary fields). Then, the total complexity of computing
-In the same manner, at the j-th iteration, after computing σ 2 k−j (P ), the algorithm computes P σ 2 k−j (P ). Thus, we multiply two degree 2 j polynomials. Performing this multiplication using the Karatsuba algorithm requires 3 j multiplications of fields elements. Let us denote M 2 k as the cost of a multiplication over F 2 2 k , we assume that M 2 k = 3M 2 k−1 . Then, the computational cost is equal to
In the end, the overall cost of computing the characteristic polynomial of an element of F 2 2 k over F 2 is, in terms of number of operations over F 2 2 k is 3kM 2 k + 2 k+1 S 2 k .
In the case of GCM, k = 7 and thus the number of field operations is 21 multiplications and 256 squarings over F 2 128 . Remark 1. It is not always suitable to decompose the field F 2 2 k into k extensions of degree 2. As a example, F 2 128 can be seen as a degree 4 extension of F 2 32 and F 2 32 a degree 32 extension of F 2 . In that case, multiplying two elements of F 2 64 is performed on the extension field F 2 128 , which means that M 2 7 = M 2 6 . In the end, the total complexity would be:
This shows that depending on the representation, the computational cost might vary. However, this computation is done once and for all at the beginning of a GCM session. As such a session can involve thousands of field multiplications (the plain text can have up to 2 39 bits, i.e. 2 31 128 bit blocks), this additional cost can be considered as negligible for long sessions.
Remark 2.
For large values of n, the computation of GHASH using the characteristic polynomial is expected to be several times faster than traditional methods that use n field multiplications. For example, consider n = 10, 000 and the FPGA based bit parallel multiplier from [12] , which has a delay of D M = 6.637 ns. This multiplier is based on the Karatsuba algorithm and hence D M = 3(log 2 (2 7 ))D X + D A , i.e., ignoring the delay due to the single level of AND gates, we have D M /D X ≈ 21. Thus, the traditional method for computing GHASH will require 10, 000 × (D X + D M ) ≈ 69.6 µs. On the other hand, the characteristic polynomial based GHASH using the same multiplier will require approximately 10, 000 × (D X + D A ) + 128 × (D M + D X ) + 135 × D M ≈ 8.1 µs, resulting in more than 8 fold reduction in the computation time.
Conclusions
In this paper we have proposed a new way to improve the performance of the GHASH function of GCM. Our method is based on the use of the characteristic polynomial of the authentication data. It has allowed use to trade most of the field multiplications involved during the authentication tag computation by a series of 128 independent fields additions. This is very attractive for high performance implementations where all such additions can be performed in parallel. This allows us to reduce the delay of each of the first n − 128 multiplications over F 2 128 to that of one XOR and one AND operation. To illustrate the effectiveness of the proposed method, we have considered n = 10, 000 and the Karatsuba algorithm based bit parallel multiplier on FPGA from [12] , which has a delay of D M = 6.637 ns and D M /D X ≈ 21, and we have estimated that compared to the traditional method, the new method can significantly reduce the GHASH computation time, that it to say, eight times.
In this paper, we have also shown the flexibility of our method in terms of parallelization. Using multiple polynomial reduction units allows us to efficiently parallelize the computations and improve the performance of GHASH even further. Finally, we have also proposed a method, specific to F 2 128 , to compute the initial characteristic polynomial efficiently.
