Abstract-We study Dickson bases for binary field representation. Such a representation seems interesting when no optimal normal basis exists for the field. We express the product of two field elements as Toeplitz or Hankel matrix-vector products. This provides a parallel multiplier which is subquadratic in space and logarithmic in time. Using the matrix-vector formulation of the field multiplication, we also present sequential multiplier structures with linear space complexity.
DICKSON POLYNOMIALS
Dickson polynomials over finite fields were introduced by Dickson in [2] . These polynomials have several applications and interesting properties, the main one being a permutation property over finite fields. For a complete explanation on this, the reader may refer to [7] . Our interest here concerns the use of Dickson polynomial for finite field representation for efficient binary field multiplication. There are two kinds of Dickson polynomials, and there are several ways to define and construct both of them. We give here the definition of [7] of the first kind Dickson polynomials.
Definition 1 (Dickson Polynomial [7, p. 9] ). Let R be a ring and a 2 R. The Dickson polynomial of the first kind D n ðX; aÞ is defined by
For n ¼ 0, we set D 0 ðX; aÞ ¼ 2 and, for n ¼ 1, we have D 1 ðX; aÞ ¼ X.
In this paper, we will consider only i ¼ D i ðX; 1Þ the Dickson polynomials in IF 2 ½X. Proof. For a detailed proof, we refer to [6] , we just give a brief explanation here. Using (1), we can see that i ¼ X i þ terms of lower degree. This implies that the conversion matrix from f 1 ; . . . ; n g to fX; . . . ; X n g, is lower triangular with 1 on the diagonal. The conversion matrix is thus invertible and since fX; . . . ; X n g form a basis of IF 2 n then B ¼ f 1 ; . . . ; n g is a basis of IF 2 n ¼ IF 2 ½X=ðP Þ. t u
The following theorem will be extensively used for the construction of subquadratic multipliers in the Dickson basis. 
Proof. We will show it by induction on i and j. We can easily check that (2) holds for i; j 1. We suppose that the equation is true for all i; j n and we prove that the equation is true for i; j n þ 1. We first prove it for i ¼ n þ 1 and j n. We have
by induction hypothesis. Now, we have
For the other case i ¼ n þ 1 and j ¼ n þ 1, the product nþ1 nþ1 is obtained using similar tricks. t u
FIELD MULTIPLICATION USING LOW-WEIGHT DICKSON POLYNOMIALS
In this section, we consider multiplication of two elements of the binary field IF 2 n ¼ IF 2 ½X=ðP Þ where the polynomial P is a lowweight Dickson polynomial. In particular, we consider two and three-term Dickson polynomials P , i.e., Dickson binomials and trinomials. Like low-weight conventional polynomials, the use of low-weight Dickson polynomials is expected to yield lower space complexity multipliers.
In Table 1 , we give the degree n 2 ½160; 285 of field IF 2 n ¼ IF 2 ½X=ðP Þ where P is a low-weight Dickson polynomial. Specifically, since no irreducible Dickson binomials were available, we have looked for Almost Dickson binomials (ADB) with irreducible P satisfying P Â ðX þ 1Þ ¼ nþ1 þ 1. We also give Dickson trinomials (DT) of the form P ¼ n þ k þ 1 with k n=2. For the purpose of comparison, we mention also whether for each degree an ONB of type I or II exists (marked as NI and NII).
Our main goal here is to express the product of two elements, represented in the Dickson basis, as a Toeplitz or Hankel matrixvector products. Recall that an n Â n Toeplitz matrix T ¼ ½t i;j satisfies t i;j ¼ t iþ1;jþ1 and a Hankel matrix H ¼ ½h i;j satisfies h i;j ¼ h iþ1;jÀ1 . We will then use the subquadratic Toeplitz matrixvector product of [3] to design a subquadratic multiplier.
Irreducible Dickson Binomials
In this section, we focus on finite fields IF 2 n ¼ IF 2 ½X=ðP Þ where P is a two-term polynomial of the form P ¼ n þ 1 where n is the nth Dickson polynomial.
The elements of IF 2 n are expressed in the Dickson basis B ¼ f 1 ; . . . ; n g. Now, our main goal is to show that the product of two elements A and B in IF 2 n can be computed as a matrixvector product M A Á B where M A is a sum of a Toeplitz matrix and an essentially Hankel matrix.
If we multiply two elements A and B expressed in B and if we use Theorem 2, we get the following:
Now, we express each sum S 1 and S 2 as matrix-vector products. Let us begin with S 1 . We remark that S 1 has a similar expression as product of two polynomials of the same degree. In other words, S 1 can be computed as Z A Á B, where 
We reduce the matrix Z A modulo P ¼ n þ 1 to get nonzero coefficients only on rows corresponding to 1 ; . . . ; n . We use the fact that nþi for i ! 0 satisfies
This equation is a simple consequence of (2) and that n ¼ 1 mod P . This implies that the rows corresponding to nþi are reduced into two rows, one corresponding to i and the other to nÀi . After performing this reduction and removing zero rows, we get Finally, we get an expression of S 1 as matrix-vector product where the matrix is a sum of a Toeplitz and an essentially Hankel matrix. Now, we do the same for S 2 . We split S 2 into two sums
We express S 2;1 and S 2;2 as matrix-vector products 
So, now we have each of S 1 and S 2 in the required form. We finally write S 1;1 þ S 2;2 ¼ T A Á B where T A is a Toeplitz matrix and S 1;2 þ S 2;1 ¼ H A Á B where H A is an Hankel matrix. We obtain
as stated at the beginning of the current section. 
Dickson Trinomials
Now we assume that the field IF 2 n is defined by a three-term irreducible Dickson 
The elements in IF 2 n ¼ IF 2 ½X=ðP Þ are expressed in the Dickson basis B ¼ f 1 ; . . . ; n g. Our aim is to express the product of two elements A and B of IF 2 n as Toeplitz or Hankel matrix-vector product. We use again the expression of the product C ¼ A Â B ¼ S 1 þ S 2 given in (3). Similar to the previous section, here, we express S 1 and S 2 as matrix-vector product separately. Specifically,
1.
The sum S 1 is expressed as Z A Á B, where Z A is given in (4). 2. For S 2 , we use the expression of (5) and we put this expression in a matrix-vector product form. 
Now, we replace S 1 and S 2 by their corresponding expressions given above in 
In (10), the addition of two 2n Â n Toeplitz matrices results in one single 2n Â n Toeplitz matrix. The latter can be horizontally split in the middle to obtain two n Â n Toeplitz matrices, say T up and T down , which can then be multiplied separately with vector ðb 1 ; . . . ; b n Þ with a total cost of two n Â n Toeplitz matrix-vector products.
The other 2n Â n Hankel matrix in (10) has all-zero in the lower n rows, contributing nothing to the cost of the matrix-vector multiplication. Thus, the total computational cost of (10) is no more than three n Â n Toeplitz or Hankel matrix-vector products.
The reduction:
The resulting expression of C in (10) is an unreduced form of A Â B, since it has nonzero coefficients c i on rows i ¼ n þ 1; . . . ; 2n. These coefficients are obtained by multiplying T down with vector ðb 1 ; b 2 ; . . . ; b n Þ, and must be reduced modulo
We reduce the expression of C ¼ P 2n i¼1 c i i by replacing each i for i > n by the expression given above. Since we assume k < n 2 , this process must be done two times to get a reduced expression of C. A circuit can be designed to perform this process which requires 6n À k XOR gates and is performed in time 3T X (see [6] , for details).
Parallel Multiplier
We can design multiplier using the expression of the multiplication in IF 2 n as a Toeplitz or Hankel matrix-vector product (TMVP). Specifically, we use the Toeplitz or Hankel matrix-vector multiplier presented in [3] to perform these products. In Table 2 , we recall the complexity of the TMVP multiplier established by Fan and Hasan [3] .
In the case of Dickson binomials, to compute the matrix-vector products of (8) we need two TMVP multipliers in parallel. Each of them can use two-way or three-way split approach of [3] . We also need additional 2n XOR gates to compute the coefficient of T A and add the result of the two matrix-vector products.
In the case Dickson trinomials, as specified in Section 3.2, three TMVPs are done in parallel using two-way or three-way split approach of [3] . We also need to perform a reduction using the circuit depicted in [6] . We obtain the complexities of Table 3 below where the second left most column indicates b-way splits with the value of b being either two or three.
The row of Table 3 labeled by y (resp. z) refers to the method of [5] combined to the polynomial multiplication of [9] (resp. [11] ).
In a recent paper, Mullin and Mahalanobis [8] pointed out that there were some links between the Dickson basis and the normal basis. In practice, a Dickson basis is interesting when no optimal normal basis exists for the considered field. This is the case for NIST recommended binary fields IF 2 163 and IF 2 283 . Using Table 1 , we can remark that NIST fields can be constructed with Dickson trinomials, and thus we obtain a subquadratic multiplier in each of these cases. In this section, we present sequential multipliers. Each of these multipliers takes OðnÞ clock cycles but has a space complexity of OðnÞ.
Using Irreducible Dickson Binomials

Multiplier with Bit-Serial Output
In the sequel, we denote the entry at ði; jÞ of the Toeplitz and the Hankel matrices of (8) 
We remark that 1. T n;? consists of the coordinates of input A that are rotated left one position, i.e., T n;? ¼ ½a nÀ1 a nÀ2 Á Á Á a 1 a n . On the other hand, H n;? is the all-zero row vector and H nÀ1;? ¼ ½0 a nÀ1 a nÀ2 Á Á Á a 2 a 1 . 2. Given T i;? and H i;nÀ1 , we can express T iÀ1;? for 1 < i n as T iÀ1;? ¼ ½T i;2 T i;3 Á Á Á T i;n T i;1 þ H i;nÀ1 . Furthermore, given the row H i;? and the entry T iþ1;1 we can express H iÀ1;? for 1 < i n À 1 as follows
The following diagram (Fig. 1) corresponds to a sequential structure to realize the multiplication C ¼ A Â B in accordance with (11) . In the initial clock cycle, the left-side register (LR) in the diagram is loaded with T n;? and the right-side register (RR) with H n;? . In this cycle, rows T n;? and H n;? are added and an inner product is performed to yield c n ¼ ðT n;? þ H n;? Þ Á B. Also, in this cycle, the output of MUX is a 1 (and in other cycles the MUX output is the second right most bit of RR). In the next cycle, RR is loaded with H nÀ1;? and LR is shifted left to generate T nÀ1;? eventually yielding c nÀ1 .
For each of the following n À 2 clock cycles, LR is shifted left, RR is shifted right, their contents are bitwise added and an inner product is performed to produce one coordinate of C. The space and time complexity of the architecture of Fig. 1 is given in Table 4 .
Sequential Multiplier with Bit-Parallel Output
Referring to (8), we denote the columns of a Toeplitz matrix as T ?;i for 1 i n and those of an essentially Hankel matrix as H ?;i for 1 i n. Thus, we can write A Â B ¼ ½T ?;1 þ H ?;1 T ?;2 þ H ?;2 Á Á Á T ?;n þ H ?;n Á B, i.e.,
We remark that
2. Given the column T ?;i and the entry H 1;iÀ1 , we can express T ?;iþ1 as T ?;iþ1 ¼ ½T n;i þ H 1;iÀ1 ; T 1;i ; . . . ; T nÀ1;i t ; 1 i n, where H 1;0 is assumed to be a 1 . Additionally, given column H ?;i and entry T n;i , we can express H ?;iþ1 as H ?;iþ1 ¼ ½H 2;i ; H 3;i ; . . . ; H nÀ1;i ; T n;i ; 0 t ; 1 i n. In the following diagram (Fig. 2) , the column vectors T ?;1 and H ?;1 are initially loaded into the top register (TR) and the bottom register (BR), respectively. The one-bit feedback cell F is initialized with H 1;0 ¼ a 1 . If TR is shifted downward and BR upward with the feedback connections as shown in the diagram, the new contents of TR and BR will be T ?;2 and H ?;2 , respectively. Note that BR is an n À 1 bits long shift register, since H n;i ¼ 0 for 1 i n. With additional shifts on TR and BR, the remaining columns of the Toeplitz and the essentially Hankel matrices are generated.
Each corresponding pair of columns (i.e., T ?;i and H ?;i ) are added and the resulting columns are multiplied with b i (in the diagram, these are shown using an array of XOR and AND gates).
The weighted columns are accumulated in accordance with (12) to produce the desired output C in a total of n clock cycles. The delay and the space complexity of this architecture are given in Table 4 . 
Using Irreducible Dickson Trinomials
From (10) of Section 3.2, the coefficients c 1 ; c 2 ; . . . ; c n are given by 
Note that c nþ1 ; c nþ2 ; . . . ; c 2n can be reduced as explained at the end of Section 3.2.
Below, we will first present a hardware structure to generate c 1 ; c 2 ; . . . ; c n in accordance with (13). Then, we will discuss how to use part of the above hardware to generate c nþ1 ; c nþ2 ; . . . ; c 2n . In practice, one can first generate c nþ1 ; c nþ2 ; . . . ; c 2n . While these n bits are reduced, one can generate c 1 ; c 2 ; . . . ; c n . This overlap of operations will effectively eliminate/hide the extra time for reduction of c nþ1 ; . . . ; c 2n .
Sequential Multiplier with Bit-Serial Output
We denote the rows of the Toeplitz and the Hankel matrices of (13) as T i;? and H i;? , respectively. For 1 i n, we can then write the following
where H 0;1 is assumed to be a 1 and
In Fig. 3 below, registers RR and LR are initialized with T 1;? and H 1;? . The feedback cell F is initialized with a 1 ¼ H 0;1 . Then, with the application of a shift to these registers together, the second rows of the Toeplitz and the Hankel matrices of (12) are formed in RR and LR, respectively. This happens due to the fact that the shift and the feedback connection as shown in Fig. 3 essentially realize (15). The remaining rows of the two matrices are formed pair by pair with successive shifts.
Note that LR is n À 1 bits long, since the right most bit of each row of the Hankel matrix is zero. The upper part of Fig. 3 is similar to that of Fig. 1 and is to add the corresponding rows of the Toeplitz and the Hankel matrices, followed by inner product operations to yield c 1 ; c 2 ; . . . ; c n .
To generate c nþ1 ; c nþ2 ; . . . ; c 2n using the structure in Fig. 3 , we initialize RR with ½a n ; a nÀ1 ; . . . ; a 1 , which is the first row of the upper triangular Toeplitz matrix of (14). Register LR and cell F are initialized with all zeros. Then, with successive shifts, RR will contain the remaining rows of the Toeplitz matrix and LR will have all zeros. This will result in c nþ1 ; c nþ2 ; . . . ; c 2 at the output of Fig. 3 .
The time and the space complexities of the structure of Fig. 3 are given in Table 4 . These exclude the cost associated with the reduction of c nþ1 ; c nþ2 ; . . . ; c 2n .
Bit-Parallel Output
Let T be the Toeplitz matrix in (13) and let H be the Hankel matrix in (13). T ?;i and H ?;i represent the i-th column of T and H respectively.
In order to generate T ?;i and H ?;i , we note that in (13) T ?;i ¼ T i;? and H ?;i ¼ H i;? . In other words, the ith column is the same as the ith row for each of the matrices. Thus, columns can be generated using the same system of feedback registers as shown in Fig. 3 earlier.
To obtain c 1 ; c 2 ; . . . ; c n in bit-parallel fashion, the inner product unit of Fig. 3 can be replaced with an array unit of weighting AND gates and accumulators (as used in Fig. 2 ). The complete diagram is shown in Fig. 4 , and its space and time complexities are given in Table 4 .
The coefficients c nþ1 ; c nþ2 ; . . . ; c 2n from (14) will be computed with the same hardware of Fig. 2 . Specifically, in Fig. 4 , RR is initialized with the first column of the above matrix. The accumulators LR and F are all initialized to zero. Then, in n clock cycles with weighting input as b n ; b nÀ1 ; . . . ; b 1 the accumulators will have c 2n ; c 2nÀ1 ; . . . ; c nþ1 .
Complexity and Comparison
In Table 4 , we put the resulting complexities of the different sequential multipliers based on the Dickson basis representation. For the purpose of comparison, we also give the complexity of the method of [4] using ONB of type I and II. We remark that when no ONB is available, a Dickson binomial seems to be the best choice since Dickson trinomial requires an increased number of clock cycles. 
CONCLUSION
In this paper, we have presented new parallel multipliers based on Dickson basis representation of binary fields. The multiplier for an irreducible Dickson binomial has a complexity similar to the subquadratic multiplier for ONB II of [4] . For an irreducible Dickson trinomial, the multiplier has a slightly more space complexity, but can still be used for fields with degree of several hundreds (for example, those used in today's elliptic curve cryptographic systems).
In this paper, we have also presented sequential multipliers using the above-mentioned Dickson representation. The sequential multipliers have a space complexity of OðnÞ. We have considered both bit-serial and bit-parallel output formats for the sequential multipliers. Compared to the sequential multipliers with bitparallel output format presented in [1] and [8] , the sequential multipliers presented here with the same output format reduce the number of XOR and AND gates by a factor of two or more, while keeping the number of flip-flops and clock cycles about the same.
ACKNOWLEDGMENTS
A preliminary version of this work was presented at the WAIFI 2008 conference [6] . This work was supported in part by an NSERC research grant awarded to Dr. Anwar Hasan.
