Abstract-Two digit-level finite field multipliers in F 2 m using redundant representation are presented. Embedding F 2 m in cyclotomic field F (n) 2 causes a certain amount of redundancy and consequently performing field multiplication using redundant representation would require more hardware resources. Based on a specific feature of redundant representation in a class of finite fields, two new multiplication algorithms along with their pertaining architectures are proposed to alleviate this problem. Considering area-delay product as a measure of evaluation, it has been shown that both the proposed architectures considerably outperform existing digit-level multipliers using the same basis. It is also shown that for a subset of the fields, the proposed multipliers are of higher performance in terms of area-delay complexities among several recently proposed optimal normal basis multipliers. The main characteristics of the postplace&route application specific integrated circuit implementation of the proposed multipliers for three practical digit sizes are also reported. Index Terms-Digit-level architecture, finite field arithmetic, multiplication algorithm, redundant representation.
2 causes a certain amount of redundancy and consequently performing field multiplication using redundant representation would require more hardware resources. Based on a specific feature of redundant representation in a class of finite fields, two new multiplication algorithms along with their pertaining architectures are proposed to alleviate this problem. Considering area-delay product as a measure of evaluation, it has been shown that both the proposed architectures considerably outperform existing digit-level multipliers using the same basis. It is also shown that for a subset of the fields, the proposed multipliers are of higher performance in terms of area-delay complexities among several recently proposed optimal normal basis multipliers. The main characteristics of the postplace&route application specific integrated circuit implementation of the proposed multipliers for three practical digit sizes are also reported. Index Terms-Digit-level architecture, finite field arithmetic, multiplication algorithm, redundant representation.
I. INTRODUCTION

F
INITE field computation has recently gained growing attention due to its wide range of applications in coding theory, error control coding, and especially in cryptography, where ElGamal [1] and elliptic curve cryptography (ECC) [2] , two out of the three well-known cryptosystems, are based on finite field arithmetic [3] . Finite field computation is performed using arithmetic operations in the underlying finite field. Among the basic field operations, multiplication plays a fundamental role as more complicated operations, namely, field exponentiation and field inversion can be carried out with consecutive use of field multiplication [2] , [4] , [5] .
Similar to linear algebra, the concept of representation bases is also used in finite field arithmetic to represent field elements. The choice of representation system-mainly affected by the hardware in use and the requirements of the cryptosystem, has a great impact on computational performance. A few number of representation systems for extension binary fields have been proposed in the literature, such as polynomial basis [6] , normal basis (NB) [7] , redundant basis (RB) [8] , and dual basis [9] . In both NB and redundant representation, squaring operation can be performed by applying a simple permutation operation on the coordinates. This makes them more efficient for the hardware implementations of cryptographic algorithms that utilize frequent squaring or exponentiation, such as point addition/doubling in ECC. Moreover, redundant representation is of a special interest due to its unique feature in accommodating ring type operations. This not only offers almost cost-free squaring operation but also eliminates the need for modular reduction in multiplication.
The idea of embedding a field in a larger ring was first put forward by Gao et al. [10] , [11] for performing fast multiplication using NB. Later on, Wu et al. [8] introduced redundant representation, also known as RB, and finite field multiplication using this representation system. In efforts to increase the multiplication speed or to reduce the hardware complexities, several architectures have been proposed afterward, such as comb-style architecture [12] and linear feedback shift register (LFSR)-based architectures [13] , [14] . More recently, Xie et al. [15] proposed a recursive decomposition scheme for digit-level serial/parallel structures to achieve less area-time-power complexities.
Despite the structure of the architecture in use, the main drawback of redundant representation is that it contains a certain amount of redundancy as embedding field F 2 m of size m in cyclotomic field F (n) 2 of size n, (n > m), is not a one-to-one mapping operation. As a result, redundant representation requires more bits to represent a field element, where the number of representation bits depends on the size of the cyclotomic field in which the underlying field is embedded. In this paper, our focus is on digit-level architectures for RB multipliers. We show that a specific feature of redundant representation can be used for a class of finite fields to significantly reduce the architectural complexity of RB multipliers to compensate for the inherent redundancy in this representation system. Two variants of multiplication algorithms along with their corresponding architecture are presented. It is shown that the proposed architectures have highly regular structures and thus suitable for hardware implementation. Comparisons with existing digit-level RB architectures reveal that both the proposed architectures outperform other RB architectures when considering area-delay product as a measure of performance. A comparison between the performances of the proposed multipliers with those of several optimal NB (ONB) multipliers is also given. Finally, hardware realizations of the proposed multipliers for three practical digit sizes are presented.
The organization of this paper is as follows. Section II contains a brief review of RB and finite field multiplication using this representation system. In Section III, two new digitlevel algorithm and architectures for RB multiplication are presented. The architectural complexity and the performance comparison are discussed in Section IV followed by the details of VLSI implementations of three practical field size multipliers in Section V. The conclusion remarks are given in Section VI.
II. BRIEF REVIEW OF REDUNDANT BASIS
Let F 2 denote a field of characteristic 2 and x n − 1 be a polynomial of degree positive integer n over F 2 . Then, the splitting field of x n − 1, denoted by F (n) 2 , is called the nth cyclotomic field over F 2 . Let β be a primitive nth root of unity in an extension field of F 2 . Then,
2 is generated by β over F 2 and elements of F (n) 2 can be represented in the form of
This representation of A is not unique, i.e., for a given element of F
(n)
2 represented by n-tuple (a 0 , a 1 , · · · , a n−1 ), a i ∈ F 2 , there exist different tuples representing the same element. For instance, each element in F (n) 2 and its ones' complement represent the same field element as explained in Lemma 1.
Lemma 1: Assume that field element E ∈ F 2 m is represented by (e 0 , e 1 , . . . , e n−1 ), e i ∈ F 2 with respect to I = {1, β, . . . , β n−1 }. Then
Proof: Since the set of powers of primitive nth root of unity, i.e., {β i , 0 ≤ i ≤ n − 1}, form a cyclic group of order n, then, β n = 1 and 1 + β + β 2 + · · · + β n−1 = 0 accordingly.
An interesting example would be the identity element of the field with respect to operation "+," namely, "0," which can be represented by both n-tuples (0, 0, . . . , 0) and (1, 1, · · · , 1). Due to the redundancy in this representation system, (1) is called a redundant representation of A and I = {1, β, β 2 , . . . , β n−1 } is referred to as a RB for any subfield of F (n) 2 [8] .
In order for a field of characteristic two, F 2 m , to be embedded in F (n) 2 , the following relationship between n and m should be satisfied.
Theorem 1: Let n be an odd positive integer greater than m. Then, F 2 m is contained in F For more information about conversion from/to normal representation system to/from redundant representation, the reader is referred to [8] and [17] .
B. Multiplication Using Redundant Representation in F 2 m
One of the unique advantages of using RB in finite field arithmetic is that it eliminates the need for modular reduction in multiplication operation. This useful feature stems from the fact that the basis elements 1, β, β 2 , . . . , β n−1 form a cyclic group of order n. As a direct result
Let field elements A and B ∈ F 2 m be expressed with respect to the RB I = {1, β, β 2 , . . . , β n−1 } as
respectively, where a i , b i ∈ F 2 . Note that n ≥ m + 1 and β n = 1. Then C, the product of A and B, can be given by
where
III. PROPOSED DIGIT-LEVEL SIPO MULTIPLIERS USING REDUNDANT REPRESENTATION
In this section, we first present a new algorithm for RB multiplication. Based on this algorithm, we propose two new optimized digit-level serial-in parallel-out (SIPO) architectures. These architectures are adopted for a class of finite fields in which n can be expressed as n = T m + 1, where T ≥ 2 and is an even number. As will be seen in the remainder of this section, this condition enables us to devise an architecture that significantly reduces the complexity of the multiplier. Corollary 1, which corresponds to [8, Lemma 2] , describes a specific feature that results from the abovementioned condition.
Corollary 1: Let A ∈ F 2 m and assume that its redundant representation is given by (a 0 , a 1 , . . . , a n−1 ) with respect to RB I over F 2 m . If n can be expressed as T m + 1, assuming that T ≥ 2 and is even, then 
selected within the range of 150 to 600 according to the security standards [19] . It can be shown, via [18] , that Corollary 1 covers over 60% of all finite fields within the practical range.
As an example, for the first 100 fields in the aforementioned range, the orders of the smallest cyclotomic fields that contain F 2 m for a subset of fields that the relationship between n and m satisfies the requirement of Corollary 1 that are listed in Table I .
A. Proposed Digit-Level RB Multiplication Algorithm, Type-a
Assume w, 1 ≤ w ≤ (n − 1/2), denotes the digit size of the multiplier. Excluding a 0 from the coordinate set let the rest of the coordinates be equally divided into 2w parts, d-bit long each, where d = (n − 1/2w) as
Note that the outside of the coordinate set is padded with zero. Replace subscript i of a i in (5) with kd
Based on the definition of d, we have:
As a result, the upper bound of the subscript kd + in the above-mentioned double summation falls within the range of n − 1 to n − 1 + 2w, and thus, all the nonzero terms of the product coefficient c j is included in (7) . Under the required conditions of Corollary 1, the last (T m/2) coordinates of a field element are mirror reflections of the first (T m/2) coordinates. A new function ϕ(i ) can be utilized to map the set of integers used in the subscript of the coordinates to the set {0, 1, . . . , (n − 1/2)}, as follows:
Taking into account (8) and (9), (7) can be rewritten as
for j = 0, 1, . . . , n − 1. In the last term of (11), a is replaced withâ due to the fact that coordinates a i for i = (n + 1/2) to n−1 are equal to their correspondingâ i for i
The complexity of multiplication operation carried out using (12) can be further reduced by utilizing Lemma 1. Taking into consideration (2), term a 0 b j can be removed from (12) . If coordinate a 0 is equal to zero then the original representation of A is used without being changed. Otherwise, in the case a 0 = 1, A complement can be used instead of A without having any effect on the multiplication operation. As a result, the product coefficient c j can be obtained as
Lemma 2 shows that if a 0 = 0, and then, c 0 will be equal to zero too. Proof: Substituting zero for j in (13) and according to the definition of function ϕ(i ) in (8) we have
Using (13) , it can be easily proven that c n− j = c j for
In order to obtain a digit-level multiplication algorithm, first decompose (13) into two double summations as
. . , (n − 1/2) and k = 0, 1, . . . , w − 1, as follows: (16) comparing (16) with (13), it follows:
If the values of p ( ) j,k and q ( ) j,k can be calculated and accumulated for all the values of j and k at each clock cycle, then it takes d = (n − 1/2w) clock cycles to obtain all the product coefficients. Algorithm 1 describes the multiplication process in more detail. To perform arithmetic operations in binary field F 2 , one XOR gate and one AND gate are required Algorithm 1 Digit-Level RB Multiplication Algorithm Where n Can Be Expressed as n = T m + 1, T ≥ 2 and Even to realize a bitwise addition and a bitwise multiplication, respectively. In Step 5 of the algorithm, for given j and k, an AND gate is used to multiply one bit of each input operands together, and then, an XOR gate is required to perform addition operation.
Step 6 could also be implemented using one XOR and one AND gate in a similar way to Step 5.
A pair of flip-flops is also required for given j and k to store the values of two signals p 
B. New Multiplier Architecture, DL-SRB-a
An architecture for the proposed multiplier can be constructed based on the steps described in Algorithm 1 at = 1. Fig. 1 shows the proposed architecture, hereafter referred to as digit-level symmetrical RB type−a multiplier (DL-SRB-a). From top to bottom, the architecture consists of an n-bit circular shift register which should be initialized with the coordinates of operand B. This shift register provides inputs to a wire expansion module with n inputs and w(n − 1) outputs followed by ((n − 1)/2) identical modules (M 1 , M 2 , . . . , M (n−1/2) ) shown inside the dashed boxes. At the bottom, there is a network of XOR gates adding 2w outputs of each module M j together to form output coordinates.
Each module M j is made of a layer of 2w AND gates receiving the outputs of the wire expansion module as their As mentioned earlier, input A should be fed into the multiplier in a digit-serial fashion (comb style). According to (13) , the multiplication operation is performed usingâ i coefficients which are necessarily equal to the (n − 1/2) coordinates of A starting from coordinate number 1 to (n − 1/2). We will refer to this set of coordinates of A asÂ. LetÂ be divided into w parts of length d in the same way we did earlier for A, aŝ
Note thatÂ is padded with wd − (n − 1/2) zeros in the most significant word. In the first clock cycle, the first bits of every word, i.e., a 1 , a d+1 , . . . , a (w−1)d+1 form an input set to the multiplier. In the second clock cycle, the inputs would be the set of second bits of every word, a 2 , a d+2 , . . . , a (w−1)d+2 , so on and so forth. For given j and k, in each clock cycle, the variable of function ϕ in b ϕ( j −kd− ) decreases by one in Step 5. An n-bit circular shift register can be used, as shown in Fig. 2 by R1 , to generate the required coefficients in Step 5. This circular shift register should be initially loaded as, from left to right, b n−1 , b n−2 , . . . , b 0 . On the contrary, the variable of function ϕ in b ϕ( j +kd+ ) in Step 6 increases by one in each clock cycle. In this case, a similar circular shift register, namely, R2, with the same initial contents but with the opposite shift direction should be utilized to produce the required coefficients.
Lemma 3: Assuming the required conditions of symmetry property explained in Corollary 1 are satisfied, only one circular shift register of length n would suffice to facilitate both the operations of Steps 5 and 6. Let the upper half of register R be initialized with equivalent coordinates from the lower half of operand B in the way shown in Fig. 2. Since ϕ(i ) = ϕ(n − i ) , an increase/decrease by one in the variable of function ϕ within the range of (n + 1/2) to n would be equal to a decrease/increase by one within the range of (n − 1/2) to 0. Also, since ϕ(−i ) = ϕ(i ) an increase/decrease by one in the variable of function ϕ within the range of −(n + 1/2) to 0 would be equal to a decrease/increase by one within the range of 0 to (n − 1/2). Consequently, the lower half of register R is used when an initial decrease in the lower half or an initial increase in the upper half of B is required and the upper half of register R is used when an initial increase in the lower half or an initial decrease in the upper half of B is needed.
Take an example in which j = 5 and k = 0. At the first clock cycle, the value of function ϕ in Step 5 is equal to ϕ(5 − 1) = 4. It decreases by one in each clock cycle up to the fifth cycle and will increase by one at each cycle afterward
.).
The similar function value in Step 6, is initially equal to ϕ(5+ 1) = 6 at the first clock cycle. It increases by one in each clock cycle up to the ((n − 1/2) − 5)th cycle, and will decrease by one at each cycle afterward (ϕ(5 + 2) = 7, . . . , ϕ(n + 1/2) = (n − 1/2), ϕ(n + 3/2) = (n − 3/2), . . .). As a result, R 4 and R n−6 should be used to produce p ( ) 5,0 and q ( ) 5,0 , respectively. As can be seen in Fig. 1 , the number of AND gates exceeds the number of flip-flops in register R. The role of wire expansion module with n inputs and w(n − 1) outputs is to receive input bits from register R and to deliver them to AND gates as follows: for given j and k, inputs to p ( ) j,k and q ( ) j,k should be connected to R ϕ( j −kd−1) and R ϕ( j +kd+1) , respectively. It is evident that what wire expansion module does, is nothing but permuting and reordering the input bits and that it does not contain any logic gates.
Depending on the choices of n and w, the complexity of the multiplier may be reduced one step further. Recall that 
In the rest of this paper, the notationw will be used in place of w to denote the number of the parallel branches required in each module M j .w can be defined as follows:
A noteworthy feature of the proposed architecture is that the critical path of the multiplier is independent of the field size (m), the degree of the cyclotomic field (n), and the digit size (w). The length of critical path in terms of the number of logic gates used remains constant regardless of the number of flip-flops in register R and the values of j and k. As the wire expansion module does not require logic cells, the critical path is composed of one AND gate and one XOR gate. Assuming T A and T x denotes the time delays required by a two-input AND gate and a two-input XOR gate, respectively, the critical path delay is equal to T cp = T A + T X . Note that the XOR network shown in Fig. 1 (bottom) is not part of the critical path of the multiplier as the summation in Step 12 of Algorithm 1 is only needed to be performed once at the end of the multiplication operation.
In other words, the proposed architecture can be viewed as a sequential circuit followed by a combinational circuit. In the sequential part (which contains the whole circuit excluding the XOR tree), partial products p ( ) j,k and q ( ) j,k are recursively generated and stored in the flip-flops at each clock cycle. Note that during the first d clock cycles, the output of XOR trees are not required to be stored as they do not play any role in the computations performed in the sequential circuit. However, the product coordinates will not be available immediately after d clock cycles. It takes another time delay of log 2 2w T X associated with the binary tree of (2w − 1) two-input XOR gates (combinational circuit) before the product coordinates can be read from the output end. To avoid the combinational circuit from becoming the critical path, this step should be performed in multicycles. A common solution would be the use of intermediate flip-flops to break a long path into smaller pieces. However, the use of extra flip-flops can be avoided provided that the inputs of the combinational circuit are kept unchanged so that the combinational circuit has enough time to generate valid outputs. In the proposed architecture, this is done by padding each input sequenceÂ 0 (19) where T clock refers to the clock period. If the clock period is chosen to be equal to the critical path delay to achieve the maximum operation frequency, T clock should be replaced with T cp in (19) . Finally, the total number of clock cycles needed for a single multiplication operation is equal to d + d ex .
C. New Multiplier Architecture, DL-SRB-b
At the expense of a slight increase in the critical path delay, the number of logic gates and flip-flops used in the architecture of Fig. 1 can be significantly reduced. Starting from the closed formula of (13), instead of the decomposition shown in (14) , define two intermediate signals s 
(20) 
Comparing (20) with (13), product coordinates c i can be expressed as
The new algorithm can be obtained by replacing Steps 5-12 in Algorithm 1 with the following steps:
end for 8: end for 9: end for 10: for all values of j = 1, 2, . . . , n−1 2 , compute in parallel 11: for all values of k = 0, 1, . . . , w − 1, compute in serial 12 :
Note that in each clock cycle, Steps 5 and 6 should be computed in serial. Fig. 3 shows the modified architecture referred to as DL-SRB-b. As can be seen from Fig. 3 , the new architecture is similar to the previously proposed architecture, DL-SRB-a, in the sense that it utilizes the same wire expansion module and the same n-bit circular shift register to store operand B. Operand A is also fed into the multiplier in the same way as earlier. The main difference between the two architectures originates from the difference between the two modules shown inside the dotted boxes in Figs. 1 and 3 . In type-a architecture, one bit of operand B is multiplied by one bit of operand A, and the resulting partial product is stored separately in its respective accumulation unit. On the contrary, in type-b architecture, two bits of operand B are first added together before they enter the AND gate and be fed into the accumulation unit. As a result, the critical path delay of the new architecture changes from T A + T X to T A + 2T X . In the recent architecture, the number of accumulation units and AND gates are reduced by half from w(n − 1) to w(n − 1/2) each. Since half of the addition operations are performed before the accumulation units, the size of the binary XOR tree is also reduced from 2w − 1 to w − 1.
Similar to DL-SRB-a, the multiplication delay of DL-SRB-b is composed of two parts: d and d ex . The first part corresponds to Steps 5 and 6 of the algorithm caused by modules M j during d clock cycles. The second part corresponds to the time delay of a w-input XOR gate or a binary tree of (w − 1) two-input XOR gates. Assuming that a binary tree of two-input XOR gates is used, the total number of clock cycles required to complete a single multiplication can be calculated as
IV. ARCHITECTURAL COMPLEXITIES AND COMPARISON
The area complexities of the proposed architectures can be readily calculated from Figs. 1 and 3 . In the case of type-a structure shown in Fig. 1 , the circular shift register contains n flip-flops to store the coordinates of operand B. As described earlier, the structure utilizes (n − 1/2) identical modules M j , j = 1, 2, . . . , (n − 1/2), each of which employing 2w flipflops to store the values of signals p ( ) j,k and q ( ) j,k at each clock cycle for all the values of k from 0 to w − 1. In total, the number of required flip-flops comes to w(n −1)+n. There are also w(n−1) two-input AND gates, each followed immediately by a two-input XOR gate. Assuming that the XOR network at the bottom of the Fig. 3 is only made of two-input XOR gates, the architecture requires (4w − 1)(n − 1/2) XOR gates altogether.
In the case of type-b architecture, each module M i only contains w parallel branches instead of 2w in type-a counterpart. As a result, the number of AND gates and the total number of flip-flops decrease to w(n − 1/2) and w(n − 1/2) + n, respectively. XOR gates appear in two separate layers in the structure of Fig. 3 . The first layer consists of w(n − 1/2) gates and the layer at the bottom most part of the structure requires (w − 1)(n − 1/2). In sum, a total of (2w − 1)(n − 1/2) XOR gates is used in the structure of type-b multiplier.
ONBs are the most efficient classes of Gaussian NBs (GNBs) [16] , [24] . To achieve smaller area and time overheads when using NBs over binary extension fields, it is recommended to use a GNB with the least possible type. The least possible type for a GNB is equal to 2 and type-2 GNB is also known as type-II ONB. Since ONB is the most efficient class of NB, it should be interesting to have a complexity comparison between the proposed multipliers and several recently proposed ONB multipliers. Table II draws a comparison between the hardware complexities of the two proposed multipliers, those of existing digit-level RB multipliers and several ONB multipliers. The comparison has been made in terms of the number of logic cells used, critical path delay (T cp ), and multiplication delay. Among the architectures listed in the Table II, the three architectures presented in [13] , [14] , and [21] are based on a LFSR structure, whereas the others have non-LFSR structural designs. The architecture most comparable to that being proposed is the "comb-style" architecture presented in [12] . Although the overall structure of the two architectures might seem similar, there are two important differences between them. First, the comb-style architecture in [12] implements a general RB multiplier and does not utilize the symmetry property even if applicable. Second, for each output coordinate in this architecture, the results of partial bitwise products are added together first, and then, the resultant value enters the accumulation unit. The addition operation is applied over w partial products together with the current data stored in the accummulation unit before updating the output flip-flop at each clock cycle.
As a result, the critical path contains the XOR chain, thus causing additional delay cost.
As can be seen in the Table II , the second proposed architecture, DL-SRB-b, requires the smallest number of gate counts compared with the other RB multipliers. In terms of maximum operating frequency, PS-III has the smallest critical path delay. However, the reduction in critical path delay is achieved by utilizing a layer of flip-flops between the AND gates and the pipelined XOR tree in the structure of PS-III [15, Fig. 7] at the cost of using about w times more flip-flops and significantly longer multiplication delay. The proposed structure DL-SRB-a together with "High-speed" structure in [14] has the second smallest critical path delay amongst all the structures under comparison.
In order to enable a better comparison, the area and delay complexities of the multipliers listed in Table II have been calculated and tabulated in Table III as a case study. Among the five field sizes recommended by the National Institute of Standards and Technology for elliptic curve applications [19] , m = 233 is the only one for which a type-II ONB exists. For this reason, in all the calculations made for Table III, the field size was selected as m = 233. Note that F 2 233 can be embedded into cyclotomic field F 2 (467) . As mentioned earlier, accommodating ring type operations is a unique feature of redundant representation which not only provides a cost-free squaring operation but also eliminates the need for modular reduction in finite field operations. However, these remarkable advantages are achieved at the cost of a certain level of redundancy in the number of bits required to represent field elements. It should be noted that the appropriate choice of representation system generally depends on the overall specifications of the cryptographic system being implemented, such as field size (security level), the frequency of using multiplication and exponentiation operations, the overhead of using basis conversions, fault-tolerancy, and so on. Although numerical comparison can reveal that the proposed architectures can effectively reduce the area-delay complexity of RB multipliers (by almost half) for 60% of all field sizes, the main focus of this paper is placed on about 20% of the fields for which T is equal to 2. In that case, not only the complexity of the proposed RB multipliers become comparable to that of ONB multipliers, but as suggested in Table III , they may even outperform ONB multipliers.
For each multiplier listed in the Table III , the calculations were made for three practical digit sizes 8, 16, and 32 based on the following assumptions. The required areas for an AND gate, an XOR gate, and a D-type flip-flop are assumed to be equal to δ A , δ X , and δ R square units, respectively. 1 Parameter r in the second row of Table II represents the number of output product bits generated simultaneously in each clock cycle. To make a fair comparison in terms of gate counts and multiplication delay, the value of parameter r is assumed to be equal to w in Table III where needed. It is also assumed that the propagation delay of an XOR gate is twice as long as the delay of an AND gate [25] . The column entitled "Area Cost" in the Table III shows the total area required by logic gates and registers for each multiplier.
Assuming that the propagation delay of an AND gate is equal to 1 delay unit, the column entitled "Delay Cost" presents the relative multiplication delays in proportion to the delay of an AND gate. As shown in Table III , DL-SRB-a offers much lower delay costs compared with the other multipliers. DL-SRB-b stands at the second position, having the second lowest delay cost except for only one case in which PS-III shows a slightly better performance when the digit size is equal to 8. In the design of digit-level finite field multipliers, there is always a tradeoff between delay and area costs as two important design factors and reducing one them generally results in an increase in the other one. To achieve a fair comparison, the area-delay product of the multipliers has been calculated and listed in the rightmost column of the Table III . As can be seen, both of the proposed architectures show much lower area-delay costs than all the existing RB multipliers for all digit sizes listed in the Table III. In the case of DL-SRB-b, the area-delay cost is 53%, 51%, and 47% lower than the most comparable architecture when w = 8, 16, and 32, respectively. In comparison with ONB multipliers, DL-SRB-b architecture offers 24%, 29%, and 7% area-delay improvement when the digit size changes from 8 to 16 and finally, 32, respectively.
It has been proven that if there exist a type-II ONB for representing field elements in F 2 m , then a cyclotomic field of degree 2m +1 (T = 2) always exists [8] . However, the inverse statement is not always true. The existence of a cyclotomic field of degree n = 2m + 1 for F 2 m does not necessarily imply that a type-II ONB for that particular field size exists. As a result, the advantage of using the proposed multipliers would become more distinct when T = 2 but no ONB exists; for example, in the case of m = 200, 204, or 224.
V. HARDWARE IMPLEMENTATION
In order to verify the theoretical results, both the proposed multipliers were implemented in hardware as separate application specific integrated circuit (ASIC) modules for three digit sizes 8, 16, and 32. Multipliers have been realized for the binary extension field of degree 233. Note that in this case T is equal to 2 and the result of Corollary 1 is applicable to the cyclotomic field of degree n = 467. All the implementations were carried out in seven-metal layer 65-nm CMOS process from STMicroelectronics with CMOS065LP standard cell library. Fig. 4 shows the design flow used to realize each multiplier. The implementation process started with writing a Verilog code to describe the multiplier in hardware description language. C language was used to generate netlist blocks describing the numerous interconnections between logic gates as the main part of the RTL code. Then, the RTL design was synthesized to an optimal gate level design using Design Compiler from Synopsys. In the final stage, the netlist was imported to the Cadence SoC Encounter to perform floorplaning, cell placement, clock tree synthesis, reset net synthesis, and routing tasks. Three rounds of simulations were also carried out after RTL design, synthesis, and place& route stages to ensure the correct functionality of the multiplier. A set of golden results was initially created by simulating the multiplier with a large set of randomly generated input operands in MATLAB. Then, in each round of simulation, the same set of input data was fed into the multiplier and the product values were compared against the golden set. To obtain accurate power estimation, the final netlist was generated by Encounter and then simulated for 1000 pairs of random input vectors by NCSim to extract and store the switching activity information of all internal nets in value change dump (VCD) format. The switching activity information was fed into Encounter afterward to calculate the power consumption values.
The main characteristics of ASIC implementations for the proposed multipliers are listed in Table IV . It should be noted that the gap between the critical path delays measured in the postsynthesis stage and the postplace& route stage increases as the value of w changes from 8 to 32. These changes in critical path delay stem from two facts. First, increasing the level of parallelism in the architectures of multipliers can significantly increase the capacitive load of certain nets, such as input A, reset, and clock. Consequently, a longer buffer chain is required to be able to properly drive logic cells connected to the high-fan-out nets, thus causing additional delay. Second, contributing factors, such as interconnect and parasitic capacitances, can only be taken into account for timing analysis after place&route when the layout is fully routed. Such factors eventually lead to a longer critical path delay.
VI. CONCLUSION Two new digit-level SIPO finite field multipliers using redundant representation have been proposed. For about 60% of the field sizes within the practical range for ECC applications, the relationship between extension degree m and the size of the smallest cyclotomic field, (n), in which F 2 m can be embedded is expressed as n = T m + 1 for T even and greater than or equal to 2 [18] . In this case, a specific feature of redundant representation was used to alleviate the redundancy problem in this representation system. Numerical complexity comparison showed that both new architectures have the lowest delay cost compared with the existing RB architectures. One of the proposed architectures achieved at least 2.12 times higher performance (for different digit sizes over F 2 233 ) in comparison with the most comparable RB architecture when considering area-delay complexity as a measure of performance. In about 20% of cases where T = 2, the proposal can show better performance than ONB multipliers, if existed, and can show much better performance than NB multipliers when T = 2 but no ONB exists (e.g., field sizes 200, 204, and 224). VLSI implementation of the proposed architectures for binary extension field of 233 and three practical digit sizes in 65-nm CMOS technology was also presented.
