Abstract-In this work, we consider detection of errors in polynomial, dual, and normal bases arithmetic operations. Error detection is performed by recomputing with the shifted operand method, while the operation unit is in use. This scheme is efficient for pipelined architectures, particularly systolic arrays. Additionally, one semisystolic multiplier for each of the polynomial, dual, type I, and type II optimal normal bases is presented. The results show that for having better or similar space and time overheads compared to a number of related previous work, the multipliers have generally a higher error-detection capability, e.g., the error-detection capability of the RESO-based scheme for single and multiple stuck-at faults in a polynomial basis multiplier is 100 percent. Finally, we also comment on how RESO can be used for concurrent error correction to deal with transient faults.
INTRODUCTION

A
CHIEVING an acceptable security level for hardware implementations of large digital systems specially with critical applications has received significant attention recently. During the use of these digital systems, faults in the circuits may occur either due to natural causes or deliberate fault injection by an attacker (see, for example, [1] , [2] , [3] , [4] ). Faulty circuits are likely to generate erroneous results that may make the whole system unusable or may help the attacker to break the system. As a result, error detection is important for these systems. This paper investigates mainly the detection of random errors in polynomial, dual, and normal bases arithmetic operations, which have applications in deep space channel coding [5] , VLSI testing [6] , and cryptography [7] , [8] . Although the schemes proposed in this paper may not provide a complete solution to the problem of deliberately injected faults, it is likely to make an attack more difficult. This is because the number of faults that can be injected by an attacker is reduced to the number of faults that cannot be detected by the scheme.
Some random errors in a number of hardware-based cryptosystems or their arithmetic accelerators can be detected at an upper level operation, e.g., in elliptic curve cryptography (ECC), if a point leaves the curve, it can be easily detected by point verification [1] , [4] . This is, however, not always possible. In the case of ECC, a fault may move a point to another point without leaving the curve and this has been exploited in the so-called sign change fault attack [2] . As a result, some kind of mechanisms for error detection in finitefield operations can still be quite important.
A number of schemes have been recently proposed in this regard (see [9] , [10] , [11] ). Some of the techniques used in these schemes can be categorized as follows:
. Using parity bits: In this approach, basically, the parity of the output is predicted and compared against its actual party. Some examples are [12] , [13] , [14] , [15] , [16] , [17] . . Scaling techniques: This approach, which is used in [18] , [19] , scales the inputs of a multiplier up by a factor and at the end of the multiplication, the correctness of the result is checked by one or two divisions. . Nonlinear techniques: One example for this approach [20] is to compute a nonlinear residue for each input of the operation, and then, predict the residue of the output using the residues of the input.
To assure the correctness of the operation, one can compare the predicted residue against the actual output residue. This approach appears to be expensive in terms of area and time for detecting random errors, although the high overhead can be justified for preventing sophisticated attacks under complex nonlinear models. . Time-redundancy-based techniques: In this approach, some methods such as recomputing with shifted operands are used to detect errors in operations. In [21] , [22] , this method is used for detecting errors in polynomial basis multipliers. This paper considers the issue of detecting errors concurrently in polynomial, dual, and normal bases arithmetic operations. Our scheme for error detection is based on recomputing with shifted operands (RESOs) and is efficient for pipelined architectures such as systolic arrays. To further investigate this scheme, one finite-field semisystolic multiplier is presented for each of the polynomial, dual, type I, and type II optimal normal bases. Then, the concurrent error detection (CED) scheme is applied to them. Additionally, space and time complexities of these multipliers are compared against a number of systolic and/or semisystolic multipliers previously published in the literature. Furthermore, the capability of error detection of each multiplier is evaluated by simulation-based fault injection. Our results show that for a better or similar space and time overheads compared to a number of related previous work, the proposed multipliers have generally a high error-detection capability, e.g., the percentage of error detection of our scheme for single and multiple stuck-at faults in a polynomial basis multiplier is 100 percent. Finally, we also comment on how RESO can be used for concurrent error correction to deal with transient faults.
The organization of this paper is as follows: In Section 2, some preliminaries about polynomial, dual, and normal bases as well as the RESO method are reviewed. Applying RESO concurrent error-detection schemes for field arithmetic using polynomial, dual, and normal bases are presented in Section 3. General pipelined architectures, which are suitable for these schemes, along with an overhead analysis are given in Section 4. In Section 5, the CED scheme is investigated with more details for polynomial, dual, and normal bases multipliers. The errordetection capability of the scheme is then evaluated in Section 6. In Section 7, some comments on the concurrent error-correction strategy are given. Finally, Section 8 gives a few concluding remarks.
PRELIMINARIES
This section briefly reviews polynomial, normal, and dual bases, and the RESO method.
Polynomial, Dual, and Normal Bases
Let F ðxÞ ¼ P m i¼0 f i x i be an irreducible polynomial over GF ð2Þ of degree m. Polynomial (or canonical) basis is defined as 1; x; x 2 ; . .
m Þ can be represented using the polynomial basis (PB) as A ¼ P mÀ1 i¼0 a i x i , where a i 2 GF ð2Þ. Moreover, suppose the dual basis (DB) of the polynomial basis defined above is represented as fy 0 ; y 1 ; . . . ; y mÀ1 g, where y i 2 GF ð2 m Þ, then we can write
where the trace function of an element a 2 GF ð2 m Þ is
Furthermore, given that x 2 GF ð2 m Þ; fx;
g is a normal basis (NB) if the set members are linearly independent.
RESO Method
RESO is a technique for CED in arithmetic and logic units introduced by Patel and Fung [23] , [24] . This technique is based on time redundancy. Suppose x and fðxÞ, respectively, are the input and the output of a computation unit f. Also, suppose E and D are two functions such that DðfðEðxÞÞÞ ¼ fðxÞ. Now, we store the result of the computation of fðxÞ (first run) in a register and compare it with the result of the computation of DðfðEðxÞÞÞ (second run). Any difference between results of these two runs indicates an error. The functions E and D are referred to as encoding and decoding, respectively, and can be usually chosen such that D ¼ E À1 . It is worth mentioning that for conventional binary operands, E and D are simple shifts of operand bits and this is why it is referred to as RESO.
CONCURRENT ERROR-DETECTION STRATEGY
Errors may be caused by different types of faults such as open faults, short (bridging) faults, and/or stuck-at faults. Furthermore, the faults can be transient or permanent. We assume that locations of these faults, occurred naturally or injected deliberately, are random.
In this section, we extend the RESO method to concurrently detect errors in arithmetic operations over the field GF ð2 m Þ. For the polynomial, normal, and dual bases, the encoding and decoding functions are chosen in a way so that the overhead costs (in terms of area and time) are fairly low. The arithmetic operations considered in this work include addition/subtraction, multiplication, inversion, and division. Fig. 1 shows a general architecture of an operation with concurrent error detection. In the figure, two encoding functions of the inputs are E 1 and E 2 and the decoding function of the output is D. Clearly, for inversion, which is an unary operation, the second input in Fig. 1 should not be considered.
Let us assume that the arithmetic operation performs the function f. Then, we can write C ¼ fðA; BÞ: Also, let A E1 ¼ E 1 ðAÞ and B E2 ¼ E 2 ðBÞ. Considering that C 0 is the result of the second computation (i.e., run) after decoding, we have
As a result,
& In the following, the aforementioned concurrent errordetection strategy for each basis is investigated. The overhead cost of the CED scheme will be discussed in Section 4.
CED for PB Arithmetic Operations
Let us denote PB of GF ð2 m Þ as f1; x; . . . ; x mÀ1 g. A possible candidate for encoding and decoding functions in the PB representation of the elements of the field is multiplication by x or x À1 . Clearly, all arithmetic operations are modulo the field defining polynomial F ðxÞ. Particularly, elements x m and x À1 modulo F ðxÞ are as follows:
f i x i mod F ðxÞ;
The multiplication of x and an arbitrary element A of GF ð2 m Þ is performed as follows: Similarly, the multiplication of x À1 and A is performed as follows:
mod F ðxÞ ¼ ða 1 ; a 2 ; . . . ; a mÀ1 ; 0Þ þ a 0 ðf 1 ; f 2 ; . . . ; f mÀ1 ; 1Þ:
Hereafter, (3) and (4) are referred to as (forward) scaling and inverse scaling, respectively. Additionally, both scalings are of low cost for hardware implementation. An overhead analysis will be given in Section 4.1. Note that multiplication of an element with x i or x Ài can be considered as i consecutive scalings or inverse scalings, respectively. In the following, encoding and decoding functions are determined for basic field arithmetic operations, namely, addition/subtraction, multiplication, inversion, and division. Also, we show the procedure of CED in each PB arithmetic operation, assuming that A; B; T 2 GF ð2 m Þ.
1. Addition/Subtraction:
a. Compute A þ B and store the result in a register.
Compare this result with that of a. 
a. Compute A Â B and store in a register.
Compare this result with that of a. Although the alternative encodings and decodings as given above are relatively more efficient for implementation, they may result in a lower error-detection capability. For example, a permanent single-bit fault at the end of an arithmetic operation cannot be detected, since such faults change the results of both runs in a same manner and generates identical results even in the presence of the faults.
CED for DB Arithmetic Operations
Similar to PB arithmetic operations, a suitable candidate for encoding and decoding functions in DB representation of the elements of the field is multiplication of an element by x or x À1 . This multiplication is considered in Lemma 1.
in the dual basis. Let F ðxÞ ¼ P m i¼0 f i x i be the field defining polynomial. Then, the (forward) scaling and an inverse scaling can be performed as follows:
Proof. The proof for forward scaling can be found in [25] . Similarly, the inverse scaling can be proved as follows: According to Section 2, the DB representation of A is A ¼ P mÀ1 i¼0 a 0 i y i . Moreover, according to [25] , we have
Therefore, for 1 j m À 1, we have
Also,
Note that for low weight F ðxÞ, the hardware implementation of DB scalings requires only a few gates (see Section 4.2). Moreover, functions E 1 ; E 2 (if applicable), and D can be chosen to be the same as those for PB representation given in Section 3.1, and then, a similar CED procedure can be performed.
CED for NB Arithmetic Operations
considering that the arithmetic is performed in characteristic 2, we have 
The hardware implementations of NB squaring and taking the square root have no cost (see Section 4.3). Therefore, in NB arithmetic operations, proper choices for encoding and decoding functions are squaring and taking the square root. Moreover, the procedures of CED in NB arithmetic operations are more uniform since the encoding function(s) and the decoding function are always squaring and taking the square root, respectively. The CED procedures follow, assuming that A; B; T 2 GF ð2 m Þ and n is a nonnegative integer.
Addition/Subtraction:
a. Compute A þ B and store the result in a register. Because of cost-free squaring using NB, the above encoding can also be applied to exponentiation as shown below. We note that an extension of such encoding/ decoding to exponentiation using PB or DB involves a lot more hardware.
. Exponentiation:
Take square root of T . Compare this result with that of a.
PIPELINE ARCHITECTURE AND OVERHEAD ANALYSIS
The above CED scheme introduces more than 100 percent time redundancy in straightforward (nonpipeline) architectures, where only one computation unit is used. For example, in the case of multiplication if the first run has a computation time of , so does the second run, and hence, the overall computation time with CED is at least 2. One way to reduce the overall computation time is to overlap the two runs using a pipeline architecture for the computation unit. For an l-stage pipeline unit, the delay in each stage can be close to l þ , where is the interstage buffering time. Thus, if the two runs can be performed in the pipeline unit one after another, the overall computation time with CED can be approximated as ðl þ 1Þ l þ . Note that like the nonpipeline architecture, the pipelined unit also has a throughput of 50 percent. However, pipelining allows us to reduce the overall latency of the operation with CED by overlapping various stages of the computation. Hereafter, the overhead of the overall latency is referred to as time overhead.
An example for the pipelined architecture is the systolic array, which is used for high-performance arithmetic operations. As shown in Fig. 2a , one buffer is added to the end of the pipeline architecture to store the result of the (first) computation of the arithmetic operation. Then, the result of the second computation after decoding will be compared against the content of the last buffer of the pipeline. Alternatively, it is possible to start performing the operation with encoded inputs first, and then, perform the normal computation. In this case, a decoder should be placed after the added buffer, as shown in Fig. 2b .
Suppose that the number of pipeline stages is l. Let the propagation delays of the encoding function, the decoding function, the ith stage of the pipeline (including a buffer), the buffer, the equality checker, and one XOR gate be t e , t d , t i , t b , t c , and t X , respectively. Let t 0 clk and t clk be the clock period of the pipelined arithmetic operation with and without CED, respectively. Clearly, t clk ! Maxft i g for 1 i l. Also, in practice, t clk ! t X . For each pipeline architecture of Fig. 2, t   0 clk , the clock period and time overheads are given in Table 1 . One can choose one of the aforementioned architectures that has a smaller time overhead. Note that t 1 , t i , and t l in Table 1 are the propagation delay of the first stage, the ith stage, and the final stage, respectively, including the delay of the corresponding buffer.
It is worth mentioning that in some pipeline architectures such as systolic arrays, the delay of the equality checker (t c ) can be larger than other delays mentioned in the second column of Table 1 under t 0 clk . In this case, one may be able to reduce t c using a suitable method such as pipelining the checker itself. This will be addressed with more details in Section 5.4. Also, as mentioned earlier, the arithmetic operations considered in this work include addition/ subtraction, multiplication, inversion, and division.
Overheads in PB Operations
The hardware implementations of the PB scaling or inverse scaling are of very low cost since they need a cyclic shift to the right or left, which is free of cost in hardware, and ! À 2 XOR gates, where ! is the Hamming weight of F ðxÞ. The propagation delay for one PB scaling or inverse scaling is t X , since there is one level of XOR gates in the implementation. As mentioned before, the multiplication of a finite-field element with x i or x Ài can be implemented by i scalings or inverse scalings, respectively. Therefore, the maximum number of XOR gates required for the implementation is ið! À 2Þ.
To implement a PB arithmetic operation with CED, we need two encoding functions at maximum. Each of them consists of one scaling or inverse scaling. Additionally, we need one decoding function that consists of a maximum of two scalings or inverse scalings. Therefore, 4ð! À 2Þ XOR gates are needed for encoding and decoding functions for a PB arithmetic operation. We also have t e % t X and t d 2t X .
Overheads in DB Operations
The hardware implementations of the DB scalings need a cyclic shift to right or left and ! À 2 XOR gates. The propagation delay for one DB scaling or inverse scaling is ð! À 2Þt X due to the least significant bit of a scaling or the most significant bit of an inverse scaling.
Similar to PB arithmetic operations, for a DB arithmetic operation, a maximum of 4ð! À 2Þ XOR gates are needed for encoding and decoding functions. We also have t e % ð! À 2Þt X and t d 2ð! À 2Þt X .
Overheads in NB Operations
Squaring and taking the square root of an element represented in NB needs just a cyclic shift to right or left. Therefore, squaring and taking the square root have almost no area and time overhead in hardware implementation and we have t e ¼ t d ¼ 0.
The area overheads for pipeline implementations of PB, DB, and NB GF ð2 m Þ arithmetic operations with the proposed CED are summarized in Table 2a . It is worth mentioning that the added buffer is m bits long. Also, m XOR gates and one m-input OR gate are needed in the equality checker unit. Note that for all practical values of m, either a trinomial ð! ¼ 3Þ or a pentanomial ð! ¼ 5Þ can be found as the field defining polynomial [26] .
Note that if E and D functions of an NB operation with CED are considered to be a number of squarings and taking square roots, the area overhead of the operation does not change. For the following PB and DB operations, the E and D functions can be considered as:
.
A CLOSER LOOK AT POLYNOMIAL, DUAL, AND NORMAL BASES MULTIPLIERS WITH CED
In this section, two semisystolic multipliers for PB and DB bases and two such multipliers for NB basis are presented. Then, the time and area complexities of each of them with or without CED are given.
A Systolic PB Multiplier with CED
Let A; B; C 2 GF ð2 m Þ and C be the result of the multiplication of A and B using PB. Then, we can write
Let A ð0Þ ¼ A and A ðiÞ ¼ xA ðiÀ1Þ mod F ðxÞ. Then, we have
Expression (7) can be written in a recursive manner as follows: (9) in (8), we have It is worth mentioning that except for f 0 and f m , the number of nonzero f i s for 1 i m À 1 is ! À 2, where ! is Fig. 4 shows the semisystolic PB multiplier. In the figure, it is assumed that f i ; f j , and f k are not zero. Consequently, the cells of the columns i; j, and k are type 2 (PBT2). Generally, we have ðm À w þ 2Þ PBT1 and ð! À 2Þ PBT2 in each row. Furthermore, PBT1 and PBT2 contain (one AND, one XOR, two Latches) and (one AND, two XORs, two Latches), respectively. Table 3a presents the total number of required gates or latches for a semisystolic PB multiplier. It is worth mentioning that for all practical values of m, one can find irreducible low-weight polynomials, either a trinomial or a pentanomial, where a trinomial does not exist [26] . Also, the computation time for each type of cells is presented in Table 3b .
For the purpose of error detection, we applied the method discussed in Section 3.1. In other words, each encoding function is one forward scaling and the decoding function is two inverse scalings. Table 4 shows the area and time complexities of the multiplier presented here along with a number of previously published systolic or semisystolic PB multipliers without CED capability and with CED capability if applicable.
In [22] and [21] , two checkers (comparators) are used. One is to detect errors at the output of the circuit and another is to detect errors on global lines. These lines are horizontal lines that connect one bit of one of the inputs (say input B) to all cells in a row of the multiplier. However, in our proposed scheme since both inputs are encoded, the detection of errors in the global lines is possible. The error-detection capability of the scheme is investigated in Section 6. Also, a proof for single-bit error detection in global lines of a PB multiplier is presented in the Appendix.
According to Table 4 , the space complexities and latencies of the multipliers with or without CED presented in this work seem to be better when compared to the other multipliers mentioned in the table. Note that ! ¼ 3 or ! ¼ 5 and the latency of the multiplier without CED in [22] is the same as our work but that multiplier is not general. Apparently, the cell time complexity of our work is not among the best. However, this may not be considered as a drawback in multipliers with CED because the bottleneck for determining the minimum clock period is usually the propagation delay of equality checkers, not the cell time complexity. This will be further investigated in Section 5.4.
A Systolic DB Multiplier with CED
Suppose that A; B; C 2 GF ð2 m Þ and C is the product of A and B. The formulation for DB multiplication is similar to [21] . However, this space complexity has to be more, since the corresponding encoding and decoding functions were not considered in the report. 
Considering Expression (12), we can consider two types of cells for semisystolic DB multipliers, as shown in Fig. 5 . Except for the last column of the multiplier (j 6 ¼ m À 1), type 1 cells are used (see Fig. 5a ). These cells require one 2-input AND gate, one 2-input XOR gate, and two 1-bit latches.
Type 2 cells (Fig. 5b) are used in the last column. In addition to the gates and latches needed for type 1 cells, type 2 cells require one extra 2-input or 4-input XOR gate depending on whether the defining polynomial of the underlying field is a trinomial (! ¼ 3) or pentanomial (! ¼ 5), respectively. Fig. 6 shows a semisystolic DB multiplier.
The error-detection scheme presented in Section 3.2 is applied to this multiplier. Hence, the encoding and decoding functions are one forward scaling and two inverse scalings, respectively. Table 5 summarizes the time and space complexities of this work and a number of previously published multipliers with and without CED capability as appropriate.
As shown in Table 5 , the multipliers (with and without CED) presented in this work outperform previously reported similar other multipliers. It is worth mentioning that latches are relatively more area-consuming components, and hence, the multiplier in [29] requires more space than our work. The cell time complexity of our work is not lower than the other multipliers; however, this does not imply that the minimum clock period (MCP) of our work with CED is larger than another multiplier with CED. This case mostly happens when the other delay parameters for determining MCP (according to Table 1 ), such as propagation delay of the equality checker, is larger than the cell time complexity.
A Systolic NB Multiplier with CED
In this section, two multipliers using optimal normal bases of type I and type II are presented. (2) Should be similarly computed according to Table 1 and Maxft i g is same as that of the multiplier without CED.
Type I Optimal Normal Basis (ONB1)
Suppose that m þ 1 is a prime number and 2 is primitive in GF ðm þ 1Þ. Then, the field defining polynomial can be chosen to be F ðxÞ ¼ P m i¼0 x i , which is an all-ones irreducible polynomial over GF ð2 m Þ. Let x be a root of F ðxÞ. Since F ðxÞjðx mþ1 À 1Þ, we have x mþ1 1. Therefore, the set of normal basis presented in Section 2.1 can be reduced accordingly. The resulting set has m linearly independent elements [32] as follows and is referred to as type I optimal normal basis fx; x 2 ; . . . ; x mÀ1 ; x m g:
It is worth mentioning that the order of the elements in the above set is different from the conventional representation of a normal basis. Therefore, we define the following permutation functions that basically change the order of the coefficients in the normal basis representations:
Suppose that the NB and ONB1 representations of
and A ¼â 1 x þâ 2 x 2 þâ 3 x 3 þ Á Á Á þâ m x m , respectively. Also, let us assume that after permutation, we haveâ j ¼ a 00 i , Then, j ¼ 2 i mod ðm þ 1Þ. Now, suppose that A; B; C 2 GF ð2 m Þ are represented in ONB1. Hence, A ¼ P m i¼0â i x i and B ¼ P m i¼0b i x i , wherê a 0 ¼b 0 ¼ 0. Therefore, using the previous notations, we have
Expression (13) is very similar to PB multiplication except that this multiplication is (m þ 1) bits long. Additionally, for 1 i m, we have as follows:
0 :F ðxÞ: Fig. 7 shows a cell of a semisystolic ONB1 multiplier. It is worth mentioning that since F ðxÞ is an all-one polynomial, we can simply add the LSB of C ðmÞ with all other bits of C ðmÞ in the hardware implementation. Fig. 8a shows a semisystolic ONB1 multiplier. As shown in Fig. 8a , b 0 ¼ 0 is 1 input of all AND gates of the cells in the first row. Therefore, the first row can be omitted, and then, a i s should be fed into the first row after one rotation, as shown in Fig. 8b . Clearly, in this way, the space and latency of the multiplier are slightly reduced.
The error-detection scheme presented in Section 3.3 is applied to this multiplier. The encoding and decoding functions are one or two shifts to the left or right. Apparently, the encodings should be performed before the permutation À 1 and the decoding should be performed after the inverse permutation À À1 1 . The time and space complexities of this work with and without CED capability are presented in Table 6 . According to Table 6 , the ONB1 multiplier presented here is better than that in [29] in terms of space complexity and latency and they are the same in terms of cell time complexity. Furthermore, to the best of our knowledge, this is the first work that addressed the CED in NB multipliers.
Type II Optimal Normal Basis (ONB2)
Suppose that 2m þ 1 is a prime number and one of the following conditions holds:
. 2 is primitive in GF ð2m þ 1Þ, or . 2m þ 1 3ðmod4Þ and the multiplicative order of 2 modulo 2m þ 1 is m. Then, the field GF ð2 m Þ can be constructed using the normal element þ À1 [32] and the basis for field representation is referred to as type II optimal normal basis as follows:
where is a primitive ð2m þ 1Þth root of unity. Hence, for
It is worth mentioning that the above set can be rewritten as follows [33] :
Similar to ONB1, a permutation function is needed to convert an NB representation to an ONB2 representation and vise versa. Hence, À 2 :
Suppose that the NB and ONB2 representations of
respectively. Also, let us assume that after permutation, we haveâ j ¼ a
we have
Now, we adjust the power of the basis for C 11 , C 12 , C 21 , and C 22 by changing the variables as follows:
Note that for i ¼ 1, the upper bound of the second summation becomes negative and the result is all zero. (1) Should be computed according to Table 1 and Maxft i g is the same as that multiplier without CED.
4. For C 22 , the powers of the basis are larger than m. Hence, using 2mþ1 ¼ 1, we have
Now, let 2m þ 1 À ðj þ iÞ ¼ k. Then, we have
To derive a single closed form for the multiplication, we have
where ji À kj is the absolute value of ði À kÞ andb 0 ¼ 0. Also, we have
To have a recursive (rolled) form which is suitable for systolic arrays, we can write the above expression as follows: Fig. 9 shows two possible implementations for a cell of the ONB2 semisystolic multiplier using (16) . One of these implementations can be chosen based on space and/or time complexities, e.g., if T 3X < 2T X , then the propagation delay of the cell shown in Fig. 9b is lower.
In this work, we have chosen the cell shown in Fig. 9b . A complete semisystolic ONB2 multiplier is shown in Fig. 10 .
Similar to the ONB1 multiplier, the multiplier of Fig. 10 can be equipped with the error-detection scheme presented in Section 3.3. The encoding and decoding functions are one or two shifts to the left or right. It is worth mentioning that the encodings and the decoding should be performed before the permutation function À 2 and after the inverse permutation function À À1 2 , respectively. The time and space complexities of this work for ONB1 and ONB2 along with a number of other related previous work with and/or without CED capability are presented in Table 6 .
According to Table 6 , the ONB2 multiplier presented here can be considered among the best in terms of space complexity and is the best in terms of latency. Additionally, as mentioned earlier, to the best of our knowledge, this is the first work addressing the CED in NB multipliers.
Some Notes on Delays of Cells and Equality Checkers
The clock rate of a pipeline architecture can be affected by a number of parameters presented in the second column of Table 1 . The propagation delays of the stages of some pipeline architectures such as systolic or semisystolic arrays are small. Particularly, for these architectures, one important parameter to determine the clock rate is the delay of the equality checker. In the following, this issue is investigated. The equality checker is basically one level of XOR gates to check the equality of the bits of two inputs and one OR unit to determine the final error signal, as shown in Fig. 11 .
A straightforward method to design the OR unit is to use 2-input OR gates. Then m such gates in dlog 2 me levels, assuming a binary tree organization, are needed. Therefore, t c ¼ t X þ dlog 2 met OR2 . Alternatively, one can construct the m-input OR unit using dlog n me levels of For the purpose of simulation, gates were modeled using ratioed logic that uses only one PMOS transistor in the pullup network. We used 0:18 m technology. Also, we initialized all inputs of the gates with zero and after a while, we changed the value of only one of them to 1. The result of the transient response simulation is shown in Fig. 12 .
According to the figure, the least propagation delay is for two-level 13-input OR design. Furthermore, both eight-level 2-input OR and one-level 163-input OR designs are significantly slower. The reason behind the one-level 163-input OR design is slow is that all 163 NMOS transistors of the pull-down network are connected to each other in parallel. This produces a large parasitic capacitance at the output of the gate, which is time-consuming to be discharged when one NMOS transistor turns on. It is worth mentioning that if one needs to use the standard cells of a library, the best choice is four-level 4-input OR design because the 13-input OR gate is often unavailable in the standard cells and it has the smallest propagation delay after two-level 13-input design.
Moreover, if after designing the equality checker, t c is found to be larger than the other parameters that determine the clock rate of the pipeline, and resulting clock rate is slower that what is desirable, then one can implement the equality checker in a pipeline manner. In other words, the equality checker can be divided into two or more stages such that the propagation delays of its stages become smaller than the desired clock period. Clearly, this approach results in a larger latency in terms of the required clock cycles.
ERROR-DETECTION CAPABILITY
In this section, the capabilities of the error-detection schemes discussed earlier are evaluated. With regard to the duration of faults, we consider two categories of faults in our simulations as follows:
. Transient faults: These faults, if occur, are assumed to be only in one of the two runs. . Permanent (or intermittent) faults: These faults, if occur, affect both runs. The CED scheme can detect all transient faults, because either these faults make the output erroneous or they are masked. In the first case, the results of the first and the second runs are different. Since one of the runs is fault-free, and clearly, its result is correct. Hence, the error due to faults is detected. In the second case, the results of both runs are identical and correct.
For permanent (or intermittent) faults, chances are as follows:
1. The faults mask in both runs. Hence, the results of the both runs are correct. 2. The faults mask in one of the runs. Hence, the result of one of the runs is correct. 3. The faults do not mask. Hence, the results of both runs are wrong.
a. The wrong results are different. b. The wrong results are identical. The proposed CED only cannot detect the case 3b. To show that this case is not frequent, we have performed a number of simulation-based fault injections on the PB, DB, ONB1, and ONB2 multipliers presented in Section 5. Fault injections were performed in a C model of the multiplier. We injected stuck-at faults (both stuck-at 1 and stuck-at 0) at the input and output pins of the gates of the multiplier. In the proposed scheme, same faults are injected in the same locations of the circuits in both runs. Fault injection in a complete multiplier with a field degree of approximately 163 is extremely time-consuming. Therefore, faults were injected in only one randomly chosen row of cells of the semisystolic multipliers. Hereafter, the percentage ratio of the number of detected and masked faults over the number of all online injected faults is referred to as the percentage of error detection. We performed the fault injections in two phases as discussed below.
Single-Bit Stuck-At Faults
In this experiment, only one-bit stuck-at fault was injected during the multiplication. As mentioned earlier, the location of a fault can be at the input and/or output pins of gates. Hence, to perform fault injection, a multiplexer can be placed at the fault location, where the control signal of the multiplexer selects between the original value of that point and the fault. Moreover, the fault value can be chosen to be 1 or 0.
The number of faults that can be injected to a multiplier for each set of inputs depends on the number of AND gates and XOR gates of that multiplier (see Tables 4, 5 , and 6). It is worth mentioning that the output pins of AND gates in each cell, which are direct inputs of the XOR gates, were not injected with faults because the XOR output pins were already done so. In this experiment, we simulated the fault injection for PB, DB, ONB1, and ONB2 multipliers. Each multiplier was simulated for one million random input pairs and for every pair, all the aforementioned single-bit stuck-at faults were injected. In the following, we give an example for a single stuck-at fault injection at a GF ð2 4 Þ ONB1 multiplier. Let A ¼ x 2 þ x 4 ¼ ð00101Þ and B ¼ x þ x 2 ¼ ð01100Þ be the inputs of the multiplier in ONB1. The fault-free result of the multiplication is x 2 ¼ ð00100Þ. We inject a stuck-at one fault at the right-hand side input of the XOR gate (see Fig. 7 ) in the first cell of the second row of the multiplier. In the first computation according to Fig. 8a , we have 3. Decoding (taking the square root): (01011). The final results of the first and the second computations are the same and both are incorrect. Therefore, the fault cannot be detected. It is worth mentioning that as presented in Table 7 , we could not find any undetected error for the PB, DB, and ONB2 multipliers based on our simulations.
Multiple-Bit Stuck-At Faults
To inject multiple-bit stuck-at faults, the locations for the injections were randomly selected from the aforementioned single-bit fault locations. Then, one stuck-at 0 or stuck-at 1 was randomly injected there. We injected 500 multiple-bit stuck-at faults for each set of inputs. Furthermore, one million random sets of inputs were simulated in each experiment. For example, for a GF ð2 163 Þ PB multiplier, there are 825 single-bit stuck-at fault locations. All simulations for PB, DB, ONB1, and ONB2 multipliers result in the detection of all errors.
CONCURRENT ERROR CORRECTION
The RESO method can also be used for correcting errors resulting from transient faults that affect less than half of the runs of one operation. As stated earlier, we assume that 
