Abstract-The detection of errors in arithmetic operations is an important issue. This paper discusses the detection of multiple-bit errors due to faults in bit-serial and bit-parallel polynomial basis (PB) multipliers over binary extension fields. Our approach is based on multiple parity bits. Experimental results presented here show that due to an increase in the number of parity bits, the area overhead tends to increase linearly, but the probability of error detection approaches unity fairly quickly, e.g., for eight parity bits. In bit-serial implementation of a (2 163 ) PB multiplier using eight parity bits, the area overhead and the probability of error detection are 10.29% and 0.996, respectively. This is achieved without any increase in the computation time of the (2 163 )
I. INTRODUCTION

D
IGITAL systems that require a large number of circuits for their implementation can be more prone to produce erroneous results simply because of the increase in the probability that one of the circuits may become faulty while in use. As a result, for sensitive or critical applications, large digital systems are generally designed with some kind of mechanism to provide correct functionality or to detect errors.
In some hardware-based cryptosystems or their arithmetic accelerators, a finite field multiplier can be the most silicon area occupying component [1] , [2] and, hence, can be subject to hardware faults. Depending on the cryptosystems, errors due to such faults in the multiplier can be detected at an upper level operation, e.g., in elliptic curve cryptography (ECC), if a point leaves the curve it can be easily detected by point verification [3] , [4] . This is, however, not always possible. In the case of ECC, a fault may move a point to another point without leaving the curve and this has been exploited in the so-called sign change fault attack [5] . As a result, some kind of mechanisms for error detection in the finite field multiplier can be quite important in cryptography as well as other critical applications where finite field multipliers of various sizes are used, for example, in deep-space channel coding [6] and VLSI testing [7] .
Manuscript received April 7, 2006 ; revised October 20, 2006 and January 5, 2007 . A preliminary version of this paper was presented at the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems [24] . The work of Dr. M. A. Hasan One technique to detect errors in hardware implementation is online testing or concurrent error detection (CED). CED is used to concurrently test a system while the system is operating normally [8] . CED can test the circuit at full operating speed without stopping the system or switching it to test mode. Accordingly, CED can detect transient faults, which may not be detected in offline testing, since they may not occur in test mode (see [9] - [17] as CED examples). This paper focuses on the detection of errors in extension field multipliers. The complexity of multiplication is much higher than the field's two basic operations namely addition and subtraction. Other complex finite field arithmetic operations such as inversion and exponentiation over binary extension fields can be preformed by repeated multiplications [18] , [19] .
In [11] , Fenn et al. presented a concurrent error detection scheme for finite field multipliers over binary extension fields. They used a parity bit for detecting errors in bit-serial multipliers, using a number of bases for representation of fields, defined by an irreducible all-one polynomial. Thus, the scheme is not generic in the sense that it cannot be used for other field defining polynomials. In [10] , Chiou presented a concurrent error detection for two bit-parallel systolic multipliers for extension fields which the field defining polynomials are irreducible all-one polynomials or irreducible equally spaced ones. In [14] and [16] , Reyhani-Masoleh and Hasan developed a generic parity-based error detection scheme for both bit-serial and bit-parallel polynomial basis multipliers. The scheme can detect any odd number of erroneous bits. In this scheme, input parity is developed through the multiplier, and predicted output parity is compared to actual output parity. In the case of inequality of the parities, an error signal is given. This paper extends [14] and [16] by applying multiple parity bits. The concept of multiple parity is already known and used in some other applications [20] , [21] , but this is the first time it is being used for the finite field multiplication. Like [16] , our work can be applied to any finite field . However, unlike [16] , our work can detect all odd parity errors as well as most of the even parity errors. Additionally, our work can detect at least multiple-bit errors in the multiplier.
The main contributions of this paper are summarized as follows.
• A multiple parity scheme that can detect multiple-bit errors in both bit-serial and bit-parallel polynomial basis multipliers over binary extension fields are presented. The error detection capability of the scheme in the presence of multiple-bit random errors is also investigated. With our proposed frequency of check points, a maximum of one multiple-bit error in each round of the bit-serial operation (or each row of the bit-parallel operation) can be detected. This implies that in a polynomial basis multiplier, at least multiple-bit errors can be detected.
• A number of experimental analyses are presented, including the simulation-based fault-injection evaluation of the scheme and the analyses of the area and time overheads. Our experimental results show that the area overhead tends to increase linearly as the number of parity bits increases but the probability of undetected errors decreases quite quickly. Furthermore, the area overhead for the bit-serial implementation is quite low, e.g., for eight parity bits the area overhead is 10.29% and the error detection probability is 0.996. The area overhead for a bit-parallel implementation of the multiplier is greater than the corresponding bit-serial one, but it is still lower than the conventional dual modular redundant systems.
The average time overhead due to the use of the scheme in bit-parallel implementations is 25%. For bit-serial implementations, time overheads have been observed to be small to negligible. The organization of the remainder of this paper is as follows. In Section II, some preliminaries about polynomial basis (PB) multiplication are discussed. A concurrent error detection strategy is presented in Section III. In Section IV, the error detection capability of the scheme is investigated. Our experimental results for this scheme are reported in Section V. Finally, Section VI gives a few concluding remarks.
II. PRELIMINARIES
In this section, first PB multiplication is briefly explained. Then three main components for the construction of bit-serial and bit-parallel multipliers are introduced.
Let be an irreducible polynomial over of degree . Let be a root of , i.e., . Polynomial (or canonical) basis is defined as the following set:
Each element of can be represented using the PB as , where . The multiplication of and an arbitrary element of can be represented with respect to PB as Hereafter, the module that receives as input and computes is called -Mul module. Let be the product of two elements and of . Then, PB representation of is as follows: (1) where and . In (1), "." is a scalar multiplication, since and , and " " is a vector addition, since its two operands are the elements of . Modules that perform scalar multiplication and vector addition are hereafter referred to as SM module and VA module, respectively. These two modules and the -Mul module discussed earlier are the main components of a PB multiplier. In accordance with (1) and using these three main components, bit-serial and bit-parallel PB multipliers can be constructed as shown in Fig. 1 (see [22] for a similar multiplier architecture).
III. CONCURRENT ERROR DETECTION STRATEGY
In this section, an error detection scheme for PB multipliers is presented. Errors may be caused by different types of faults such as open faults, short (bridging) faults, and/or stuck-at faults. Furthermore, the faults can be transient or permanent. The goal of this scheme is to detect as many random errors as possible including single and multiple errors. Towards this goal, we use a parity-based method. One-bit parity is able to detect the presence of any odd number of erroneous bits [23] . Here, we use additional parity bits in order to increase error detection capability. In particular, an -bit input is divided into parts and for each part one parity bit is used. Thus, the -bit PB representation of is divided as follows:
The length of , , is if otherwise.
For the sake of simplicity, we assume that and the length of each part is , i.e., Parity of is denoted as . Using parity bits of 's, a -bit parity of is formed as follows:
Then, using the parity , we construct encoded as follows:
Unlike which is represented with bits, the field defining irreducible polynomial requires bits. In order to have the same length for partitioning, we exclude the leading coefficient of and divide into parts as follows:
The parity bit of , , is denoted as . One of the important issues in detecting errors in the output of a finite field multiplier (or an arbitrary circuit, in general) is parity prediction. The latter refers to the task of determining the parity of the expected outputs by using the corresponding inputs as well as the functionality of the circuit. As mentioned in Section II, a PB multiplier consists of three modules: 1) -Mul module; 2) SM module; and 3) VA module. In the following, the parity prediction method for each of these modules will be discussed.
A. Multiple Parity Prediction in -Mul Module
In the following, the output parity of an -Mul module is predicted.
Let
, i.e., must be reduced by as follows:
Now, we group the expression and obtain Thus, the th part of for can be derived as (2) where . Fig. 2 shows a circuit diagram implementing . In practice, many coefficients of are zero and, hence, the corresponding XOR gates in Fig. 2 are not needed. By cascading copies of the circuit shown in Fig. 2 , an -Mul module can be constructed as illustrated in Fig. 3 .
Let be the Hamming weight of . The total number of two-input XOR gates required in an -Mul module is , since no XOR gate is needed for the first and the last coefficients of . For parity prediction of the th part of the -Mul module, we have the following lemma, where and . 
Lemma 1:
Let and be the parities of the input and the expected output of the th part of the -Mul module, respectively. Then Proof: Using (2) the proof is immediate. Fig. 4 shows the parity prediction circuit of the th part of the -Mul module, where is predicted parity of . The parity of the th part of is and is assumed to be known, since it can be precomputed. Thus, the corresponding AND gate is not really required. On the other hand, can be a trinomial or a pentanomial and usually it can be chosen so that the parities of all parts become zero, i.e., for . In this case, the value of is not important and one XOR gate is removed. In the worst case, the circuit of Fig. 4 can be implemented with three two-input XOR gates. The total number of two-input XOR gates for the whole parity prediction circuit is . Hereafter, an -Mul module together with its parity prediction circuit (PPC) is referred to as -Mul-P module. It should be mentioned that different partitioning of and can change the parity prediction circuit of the -Mul module. Appendix I presents a partitioning of and that reduces the number of XOR gates of each parity prediction circuit by two, i.e., PPC can be constructed by only one XOR gate.
B. Parity Prediction in Scalar Multiplication and Vector Addition Modules
In this work, scalar multiplication refers to multiplication of an element of by an element of and vector addition refers to addition of two elements of . For and , scalar multiplication of and is . Thus
For , vector addition of and is Thus (4) The circuit of the parity prediction, as defined in (3) and (4), are shown in Fig. 5 , where they need two-input AND gates and two-input XOR gates, respectively. These circuits for parity bits are now included with the SM and the VA modules appropriately and the resulting new modules are hereafter referred to as SM-P and VA-P.
C. Parity Checking Circuit
In order to detect errors in the multiple parity scheme, the predicted parity bits should be compared with the corresponding actual parity bits. Actual parity bits are generated by parity generating circuit. Fig. 6 shows the parity generator and the parity checker.
In Fig. 6 , and can be considered as the expected and the actual outputs of one of the three modules discussed earlier.
and are -bit parities of and , respectively. The result of bit by bit comparison of and are ORed to signal any difference which indicates an error. The parity generator is constructed by XOR trees which contain two-input XOR gates. Furthermore, two-input XOR gates are required for comparison. Total numbers of two-input XOR and OR gates required for a parity checker are and , respectively.
D. PB Multiplier With CED
To construct a bit-serial and a bit-parallel multiplier with concurrent error detection capability, we will use PPC embedded modules -Mul-P, SM-P, and VA-P. Fig. 7 shows a bit-serial multiplier with PPC. and are the inputs of the multiplier. Register is initialized with and its -bit parity . A parity checker can be at each of the three locations: L1, L2, and L3. In Section IV, the frequency of check points will be discussed. Fig. 8 shows a bit-parallel multiplier with PPC. In the bit-parallel multiplier, a parity checker can be placed after each modules. Thus, there can be as many as error checkers for a bit-parallel multiplier. 
IV. ERROR DETECTION CAPABILITY
In this section, first, the error model is explained. Then, the probability of error detection at the output of the circuit using the multiple parity method is determined. Finally, the frequency of the check points is discussed.
A. Error Modelling
The effect of a fault, such as a transient fault, in one location of the multiplier circuit is modelled by XORing an error vector with the expected correct "value" of that location. The th bit of the error vector of a location being one implies that the th bit of the value of the location has changed from 0 to 1 or vice versa due to a fault. If the location is one of the main components ( -Mul-P, SM-P, or VA-P), without loss of generality, we can assume that the error vector should be XORed with the output of the component. It is worth mentioning that the parity prediction circuits (PPCs), parity generators, and parity checkers should be fault-free or at least self-checking [8] . Since in practice the number of parity bits is much less than the size of the input operands of the multiplier , the self-checking technique is feasible. In this paper, these circuits are assumed to be fault-free or self-checking. It will be shown in Section V that with a moderate number of parity bits the probability of error detection becomes quite close to unity. As an example, for , with eight parity bits, the error detection probability is approximately 0.996. Let be the representation of an error of a location in the multiplier. The first bits of correspond to errors in an element, say that is part of the value of that location. The remaining bits of correspond to errors in the -bit parity vector . Note that although we assume the parity prediction and the parity checking circuits to be fault-free or self-checking, an error may occur in the parity bits any where in the remainder of the multiplier circuit such as the registers in the bit-serial implementation of the multiplier or the wires through which the parity signals propagate. If one assumes otherwise, i.e., the parity bits/signals are error free, then all registers and wires through which these signals travel have to be fault-free, even though some of these registers and wires are not part of the parity prediction and checking circuits.
Since is an -tuple vector and the all-zero corresponds to no error, the number of possible errors is . We logically divide into parts each of length bits, where the th part is
In the following, we investigate which kind of errors cannot be detected by the -bit parity scheme.
B. Probability of Error Detection
Let be an odd parity error, i.e., the number of 1's in is odd. Then the parity of at least one of the partitions is odd. Therefore, can be detected by the proposed CED method and the probability of undetected error is . Let be a nonzero even parity error. Since , there is at least one error , such that all of its partitions have even parity. Then the error cannot be detected. Accordingly, . Theorem 1: Let be the number of parity bits of the scheme. Suppose is the probability that for . The probability of error detection is given as follows:
, where is the probability of undetected errors. As it is mentioned, all nonzero errors with even parity in their partitions are undetectable. Thus, considering error vectors are -bit long and each of them has partitions, first, we need to compute the probability of an -bit number with even parity. Let and be the probabilities that an -bit number has even and odd parity, respectively. Thus, . Moreover, let be the probability that a bit of the error vector is zero, i.e.,
. We proceed in a recursive manner Let and . We determine for some to find a closed formula . . .
Now, we write the expression only in terms of
The probability that an -bit partition of the error vector has even parity is . Moreover, the partitions are independent. Thus, the probability of having a vector with even parity in each of its partitions is or
However, the zero vector should be excluded and hence
As a result
As mentioned, is the probability of an error vector bit being one. A reduction of increases the probability of having an all-zero error vector. This reduction means a reduction in the probability of (nonzero) errors, which in turn means a reduction in the probability of undetectable errors. Thus, with a reduction in , the probability of error detection increases.
As it can be determined from (5), as the number of parity bits increases, the probability of error detection quickly approaches unity so that it reaches 0.996 for eight parity bits.
C. Frequency of the Check Points
Suppose that there are several multiple-bit errors in a location of the circuit of a PB multiplier. For having an error detection capability as given in Theorem 1, each of the previously mentioned locations in Section III-D should have a parity checker. This causes a very high area overhead especially for bit-parallel multipliers. The following lemma helps us reduce the number of checkers considerably.
Lemma 2:
Suppose only a maximum of one multiple-bit error occurs per round of a bit-serial multiplier or per row of a bitparallel multiplier (see Figs. 7 and Fig. 8) . Then any such error can be detected with the probability , given in Section IV-B, using a parity checker at L3 of the bit-serial multiplier or a parity checker before the vertical input of every VA-P and one parity checker after the final VA-P in the bit-parallel multiplier.
Proof: It should be verified if a detectable error vector can be changed to an undetectable one after passing through a main component and before reaching one of the check points.
If a detectable error vector passes through an -Mul-P module, it can be changed to an undetectable one. However, the check points are located so that any error vector can reach one of the check points without passing through any -Mul-P module. Therefore, one of the following cases should be considered: 1) a detectable error vector passes through an SM-P moduler; 2) a detectable error vector passes through a VA-P module; or 3) both.
In the first case, if then regardless of the other input value, the value of the output vector and parity are zero. This is a correct result and there is no error anymore. If , then, the input and the output of the SM-P module are equal. Hence, the error vector passes SM-P without any change.
In the second case, if only one of the two inputs of VA-P module has erroneous bits, the error vector can pass the VA-P module without any change. Since a maximum of one multiple-bit error is allowed in a round of a bit-serial multiplier or in a row of a bit-parallel multiplier, only one of the inputs of VA-P can be erroneous.
In the third case, the error must occur before an SM-P module but after the -Mul-P module (in the corresponding row of a bit-parallel multiplier). Therefore, according to case 1 and case 2, it passes SM-P and VA-P modules and reaches the parity checker.
V. RESULTS
Important performance measures for an error detection scheme include error detection capability, area, and time overheads. In this section, the results of our studies on these measures are presented. The results can guide the choice of a proper number of parity bits for design requirements.
A. Simulation-Based Fault Injection
We have injected stuck-at faults to a PB multiplier with to evaluate the error detection capability of the proposed scheme. The fault injection was performed in a C model of the multiplier. Furthermore, the fault injection was at the gate-level, i.e., stuck-at faults (both stuck-at 1 and stuck-at 0) were injected at the input and output pins of the gates of the multiplier. In the proposed scheme, a checker is placed at the end of a round of a bit-serial multiplier (or at the end of the row of a bit-parallel one). Moreover, the scheme can detect an error if the error can be detected in one round of a bit-serial multiplier (or a row of a bit-parallel one). Fault injection in a complete multiplier of is extremely time consuming. In order to reduce the time for completing experiments, faults were injected in only one round of a bit-serial multiplier (and a row of a bit-parallel one). In the following, two phases of our fault injections are presented. (2 ) PB MULTIPLIER AGAINST STUCK-AT FAULTS in one round of a bit-serial (or one row of a bit-parallel) multiplier Fig. 9 . Fault injection at a gate pin.
1) Single-Bit Stuck-at Faults:
In this experiment, single-bit stuck-at faults were injected at the input or output pins of gates. As shown in Fig. 9 , to inject a fault at a point, a multiplexer is placed at that point, where the control signal of the multiplexer selects between the original value of that point and the fault. Also, the fault can be chosen to be stuck-at 1 or stuck-at 0.
In a PB multiplier, there are two-input XOR gates, two-input AND gates, and two-input XOR gates in the -Mul, SM, and VA modules, respectively, where is the Hamming weight of the field defining polynomial. Single-bit stuck-at faults are injected at all input and output pins except the output pins of AND gates of SM module because they are direct inputs of the VA module's XOR gates. Therefore, the number of locations for single-bit stuck-at fault injections at a round of a bit-serial (or a row of a bit-parallel) multiplier is . Additionally, for each input or output gate pin, two single-bit faults can be injected. Hence, the number of single-bit stuck-at faults that should be injected at a round of a bit-serial (or a row of a bit-parallel) multiplier is . In this experiment, we simulated the multiplier for one million random inputs and for every input, all the previously mentioned single-bit stuck-at faults were injected. As shown in Table I , all faults were detected.
2) Multiple-Bit Stuck-at Faults: For multiple-bit stuck-at fault injection, the location of the previously mentioned single-bit faults were randomly selected and a stuck-at 0 or stuck-at 1 was randomly injected there. Furthermore, simulations were performed for one million random inputs and for every input, 500 random multiple-bit stuck-at faults were injected. It is worth mentioning that for a multiplier experiment, there are 824 single-bit stuck-at fault locations. Therefore, the number of accesses to those locations, whether or not any fault is injected, is 412 billions, implying that the experiment is very time consuming. As shown in Table I , the error detection capability of the scheme for multiple-bit stuck-at fault injections is 99.61%.
B. Time and Area Overheads
We have described the multiple-bit parity scheme by VHDL to obtain a realistic approximation of area and time overheads.
In order to reduce the number of XOR gates in the multiplier, field defining polynomial can be chosen to be a trinomial or a pentanomial such that the parity of in each partition is zero, i.e.,
. In Section I-B, the complexity of the parity prediction circuit for NIST recommended irreducible polynomials for ECDSA is discussed.
We used Modelsim to simulate the design for checking its correct functionality. We implemented the multiple parity scheme on a Xilinx Spartan 3 (XC3S5000) FPGA using Xilinx ISE 7.1i.
1) Bit-Serial PB Multiplication:
The circuit of a complete bit-serial multiplier with CED is shown in Fig. 10 . The circuit consists of two major blocks: PB multiplier with PPC and checker. The parity generator of the checker is used at the initialization phase to generate the parity of input . Note that no extra clock cycle is needed for the circuit shown in Fig. 10 when compared to a bit-serial PB multiplier without CED.
From the first experiment, we obtained the area overhead percentage of the scheme for multipliers of different field sizes. The number of parity bits for this experiment was chosen to be 8 bits since the probability of error detection was within acceptable range for our experiment ( 0.996). Furthermore, the defining polynomial of the fields used in the experiment included the NIST recommended irreducible polynomials for ECDSA. Fig. 11 shows the result of the experiment.
As shown in the figure, the area overhead for a fixed number of parity bits tends to decrease as the size of the field increases. The area overhead does not decrease in a strictly monotonic way because the field-programmable gate array (FPGA) compiler used in the experiment optimizes the multiplier for different field sizes differently. The worst area overhead percentage among the fields implemented is for and is still reasonably low, i.e., 12%.
In the second experiment, we implemented the scheme for and using the NIST recommended field defining polynomials for ECDSA and , respectively. Both of these polynomials are quite suitable for implementation because the PPCs of the scheme would be in the simplest form since, in a -bit parity scheme, we have and As shown in Fig. 12 , area overhead cost increases as the number of parity bits increases. For all points in each graph depicted in the figure, a line is fitted as follows:
for overhead of parity bits for overhead of parity bits (6) As expected, according to the first experiment, the slope of the fitted line for is more than that for , i.e., the area overhead increase rate versus parity-bit numbers in is lower. Furthermore, based on the experimental results, area overhead tends to increase linearly except for very small numbers of parity bits.
Note that (6) implies that even if one parity is used for each information bit, circuit overhead is not expected to be more than 100%, which is the overhead for the conventional dual modular redundant (DMR) scheme.
In the second experiment, we also investigated the time overhead of the and PB multipliers for different numbers of parity bits. Since there is no extra clock cycle, the time overhead is equal to the clock period overhead. We obtain the clock periods from the post place and route static timing report of Xilinx ISE. Except for four cases, there was no clock period overhead and, in turn, no time overhead for the bit-serial implementation of the multipliers. These four cases belong to the PB multiplier shown in Table II . According to the table, the time overheads even for these cases are small.
2) Bit-Parallel PB Multiplication: A circuit diagram of a complete bit-parallel PB multiplier with CED is depicted in Fig. 13 . The parity checker is very similar to that presented in Fig. 10 . As shown in Fig. 13 , once the inputs and are updated, the results of the multiplication and error detection are ready after certain amount of delay due to the propagation of various signals through the circuit where no clocking is used.
For bit-parallel multiplier, the first experiment was to measure the area overhead percentage of the eight parity-bit scheme for multipliers of different field sizes. The results show that the area overhead decreases as the field size increases (see Fig. 14) .
There is a major difference between the structure of bit-serial and bit-parallel PB multipliers and this affects the area overhead considerably. A bit-serial PB multiplier contains simple and shift registers, but a bit-parallel multiplier does not. Basically, registers are relatively area consuming components in FPGAs. Therefore, assuming that one wants to implement a PB multiplier for a field of size , the area (in terms of slices) needed for a bit-parallel PB multiplier without CED is significantly smaller than times the area needed for a bit-serial multiplier. Accordingly, CED overhead on a bit-parallel PB multiplier is much higher than that on a bit-serial one. This fact can be observed easily in the experiments reported in this section. The second experiment was to investigate the area and time overheads' increase rates versus the number of parity bits for the field (see Fig. 15 ). The field defining polynomial is . According to Table III , the bit-parallel implementation is very area consuming; therefore, similar experiments for the field are extremely time consuming and clearly that design does not fit into our current FPGA. However, the area overhead results for higher values of are expected to be better than the result of this experiment as one can infer from Fig. 14 , where the number of parity bits is fixed to eight. Fig. 15 illustrates that as the number of parity bits increases, the area overhead for a bit-parallel implementation increases at a greater rate compared to the bit-serial implementation. However, the area overhead may still be acceptable for some applications. This is because for obtaining a sufficiently high probability of error detection (say 0.996), one needs only about eight parity bits in the proposed scheme and it results in about 50% area overhead, which is better than 100% overhead of the DMR scheme.
In the bit-parallel implementation, the time overhead is the delay of the critical path, i.e., the maximum propagation delay from one of the input pins to one of the output pins. We obtain the delay of all input pins to output pins from the post place and route static timing report of Xilinx ISE. The time overhead for the bit-parallel implementation of a PB multiplier versus number of parity bits is given in Fig 16, which shows that the time overhead is generally less than 25% when more than a couple of parity bits are used.
VI. CONCLUSION
In this paper, a multiple parity error detection scheme is presented for concurrent detection of errors in polynomial basis multipliers. In this scheme, the probability of error detection for random errors is more than 75% and it quickly approaches unity for approximately eight parity bits. The overhead of our implementation tends to increase linearly as the number of parity bits increases. Results show that the area overhead cost of the bit-serial implementation is lower than that for the bit-parallel one. Both implementations have lower area overheads than the traditional dual modular redundant scheme for a sufficient number of parity bits. Additionally, the average time overhead due to the use of the scheme in bit-parallel implementations is around 25%, while for bit-serial implementations time overheads have been observed to be small to negligible. It is hoped that using the results presented in this paper, one will be able to choose an appropriate number of parity bits for specific applications.
APPENDIX A ALTERNATIVE PARTITIONING
In this section, another partitioning of and is presented. The new partitioning reduces the overhead of the parity prediction circuit of the -Mul module.
As mentioned is partitioned into parts. As before, we assume that is divisible by and . The alternative (vertical) partitioning is illustrated as follows:
. . . . . . . . .
For
, the th partition is 
A. Structure of -Mul Module
where . Fig . 17 shows the th part of the -Mul module. The complete -Mul module is shown in Fig. 18 . The number of gates is exactly the same as for the previous -Mul module mentioned in Section III-A, as only the position of the coordinates is changed.
The following lemma discusses parity prediction in the th part of the -Mul module. Proof: According to (7), we have Therefore, for , we have For , we have 's can be precomputed. Therefore, the maximum number of gates required for the parity prediction circuit of each part of the -Mul module is one XOR gate. No XOR gate is needed for the parity prediction circuit of a part of the -Mul module 
B. Comparison of -Mul-P Modules
According to Section V-A, the scheme with eight partitions results in a fairly high probability of error detection for values of that are of interest for elliptic curve cryptosystems. Therefore, we have divided each of corresponding NIST recommended irreducible polynomials into eight partitions using our horizontal and vertical partitioning methods. Table IV gives the number of partitions with nonzero parity and the number of required two-input XOR gates for PPC of the -Mul module along with the NIST recommended irreducible polynomials.
As it can be seen in Table IV , the -Mul-P module is relatively area efficient in the vertical partitioning than the horizontal partitioning. However, the -Mul-P module is much less resource consuming than any of the SM-P and VA-P modules. Therefore, the overheads resulting from the vertical partitioning are expected to be very similar to those presented in Section V for horizontal partitioning.
