Abstract-This paper studies the problem of designing a low complexity Concurrent Error Detection (CED) circuit for the complex multiplication function commonly used in Digital Signal Processing circuits. Five novel CED architectures are proposed and their computational complexity, area, and delay evaluated in several circuit implementations. The most efficient architecture proposed reduces the number of gates required by up to 30 percent when compared with a conventional CED architecture based on Dual Modular Redundancy. Compared to a Residue Code CED scheme, the area of the proposed architectures is larger. However, for some of the proposed CEDs delay is significantly lower with reductions exceeding 30 percent in some configurations.
INTRODUCTION
IN recent years, the numbers of soft errors occurring in digital Integrated Circuits (ICs) has been increasing due to reductions in feature sizes and supply voltages [1] . This has prompted renewed research interest in developing methods for detecting the occurrence of soft errors and single faults. Concurrent Error Detection (CED) seeks to identify errors in parallel with calculation of the circuit output [2] . Often CED solutions employ Dual Modular Redundancy (DMR) whereby two copies of the module to be protected are instantiated and a comparator flags an error when the outputs differ. Typically, DMR doubles the area of the module.
Digital Signal Processing (DSP) modules are common in ICs for communications and multimedia [3] . In the case of DSP, CED techniques often make use of the mathematical properties of, or the relationships between, a module's inputs and outputs to provide efficient solutions [4] . A number of publications have proposed efficient techniques for CED in arithmetic modules. In [5] , [6] , [7] , residue codes were used to detect errors in real multiplications. The technique computes the residues, modulo m, of the multiplier inputs, multiplies these residues, and calculates the residue modulo m, of the result. The method compares this residue with the conventional multiplication output, modulo m. The choice of the value of m is critical [7] , and depends on the implementation of the adders and multipliers to protect and on the required fault detection coverage. The use of higher values of m increases the capability of the circuit to detect errors, but also increases area and delay overhead.
Residue codes have been extended by combining the approach with time redundancy [8] . Parity prediction has also been studied as a method to protect multipliers [9] . The implementation of fault tolerant multipliers has also been studied in the context of security to prevent attacks [10] .
Previous work focuses on multiplication of two real numbers. However, in DSP systems, complex numbers are commonly used. CED schemes optimized for complex multiplication would be beneficial both in terms of circuit cost (area) and circuit speed (delay). In this paper, we investigate CED for complex multiplication, an operation common to many DSP algorithms, based on arithmetic transformations.
Herein, we propose five efficient architectures for CED in complex multiplication and assess their computational complexity, circuit area and delay. To the authors' knowledge, the proposed architectures have not been described previously.
The following section provides background information on implementation of complex multiplication in ICs. Section 3 presents conventional and proposed schemes for CED in complex multiplication. Section 4 presents an evaluation of the proposed techniques in terms of circuit area and delay for various multiplier configurations. Finally, the conclusions of the paper are summarized in Section 5.
BACKGROUND
Complex multiplication of two complex numbers, a þ b j and c þ d j, is defined as:
where j represents the square root of À1. This direct implementation of the operation requires four real multiplications and two additions, for a total of six arithmetic operations. In DSP systems, a more efficient, indirect, implementation is typically used [11] :
This indirect approach also known as Golub's method only requires three multiplications and five additions, for a total of eight arithmetic operations. This is advantageous since, in conventional ICs, multiplication is much more area expensive and slower than addition.
The implementation of complex multiplication according to (1) and (2) is illustrated in Figs. 1 and 2 . This graphical representation will be useful in the following to present the proposed Concurrent Error Detection checkers. The convention for the minus boxes is that the output is equal to the left input minus the right input. For each checker, the number of subtractions is reported together with the number of additions as they have a similar complexity.
CONCURRENT ERROR DETECTION FOR COMPLEX MULTIPLICATIONS
The conventional CED solution for complex multiplication is based on DMR. Using the direct implementation, this approach gives a complexity of eight real multiplications, four real additions, and two real comparisons. Using the indirect implementation, in both the original and redundant modules, leads to an overall complexity of six real multiplications, 10 real additions, and two real comparisons. Herein, we propose five novel checkers. Checker 1, based on noting that in the error-free case (see (1) 
Hence, from (2), we can verify that:
Based on this, it would seem that y can be checked by simply adding its real and imaginary parts and comparing the result with t 1 À 2 b d. This would eliminate one multiplication in the checking module since times À2 can be implemented simply with a bit shift and bit flip. However, using this approach, an error in t 2 ð¼ a cÞ cannot be detected since it affects y real and y imag in equal and opposite ways, meaning that their sum is unchanged. An additional check must be introduced to detect errors in calculation of t 2 . This adds a redundant multiplication giving a complexity of six multiplications, nine additions, and two comparisons. The scheme is shown in Fig. 3 where the elements of the checker are shaded.
Checker 2 is based on noting that in the indirect implementation, calculation of y imag shares all of the steps needed for calculation of y real , except for subtraction of t 3 ð¼ b dÞ from t 2 ð¼ a cÞ. Hence, the complex multiplication results can be checked by verifying that:
The scheme reduces the number of multiplications to five and has seven additions and two comparisons. The checker is illustrated in Fig. 4 .
Checker 3 is based on rearranging (4) and verifying that:
This checker is not efficient when used with the indirect implementation since b d is not available as part of the original module and since t 2 must be checked individually (as before, errors in t 2 cancel when y real is added to y imag ). However, the checker can be efficiently implemented using the direct implementation for the original module. This approach has the advantage that y real and y imag are calculated independently and so a single error cannot affect both values. In addition, b d is already available in the original module and does not need to be calculated as part of the checker. The resulting architecture is shown in Fig. 5 . The complexity of this approach is five multiplications, six additions, and one comparison.
Checker 4 is based on the indirect implementation and checks the following equality:
In this case the subtraction ða cÞ À ðb dÞ has to be duplicated, as was the case in Checker 2, as it is not covered in (7) . This checker reduces the number of real multiplications required to four. The checker is illustrated in Fig. 6 .
Finally, Checker 5 is based on an implementation of complex multiplication that is different from those presented in Section 2. The checker is shown in Fig. 7 . It computes the real and imaginary part as: 
Then, CED can be implemented by checking:
which requires only four multiplications. This method can be seen as an extension of the Karatsuba formula to detect errors in finite field multipliers presented in [12] . Some calculations, for example, ða þ bÞ ðc þ dÞ are used in both sides of the checker. However they have different multiplication weights. For example, the output of ða þ bÞ ðc þ dÞ propagates to the Left Hand Side (LHS) of (9) divided by two and to the Right Hand Side (RHS) directly. Therefore, an error in ða þ bÞ ðc þ dÞ will cause a mismatch between the LHS and RHS of (9) and an error will be detected. For example, consider the input values: a ¼ 5; b ¼ 3; c ¼ 8, and d ¼ 9. We obtain
If an error occurs in ða þ bÞ changing its value, e.g., from 8 to 9 (for erroneous values we use red, bold font) the computation becomes
Hence, the values on the LHS and RHS differ, and the error is detected.
The complexity and delay of the CED schemes are summarized in Tables 1 and 2 in terms of the number of arithmetic operations. As can be seen, the Checker 5 has the lowest computational complexity. It is notable that the highest complexity multiplier structure considered leads to the simplest checker and the lowest complexity overall. For the direct implementation, the checker with the lowest complexity is Checker 3. For the indirect implementation, Checker 4 has the lowest cost, with the same number of multiplications as Checker 5. The checker based on DMR of the direct implementation is the fastest solution. All of the proposed schemes incur in some delay overhead. In the next section, circuit implementations of the checkers are presented to evaluate their circuit area and circuit delay.
EVALUATION
All five proposed checkers were proven analytically to provide 100 percent accurate fault detection when implemented with full precision with respect to the input data bit width.
To evaluate the circuit complexity and delay of checkers, they were implemented in VHDL and synthesized for the 45 nm OSU FreePDK Standard Cell library [13] using Synopsys Design Compiler. The synthesis was done using independent modules for the original multiplication and the checkers so that the protection logic was not removed by Design Compiler during the optimization process. In a first set of implementations, full precision checking was used for various input bit widths. No truncation or rounding is done such that the outputs have approximately twice the number of bits of the inputs. In a second set of implementations, reduced precision checking is used. In all cases, two options for synthesis were considered: area optimization and speed optimization. The results for the first option provide insight into the minimum circuit area and the second provides insight into the minimum delay.
To compare with other error detection techniques, DMR and Concurrent Error Detection using residue codes were implemented. DMR was considered for both the direct and the indirect implementations and also for reduced precision checking. For residue codes, the scheme was applied by performing the modulo operation at the inputs and outputs of the complex multiplication such that both adders and multipliers are protected. Modulo 7 was used to obtain good error coverage.
1 [5] Concurrent error detection using residue codes does not allow for truncation and therefore these schemes were not considered for reduced precision checking. The results for the unprotected implementations are also reported for completeness.
The circuit area in terms of the number of equivalent NAND2 gates for the full precision implementations is reported in Table 3 . Comparing the direct and indirect unprotected multipliers it can be observed that the area savings of the indirect implementation over the direct one are small. This is due to the optimizations performed by Synopsys Design Compiler, which works at lower abstraction levels, (use of carry-save representation, high radix Booth recoding, conversion from Sum of Product to Product of Sum, etc.) [14] . These optimizations can produce unexpected results with unconventional arithmetic such as that in (2) . This has been observed previously in floating-point implementations of complex multipliers where indirect implementations had no advantage over the direct ones [15] . In our case, the indirect implementation gives better results but with only small savings over the direct implementation.
It can be observed that the DMR versions require more than two times the number of gates of the unprotected implementation as expected. Similarly to the unprotected version, there are little benefits in using the indirect DMR with respect to the direct DMR. Checker 5 has the least area with reductions of up to 30 percent compared with the most efficient DMR checker. The other checkers also provide significant reductions in the number of gates compared to DMR. The Residue CED scheme provides similar area to that of Checker 5 when used with the direct implementation and significantly less when used with the indirect implementation.
The delay results, in nanoseconds for the full precision implementations are reported in Table 4 . The direct DMR implementation provides the lowest delay with an overhead of up to 8 percent over the unprotected version. Checker 5 has the highest delay with an overhead of up to 66 percent over the direct DMR implementation and up to 76 percent over the unprotected version. Other checkers, such as Checker 2 have only a 15 percent delay overhead over the direct DMR implementation and a 21 percent overhead over the unprotected version in the worst case. The Residue CED schemes incur delays that are significantly larger than that of Checker 2 but less than that of Checker 5.
In many DSP systems, it is sufficient to detect errors that have a large magnitude. This has led to the concept of reduced precision redundancy [16] . In our case, reduced precision can be used in the elements that are needed for the CED but not for the main complex multiplication operation to reduce area. These elements are shaded in Figs. 3, 4 , 5, 6, and 7. Reduced precision cannot be used with the Residue checkers. Therefore those schemes are not included in the comparison.
To evaluate the effects of reduced precision on the area and delay, implementations using 16 bits inputs for the main operation and 8 bits inputs for the checker components were evaluated. This ensures that large errors that affect the most significant bits will be detected. The magnitude of the smallest error that will be detected can be calculated by analyzing the quantization in the different checkers. Since the objective is only to study how reduced precision affects the implementation complexity and speed of the different checkers, a full analysis is not included in the paper. However, assuming that the inputs are normalized to 0.5 such that multiplications do not increase the quantization error, the largest undetected error can be bounded. As truncation is only done at the inputs of the checkers, in the worst case the truncation errors add to give a value of the number of truncated inputs times the maximum truncation error. If that value is used as threshold to detect errors then the maximum undetected error will in the worst case be twice that value (as in that case the error cannot be masked by quantization effects). 1. Synthesis was performed by configuring Synopsys Design Compiler to implement parallel prefix structures for the adders and multipliers. For these structures, modulo 7 may not be sufficient to achieve the desired error detection coverage depending on the application. In those cases larger values of m would be required with the associated area overhead. The areas of the reduced precision checker implementations are reported in Table 5 . Checker 2 provides the lowest area with a reduction of 11 percent compared to the smallest DMR checker. The reduced precision Checker 2 architecture provides a 17 percent area saving relative to the smallest full precision architecture (Checker 5). Checker 5 shows the least area saving since all multiplication operations are part of the main complex multiplication operation, as shown in Fig. 7 , and must therefore be full precision.
The delay results in nanoseconds for the reduced precision implementations are reported in Table 6 . In this case, the delay for the DMR implementation increases slightly as the comparison is no longer a simple equality check. Checker 2 provides the lowest delay overhead, 13 percent, similar to that of a full precision implementation.
In summary, we can conclude that the Residue scheme using the indirect implementation should be used to minimize the circuit complexity (area) when delay is not an issue and full precision checking is required. Minimum delay is provided by the Direct DMR architecture. However, if near-minimum delay is acceptable, Checker 2 should be used to reduce circuit complexity. If reduced precision checking is acceptable for the application, Checker 2 provides lowest circuit area and is smaller than the smallest full precision alternative.
CONCLUSIONS
This paper proposed five new Concurrent Error Detection architectures for complex multiplication. The benefits in terms of circuit complexity reductions were evaluated for various configurations showing reductions of up to 30 percent in the area compared to DMR. Compared to a Residue CED scheme, the area is greater but the delay for some of the proposed architectures is significantly lower showing reductions of over 30 percent in some configurations. The use of reduced precision for the CED elements was also considered showing that it results in a different CED architecture providing the best results. The proposed solutions provide a wide set of efficient building blocks to implement CED in DSP systems. 
