Abstract-Test issues in application-specific digital filter datapaths are investigated. It is found that such designs can contain hundreds of redundant faults, making it difficult to accurately determine fault coverage. Since these redundant faults tend to appear in the same general location as test-resistant faults, the presence of many redundant faults can hide significant untested faults despite high overall test coverage. Classes of redundant faults that arise in digital filter datapaths are described, and we propose a suite of techniques for identifying and eliminating the most common redundancies based on arithmetic optimization. The approach is suitable as a front-end to more accurate fault simulation, or can be used in the design process to eliminate redundant logic. The approach is validated as a tool for developing very high-coverage built-in self-test circuits, showing that 100% fault coverage can be achieved in 24k-gate filters with as little as 1% area overhead. When used as a datapath optimization technique, the average area reduction over 15 designs was 8.9%, compared with moderately optimized designs. As a front-end to fault simulation, the approach yielded a 97.9% average reduction in the number of undetected faults across the 15 designs.
DSP designs tend to have some very desirable test properties, chief among them good observability and controllability over large parts of the datapath. However, some designs contain faults that are strongly resistant to detection by random patterns, posing an obstacle to low-cost on-chip testing [3] . In some cases, small alterations to the design can provide enough improvement in observability or controllability to implement a built-in self-test approach that does not seriously impact circuit area or performance [4] , [5] . In other cases, a functional builtin self-test (BIST) approach based on analog test concepts has shown some promise [6] .
Both of these approaches offer low area and performance overhead. The first approach tends to be specific to particular applications (e.g., decimators used in 1-bit sigma-delta modulation [4] or general-purpose array multipliers [5] ), but does show the potential of low-overhead BIST in DSP. Other solutions have taken advantage of the regularity of iterative logic array (ILA) implementations to create testable designs [7] [8] [9] . These approaches, having their roots in the early work of Friedman [10] and Kautz [11] , are most applicable to general-purpose filters that are not dedicated to a particular filter transfer function.
The more varied and specialized structures found in large, application-specific digital filters, such as those produced by filter generation tools like FIRGEN [12] , pose a more complex test problem because many of the structures can be resistant to random-pattern testing. Counil and Cambon have investigated testing small-to moderate-sized applicationspecific filter designs, and were able to reach high coverages (mid-to high-90's) using a functional approach [6] . This approach, based on short, triangle-shaped test waveforms, can provide good initial coverage of the less test-resistant structures. In particular, it can provide good toggle coverage or stuck-node (stem fault) coverage [13] . However, if a more stringent fault model is used, we find that many of the more test-resistant faults are checkpoint faults on fanout branches [14] , and are not detected by such short general test sequences.
Developing high-coverage tests (whether built-in or external) for application-specific filters is considerably complicated by the presence of these random-test-resistant faults. The typical test-resistant fault in a digital filter datapath is a stuckfault in the carry logic of upper-bit adders, and is most often associated with an overflow condition at the bit in question. Since these faults lie in upper bits, the magnitude of the fault effect at the filter output is large, and consequently we would like to have very high coverage of these faults. In contrast, the lower bits tend to be highly testable, leading to high fault coverage figures with little test effort. Thus, a relatively short test set can give relatively high fault coverage while leaving a significant kernel of faults undetected.
In addition to the test-resistant faults, the applicationspecific nature of these designs leads to many more redundant (untestable) faults than are found in generalpurpose, programmable architectures. The typical redundant fault in these designs is a stuck-fault on a fanout branch in the carry logic of upper bit adders-exactly the same type of fault as the typical test-resistant fault.
Distinguishing between test-resistant and redundant faults in these designs can be difficult. Although automatic test pattern generation (ATPG) tools can be used to identify redundant faults, the large size and sequential nature of these designs makes this impractical in many cases. Logic synthesis tools can also be used to identify and remove redundant faults, but in practice many redundant faults remain. This is to be expected since redundancy identification is, in general, a difficult problem. Consequently, typical designs will contain a significant number of redundant faults, giving the appearance of poor coverage of the high-order datapath bits that are of greatest concern.
If effective tests are to be developed for signal processing datapaths, as many redundant faults as possible must be eliminated from the fault universe so that real test problems can be accurately identified. The goal of this work is to identify the more common types of redundancies that arise in DSP datapaths, providing a framework for eliminating them and enabling more accurate fault simulation. This can be done most efficiently by analysis of the register-transfer-level (RTL) description of the design using arithmetic techniques. Targeting the most common forms of redundancy, it is possible to remove hundreds of redundant faults from large designs quickly.
The overall approach is shown in Fig. 1 , where we will be focusing on the shaded portion. The RTL design description is analyzed to identify structures that are redundant, and the logic is marked to indicate the specific redundancies that it embodies. There are two general ways the results can be used: first, as a front-end to fault simulation, it allows more accurate estimation of the true coverage. Second, since most of the techniques used involve elimination of redundant logic, the approach can also be used as a design aid, producing an optimized design that uses fewer gates and is possibly slightly faster in some cases. Other benefits include greatly accelerated ATPG since redundant faults cause high test generation times.
As a front-end to fault simulation, the tool can be used to design more effective tests by enabling fault coverage targets to be raised. In some cases, reduced overhead BIST becomes possible. A common BIST strategy for dealing with low overall coverage is to split the design into independentlytested pieces. In the case of filters, splitting the design in this way can make test-resistant faults more testable, but can also make redundant faults become testable. The net effect is higher apparent fault coverage, while the true fault coverage (excluding redundant faults) may improve only slightly at considerable cost in terms of added circuit area.
Evaluating the number of redundant faults that remain after application of the approach is difficult, for the reasons mentioned earlier. To validate the approach, we will use pseudorandom test sequences. Our motivation for using pseudorandom BIST is twofold: first, it provides an upper bound on the number of redundant faults that remain in the designs. Second, it provides a demonstration of reduced overhead BIST design using accurate fault simulation. We will stress improvements in the test generator over more intrusive BIST in these designs. As an example of what is possible, we will show that a 24 000-gate design can be 100% tested with 1% BIST area overhead, while the initial design contained about 700 redundant faults. If the redundancy-elimination techniques are used for circuit optimization, significant area savings are possible. Compared to designs implemented using only basic optimization techniques, it is not uncommon to see a 5%-10% reduction in active circuit area, counting BIST overhead. In this way, testability techniques can pay for themselves.
We will focus on finite-impulse response (FIR) filters since they are perhaps the most widely implemented class of digital signal processing applications and are a basic building block of many more complex systems. However, the general approach is geared toward any system that can be described as a network of shift, add, delay, sign-extension, and truncation elements. In order to gauge the efficacy of the approach across a fairly broad slice of designs, we will use five filter specifications selected from the literature that include three lowpass filters, a wide-band bandpass filter used in video processing, and a predistortion filter. The designs will be implemented using three different common architectures: cascaded ripple-carry adders, carry-save array pipelines, and adder tree structures. Both signed and unsigned arithmetic will be considered.
II. PRELIMINARIES

A. Application
The target applications we will focus on here are FIR filters. To provide a brief review of terms and definitions, FIR filters essentially perform a weighted moving average of a sequence of input samples. This is described by the linear constantcoefficient difference equation where is the filter order, is the output signal at time , is the input at time , and , is the set of filter coefficients, which also corresponds to the impulse response of the filter. The designer frequently has some flexibility in choosing the filter coefficients, and it is often possible to select the such that they can be expressed as a power of two or the sum or difference of two powers of two, which leads to efficient VLSI implementations.
B. Number Representation
Digital filters may be implemented either using a fixed-point number representation or floating point. While floating point simplifies the system level design by eliminating overflow problems, it has much greater hardware complexity for a given dynamic range. Here, we focus on fixed-point implementations. It is common to represent fixed-point numbers as either integers or binary fractions. We will assume that signed signals are represented using two's-complement binary fractions; the most-significant bit (MSB) of an -bit signal is interpreted as the sign bit, with bits to the right of the binary point. An -bit signal represents the number given by the sum where is the MSB, is the least-significant bit (LSB).
A ternary signed-digit representation is also useful in discussing filter design. Filter coefficients are typically converted to a canonic signed-digit (CSD) representation, in which no two adjacent digits are nonzero [15] . For example, the coefficient 0.001 110 would be represented as , where represents the CSD digit . Finally, a common filter design procedure involves conversion of a signed signal to an unsigned signal plus an offset. Inverting the MSB of a signed two's-complement signal and zero-extending the result maps the signed signal into the positive range of a wider signal. The new signal is related to the original signal by an offset, which can be added in at a later point in the datapath, usually as a correction factor at the output of the filter, where multiple correction factors can be added in one addition operation. At the register-transfer level, we will denote the operation of inverting the MSB and zeroextending by zcvt. The testability of filters implemented using both signed and unsigned arithmetic will be examined.
C. Scaling
A key property of two's-complement numbers is that intermediate nonsaturating additions can be allowed to overflow as long as the final sum (as computed with infinite precision) lies in the permitted range ( to for -bit numbers) [16] . In FIR filters, this property means that the designer only has to be concerned with scaling the input signal or filter coefficients such that the output node does not overflow.
Overflow is permitted in intermediate nodes of an adder tree or chain as long as they use at least as many bits to represent the sum as the output node. In infinite impulse response (IIR) filters, the situation is complicated by the need to also prevent feedback signals from overflowing. In either case, the approach typically used is to make all adders in the datapath the same width, and scale the input and/or filter coefficients such that overflow does not occur.
From the testing perspective, this scaling method presents a problem since many of the input signals to adders may actually use only a small portion of the dynamic range available, and consequently a number of the high-order outputs may function only as sign bits. It is impossible to assert overflow conditions at adder bits that output redundant sign bits, resulting in untestable stuck-at faults in the carry logic. In Section III-A we will revisit scaling as a technique for identifying redundant faults.
D. Architectures
Three implementation architectures, all based on the popular transposed direct-form architecture [16] , were examined. The filters are composed of cascaded taps, where each tap implements a constant-multiplication operation, a sum and a delay. The three implementation architectures involve variations on the constant-multiplication operation: the first is the Linear architecture, where ripple-carry addition operations are performed in series with the main datapath. The sxt operator denotes sign-extension. The second is the carry-save adder array, where the ripple adders are replaced by carry-save adders, and the datapath is expanded to include both carry and sum signals. In this architecture, signed inputs are converted to unsigned quantities and zero extended using the zcvt operator. The last is a tree-based architecture, where ripple-carry adders are used to perform the constant-multiplication off the main path, and the result is then added into the main path. The three tap RTL architectures are illustrated in Fig. 2 .
E. Fault Model
Since the principal active element in all the designs was the full-adder cell, the fault model used for this cell is of some concern. We used the common gate-level model shown in Fig. 3 , where the faults modeled are the stuck-faults at gate input and output pins. Often, the hardest tests to apply using pseudorandom techniques are those associated with overflow conditions at the next-to-MSB full-adder. If the input is the carry input to the full adder, the overflow tests are and (or tests 1 and 6 if the minterms are interpreted as binary numbers). Tests 2 and 5 are also common test-resistant faults in high-order adder bits, while tests 0, 3, 6, and 7 tend to be much easier to apply [3] .
Under this fault model, testing the carry logic requires five tests: the three labeled (essential) in the right half of Fig. 3 , and one from each of the two equivalence classes, labeled 1 and 2. Under this model, overflow test 6 is nonessential, since the logic tested by it can also be tested by the easier (more probable) test 7. 
F. Sequential Testing Considerations
While sequential circuits in general suffer from the problem that undetectable faults may not be redundant [17] , FIR filters fall into the class of feedback-free circuits, where this is not a concern [17] , [18] . Although the register stages of Fig. 2 are closely related to pipelines, these filters are typically not strict pipelines due to the injection of global signals in each stage, making efficient application of combinational redundancy identification techniques problematic. Some of the larger filters examined contain close to 900 flip-flops, with combinational logic complexity on the order of 20 000 gate-equivalents, making general sequential redundancy identification difficult. Desirable test properties include the ability to apply external initialization sequences. Consequently, resets are often omitted on flip-flops, eliminating the need to test for reset faults. Additionally, faults on clock lines in these circuits are easily detectable since observable fault effects are generated simply by toggling data through the datapath flip-flops.
III. DATAPATH REDUNDANCY ELIMINATION
In this section, we outline the techniques used to identify and eliminate redundant faults in filter datapaths. Primarily, this involves arithmetic analysis of the RTL design, first to find constraints on signal magnitude and phase relationships, and then to use this information to perform a set of logical transformations of the gate-level design. All these transformations except one (described in Section III-B) involve marking logic as inactive or as implementing a reduced functional behavior. As such, the effect of almost all the transformations is to eliminate logic that contains redundant faults, and consequently the approach is applicable either to fault simulation of unoptimized datapath logic (essentially as a front-end to fault simulation), or to optimization of the datapath logic directly. Once the RTL analysis is complete, additional gate-level optimizations can be performed which include constant propagation and mapping of full-adder cells to reduced functionality cells such as half-adders, operations typically performed by synthesis systems.
To summarize the approach, we start redundancy elimination by identifying and removing redundant sign bits using scaling techniques followed by further redundancy elimination involving the top bits of adders and subtracters. In signed arithmetic, these optimizations apply to operators where signextending arithmetic is performed. In carry-save adder (CSA) array architectures, we assume that unsigned arithmetic is being used. In CSA architectures, scaling is sufficient to identify most redundancies, although one additional redundancy involving reduced functionality CSA's will be discussed.
A. Scaling
The single most important design-for-testability optimization that can be performed on filters is to scale signal widths to the minimum size needed, eliminating redundant sign bits. Computing the minimum width needed to hold a signal can be performed using any of number of standard fixed-point scaling techniques [16, Sec. 6.9.2] . In redundancy elimination, we use scaling since it is the most conservative scaling technique, guaranteeing that the circuit behavior will not be altered. To briefly review scaling, the behavior of a signal at an internal node can be characterized using the idealized impulse response of the subfilter that outputs at the internal node where is the impulse response of the subfilter and is the order of the subfilter. The finite sum applies only for FIR subfilters. For IIR filters, the limit on the sum becomes infinite, although in practice its convergence properties are good for stable systems. Using the property that the magnitude of a sum is less than or equal to the sum of the magnitudes of its terms, and then replacing with (the maximum input signal magnitude), we obtain This gives an upper bound on the signal amplitude at the internal node. Without knowing more about the characteristics of the input signal, we assume that it is capable of swinging through the full range available to it . The upper bound on the signal amplitude is then It can be shown that this bound is exact in the sense that an input signal exists that achieves the upper bound. This bound is usually very conservative for other than contrived input signals [16, Eq. 2.55]. Since we have assumed that the system is properly scaled to start with, at most bits are needed to hold any result, where is the width of the filter output. If the above bound is 0.5 or less, then the MSB can be omitted. If it is 0.25 or less, then the top two bits can be omitted, and so on. In general, the number of bits required for signal to not overflow (other than the allowable overflow discussed in Section II-C) is given by In practice, to account for departures from the ideal impulse response, we perform scaling analysis by taking the norm of the subfilter's actual response to maximum-magnitude positive and negative impulses. Essentially equivalent to the scaling described above, this approach ensures that truncation and offset effects are properly accounted for by applying a maximum magnitude impulse at the input of the filter and simulating the behavior of the filter. Overflow is inhibited during this calculation so that maximum signal values are accurately known even if the specified datapath is not wide enough to hold the result. Thus, intermediate overflow of the type discussed in Section II-C is supported.
Once minimal signal widths have been established, additional redundancies exist at operators where sign extension is applied to the operands (for example, at points where the datapath widens after an addition operation), or where signals subject to a phase constraint (identical or opposite signs) are added or subtracted. This is addressed through the sxt-add, C and S denote carry and sum respectively in the full adder.
shift-add, shift-subtract, and restricted-overflow optimizations, to be discussed next.
B. SXT-Add Transformation
In signed-arithmetic (two's-complement) datapaths, this transformation applies wherever the datapath width expands to accommodate an adder output that is wider than its widest input. (After scaling, all adder output widths should either be equal to or one greater than their widest input width.) The transformation is shown in Fig. 4 , where one full-adder is eliminated and two inverters are added. (In some cell libraries, full-adder cells with inverted carry inputs are available, eliminating one of the inverters.) This optimization is local to the top two bits of the adder (i.e., it does not assume anything beyond the context of the top two full adders), and can be verified using the identities and , where and are the carry and sum functions, respectively, of a full adder.
It should be noted that unlike the other redundancy elimination techniques discussed here, this transformation does more than simply delete logic: since the carry input to the nextto-MSB adder is inverted, the mapping of faults to tests is changed. Consequently, to be effective, this transformation must be applied as part of the design process; it should not be applied in fault simulation unless it reflects the actual design. Under most common gate-level fault models, none of the original sign-extending adder's logic is redundant as long as it is properly scaled, since there is no carry logic out of the MSB and there are no redundant sign bits at the adder's output. The purpose of this transformation is primarily to improve the random-pattern testability of the next-to-MSB full adder by mapping faults with difficult tests to easier tests. Whether this is effective or not will depend on the fault-model used.
The sxt-add transformation improves the testability of the adder by mapping overflow test 1 to test 0, which is nonessential, while overflow test 6 is still mapped to equivalence class 1 (see Section II-E), and therefore also nonessential. Thus as a result of the transformation none of the essential tests map any longer to difficult tests. The effect of the transformation is shown in Fig. 8 .
As an example of the difference this transformation can make, we applied it to an implementation of a ten-tap filter we Fig. 5 . Generalized sign-extending shift-add redundancy elimination. Applies when the inputs to the adder have a positive phase relation, and the output width is greater than the input width. A similar redundancy elimination technique applies to subtracters with negative phase inputs. described in [19] , based on filter coefficients used by Counil and Cambon [6] . One of the adders in this design used little more than half its full output range. For this adder, it can be calculated that the probabilities of asserting tests 1 and 6 are each 0.002% for an independent, uniformly distributed input signal, giving an expected test length on the order of 50 000 vectors for a standard linear-feedback shift-register (LFSR)-generated input signal. Using the sxt-add transformation, the filter can be fully tested with less than 100 LFSR-generated vectors.
C. Shift-Add Redundancy
The shift-add redundancy, shown in Fig. 5 , is similar to the sxt-add redundancy, with the addition of a positive phase constraint on the adder inputs. This commonly occurs in adders where one input is a shifted version of the other input, but also occurs wherever the two inputs always have the same sign. The transformation can be verified using the identities and . An analogous optimization applies when negative-phase signals are subtracted.
1) Restricted Overflow Redundancy:
Signal phase information makes it possible to remove the top two full-adders in a sign-extending shift-add operator. This removes most of the redundant logic in a properly scaled constant-multiplier block. However, the strongly correlated signals in such blocks do give rise to additional redundant faults with some frequency; specifically, it may not be possible to assert both overflow tests in the highest-order remaining full adder. If this is the case, the redundant faults can be marked as untestable, or the adder can be replaced with a circuit that implements the simplified logic.
This restricted overflow redundancy can be efficiently identified. In a generalized shift-add operation, the adder is fed by signals derived from a common signal , shown in Fig. 9 . The and blocks represent multiplication operations that may be composed of shift, add, subtract, and truncation operations, where corresponds to the larger coefficient. Interpreting all signals as two's-complement integers, asserting a positive overflow test at the top adder requires and where is the adder's output width, and the output width of is assumed to be less than the width of . Similarly, the negative overflow test requires and . The conditions for the positive overflow test are illustrated in Fig. 9 . The key to deriving these conditions is to recognize that the input to the adder is sign-extended by one bit, the input is signextended by at least two bits, and the top bits of both signals are identical due to the phase constraint.
If the multipliers' transfer functions are monotonic, the existence of a positive overflow test can be ascertained by maximizing the input signal subject to the constraint on the output of , and determining whether overflow at the top adder is triggered by this input such that if then positive overflow is impossible
If the positive overflow test cannot be asserted, the carry logic can be modified as shown in Fig. 10 , or the corresponding faults can be marked as redundant. The corresponding negative overflow test is such that if then negative overflow is impossible
In these tests, the actual operation implemented by the multiplier is denoted as , while the coefficient that it implements is labeled . For example, a multiplier where might be implemented as . Other possible implementations include and . Since each implementation has a different transfer function, the problem of finding the maximum value that satisfies the output constraint on may require some search. The search procedure can be accelerated by estimating and based on the multiplier's ideal behavior, and then using the monotonicity property to find the exact solution to the maximization problem. An algorithm to implement this procedure is given in [20] .
An important special case that occurs frequently in practice is where the larger multiplier is simply a left or right shift operation. In this situation, the transfer function is welldefined and no search is required. It should also be noted that the negative-overflow redundancy cannot occur in this case.
Nonmonotonic multipliers pose a special problem, requiring some limited additional search to guarantee that no overflow can occur. An example of a nonmonotonic multiplier is . These multipliers tend to be noisy since they typically contain many truncation operations and have low gain. We will not consider them further here.
While the approach described above is a general way to identify restricted-overflow redundancies, there are some common structures that produce this type of redundancy. One such occurrence involves the addition of two signals related by a shift , where the width of the source signal (excluding any redundant sign bits) is less than or equal to the relative shift plus two. An example is adding and times a 6-bit input signal, producing a 13-bit result. Here, , , and , so a restricted-overflow redundancy will occur in this adder logic. Additionally, if the smaller signal is truncated to two bits, a positive overflow cannot be generated at the A more subtle example of this redundancy is found in the final subtraction of , where is taken to be a 6-bit integer, refers to right-shifting by the indicated number of bits followed by integer truncation, and denotes left-shifting, expanding the signal width to hold the added bits and zero filling on the right. The result is 11-bit wide. Let , , and . Testing for negative overflow in the subtracter requires setting and testing for . Choosing , , so no negative overflow is possible at this bit of the subtracter, and the redundant logic can be eliminated or otherwise marked as untestable for fault simulation.
D. Shift-Subtract Redundancy
This redundancy occurs when negative phase signals are added (or, equivalently, positive phase signals are subtracted). Both a strong form and a weak form exist; the weak case, shown in Fig. 6 , applies when the phase constraint exists, but the magnitudes of the two input signals are not related. If the magnitude of one of the input signals is always smaller than the other, then the strong form applies, shown in Fig. 7 . To eliminate all redundancies, the strong form should be used where possible. The weak form is derived using the identity . The strong form follows directly from the signal magnitude constraint.
Identifying this redundancy using gate-level techniques is more involved, requiring justification across the full width of the adder. As an example, consider an 8-bit signal shifted left by 4 bits and subtracted from itself, i.e.,
where the indicate carry bits, are the sum output bits, and the prime denotes logical inversion. To apply the strong form transformation, the don't-care conditions and are needed. Tracing through the justification of the last don'tcare condition shows that the scope of the optimization spans several adder bits contradiction In contrast, only knowledge of signal phase and magnitude constraints is sufficient to derive the optimization shown in Fig. 7 .
E. Shift-By-One Redundancy
When a signal is added or subtracted with an identical signal shifted by one bit position, logical redundancy is induced at the gate-level. This redundancy can be avoided by converting all constant-multiplier coefficients to CSD representation, where the binary multiplier is transformed to a ternary signed-digit representation , where no two adjacent digits are nonzero [15] . The CSD representation leads to an implementation consisting of the minimal number of adders and subtracters. For example, the coefficient 0011100, if implemented as would result in logical redundancies in the first addition operation. This is avoided by implementing the coefficient as , which also eliminates one adder.
F. Incrementer/Decrementer Redundancy
In cases where a 1-bit signed signal (in two's-complement corresponding to values ) is combined with a wider signed signal, gate-level redundancies result if a normal adder or subtracter is used. In this case, redundancy is eliminated by replacing an adder with a decrementer, and a subtracter with an incrementer. At the RT level, this optimization can be represented by negating and zero-extending the 1-bit signal (a zero-cost operation) and inverting the sense of the addition or subtraction operation with respect to the zero-extended input. The adder or subtracter is then converted to an incrementer or decrementer during constant propagation, when the upper zeros in the zero-extended signal are pushed through the operator.
G. Carry-Save Adder Array Redundancies
When high-performance datapaths are implemented using CSA arrays, special consideration must be given to scaling and high-order bit redundancies. To determine the minimum required bit width at any point in a CSA array, the scaling operation is essentially performed along a diagonal cut of the array. Using a technique from Baugh-Wooley multipliers, we assume that all inputs to the array have been converted to positive quantities and zero-extended, reducing the load on the most significant input bits. We also assume that a compensation vector has been provided by the designer to correct for the offset introduced by this operation, which is added to the final result of a CSA array operation, converting it back to a signed quantity.
Conversion to a positive quantity consists of inverting the MSB and zero-extending the result, and will be denoted by zcvt. Carry-save subtraction is implemented by adding the inverted subtrahend signal (denoted by inv in Fig. 2) , rather than taking the two's complement. A correction factor is included in the compensation vector to correct for the missing offset. If the lower bits of the subtrahend are constant zeros (as in the output of a left-shift operation), the inv operation passes these through and the value added to the final compensation vector is shifted left by a corresponding number of bits.
1) CSA Scaling: Scaling CSA arrays is similar to normal scaling, but must take into account the fact that the carries from lower bits will not reach the top of the array until the signal has passed through several adders. When trying to determine if a carry output is possible from an adder bit, we can trace back through the array to determine which signals can contribute to that carry. Fig. 11 illustrates the type of analysis we propose using to scale CS adder arrays. The figure corresponds to the first ten taps ( -) of a 25-tap FIR filter [15] , consisting of 13 CS adder stages (A0-A12). In this example, we want to determine whether the datapath widens at the output of adder A12. To do this, we examine whether a nonzero carry output is possible from the highest active adder bit, which at adder A12 is bit 11.
We will refer to the staircase on the right as the carry horizon, which starts at the current highest active bit, and expands to the right by one bit for every step back in the CSA array. No input to the right of the carry horizon can influence the current carry output, so these inputs are treated as identically zero for this analysis. The maximum sum possible due to inputs that lie to the left of the carry horizon is determined by choosing at each tap the signal phase that maximizes the sum. If the sum overflows the current datapath width, the datapath is expanded.
The left boundary shows the minimum datapath width needed to hold the input signals to each adder, as determined by scaling analysis, where the dashed line indicates the highest active carry input at each adder, and the solid line indicates the highest active sum output. The datapath can widen due to either a wide data input signal (as in the case of the input to adder A2), or due to an MSB carry output from a prior CSA stage (as in the case of the carry input to adder A4). The minimum width is found by iteratively applying the scaling analysis, starting at the highest active (nonconstant) input bit of the topmost adder, and moving down. All logic to the left of this boundary is deleted.
The analysis is complicated by the possibility that, after a wide input is encountered, some upper bits will only act as feed-through bits. In our analysis, we track both the sum and carry datapath widths in order to account for feed-through sum bits. The algorithm for performing this scaling analysis, based on the assumption that the array is adding positive quantities only, as produced by zcvt operators, is given in [20] . This analysis assumes that the maximum value at any input to the array also maximizes any sub-word at that input (also consistent with the use of zcvt), which allows us to track carries at lower bits while ignoring feed-throughs at upper bits. If this is not the case (e.g., the maximum value at an input is 01000 while 00111 maximizes subwords below bit 3), the scaling procedure should be modified to increase the carry width when the sum width increases.
After a number of narrow inputs have been applied in succession, it is possible that upper carry bits will become inactive. Since we are only computing an upper bound on the carry width, it is possible that redundant faults will remain due to such reduced carry activity.
Constant propagation is applied after CSA scaling to push the adjusted width information through the array, converting full-adders to half-adders, feed-throughs, or constants, as appropriate.
2) CSA ABX Redundancies: Since sign-extension is eliminated in a CSA implementation, the sxt-add and related optimizations do not apply. Instead, we find that redundancy occurs in the upper bits when the sum and carry signals are strongly related. This typically occurs when the sum and carry signals arriving at an adder cell are mutually exclusive. The exclusion condition can be used to eliminate redundancy, for example by replacing a full adder having mutually exclusive and inputs with an ABX cell, shown in Fig. 12 . Exclusion conditions can be detected when the array is scaled. If the carry width increases at the current adder, it is possible that this is only due to the current data input. If the new carry bit cannot be activated without the current input data, then the sum and carry inputs to the uppermost bit are mutually exclusive. Referring to Fig. 11 , if the signal input to the current row is zeroed and a smaller maximum width is then determined by scaling analysis, the MSB is marked as an ABX adder.
IV. EXPERIMENTAL RESULTS
To validate the approach, we created 15 sample designs based on five filter specifications, each implemented using the three architectures described in Section II-D. To offer realistic comparison points, we also created 15 baseline designs for each filter/architecture pair using common optimizations. Using the full suite of scaling-based optimizations to produce an optimized version of each design, we then compared the fault coverages of these designs with the corresponding baseline designs.
Of the five filter specifications, the coefficients of three were scaled so that the maximum passband gain was unity. Two designs (filt25 and filt60) were scaled more aggressively on the assumption that the passband signal was relatively weak. In these designs, the maximum passband gain was roughly equal to the passband width, and was 2.4 and 7.2, respectively. In general, more aggressive scaling results in easier-to-test designs since internal signal amplitudes tend to be larger.
The redundancy elimination techniques used are shown in Table I . The baseline optimizations are intended to reflect the level of optimization found in filters described in the literature, and correspond roughly to what is found in synthesized designs and commercial fault simulators. In addition to the shiftby-one and incrementer/decrementer optimizations mentioned in Section III, the optimizations include constant propagation (where register bits that are tied zero are also eliminated), dead logic elimination, and cell conversion (e.g., mapping a full-adder with a zero input to a half-adder).
Since the tree designs are amenable to simple adder sizing based on setting adder widths to the maximum input width plus one, up to the maximum datapath width, this approach was used to produce more realistic baseline designs. Also, since the CSA arrays investigated use unsigned arithmetic, constant pushing acts as a crude form of scaling. Thus, both the CSA and Tree baseline designs include some simple adder width optimizations.
The filter specifications were drawn from sample designs described by various authors [6] , [12] , [15] , [21] , [22] . The test cases range from ten-tap to 64-tap fixed-coefficient FIR filters, with a mix of input/output/coefficient widths ranging from 8 bits to 16 bits, consisting of as many as 193 adders. The fundamental statistics for each filter specification are shown in Table II . Mapping each design to the LSI 10k library yielded the design sizes shown in Table III , where the gate counts are in terms of two-input gate-equivalents. Applying the optimization techniques as a design tool, substantial area reductions were possible in most designs, shown in Table IV . The average area reduction was 8.9%.
In order to gauge the impact of redundancy removal, we chiefly relied on random-pattern self-test techniques, since automatic test pattern generation is unwieldy on large filter designs. On the nine smaller filter designs and two of the larger designs, this was sufficient to show that all redundant faults had been eliminated from the designs. Guaranteeing 100% removal of redundant faults is difficult in the four remaining large designs, since they contain a relatively high number of random-pattern resistant faults that are difficult to distinguish TABLE I  REDUNDANCY ELIMINATION TECHNIQUES APPLIED TO BASELINE AND OPTIMIZED DESIGNS   TABLE II  FUNDAMENTAL FILTER STATISTICS   TABLE III  GATE-EQUIVALENTS, LSI LOGIC 10k LIBRARY   TABLE IV  PERCENT REDUCTION IN ACTIVE CIRCUIT AREA DUE TO  REDUNDANT FAULT ELIMINATION (LSI 10k LIBRARY) from redundant faults. However, the techniques described here eliminated from 81.9%-99.9% of the untested faults in these designs. The following sections describe the test strategy in detail and examine the overhead involved in implementing such a self-test scheme.
A. BIST Test Pattern Generator
To measure the extent to which redundancies had been eliminated from filter designs, we employed a pseudorandom test approach, adapted to wring out many of the more stubborn random-pattern test resistant faults. Although the main goal of testing was to determine how many faults could be attributed to redundancy, some consideration was given to finding a scheme that could achieve very high coverage and be efficiently implemented on-chip as part of a self-test scheme.
While standard LFSR sequences can efficiently test a large proportion of the faults in these designs, they have two drawbacks: first, since the output bits are related by a shift, some faults are missed due to correlation effects. Second, the output signal variance of a standard LFSR is approximately 0.3333. This variance is not always high enough to exercise the upper bits of all adders, so a higher variance test signal is sometimes needed.
To address the correlation problem, we have found that an exclusive-or network can effectively destroy any linear correlation at the LFSR's output. The decorrelating circuit we used XOR's the LFSR's LSB with all other output bits. In some cases, it may be sufficient to only XOR some of the upper bits, depending on the nature of the correlation problem.
The test signal variance problem can be addressed by using a test signal with variance close to one, the upper limit on signal variance if the test signal is interpreted as a two's-complement number in the interval ; specific approaches are outlined in [23] , [3] , and [24] . One means of implementing a maximum-variance test signal is to simply use 1 bit of an LFSR to select between the maximum positive and minimum negative representable numbers. This test should only be used to supplement other test sequences, since by itself it does not offer enough testing of low-order bits.
In most filters, an input register is available that can be modified to add an LFSR test generation mode. In Fig. 13 , we show a circuit that can be used to add decorrelating and maximum-variance modes to the input register.
B. BIST Overhead Considerations
The amount of area and delay added by any BIST scheme is of great concern. We have attempted to minimize this overhead by restricting the test hardware to a single test generator and compressor for each filter. In small filters, it is often sufficient to give the filter input register an LFSR mode and the output register a multiple-input signature register (MISR) mode. In larger filters, high test coverage requirements and maximum test sequence length restrictions will often drive the addition of the decorrelating and maximum-variance test modes provided by the circuit described earlier.
The performance impact of the decorrelator/maximumvariance (DCMV) circuit may be of concern in some high-speed circuits. In a ripple-carry adder architecture, the impact on delay can be minimized by eliminating the LSB multiplexer, feeding the LSB directly through to the circuit. This does not significantly reduce the variance of the maximum-variance test signal, and the delay impact on the critical path through the LSB is eliminated. In carry-save adder array architectures, all bits are on the critical path, so up to two gate delays are added to the critical path. In cases where this added delay is not acceptable, a pipeline register (PREG) can be added between the output of the decorrelating circuit and the filter. A drawback of this approach is the latency added by the pipeline register.
Another performance concern is the delay added by converting the input register to an LFSR. This involves converting the input flip-flops to multiplexed-input flip-flops. In common cell libraries, this modification has little impact on the clock-to-Q time of the register. Instead, the setup time of the register is increased. This could be of concern if the arrival time at the input to the filter is critical. A similar concern applies to the output register, which is converted to a MISR. Again, in timing-critical designs, additional pipelining may need to be considered.
Using the LSI Logic 10k library, the approximate circuit area overhead of our BIST approach can be estimated. There are three schemes we will consider. 1) LFSR and MISR. This is sufficient for many small circuits. 2) LFSR, DCMV, MISR. This provides higher-quality tests with at most two added gate delays. 3) LFSR, DCMV, pipeline register, MISR. This provides high-quality tests and eliminates the delay of the previous scheme at the cost of added area. The most likely application of this scheme is to CSA-based designs. In small filters, the LFSR can be the same width as the input signal. In very large filters, it may be necessary to generate longer test sequences. In such cases, the width of the input register is increased to accommodate a larger LFSR, while the circuit input width remains unchanged.
The overhead of each scheme is determined by the input signal width , the LFSR width , and the output signal width . The area overhead in terms of two-input gate-equivalents of each BIST component is given by
Since filters do not always require resettable registers, the LFSR and MISR costs include the cost of adding an initialization capability to these registers. The pipeline register is assumed to be nonresettable. The overhead of the three schemes is given by
In some circuits, it may be possible to reduce the size of the DCMV circuit while maintaining good test coverage. This can be done on a trial-and-error basis, starting at the LSB. If bypassing the DCMV at this bit does not reduce fault coverage, the DCMV circuitry for this bit is replaced by a feed-through, and the process is repeated at the next higher bit. In our examples, we use a full-width DCMV circuit.
For the smallest filter, filt10, the first scheme is adequate to reach 100% coverage with an 8-bit LFSR. The overhead in this case is ( ,
gate-equivalents
The total area of the baseline linear-architecture design is 1976 gate-equivalents, giving an overhead figure of 6.4%. However, the redundancy elimination techniques described here provide an area savings of 384 gate-equivalents, giving a net area reduction of 258 gate-equivalents, or 13% relative to the baseline design. The larger filters, such as filt64, require a more aggressive BIST design, but the cost is amortized over a larger area, so the test overhead percentage is typically very low. For the filt64 CSA-based design, assuming a pipeline register is required (Scheme 3), the overhead is , , and filt64 LFSR (DCMV) (PREG) (MISR) gate-equivalents or 1.2% of the baseline design active circuit area. The area reduction due to redundant fault elimination is 2253 gateequivalents, so again there is a net decrease in active circuit area if BIST insertion is combined with aggressive redundant fault pruning.
The BIST strategies used for each of the five filter specifications is shown in Table V , where the area overhead is in terms of LSI 10k gate-equivalents. The BIST active area overhead is compared to the baseline design sizes in Table VI . If redundant fault elimination is used to reduce the area of the design, the net overhead of BIST plus redundant fault elimination is shown in Table VII . The average size of a BISTed, optimized design was 5% smaller than the baseline, non-BIST design.
C. Fault Coverage Results
The results of fault simulating the 30 BISTed designs are shown in Table VIII . In each case, an upper limit of 10 000 vectors was imposed, and the final fault coverage for both optimized ("Opt.") and baseline ("Orig.") designs is given. Decorrelating and maximum-variance test generation modes were used to increase fault coverage in all designs except filt10, for which a simple LFSR sufficed. The pattern generator was started in decorrelating mode and switched to maximumvariance mode after 2k or 4k vectors, depending on which gave the best coverage. The test length corresponds to the last vector at which a fault was detected. It was assumed that no aliasing occurred in the response compressor.
Directly comparing fault coverage between different filter designs can be misleading. Since test-resistant faults are typically found at the MSB-side of the datapath, the test problem remains essentially unchanged if bits are added at the LSB side of the datapath. Consequently, it makes sense to normalize the number of missed faults by the number of adders in the design rather than directly comparing the number of missed faults with the total number of faults. The value of this metric for the thirty designs examined is shown in Table IX .
Redundancy identification eliminates both redundant faults and testable faults associated with logic that can be eliminated, leading to a substantial reduction in the total number of adder faults in most designs, as shown in Table X . The reduction in the total number of undetected faults shown in Table XI corresponds primarily to the elimination of redundant faults, although a few faults are more easily detected due to the SXT-Add random-test enhancing transformation described in Section III-B.
Fault simulation plots are shown in Figs. 14-17. Figs. 14-17 compare the results for one architecture across four filters, while Figs. 18 and 19 compare the three architectures across two designs. In Fig. 14 , the generator is switched to maximum-variance mode after 4000 vectors, catching a few undetected faults. In Figs. 15-17 , the generator was switched to maximum-variance mode after 2000 vectors.
Redundancy elimination allows fair comparison of the testability of different architectures. Figs. 18 and 19 compare the number of undetected faults for the Linear, CSA, and Tree architectures. The test generator was switched to maximum variance mode after 2000 vectors, except for the filt60 CSA and Tree designs, which reached 100% coverage with the generator switched to maximum variance mode after 4000 vectors. The results show the CSA and Tree architectures to be relatively amenable to random-pattern testing as compared to the Linear architecture.
V. CONCLUSION
Redundant faults can be an obstacle to gauging the true effectiveness of any test scheme, particularly in applicationspecific digital filters where these faults can be hard to distinguish from highly test-resistant faults. Analysis of the RTL design using arithmetic techniques based on scaling theory and signal phase and magnitude constraints provides an efficient means of identifying and eliminating most redundant faults in these designs.
Elimination of these faults can be done as a preprocessing phase for more accurate fault simulation, or it can be used to eliminate redundant logic from the design itself, in which case it is possible to make significant area reductions as compared to moderately optimized designs, like those produced by automated tools. In 11 of 15 optimized designs, it was possible to show that all redundant faults had been eliminated using these techniques. Across all designs, there was an average reduction of 97.9% in the number of faults that remained undetected after up to 10 000 pseudorandom test patterns had been run through. In terms of area, there was an average reduction of 8.9%. Elimination of redundant faults also made it possible to compare the testability of different implementation architectures without the bias of large numbers of redundant faults.
Using these techniques as a basis for accurately identifying test problems during BIST development, we found that slight improvements in the test generator were preferred to more aggressive BIST insertion techniques, leading to the development of a decorrelator and maximum-variance mode for the pattern generator. In large filters, the BIST area overhead was as low as 1%, while designs as large as 24k gates were 100% tested under a standard gate-level fault model. Thus, it appears that pseudorandom BIST can be an effective approach to testing filters: the area of the added self-test hardware is-in many cases-smaller than the area of the logic removed by redundant fault elimination, for a net reduction in area over the unoptimized design with no self-test capability.
