(21)
The transfer function of Fig. 8 can be shown to be 
The overall loop gains remain closer to unity over a broader range of gain error and, hence, the circuit quality factor remains closer to the desired value. Positive one and two percent change in amplifier gains result in a 6.2 and 21.4% reduction of Q, respectively. Similarly, negative one and two percent changes in amplifier gains cause a 5.8
and 19.5% reduction of Q: Thus, application of this filter approach to fixed gain amplifier integration requires gain tracking of around 1% for the various amplifiers. This appears reasonable for amplifiers fabricated on a common substrate.
VII. CONCLUSIONS
Several improvements have been made to a known active filtering method using all-pass networks. A new all-pass circuit, based on a fixed gain integrable amplifier extends the filter's resonant frequency to the 40-50 MHz range. A modified configuration improves the off-resonant performance of the filter. A second modification of the configuration improves the stability of the circuit and reduces the sensitivity of the Q with respect to amplifier gain.
I. INTRODUCTION
Recursive digital filters are usually implemented as a cascade of second-order sections to minimize finite-precision effects. Nevertheless, due to the inherent feedback loop of the recurrence, such finite-precision effects can induce overflow and limit cycle oscillations even in stable filters. Overflow oscillations are of large amplitude, caused by arithmetic overflow due to fixed precision, and limit cycles are small amplitude oscillations caused by round-off error in multiplication. Both types of oscillations can be induced with zero or nonzero inputs. The direct-form filter is faster and cheaper than other structures but more susceptible to oscillation. Eliminating or reducing oscillations without compromising speed is desirable.
Filter rate is also affected by the arithmetic used for computation. Particularly with conventional arithmetic (bit-parallel, leastsignificant-bit-first computation), the dependence of the output y(n) on y(n 0 1) limits operating speed. Recent papers have shown that most-significant-digit-first (MSDF) or on-line arithmetic can reduce the dependency to the digit level [2] , [4] . MSDF and online algorithms compute digit serially, with inputs and outputs in MSDF sequence, enabling the computation of y(n) to begin as soon as the most significant digit of y(n 0 1)is produced. Consequently, the clock rate of MSDF and on-line designs is independent of word length, whereas the clock rate of conventional designs diminishes with increasing word length. Therefore, MSDF and on-line designs are faster than conventional ones for words longer than a particular precision which depends on technology [3] . For on-line designs, the precision extension (PE) method can be used to suppress all oscillations without affecting sampling rate [1] . The large precision extension makes the PE method costly.
II. MSDF AND ON-LINE ARITHMETIC
MSDF and on-line algorithms produce outputs digit-serially, beginning with the most significant digit (MSD). In on-line algorithms, all operands are in on-line form (digit serial MSDF). If some of the operands are in parallel form, the algorithm is referred to as an MSDF algorithm. An on-line radix-r operand, x; assumed to be a fraction, is expressed in terms of its digits Xj as x = 6 d01 j=0 Xj r 0j ; where Xj is in the redundant digit set f0; 111 ; g such that r > r=2: Having input and output digits in the same digit set allows cascading of on-line modules and facilitates recursive computation. An important characteristic of on-line computation is the on-line delay, imp ; which is the number of clocks between input and output digits of identical weight. The on-line delay is typically two to five clocks. Online arithmetic offers a systematic approach to deriving digit-serial algorithms. Well established on-line algorithms exist for common operations. On-line and MSDF algorithms are particularly well suited for VLSI implementations of high-speed recursive filters [1] , [3] , [4] .
Precision Extension (PE) Method
The PE method of eliminating all self-sustained oscillations in a stable fixed-point second-order filter (y(n) = u(n) + ay(n 0 1) + by(n 0 2); where coefficients a and b are real-valued) using on-line arithmetic is described in [1] . The relevant results described in [1] are summarized below. The symbols used are as follows: m is the number of fraction bits representing coefficients, b D the number of desired output bits, bL the number of additional least significant bits required to, eliminate limit cycles from the desired output, b O the number of additional most significant bits required to eliminate overflow oscillations from the desired output, bW the working precision in bits, and E the maximum normalized quantization error due to multiplication.
The limit cycle magnitude is less than 2E(4=)2 1:5m for pole magnitudes >0.9. Since the limit cycle corrupts the least significant part of the fixed-point result, extending working precision sufficiently at the least significant bit (LSB) end eliminates the limit cycle from the actual output. The required extension is given by (1) . Overflow oscillations are caused by internal overflows. Assuming the quantization error is negligible, and that u(n) < 1; the number of bits required at the most significant bit (MSB) end to prevent overflow is given by (2) . Thus, the working precision required to eliminate both overflow and limit cycles from the actual output is given by (3).
(1)
In implementations using conventional arithmetic, such an extension of precision reduces the sampling rate. In contrast, if on-line or MSDF arithmetic is used sampling rate is unaffected. The increase in hardware for PE is significant regardless of the arithmetic used. For example, having coefficients with 10 fraction bits requires a working precision of 44 bits. The dynamic scaling (DS) method is a less costly method of eliminating oscillations in on-line filters.
III. THE DS METHOD
With a moderate extension of working precision, the MSD of the output y(n) can be guaranteed to be zero during zero-input (u(n) = 0) limit cycle oscillations. This allows the output to be left-shifted by one digit for the computation of the next output y(n+1); provided an exponent is introduced to keep track of the shifts. Shifting can be done again when the MSD becomes zero sometime later. By induction, the shifting can be done until the exponent is decremented to the point at which the desired output is zero for a given precision. Thus, limit cycles are eliminated from the desired output by increasing working precision just sufficient to guarantee a zero MSD when u(n) = 0.
Thus, for radix-4 digits, the DS method requires a minimum working Table I . The number of bits required for the PE method is calculated assuming b D = m:
Overflow oscillations are eliminated by incrementing the exponent (E y ) and right-shifting the output when y(n) overflows. For radix-4 digits, the maximum value of E y is E max = db O =2e: To eliminate limit cycles from the desired output requires bD zero bits to be absorbed into the exponent. For radix-4 digits, minimum value of E y is E min = 0db D =2e:
The working precision for the PE method is about 2.5 times that required for the DS method as Table I shows. Cost savings result because, as shown later, the cost of introducing an exponent into the computation is less than that for precision extension according to (3) .
The DS Algorithm
The DS algorithm is based on two scaling operations, advance and retard, performed on on-line operands. The advance operation performs a 1-digit left shift of the mantissa and also decrements the exponent. The retard operation performs an n-digit right shift and increments the exponent by n: Table II illustrates advance and retard (for n = 2) operations performed on operand y with digits Y:
Normalized u(n) is denoted by u; and y(n); y(n 0 1);y(n 0 2) are denoted by y; y 1 ; y 2 : E u is the exponent of u and E y is the exponent for both y1 and y2: E 0 y is the exponent of y and is also the new exponent of y 1 : Advance operation is indicated by ADV = 1.
R u is the magnitude of shift, in digits, for retard on u and R y the same for retard on y1 and y2: Rv is the magnitude of shift, in digits, caused by overflow in y 1 from previous computation (R v f0; 2g): Signals Y 1z = 1 and Y 2z = 1 indicate if leading fraction digit of y1 or y2 is zero.
DS Algorithm

Begin
Step 0: Initially E y = 0; y 1 = 0; y 2 = 0
Step 1: Compute Ry = max(Eu 0 Ey; Rv); Ry 0
Step 2: Find (Boolean) ADV =(Rv =0)(Y 1z)(Y 2z)(Eu < Ey)(Ey > Emin)
Step 3 Step 5: Retard/Advance u; y1; y2
Step 6: Execute on-line fixed point computation 1 y = u + ay1 + by2
Step 7: E y E 0 y
Step 8: Go to Step 1
End
Step 1 of the DS algorithm indicates the two conditions that require y 1 and y 2 to be retarded: when E u > E y ; or when y 1 overflows from the previous computation. Since both conditions may occur simultaneously, the maximum retard value is chosen.
Step 2 specifies the conditions for advance: no overflow, leading fraction digits of u; y 1 , and y 2 must be zero, and the exponent must be greater than the minimum value. As Step 3 indicates, the retard value of u is different because u is normalized (i.e., MSD is nonzero). Normalization is convenient because u needs no advance subsequently. Also, advance by more than 1 increases complexity and delay of the DS unit.
Step 4 calculates the new exponent based on the scaling operations performed.
Step 5 scales the on-line operands that are input to the on-line fixed point computation in Step 6. To reduce complexity, y 1 and y 2 share the same exponent and are retarded or advanced identically. The radix and digit set chosen for the DS algorithm are the same as that for the fixed-point computation.
Besides extending precision, no other changes are required in the fixed-point computation.
IV. IMPLEMENTATION
This section describes gate array (LCA10K) implementations of the DS scheme and the PE method and compares performance and cost. Both implementations are for m = 8 and b D = 8. Fig. 1 shows the block diagram of the DS scheme. Three scalers and a single exponent unit are connected to a word module (two cascaded radix-4 on-line MA modules described in [3] ) that computes the online fixed point recurrence y = u + ay1 + by2: The exponent unit computes the retard values and advance signal as shown in the graph in Fig. 2 .
In general, for d-digit working precision, d + imp clocks are required. For 8-digit working precision, the computation of y(n) takes 12 clocks, C0; C1; 1 11;C11: Synchronization of the digits is done by introducing appropriate delays (1D, 4D, and 6D in Fig. 1 ) based on the known latencies of the fixed-point computation and the scalers. The latency of the fixed-point computation is nine clocks (two cascaded MA modules, each with a four-clock latency, plus a latch) and that of each scaler is two. Since the delay for the y1 loop must be 12 (same as the computation cycle) a delay of 1 is inserted (shown as 1D in Fig. 1 ). To synchronize inputs for the fixed-point computation, y 1 is delayed another 12 clocks to produce y 2 : Since the scaler takes two clocks, a delay of 10 clocks must be inserted in the y 2 path. The 10-clock delay is split into delays 6D and 4D, to achieve synchronization at the scaler input and at the fixed-point computation input.
The DS unit operates as follows. In clock C0 the scalers receive Ru; Ry, and ADV from the exponent unit and also the MSD's of u; y1, and y2: The on-line inputs are scaled and the fixed-point computation begins in C2. The exponent unit begins computation in C10, when inputs E y ; E u ; R v ; Y 1z, and Y 2z are gated by GATE. ADV; R y and R u have to be stable in C1. Computation of E 0 y is not critical because the value is not needed at the beginning of the iteration. In C0, the most significant fraction digits are available at the inputs of the scalers shown in Fig. 1 . Since the scalers have a delay of two clocks, the scaled outputs are available in C2. The signal CLD clears the digit registers of the scalers before each computation cycle begins. CLZ clears the flip-flop that outputs Z to the scalers. (Table I ). The exponent range required for the DS scheme is 04 E y 7: The two implementations are compared in Table III . The sampling rate of the DS implementation is 80% higher than that of the equivalent PE implementation and is 13% smaller. Thus the rate/gate ratio for the DS scheme is more than twice that for the PE implementation. As m or b D is increased, the required working precision for the PE scheme increases relative to that of the DS scheme (Table I ) and the increase in cost of the scalers and the exponent unit of the DS scheme is relatively small. Thus for higher working precision the DS scheme is even more cost effective.
Applying DS to Arrays: The DS method may be applied to maximum-rate arrays such as that shown in Fig. 3 for 16-bit I/O. Since the output of a word module is input to the next word module without delay, the array delivers the maximum rate achievable with the given MA modules (MaxRate = 1000=t clk imp):
Rather than advancing on-line inputs before each computation, the advancing can be done once for several computations. Thus leading zeros are allowed to accumulate and are removed simultaneously. Limit cycles would still be effectively eliminated since leading zeros in the I/O would be gradually removed. Consider an array similar to Thus the array requires 8 MA modules with a working precision of 22 bits.
With one DS unit for the array, the normalizer feeding the independent inputs u(n) through u(n + 3) to the word modules must perform block normalization, producing four on-line inputs with a single exponent E u : The cost of a maximum rate word module array with DS is estimated at 14 000 gates (Block Normalizer '1000 gates, eight MA modules with 22-bit I/O 11 856 gates, DS Unit 1179 gates). In contrast using PE in the array takes 28 260 gates (six word modules, with 36-bit precision). Thus, the cost of a PE scheme is twice the cost of a DS scheme for maximum rate arrays. With DS, the maximum rate is given by MaxRateDS = 1000=t clk (imp + (2=N DS ): For the example considered, d = 8; imp = 4 and N DS = 4, MaxRate DS = 24.4 MSamples/s, which is 88% of the maximum rate without DS (27.8 MSamples/s). Although the array using PE is more regular and 12% faster, the maximum-rate array with DS is more cost-effective with a rate/cost ratio of 1.8 times that of an array with PE.
V. CONCLUSION
We have proposed the DS scheme and shown that it is more costeffective than the PE method in eliminating limit cycles oscillations and overflow oscillations in on-line implementations of direct form recursive filters. The scheme is implemented by adding a DS unit to a fixed-point on-line word module. Except for precision adjustment, easily achieved by adding bit-slices, no changes to the word module are required. Implementation in a 1.5-m gate array technology shows that, for word modules with an output precision of 8 bits, the DS scheme is 13% smaller than the PE scheme and has a sampling rate 80% higher. Maximum-rate arrays using the DS scheme require only half as many gates as an array using PE and operate at 88% of the maximum rate. For higher output precision the DS scheme is even more cost effective. Having automatic scaling, the DS scheme eliminates the need for scaling between cascaded sections.
