INTRODUCTION

W
hile Σ∆M techniques [l, 2] are applied widely in analog conversion sub-systems, both analog-to-digital (ADC) and digital-to-analog (DAC) converters. these methods have enjoyed much less exposure in the broader application domain, where flexible and configurable solutions, traditionally supplied via a software DSP (soft-DSP), are required. And this limited level of exposure is easy to understand. Most, if not all, of the efficiencies and optimizations afforded by Σ∆M are hardware oriented and so cannot be capitalized on in the fixed precision pre-defined datapath found in a soft-DSP processor. This limitation, of course, does not exist in a field programmable gate array (FPGA) DSP solution. With FPGAs the designer has complete control of the silicon to implement any desired datapath and employ optimal word precisions in the system with the objective of producing a design that satisfies the specifications in the most economically sensitive manner . While implementation of a digital Σ∆ ASIC (application specific integrated circuit) is of course possible, economic constraints make the implementation of such a building block that would provide the flexibility, and be generic enough to cover a broad market cross-section, impractical. FPGA based hardware provides a solution to this problem. FPGAs are an off-the-shelf commodity item that provide a silicon feature set ideal for constructing high-performance DSP systems. These devices maintain the flexibility of software based solutions. while providing levels of performance that match, and often exceed ASIC solutions.
There is a rich and expanding body of literature devoted to the efficient and effective implementation of digital signal processors using FPGA based hardware. More often than not, the most successful of these techniques involves a paradigm shift away from the methods that provide good solutions in software programmable DSP systems. This paper reports on the rich set of design opportunities that are available to the signal processing system designer through innovative combinations of Σ∆M techniques and FPGA signal processing hardware. The applications considered include narrow-band filters, both single-rate and multirate, DC canceler, and Σ∆M hybrid digital-analog control loops for simplifying carrier recovery, timing recovery and AGC (automatic gain control) loops in a digital communication receiver. The paper is organized as follows: Section 2 presents a brief overview of FPGA architecture. In Section 3 a simple single-loop base-band Σ∆ modulator is introduced. This structure is then extended to a novel architecture that permits center frequency tuning, as well a method for working with the system degrees of freedom to tradeoff modulator bandwidth with dynamic range. The tunable Σ∆M is then utilized for implementing area efficient FPGA FIR filters. The process for computing the modulator coefficients for lowpass, bandpass and highpass designs is described. In Section 4, a new Σ∆ modulator architecture is described that provides a very simple method for tuning using only a single coefficient. In any fixed-point datapath, careful consideration must be given to the DC aspects of the design. For example, the introduction of a DC component due tc sample truncation between the stages of a multistage multi-rate filter can be problematic, causing arithmetic saturation or increasing the bit error rates in a digital receiver. Section 5 describes a unique Σ∆ modulator approach to building a DC canceler. In Section 6, Σ∆ methods are described for simplifying the implementation of hybrid digital-analog control loops in a system such as a software defined radio. In Section 7 some comments on the industrial implications of the techniques considered in the paper are presented. Finally, some conclusions are drawn in Section 8.
FPGA ARCHITECTURE
There is a rich range of FPGAs provided by many semiconductor vendors including Xilinx, Altera, Atmel, AT&T and several others. The architectural approaches are as diverse as there are manufacturers, but some generalizations can be made. Most of the devices are basically organized as an array of logic elements and programmable routing resources used to provide the connectivity between the logic elements, FPGA I/O pins and other resources such as on-chip memory. The structure and complexity of the logic elements, as well as the organization and functionality supported by the interconnection hierarchy, distinguish the devices from each other. Other device features such as block memory and delay locked loop technology are also significant factors that influence the complexity and performance of an algorithm that is implemented using FPGAs. A logic element usually consists of 1 or more RAM (random access memory) n-input look-up tables, where n is between 3 and 6, and 1 to several flip-flops. There may also be additional hardware support in each element to enable highspeed arithmetic operations. This generic FPGA architecture is shown in Figure 1 . Also illustrated in the figure (as wide lines) are several connections between logic elements and the device input/output (1/O) ports. Application specific circuitry is supported in the device by downloading a bit stream into SRAM (static rondom access memory) based configuration memory. This personalization database defines the functionality of the logic elements, as well as the internal routing. Different applications are supported on the same FPGA hardware platform by configuring the FPGA(s) with appropriate bit streams. As a specific example, consider the Xilinx Virtex TM series of FPGAs [3] . The logic elements, called slices, essentially consist of two 4-input look-up tables (LUTs), two flip-flops, several multiplexors and some additional silicon support that allows the efficient implementation of carry-chains for building highspeed adders, subtracters and shift registers. Two slices form a configurable logic block (CLB) as shown in Figure  2 .The CLB is the basic tile that is used to build the logic matrix. Some FPGAs, like the Xilinx Virtex families, supply on-chip block RAM. Figure 3 shows the CLB matrix that defines a Virtex FPGA. Current generation Virtex silicon provides a family of devices offering 768 to 12,288 logic slices, and from 8 to 32 variable form factor block memories. Xilinx XC4000 and Virtex [3] devices also allow the designer to use the logic element LUTs as memory -either ROM or RAM. Constructing memory with this distributed memory approach can yield access bandwidths in the many tens of gigabytes per second range.Typical clock frequencies for current generation devices are in the multiple tens of mega-hertz (100 to 200) range. In contrast to the logic slice architecture employed in Xilinx Virtex devices, the logic block architecture employed in the Atmel AT40K [4] FPGA is shown in Figure 4 . Like the Xilinx device, combinational logic is realized using lookup tables. In this case, two 3-input LUTs and a single flip-flop are available in each logic cell. The pass gates in a cell form part of the signal routing network and are used for connecting signals to the multiple horizontal and vertical bus planes. In addition to the orthogonal routing resources, indicated as N, S, E and W in Figure 4 , a diagonal group of interconnects (NW, NE, SE, and SW), associated with each cell x output, are available to provide efficient connections to a neighboring cell's x bus inputs. The objective of the FPGA/DSP architect is to formulate algorithmic solutions for applications that best utilize FPGA resources to achieve the required functionality.This is a three-dimensional optimization problem in power , complexity and bandwidth. The remainder of this paper describes some novel FPGA solutions to several signal processing problems. The results are important in an industrial context because they enable either smaller, and hence more economic, solutions to important problems, or allow more arithmetic compute power to be realized with a given area of silicon.
Σ∆ MODULATORS, FIR FILTERS AND FPGAS
This section describes a method employing sigma-delta modulation (Σ∆ M) techniques for implementing area efficient finite impulse response (FIR) filters using FPGA hardware.Before treating the FPGA filter design, a brief review of Σ∆ modulation encoding is presented.
Σ∆ MODULATION
Sigma-Delta modulation is a source coding technique most prominently employed in analog-todigital and digital-to analog converters. In this context, hybrid analog and digital circuits are used in the realization. Figure 5 shows a single loop Σ∆ modulator. Provided the input signal is busy enough, the linearized discrete time model of Figure 6 can be used to illustrate the principle. In this figure the 1-bit quantizer is modeled by an additive white noise source with variance δ which is the transfer function of delay and an ideal integrator, and Hs(z) and Hn(z) are the signal and noise transfer functions (NTF) respectively. In a good Σ∆ modulator, Hs(ω) will have a flat frequency response in the interval|f| =< B. In contrast, Hn(ω) will have a high attenuation in the frequency band|f|=< B and a don't care region in the interval B<|f| < fs/2. For the single loopΣ∆ in Figure 6 Hs(z) = z -1 and Hn(z) = 1-z -1 . Thus the input signal is not distorted in any way by the network and simply experiences a pure delay from input to output. The performance of the system is determined by the noise transfer function Hn(z) which is given by and is shown in Figure 7 . The in-band quantization noise variance is where Sq(f) = δ 2 q/fs is the power spectral density of the quantization noise. Observe that for a nonshaped noise (or white) spectrum, increasing the sampling rate by a factor of 2, while keeping the bandwidth B fixed, reduces the quantization noise by 3 dB. For a first order Σ∆M it can be shown that for fs >> 2B. Under these conditions doubling the sampling frequency reduces the noise power by 9 dB, of which 3 dB is due to the reduction in Sq(f) and a further 6 dB is due to the filter characteristic Hn(f). The noise power is reduced by increasing the sampling rate to spread the quantization noise over a large bandwidth and then by shaping the power spectrum using an appropriate filter.
REDUCED COMPLEXITY FILTERS USING Σ∆ MODULATION TECHNIQUES
Σ∆Μ techniques can be employed for realizing area efficient narrowband filters in FPGAs. These filters are utilized in many applications. For example, narrow-band communication receivers, multi-channel RF surveillance systems and for solving some spectrum management problems. A uniform quantizer operating at the Nyquist rate is the standard solution to the problem of representing data within a specified dynamic range. Each additional bit of resolution in the quantizer provides an increase in dynamic range of approximately 6 dB. A signal with 60 dB of dynamic range requires 10 bits, while 16 bits can represent data with a dynamic range of 96 dB. While the required dynamic range of a system fixes the number of bits required to represent the data, it also affects the expense of subsequent arithmetic operations, in particular multiplications. In any hardware implementation, and of course this includes FPGA based DSP processors, there are strong economic imperatives to minimize the number and complexity, of the arithmetic components employed in the datapath. The proposal investigated in this section is to employ noise-shaping techniques to reduce the precision of the input data samples so that the complexity of the multiply-accumulate (MAC) units in the filter can be minimized. Of course, the pre-processing must not compromise the integrity of the signal in the band of interest. The net result is a reduction in the amount of FPGA logic resources required to realize the specified filter. Consider the structure shown in Figure 8 . Instead of applying the quantized data x(n) from the analog-to-digital converter directly to the filter, it will be pre-processed by a Σ∆ modulator.
The re-quantized input samples x(n) are now represented using fewer bits per sample, so permitting the subsequent filter H(z) to employ reduced precision multipliers in the mechanization. The filter coefficients are still kept to a high precision. The Σ∆ data re-quantizer is based on a single loop error feedback sigma-delta modulator [l] shown in Figure 9 . In this configuration, the difference between the quantizer input and output sample is a measure of the quantization error which is fed back and combined with the next input sample. The error-feedback sigma-delta modulator operates on a highly oversampled input and uses the unit delay z -1 as a predictor. With this basic error feedback modulator only a small fraction of the bandwidth can be occupied by the required signal. In addition, the circuit only operates at baseband. A larger fraction of the Nyquist bandwidth can be made available and the modulator can be tuned if a more sophisticated error predictor is employed. This requires replacing the unit delay with a prediction filter P(z). This generalized modulator is shown in Figure 10 . The operation of the re-quantizer can be understood by considering the transform domain description of the circuit. This is expressed in Eq. (7) as where Q(z) is the z-transform of the equivalent noise source added by the quantizer q(·), P( z ) is the transfer function of the error predictor filter, and X(z) and X(z) are the transforms of the system input and output respectively. P(z) is designed to have unity gain and leading phase shift in the bandwidth of interest. Within the design bandwidth, the term Q(z)(1 -p(z)z -l ) = 0 and so X(z) = X(z). By designing P(z) to be commensurate with the system passband specifications, the in-band spectrum of the re-quantizer output will ideally be the same as the corresponding spectral region of the input signal. To illustrate the operation of the system consider the task of recovering a signal that occupies 10% of the available bandwidth and is centered at a normalized frequency of 0.3 Hz. The stopband Figure 10 . Tunable sigma-delta modulator using a linear modulator in the feedback path.
requirement is to provide 60 dB of attenuation. Figure 11 (a) shows the input test signal.It comprises an in-band component and two out-ofband tones that are to be rejected. Figure 11 (b) is a frequency domain plot of the signal after it has been re-quantized to 4 bits of precision by a Σ∆ modulator employing an 8th order predictor in the feedback path. Notice that the 60 dB dynamic range requirement is supported in the bandwidth of interest, but that the out-of-band SNR has been compromised. This is of course acceptable, since the subsequent filtering operation will provide the necessary rejection. A 160-tap filter H(z) satisfies the problem specifications. The frequency response of H(z) using 12-bit filter coefficients is shown in Figure 11 (c). Finally, H(z) is applied to the reduced sample precision data stream X(z) to produce the spectrum shown in Figure 11 (d).
Observe that the desired tone has been recovered, the two out-of-band components have been rejected, and that the in-band dynamic range meets the 60 dB requirement.
PREDICTION FILTER DESIGN
The design of the error predictor filter is a signal estimation problem [7, 8] . The optimum predictor is designed from a statistical viewpoint. The optimization criterion is based on the minimization of the mean-squared error. As a consequence, only the second-order statistics (autocorrelation function) of a stationary process are required in the determination of the filter. The error predictor filter is designed to predict samples of a band-limited white noise process Nxx(ω) shown in Figure 12 .
Nxx(ω) is defined as and related to the autocorrelation sequence rxx (m) by discrete-time Fourier transform (DTFT)
The autocorrelation function rxx(n) is found by taking the inverse DTFT of Eq. ( 9 ) Nxx(ω) is non-zero only in the interval -θ=<ω=<θ giving rxx(n) as 
So the autocorrelation function corresponding to a band-limited white noise power spectrum is a sinc function. Samples of this function are used to construct an autocorrelation matrix which is used in the solution of the normal equations to find the required coefficients. Leaving out the scaling factor in Eq. (11), the required autocorrelation function rxx(n), truncated to p samples, is defined as
The normal equations are defined as equations [7] . This system of equations can be compactly written in matrix form by first defining several matrices. To design a p-tap error predictor filter first compute a sinc function consisting of p + 1 samples and construct the autocorrelation matrix Rxx as 18) is an ill-conditioned problem. To arrive at a solution for A, a small constant e is added to the elements along the diagonal of the autocorrelation matrix Rxx in order to raise its condition number. The actual autocorrelation matrix used to solve for the predictor filter coefficients is
BANDPASS PREDICTOR
The previous section described the design of a lowpass predictor. In this section bandpass processes are considered.A bandpass predictor filter is designed by modulating a lowpass prototype sinc function to the required center frequency θ0 [10] . The bandpass predictor coefficients hBP(n) are obtained by solving the normal equations with a heterodyned sinc function
HIGHPASS PREDICTOR
The highpass predictor coefficients hHP(n) are obtained by solving the normal equations with a sinc function heterodyned to the half sample rate
Submitted to the IEEE Special Issue Magazine on Industrial Signal Processing, August 1999. 
The most challenging aspect of implementing the data modulator is producing an efficient implementation for the prediction filter P(z). The desire to support high-sample rates, and the requirement of zero latency for P(z), will preclude bitserial methods from this problem. In addition, for the sake of area efficiency, parallel multipliers that exploit one time-invariant input operand (the filter coefficients) will be used, rather than general variable-variable multipliers. The constant coefficient multiplier (KCM) is based on a multi-bit inspection version of Booth's algorithm [9] . Partitioning the input variable into 4-bit nibbles is a convenient selection for the Xilinx Virtex function generators (FG) [3] . Each FG has 4 inputs and can be used for combinatorial logic or as application RAM/ROM. Each logic slice [3] in the Virtex logic fabric comprises 2 FGs, and so can accommodate a 16 x 2 memory slice. Using the rule of thumb that each bit of filter coefficient precision contributes 5 dB to the sidelobe behavior, 12-bit precision is used for P(z). 12-bit precision will also be employed for the input samples . There are 3 4bit nibbles in each input sample. Concurrently, each nibble addresses independent 16 x 16 lookup tables (LUTs). The bit growth incorporated here allows for worst case filter coefficient scaling in P(z). No pipeline stages are permitted in the multipliers because of P(z)'s location in the feedback path of the modulator. It is convenient to use the transposed FIR filter for constructing the predictor. This allows the adders and delay elements in the structure to occupy a single slice. 64 slices are required to build the accumulate-delay path. The FPGA logic requirements for P(z), using a9-tap predictor, is Γ(p(z)) =9x 40+64=424 CLBs. A small amount of additional logic is required to complete the entire Σ∆ modulator. The final slice count is 450. The entire modulator comfortably operates with a 113 MHz clock. This clock frequency defines the system sample rate, so the architecture can support a throughput of 113 MSamples per second. The critical path through this part of the design is related to the exclusion of pipelining in the multipliers.
REDUCED COMPLEXITY FIR MECHANIZATION
Now that the input signal is available as a reduced precision sample stream, filtering can be performed using area optimized hardware. For the reasons discussed above, 4-bit data samples are a convenient match for Virtex devices. Figure 13 shows the structure of the reduced complexity 39% of the logic resources of a direct implementation.
Σ∆ DECIMATORS
The procedure for re-quantizing the source data can also be used effectively in an m . 1 decimation filter. An interesting problem is presented when high input sample rates (>= 150MHz) must be supported in FPGA technology. High-performance multipliers are typically realized by incorporating pipelining in the design. This naturally introduces some latency in to the system. The location of the predictor filter P(z) requires a zerolatency design. 1 Instead of requantizing, filtering and decimating, which would of course require a Σ∆ modulator running at the input sample rate, this sequence of operations must re-ordered to permit several slower modulators to be used in parallel. The process is performed by first decimating the signal, re-quantizing and then filtering. Now the Σ∆ modulators operate at the reduced output sample rate. This is depicted in Figure 15 . To support arbitrary center frequencies, and any arbitrary, but integer, down-sampling factor m, the bandpass decimation filter must employ complex weights. The filter weights are of course just the bandpass modulated coefficients of a lowpass prototype filter designed to support the bandwidth of the target signal. Samples are collected from the A/D and alternated between the two modulators. Both modulators are identical and use the same predictor filter coefficients. The re-quantized samples are processed by an m:1 complex polyphase filter to produce the decimated signal. Several design options are presented once the signal has been filtered and the sample rate lowered. Figure 15 illustrates one possibility. Now that the data rate has been reduced, the low rate signal is easily shifted to baseband with a simple, and area efficient, complex heterodyne. One multiplier and a single digital frequency synthesizer could be time shared to extract one or multiple channels. It is interesting to investigate some of the changes that are required to support the Σ∆ decimator. What may not be immediately obvious is that the center frequency of the prediction filter must be designed to predict samples in the required spectral region in accordance with the out-put sample rate. For example, consider m=2, and the required channel center frequency located at 0.1 Hz, nor- malized with respect to the input sample rate. The prediction filter must be designed with a center frequency located at 0.2 Hz. In addition, the quality of the prediction must be improved. With respect to the output sample rate, the predictors are required to operate over a wider fractional bandwidth. This implies more filter coefficients in P(z).The increase in complexity of this component must of course be balanced against the savings that result in the reduced complexity filter stage to confirm that a net savings in logic requirements is produced. To more clearly demonstrate the approach, consider a 2:1 decimator, a channel center frequency at 0.2 Hz and a 60 dB dynamic range requirement. Figure 16 (a) shows the double sided spectrum of the input test signal. The input signal is commutated between Σ∆0 and Σ∆1 to produce the two low precision sequences xo(n) and x1(n). The respective spectrums of these two signals are shown in Figures  16(b) and 16(c) . The complex decimation filter response is defined in Figure 16(d) . After filtering, a complex sample stream supported at the low output sample rate is produced. This spectrum is shown in Figure 16 (e). Observe that the out-ofband components in the test signal have been rejected by the specified amount and that the inband data meets the 60 dB dynamic range requirement. For comparison, the signal spectrum resulting from applying the processing stages in the order, requantize, filter and decimate is shown in Figure 16 (f). The interesting point to note is that while the dual Σ∆ modulator approach satisfies the system performance requirements, its out-ofband performance is not quite as good as the response depicted in Figure 16 (f). The stopband performance of the dual modulator architecture has degraded by approximately 6 dB. This can be explained by noting that the shaping noise produced by each modulator is essentially statistically independent. Since there is no coupling between these two components prior to filtering, complete phase cancelation of the modulator noise cannot occur in the polyphase filter.
DISCUSSION
To provide a frame of reference for the Σ∆ decimator, consider an implementation that does not pre-process the input data, but just applies it directly to a polyphase decimation filter. A complex filter processing real-valued data consumes double the FPGA resources of a filter with real weights. For N = 160, 15344 CLBs are required. This figure is based on a cost of 40 CLBs for each KCM and 8 CLBs for an add-delay component. Now consider the logic accounting for the dual modulator approach. The area cost Γ(FIR) for this filter is 
(b) Shaped data xo(n). (c) Shaped data xl(n). (d) Complex filter. (e) Recovered result. (f) Filtered signal -single modulator.
where Γ(Σ∆) represents the logic requirements for one Σ∆ modulator, and Γ(MUL) is the logic needed for a reduced precision multiplier. Using the filter specifications defined earlier, and 18-tap error prediction filters, Γ(FIR) = 2 x 738 + 2 x ((160 + 159) x 8) = 6596. Comparing the area requirements of the two options produces the ratio So for this example, the re-quantization approach has produced a realization that is significantly more area efficient than a standard tapped-delay line implementation.
CENTER FREQUENCY TUNING
For both the single-rate and multi-rate Σ∆ based architectures, the center frequency is defined by the coefficients in the predictor filter and the coefficients in the primary filter.The constant coefficient multipliers can be constructed using the FPGA function generators configured as RAM elements. When the system center frequency is to be changed, the system control hardware would update all of the tables to reflect the new channel requirements. If only several channel locations are anticipated, separate configuration bit streams [3] could be stored, and the FPGA(s) re-configured as needed.
BANDPASS Σ∆Ms USING ALLPASS NETWORKS
In an earlier section we discussed how to design a predicting filter for the feedback loop of a standard sigma delta modulator. The predicting filter increases the order of the modulator so that the modified structure has additional degrees of freedom relative to a single-delay noise feedback loop.These extra degrees of freedom have been used in two ways, first to broaden the bandwidth of the loop's noise transfer function, and second to tune its center frequency. The tuning process entailed an off line solution of the Normal equations which while not difficult, does present a small delay and the need for a background processor. We can define a sigma-delta loop with a completely different architecture that offers the same flexibility, namely wider bandwidth and a tunable center frequency that does not require this background task. In this alternate architecture, a fixed set of feedback weights from a set of digital integrators defines a base-band prototype filter with a desirable NTF. The filter is tuned to arbitrary frequencies by attaching to each delay element z -l , a simple sub-processing element that performs a base-band to band-pass transformation of the prototype filter. This processing element tunes the center frequency of its host prototype with a single real and selectable scalar.The structure of a fourth order prototype sigma-delta loop is shown in Figure 17 . The time and spectrum obtained by using the loop with a 4-bit quantizer is shown in Figure 18 . In this structure the digital integrator poles are located on the unit circle at DC. The local feedback (a1 and a2) separates the poles by sliding them along the unit circle, and the global feedback (b1, b2, b3 and b4) places these poles in the feedback path of the quantizer so they become noise transfer function zeros. A block diagram of a digital filter with the transfer function for G(z) is shown in Figure 19 . Examining the left hand block diagram, we find the transfer function from x(n) to y(n) is the allpass network -(1 -cz)/(z -c) while the transfer function from x(n) to v(n) is -(l/z)(l-cz)/(z-c).we absorb the external negative sign change in the internal adders of the filter we obtain the simple right-hand side version of the desired transfer function G(z). After the block diagram substitution has been made, we obtain Figure 20 , the tunable version of the low-pass prototype. The basic structure of the prototype remains the same when we replace the delay with the tunable all-pass network. The order of the filter is doubled by the substitution since each delay is replaced by a second order sub-filter. Tuning is trivially accomplished by changing the c multiplier of the all-pass network. The tuned version of the system reverts back to the prototype response if we set c to 1. Figure 21 presents the time and spectrum obtained by using the tunable loop with a 4-bit quantizer shown in Figure 20 . The single sided bandwidth of the prototype filter is distributed to the positive and negative spectral bands of the tuned filter. Thus the two-sided bandwidth of each Figure 17 . Fourth order sigma-delta loop. spectral band is approximately 4% of the input sample rate. We now estimate the computational workload required to operate the prototype and tunable filter. The prototype filter has six coefficients to form the 4-poles and the 4-zeros of the transfer function. The two ak k = 0, 1 coefficients determine the four zero locations. These are small coefficients and can be set to simple binary scalers. The values computed for this filter for al and a2 were 0.0594 and 0.0110.These can be approximated by 1116 and 11128 which lead to no significant shift of the spectral zeros in the NTF. These simple multiplications are of course virtually free in the FPGA hardware since they are implemented with suitable wiring. The four coefficients bk k = 0, ... , 3 are 1.000, 0.6311, 0.1916, and 0.0283 respectively were replaced with coefficients containing one or two binary symbols to obtain values 1.000, 1;2+1/8 (.625), 118+1/16 (0.1875) and 1;32 (0.03125). When the sigma-delta loop ran with these coefficients there was no discernable change in bandwidth or attenuation level of the loop. The loop operates equally as well in the tuning mode and the non tuning mode with the approximate coefficients listed above. Thus the only real multiplies in the tunable sigma-delta loop are the c coefficients of the all-pass networks. These networks are unconditionally stable and always exhibit all-pass behavior even in the presence of finite arithmetic and finite coefficients.This is because the same coefficient forms the numerator and the denominator. Errors in approximating the coefficients for c simply result in a frequency shift of the filter's tuned center. The c coefficient is determined from the cosine of the center frequency (in radians/sample). The curve for this relationship is shown in Figure 22 . Also shown is an error due to approximating c by c +δc. The question is, what is the change in center frequency θ, from θ to θ + δθ due to the approximation of c? We can see that the slope at the operating point on the cosine curve is -sin(θ), so that δc/δθ ≈ -sin(θ) so that δc ≈ -δθsin(θ) is the required precision to maintain a specified error. We note that tuning sensitivity is most severe for small frequencies where sin(θ) is near zero. The tolerance term, δθ sin(θ) , is quadratic for small frequencies, but the lowest frequency that can be tuned by the loop is half the NTF pass-band bandwidth.For the fourth order system described here, this bandwidth is 4% of sample rate, so the half bandwidth angle is 2% or 0.126 radians. To assure that the frequency to which the loop is tuned has an error smaller than 1% of center frequency δc < δθsin(θ) => δc < (0.1261100)(0.126) = 0.0002 which corresponds to a 14 bit coefficient. An error of less than 10% center frequency can be achieved with 10 bit coefficients. The tuning multipliers could be implemented as full multipliers in the FPGA hardware or as dynamically reconfigured KCMs, or KDCM, as shown in Figure 23 . The later approach conserves FPGA resources at the expense of introducing a start-up penalty each time the center frequency is changed. The start-up period is the initialization time of the KCM LUT. When a new center frequency is desired, the tuning constant is presented to the k input of the KDCM and the load signal LD is asserted. This starts the initialization engine which requires 16 clock cycles to initialize 16 locations in the multiplier LUT. The initialization engine relies on the automatic shift mode [31 of the Virtex LUTs. In this mode of operation a LUT's register contents are passed from one cell to the next cell on each clock tick.This avoids the requirement for a separate address generator and multiplexor in the initialization hardware. Observe from Figure 23 that the initialization engine only introduces a small amount of additional hardware over that of a static KCM.
There is approximately a factor of 4 difference in the area of a KDCM and full multiplier.
Σ∆ DC CANCELER
Unwanted DC components can be introduced into a DSP datapath at several places. It may be presented to the system via an un-trimmed offset in the analog-to-digital conversion pre-processing circuit, or may be attributed to bias in the AID converter itself. Even if the sampled input signal has a zero mean, DC content can be introduced though arithmetic truncation processes in the fixed-point datapath. For example, in a multistage multi-rate filter , the intermediate filter output samples may be quantized between stages in order to compensate for the filter processing gain and thereby keep the word-length requirements manageable. The introduced DC bias can impact the dynamic range performance of a system and potentially increase the error rate in a digital receiver application. In a fixed-point datapath, the bias can cause unnecessary saturation events that would not occur if the DC was not present in the system. In a digital communication receiver employing M -ary QAM modulation, the DC bias can interfere with the symbol decision process, so causing incorrect decoding and therefore increasing the bit error rate. In some cases the introduced bias can be ignored and is of no concern. However, for other applications it is desirable to remove the DC component.One solution to removing the unwanted DC level is to employ a DC canceler. A simple canceler is shown in Figure 24 . It is easy to show that the transfer function of the network is
The cancelation is due to the transfer function zero at 0 Hz. The pole at 1-µ controls the system Figure 25a . is a spectral domain representation of a biased signal presented to the DC canceler. Figure 25b is the processed signal spectrum at yq(n) in Figure 24 . We observe that the DC content in the input signal has been completely removed. However, in the process of running the canceling loop the network processing gain has caused a dynamic range expansion. So although the sample stream yq(n) is a zero mean process, it requires a larger number of bits to represent each sample than is desirable. The only option with the circuit is to re-quantize yq(n) to produce y(n) using the quantizer Q(·). The effect of this operation is shown in Figure 25c , which demonstrates, not surprisingly, that after an 8-bit quantizer, the signal now has a DC component and we are almost back to where we started. How can the canceler be re-organized to avoid this implementation pitfall? One option is to embed the re-quantizer in the feedback loop in the form of a Σ∆ modulator as shown in Figure 26 . The modulator can be a very simple 1st order loop such as the error feedback Σ∆ modulator shown in Figure 9 . Figure 25d demonstrates the operation of the circuit for 8-bit output data. Observe from the figure that the DC has been removed from the signal while employing the same 8-bit output sample precision that was used in Figure  24 . The simple Σ∆M employed in the canceler is easily implemented in an FPGA.
SIMPLIFY DIGITAL RECEIVER CONTROL LOOPS USING Σ∆ MODULATORS
In earlier sections we recognized that when a sampled data input signal has a bandwidth that is a small fraction of its sample rate the sample components from this restricted bandwidth are highly correlated. We took advantage of that correlation to use a digital sigma-delta modul8.tor to requantize the signal to a reduced number of bits. The sigma-delta modulator encodes the input signal with 8. reduced number of bits while preserving full input precision over the signal bandwidth by placing the increased noise due to requantization in out-of-band spectral positions that are already scheduled to be rejected by subsequent DSP processing. The purpose of this requantization is to allow the subsequent DSP processing to be performed with reduced arithmetic resource requirements since the desired data is now represented by a smaller number of bits. A similar remodulation of data samples can by be employed for signals generated within a DSP process when the bandwidth of the signals are small compared to the sample rate of the process. A common example of this circumstance is the generation of control signals used in feedback paths of a digital receiver. These control signals include a gain control signal for a voltage controlled amplifier in an automatic gain control (AGC) loop and VCO (voltage controlled oscillator) control signals in carrier recovery and timing recovery loops [10, 11, 12] . A block diagram of a receiver with these specific controls signals is shown in Figure 27 . The control signals are generated from processes operating at a sample rate appropriate to the input signal bandwidth. The bandwidth of control loops in a receiver are usually a very small fraction of the signal bandwidth, which means that the control signal are very heavily oversampled. As a typical example, in a cable TV modem, the input bandwidth is 6 MHz, the processing sample rate is 20 MHz, and the loop bandwidth may be 50 kHz. For this example, the ratio of sample rate to bandwidth is 4000-to-1. As seen in Figure 27 , the process of delivering these oversampled control signals to their respective control points entails the transfer of 16 bit words to external control registers, requiring appropriate busses, addressing, and enable lines as well as the operation of 16-bit digital-to-analog converters (DACs). We can use a sigma-delta modulator to requantize the 16-bit oversampled control signals in the digital receiver prior to passing them out of the processing chip. The sigma-delta can preserve the required dynamic range over the signal's restricted bandwidth with a one-bit output. As suggested in Figure 28 , the transfer of a single bit to control the analog components is a significantly less difficult task than the original. We no longer require registers to accept the transfer, the busses to deliver the bits, or the DAC to convert the digital data to the analog levels the data represents. All that is needed a simple filter ( and likely an analog amplifier to satisfy drive level and offset requirements). Experience shows that a 1-bit, oneloop sigma-delta modulator could achieve 80 dB dynamic range and requires a single RC filter to reconstruct the analog signal. A two-loop sigmadelta modulator is required to achieve 16-bit precision for which a double RC filter is required to reconstruct the analog output signal. Figure 29 shows the time response of the one-bit two loop sigma-delta converter to a slowly varying control signal and the reconstructed signal obtained from the dual-RC filter. Figure 30 shows the spectrum obtained from a 1-bit two-loop modulator and the spectrum obtained from an unbuffered RC-RC filter. This example has shown how with minimal additional hardware, an FPGA can generate analog control signals to control low-bandwidth analog functions in a system. An observation worthy of note, is that the audio engineering community has recognized the advantage offered by this option of requantizing a 16-bit oversampled data stream to 1-bit data stream. In that community, the output signal is intentionally upsampled by a factor of 64 and then requantized to 1-bit in a process called a MASH converter. Nearly all CD players use the MASH converter to deliver analog audio signals.
WHAT HAVE WE GAINED?
What has been achieved by expressing our signal processing problems in terms of Σ∆M techniques? The paper has demonstrated some Σ∆M techniques for the compact implementation of certain types of filter and control applications using FPGAs. This optimization can be used in several ways to bring economic benefits to a commercial design. By exploiting Σ∆M filter processes, a given processing load may be realizable in a lower-density, and hence less expensive, FPGA than is possible without access to these techniques. An alternative would be to perform more processing using the same hardware. For example, processing multiple channels in a communication system. In addition to FPGA area trade-offs, the Σ∆M methods can result in reduced power consumption in a design. Power p may be expressed as where c is capacitance, v is voltage and fclk is the system clock frequency. By reducing the silicon area requirements of a filter, we can simultaneously reduce the power consumption of the design. For the examples considered earlier, logic resource savings of greater than 50% were demonstrated. The savings is proportional to increased efficiency in the system power budget, and this of course is very important for mobile applications. The Σ∆M AGC, timing and carrier recovery control loop designs are also important examples in a industrial context. The examples illustrated how the component count in a mixed analog/digital system can be reduced. In fact, not only is the component count reduced, but printed circuit board area is minimized. This results in more reliable and physically smaller implementations. The reduced component count also results in reduced power consumption.In addition, since the control loops no longer require wide output buses from the FPGA to multi-bit DACs that generate analog control voltages, power consumption is decreased because fewer FPGA I/O pads are being driven.
CONCLUSION
FPGAs opens a range of opportunities in the solu- tion space that can result in high-performance and economic solutions to a DSP problem. Often this is best achieved via an appropriate paradigm shift in the algorithmic domain to find an optimal approach that best exploits the cellular LUT /flipflop FPGA architecture. This paper has illustrated how Σ∆M techniques can be combined with FPGA technology to address a range of signal processing problems. These included single and multi-rate filters, DC canceler and the efficient and compact generation of analog control signals for AGC, carrier recovery and timing recovery functions in a communication receiver. The source data re-quantization approach is suitable for both single-rate and multi-rate filter processes. The proposed method arms the DSP /FPGA engineer with another tool that is useful for certain filtering requirements. For the examples considered here, logic savings in excess of 50% were demonstrated. As the frequency band of interest occupies a smaller fractional bandwidth, the order of the required filter increases. This growth tends to make the data re-quantization approach more attractive, as the cost of modulator consumes a decreasing proportion of the entire design. While the paper has exclusively focused on Σ∆M methods in the context of FPGA hardware, we feel that there is broader lesson delivered in the study. The signal processing literature is full of creative solutions to real world problems. Often these solutions are excluded to a designer because they do not map well to software programmable DSP architectures. The algorithm will of course have an ASSP solution, but this may not be an option for reasons of schedule, economies of scale, and flexibility. FPGAs do, however, give immediate access to the diverse range of potential solutions. And they do so while simultaneously providing flexibility and high-performance -frequently the performance equals or exceeds that of an ASSP. In this era of systems-on-a-chip, increased fiscal pressure, tighter engineering deadlines, time-tomarket constraints, and increasing performance demands, new system level and hardware architectures must be employed. FPGAs provide a mechanism for working with and performing trade-offs between all of these important variables. As we move into the next millennium, reconfigurable FPGA technology will increasingly provide solutions to signal processing problems.
