Abstract-This paper studies the design, signal round-off noise, and complexity optimization of a new digital intermediate frequency (IF) architecture for a software radio receiver (SRR). The IF under study consists of digital filters with fixed coefficients, except for a limited number of multipliers required in the Farrow-based sampling rate converter (SRC). The fixed-coefficient filters can be implemented efficiently using sum-of-power-of-two (SOPOT) coefficients and the multiplier-block technique, which gives minimum adder realization. Apart from the multipliers required in the SRC, the digital IF can be implemented without any multiplications. While most multiplier-less filter design and realization methods address only the coefficient round-off problem by minimizing the number of SOPOT terms used, the proposed design methodology aims to minimize more realistic hardware complexity measure, such as adder cells and registers, of the digital IF subject to a given spectral and accuracy specifications. The motivation is that the complexity is closely related to the target output accuracy, which is specified statistically by its total output noise power generated by rounding the intermediate data. 
and to process it by a sophisticated programmable system, probably consisting of a combination of hardware that is re-configurable or programmable, and digital signal processors (DSPs). Due to various limitations of current digital technology such as the speed and power dissipation of device technology used to fabricate the ADCs [3] , most software radio architectures digitize the down-converted signal at the intermediate frequency (IF) . Fig. 1(a) shows a commonly used digital IF for SRR. The analog IF signal is first digitized at a bandwidth, say 20 to 70 MHz. A programmable digital decimator and a sampling rate converter (SRC) are then employed to isolate the desired user's channel from the signal spectrum and convert it to an appropriate sampling rate for further processing by the digital signal processor (DSP) [1] .
Conventional receivers usually consist of a programmable digital decimator, which is normally realized using multistage decimators, a programmable FIR filter (PFIR) and a SRC as shown in Fig. 1(b) . The reason for such arrangement is that the required sampling rate of the baseband signal might be considerably lower than that at the IF. By passing the IF sampled signal through the decimators, which consists of the bandlimiting low-pass filters (LPFs) and a downsampler, the unwanted signals can be suppressed and the sampling rate can be lowered. By choosing appropriately the number of decimation stage, the bandwidth of the input signal can be decimated by an integer factor, which is just sufficient to cover the baseband signal. After the integer decimation, a PFIR is usually needed to remove the residual interference from adjacent channels because the sampling rate is usually not an integer multiples of the channel spacing. Finally, an SRC is used to provide the necessary arbitrary rate-change factor so that the sampling rate of its output is suitable for the baseband processor, which usually operates on a multiple of the sampling rate of the baseband signal. Hence, it is now possible to accommodate signals with a wide variety of bandwidths using this architecture [1] , [4] [5] [6] . A drawback of this structure is that the output of the multistage decimators, which is obtained by downsampling the high-rate IF signal from the ADC, has to be upsampled again (say by an -band interpolated filter) in order to carry out the arbitrary sample rate conversion. Another important problem is the high complexity of the PFIR, since a considerable number of high-speed general-purpose multipliers are required for its implementation especially for wideband signals.
Recently, the authors have proposed a new digital IF architecture for SRRs as shown in Fig. 2 [7] , [8] . Unlike conventional SRRs, the SRC, which is based on a Farrow-based variable digital filter (VDF) [9] , is performed immediately after the multistage decimators so that the PFIR can be replaced by a half-band filter (HBF) with fixed coefficients, if the arbitrary downsampling ratio is properly chosen. This modification eliminates the need of the PFIR, which is usually the bottleneck for wideband signals. Moreover, instead of upsampling again the output of the multistage decimators as in the conventional receivers to achieve arbitrary dowsampling ratios [1] , [4] [5] [6] , the VDF is able to perform the same function since it can be designed to provide a variable fractional delay and additional attenuation in its passband and stopband, respectively. The performance of this SRR can be further improved by using a new second-order compensator, which compensates for the passband droop of the basic cascaded integrator-comb (CIC) filter. In [10] , allpass-based low-pass anti-aliasing filters are employed to realize the multistage decimator in order to achieve a lower system delay and implementation complexity.
In this paper, the design, signal round-off analysis and complexity optimization of the proposed digital IF for SRR (or SRR for simplicity) are studied. Since the proposed SRR consists of digital filters with fixed coefficients except for the multipliers required to implement the Farrow-based SRC, it can be implemented using the canonical signed digit or sum-of-power-of-two (SOPOT) coefficients or [7] , [11] [12] [13] . In addition, the redundancy in realizing the multiplications of these SOPOT coefficients can be significantly reduced by using the multiplier-block (MB) technique [14] , which gives rise to minimum adder realization. As a result, apart from the limited number of multipliers required in the Farrow structure, the entire digital IF can be implemented without any multiplications. While most multiplier-less filter design and realization methods address the coefficient round-off problem by minimizing the number of SOPOT terms used, the proposed design methodology aims to minimize more realistic hardware complexity measure of the digital IF subject to a given spectral and accuracy specifications. The motivation behind this approach is that the complexity is closely related to the target output accuracy. A lower output accuracy means that a shorter internal wordlength, and hence complexity, can be employed. To this end, the multiplier-less digital filter with no signal round-off noise is first designed. The output accuracy of this digital filter is then specified statistically by its output noise power, which is generated by the rounding operations performed to the intermediate data. Using this signal round-off model, the internal wordlengths of all the intermediate data are determined. The hardware complexity measure to be minimized is the exact wordlengths being used for each intermediate data, which is related to the number of adder cells and/or registers used. Other more sophisticated models such as power consumption may be used instead. For simplicity, only the former measure is considered in this paper. In contrast to conventional approaches that minimize only the total number of SOPOT terms, the new criterion is more realistic and general for hardware implementation. While the random search algorithm in [8] and the mixed integer linear programming approach in [15] are very flexible methods, their design times are expected to increase considerably when large number of variables is involved. In principle, it is possible to optimize the SOPOT coefficients and the internal wordlengths simultaneously. However, the design complexity increases significantly. Because of these reasons, we propose two other novel and simple algorithms to address the wordlength determination problem and solve the two problems separately. It is shown that if the wordlengths are relaxed from integer-to real-valued quantities, then the problem can be solved using the Lagrange multiplier method [16] and a close-form solution can be obtained. A similar work, which was concerned with the real-time wordlength adaptation in adaptive FIR filters, can be found in [17] . An important limitation of these approaches is that the solution so obtained is usually not integer valued and hence it has to be rounded to the next larger integer. Fortunately, this can be used as an initial guess to the random search algorithm and the searching time is greatly reduced. By recognizing the close similarity between the wordlength determination problem and the bit allocation algorithm for data compression [18] [19] [20] , we further propose an algorithm based on the marginal analysis method [19] , [20] . The basic idea of the proposed algorithm is to increase the wordlength of one of the intermediate output points successively in order to lower the output round-off noise power as much as possible, until the given bit accuracy or bit budget (total wordlength or complexity) is met. Design results show that the proposed algorithm works well with large number of variables. Furthermore, when coupling the optimal solution obtained from the previous method with the bit-allocation algorithm, a near optimal solution can be obtained within several seconds in a Pentium 4 personal computer. It should be noted that the proposed algorithms are also applicable to the realization of other linear-time-invariant (LTI) systems, including the low-delay FIR and digital allpass filters-based SRR in [10] . The rest of this paper is organized as follows. Section II is devoted to the principle and design of the proposed SRR and its generalization to multiple receiving channels. Section III is devoted to the multiplier-less realization of the proposed SRR. Section IV presents the signal round-off and overflow analysis. Various algorithms for minimizing the internal wordlengths of the proposed SRR while satisfying the given specifications are described in Section V. This is then followed by a detailed design example and the field programmable gate array realization of a multi-standard receiver in Section VI. Finally, conclusions are drawn in Section VII.
II. PRINCIPLE AND DESIGN OF PROPOSED DIGITAL IF ARCHITECTURE
As mentioned earlier, the conventional receiver, which is shown in Fig. 1(b) , uses multistage decimators, followed by a PFIR and a SRC. In [7] , [8] , and [10] , a new digital IF architecture as shown in Fig. 2 is proposed. Its purpose is to extract a desired user channel with a given bandwidth and decimate it to a lower sampling rate for further processing by the DSP. Depending on the required downsampling ratio, the digitized IF-signal from the high-speed ADC will optionally be passed through a compensated CIC filter with a decimation factor of , which is a positive power-of-two integer, and an appropriate number of multistage decimators each with a decimation factor of two. Without loss of generality, we assume that our receiver consists of the compensated CIC filter with a maximum decimation factor of 16 and three-stage decimator so that they can support the signal bandwidths ranging from GSM to Hiperlan/2 standards (i.e., a downsampling ratio ranges from 4 to 295.3849, assuming that the digital IF signal is sampled at 80 M samples per second (sps), and the maximum downsampling ratio of the SRR is 512. A detailed example will be discussed in Section VI). The maximum downsampling ratio can be adjusted by increasing or decreasing the decimation factor of the CIC filter and the number of anti-aliasing filters. As an illustration, Fig. 2 (c) shows a three-stage decimator consisting of three general LPFs denoted by LPF#1, LPF#2, and LPF#3. Then, the output of the multistage decimators is fed to a VDF-based SRC to provide the required fractional sample rate conversion. Finally, the output of the SRC is fed to an HBF, which is merely sufficient to reduce the residual interference while keeping the system delay and complexity as low as possible. Consequently, the overall downsampling ratio of the proposed SRR is given by where is the downsampling ratio of the compensated CIC filter;
is the arbitrary downsampling ratio of the SRC; and is an integer representing the number of the remaining 2-to-1 decimators to be selected. Fig. 3 shows an example of the operation of the proposed SRR for . From (2-1), the downsampling ratio can be achieved by choosing , and . In general, the VDF-based SRC is more complicated to design and realize than the other digital filters in the SRR. Therefore, it is preferable to perform the arbitrary sample rate conversion by the VDF-based SRC after the compensated CIC filter and the multistage decimators so that the sampling rate and hence the power consumption can be lowered.
A. Generalization for Receiving More Channels
The proposed architecture in Fig. 2 can be generalized to receive a set of adjacent, instead of one, user's channels. The basic idea is to employ an -channel over-sampled discrete Fourier transform (DFT) filter bank (FB) after the HBF, as shown in Fig. 4(a) . The downsampling ratio of the FB is . More precisely, we treat consecutive channels as a single channel and choose an appropriate sampling factor to remove the interference from other channels. After the HBF, the sampling rate will be , where is the channel spacing. By decimating the output by a factor of 2 and using an -channel over-sampled DFT FB with a downsampling ratio of , the users' channels can be isolated, each has a sampling rate of . In this paper, we shall only focus on its polyphase structure shown in Fig. 4(a) . This can be viewed as the generalization of the software radio base-station first proposed in [21] for the digital advanced mobile phone system in that the programmable decimator and arbitrary SRC proposed above are employed to select the desired channels and convert it to an appropriate sampling rate for the -channel DFT FB. In DFT FBs, each subband filter, , , is obtained by modulating a low-pass prototype filter using the inverse DFT (IDFT). In order to avoid aliasing, the passband edge and stopband edge of the prototype filter should satisfy [22] and , and the center frequency for the th channel is given by . As an example, Fig. 4(b) shows the frequency response of an 8 channel oversampled DFT FB with downsampling ratio of 4. It is modulated by a FIR linear-phase (LP) prototype with and , which has a passband ripple of 0.00173 and stopband ripple of 0.0001. By cascading this 8-channel DFT FB with the SRR, up to eight consecutive channels can be extracted by the digital IF simultaneously. In what follows, the design of the various components of the proposed SRR will be briefly outlined. Their multiplier-less realization and finite wordlength effects will be described later in Sections III-V. 
B. Design of the Second-Order Cascaded Integrator-Comb (CIC) Compensator
The basic CIC filter [4] is commonly employed when a large downsampling ratio is required, because of its reasonable performance and low hardware complexity. However, the passband droop of the CIC filter significantly limits the quality of anti-aliasing filters, if the decimation ratio is small [23] . In general, the transfer function of the CIC filter is given by (2-2) where ; and is the number of CIC stages. The sharpened CIC (SCIC) filter [6] and the interpolated second-order polynomial (ISOP) [5] were proposed to improve the passband droop of the basic CIC filters. Although the performance of the simple ISOP is inferior to that of the SCIC, it is usually sufficient for most applications. On the other hand, its implementation complexity is significantly lower than SCIC due to its simple structure. A drawback of the ISOP is the rather high dynamic range of the filter coefficients and its long delay chain. In [7] , [8] , and [10] , the following second-order CIC compensator is proposed:
where and are real-valued constants to be determined. As shown in Fig. 5(a) , it is placed after the CIC filter. Note that is chosen to be linear-phase so as to avoid any phase distortion and reduce the implementation complexity. Given the frequency response of the CIC filter in (2-2), the coefficients and can be readily determined using the Parks-McClellan algorithm. It was shown that the compensated CIC filter has lower dynamic range of filter coefficients and shorter delay chain than the ISOP because of the increased flexibility provided by a general linear-phase filter. On the other hand, the hardware complexity is still very low thanks to the use of SOPOT coefficients and MB technique, which will be discussed later in Section III. Table I summarizes the SOPOT coefficients of the CIC compensator. Note that only two adders are required to implement the multiplications with and . Furthermore, using the noble identity [24] , the compensated CIC filter in Fig. 5 (a) can be implemented more efficiently as shown in Fig. 5(b) .
C. Design of Low-Pass Anti-Aliasing Filters
In this subsection, we shall consider the design of the lowpass anti-aliasing filters in the multistage decimators shown in 
Fig. 2(c).
For the sake of presentation, a multistage decimator with a decimation factor of four is designed as an example. The overall passband and stopband edges of the decimation filter are chosen as and respectively, and the desired stopband attenuation is 80 dB. The multistage decimator is divided into two sub-stages and each sub-stage decimates the input signal by a factor of two. Although the given specification can be implemented by a single FIR filter, the multistage implementation usually requires much lower complexity [24] . In [6] , all the filters in the multistage decimator are chosen as LP HBFs. The passband and stopband edges of the first and second HBFs are and , respectively. Note, due to the structural constraints of the HBF, has to be equal to . This will limit the performance of the decimation, as we can see from the dotted line in Fig. 6 . The stopband attenuation of the multistage decimator designed using these HBFs has a peak of about 60 dB between and because the transition band of the first HBF coincides with one of the transition bands of the interpolated frequency response of the second HBF. Therefore, it cannot attenuate the transition band of the second HBF to the desired stopband attenuation. To avoid this problem, one may slightly modify the specification of the first HBF to and since the transition band of the first aliased folding in the second HBF starts at . This will however increase the system delay and complexity of the multistage decimators. Another possibility is to employ general linear-phase LPFs. To satisfy the given stopband attenuation, their stopband edges should start at the transition band of the first aliased folding of the previous filters. As a result, the passband and stopband edges of the first LPF should be and , respectively. The frequency response of the multistage decimator designed using this LPF is shown as the solid line in Fig. 6 . It can be seen that the stopband attenuation of 80 dB is now achieved. In general, let and be, respectively, the passband and stopband edges of the th anti-aliasing filter, relative to its input sampling rate , and be the downsampling ratio of the th decimator. Then the th anti-aliasing filter satisfies the following inequalities:
and (2-4) where and are the overall passband and stopband edges of previous digital filters. In this work, general LPFs will be employed in the multistage decimators to improve the flexibility and hence the system performance. Moreover, to attenuate the aliasing component around , even-length filters are used so that a zero is imposed at for all the anti-aliasing filters. The filter coefficients can be readily determined using the Parks-McClellan algorithm. Their multiplier-less realization will be discussed later in Section III. Next, we shall consider the design and implementation of the SRC.
D. Design of SRC
The design of programmable SRCs with arbitrary conversion factors was studied in detail by Ramstad [24] . In general, there are two approaches to implement a SRC with different tradeoff between the sampling rate and the hardware complexity. One is to employ the structure in Fig. 7(a) where the input signal is first up-sampled by a factor of by inserting zeros between successive time samples. This creates images in the frequency domain, which are then removed by an -band interpolated filter with spectral support from to . If is sufficiently large, further interpolation with an irrational downsampling ratio can be achieved simply by a low-order interpolation such as Lagrange interpolation [26] or cubic spline [27] . One drawback of employing this structure is that the output of the multistage decimators, which is obtained by downsampling the high-rate IF signal from the ADC, has to be upsampled again by the -band filter. Alternatively, the functions of the -band filter and the interpolator can be implemented using a VDF [28] , [29] with a control parameter as in Fig. 7(b) . For modest downsampling ratios, the VDF-based SRC is more efficient than the structure in Fig. 7 (a) because its coefficients can be jointly optimized to fulfill the given spectral and fractional-delay specifications. In the proposed SRR, is chosen to lie between 1 and 2. This leads to a better performance without having to increase the sampling rate as in the -band filter approach. As a result, the operating rate of the multistage decimators can be significantly lower to reduce the power consumption. With this choice of , the overall downsampling factor of the SRR is greater than 2 because the output of the SRC must be fed to the HBF. The proposed VDF-based SRC has the following ideal frequency response:
where , , and are the group delay, the passband and stopband edges of the VDF, respectively. One advantage of employing VDF is that it can be implemented efficiently using the Farrow's structure [9] as shown in Fig. 8 . It consists of a set of subfilters followed by the multiplications with the appropriate powers of the parameter . More precisely, the transfer function of a VDF can be expressed as follows: (2) (3) (4) (5) (6) This allows us to compute the required samples at fractional sampling intervals by tuning a single parameter , which in turn provides the required arbitrary sampling rate conversion. For the design of VDFs, interested readers are referred to [28] and [29] . As an example, Fig. 9 shows the frequency responses of the VDF-based SRC with the passband and stopband cutoff frequencies respectively given by and . This Farrowbased VDF can be efficiently implemented using the method in [12] . More precisely, all the subfilters are redrawn in their transposed forms so that the input will be multiplied directly with all the constant filter coefficients. Consequently, by making use of SOPOT coefficients and the MB technique [14] , the total number of additions can be kept to be minimal by reusing the immediate results generated. Finally, it was found that the symmetric or anti-symmetric impulse responses of the subfilters significantly decrease the hardware complexity. 
III. MULTIPLIER-LESS REALIZATION OF SRR
In this section, the multiplier-less realization of the proposed SRR will be described. As mentioned earlier, the constant coefficients in the CIC compensator, LPFs, HBF, and the subfilters of the VDF can be efficiently implemented as limited number of shifts and additions by employing the SOPOT representations [11] [12] [13] as follows:
where and ; and are positive integers and their values determine the dynamic range of the coefficients; is the number of terms used in the coefficient approximation. To further reduce the implementation complexity, the MB technique proposed in [14] is also employed. The basic idea of MB is to reduce the redundancies in implementing all the SOPOT coefficients by removing any possible common sub-expressions in their representations. We now briefly describe how the SOPOT coefficients of the proposed SRR can be determined. For example, given the real-valued coefficients in the subfilters of the VDF-based SRC, the corresponding SOPOT coefficients can be obtained by a number of methods [7] , [10] [11] [12] [13] , [30] . Here, we shall employ the random search algorithm reported in [7] because different types of constraints can be easily incorporated. The objective function to be minimized can be written as follows:
where is the passband peak ripple error; 
is the stopband peak ripple error; is the fractional-delay peak ripple error; is the total number of SOPOT terms used to implement all the SOPOT coefficients; and are, respectively, the desired frequency response and group delay, which are defined in (2-6) ; and are, respectively, the frequency response and group delay of the multiplier-less VDF; , and are the maximum tolerance of the passband, stopband, and fractional-delay peak ripple errors, respectively. In the random search algorithm, the given real-valued coefficients are first obtained by the WLS approach proposed in [28] and [29] . Let be the vector containing these real-valued coefficients. Then, the algorithm repetitively calculates a candidate SOPOT vector given by
where is a random vector with elements chosen in the range . is a user-defined variable used to control the size of the neighborhood to be searched, and is the rounding operator that converts every element inside the input vector to its closest SOPOT value bounded between and . The performance measures , and of the new SOPOT coefficients are then calculated. The set that yields the minimum total number of SOPOT terms, while satisfying the given specifications and the wordlength constraints, is declared as the optimum solution. The SOPOT coefficients for the other components, namely HBF, LPFs, and CIC compensator can be determined by the same approach. Table II summarizes TABLE IV SOPOT COEFFICIENTS OF THE LPF#2 (Filter Length = 12, h(n) = h(11 0 n)) the specifications and performances before and after SOPOT optimization for the components of the proposed SRR. The corresponding SOPOT coefficients are listed in Tables III-V except those for the VDF-based SRC and the HBF because of page limitation. Figs. 9 and 10 show the corresponding frequency responses. It can be seen that they possess good frequency characteristics while achieving a low implementation complexity. In fact, there is a tradeoff between the filter performance and the hardware complexity. If the lower bound is sufficiently large, the performance of the multiplier-less filters will be close to their real-valued counterpart, at the expense of increased hardware complexity.
To implement this multiplier-less SRR using the MB technique, all FIR filters or subfilters are implemented in their transposed form as shown in Fig. 8(b) . Instead of passing the input signal through the delay chain as in the direct form implementation, it is now multiplied with a large number of constant coefficients in SOPOT form before adding the products together. Therefore, the redundant additions in these SOPOT products can be removed by a MB, which greatly reduces the arithmetic complexity. In principle, it is possible to remove all the redundancy found in the SOPOT coefficients leading to a minimum adder realization. This can drastically reduce the number of adders required for realizing the SRR. Interested readers are referred to [14] for more details on the generation Fig. 11 . Transposed form implementation of a typical FIR digital filter with (a) round-off noise model due to the finite wordlength effect, (b) noise power being modeled as uncorrelated white noise sources. D: register; Qf1g:rounding operator; e (n): rounding noise; P : rounding noise power; P : total output noise power at ith stage. of the MB. Similarly, the above multiplier-less realization approach can be applied to the DFT-FB-based channelizer discussed in Section II using the technique proposed in [31] and [32] . However, details are omitted due to page limitation. We now present the signal round-off model for the SRR. The problem of wordlength determination using this model will be given later in Section V.
IV. SIGNAL ROUND-OFF AND OVERFLOW ANALYSES

A. Analysis of Signal Round-Off Noise
Signal round-off errors occur due to rounding of the intermediate signal after multiplications. Since the exact round-off errors are difficult to analyze, they are usually treated as uncorrelated white noises [33] . For rounding operations, the quantization noise will have a zero mean with a variance equal to , where is the quantization step-size. In other words, the variance is determined by the number of fractional bits that is retained after multiplication. Let's consider an example in Fig. 11(a) , where a digital FIR filter with impulse response is implemented in the transposed form. Assume that the filter coefficients are represented as SOPOT coefficients and are simultaneously realized using the MB technique. Hence, the maximum wordlengths required for the products can be determined. To minimize the hardware complexity, these products may be rounded using the signal round-off operator . In fixed-point arithmetic, each intermediate signal can be represented in the form of , where is the number of integer bits including the sign bit and is the number of fractional bits. In general, if bits are rounded to bits, where , then the noise variance is given by where . More generally, consider the round-off noise model of the LTI system in Fig. 12 , where the signals to be quantized are for ; is the total number of rounding sources. From (4-1), if is rounded to bits, then the variance of the quantization error, , is given by . Let the transfer function from to the output be , . Furthermore, we assume that the noise sources are uncorrelated. Hence, the variance of the output noise at can be expressed as follows:
where ; is the transfer function from to ; and is the impulse response corresponding to . Returning to the proposed SRR, Fig. 11(b) shows the noise power model of the th stage of the SRR. If there are such rounding processes at the th stage, then the total noise power due to these rounding sources is simply given by (4-3)
The total output noise power at the th stage, , taking into account noise sources at previous stages is (4-4) where is the impulse response of the digital filter in the current stage, which is assumed to have a filter length of . The output accuracy at the th stage, in terms of the number of fractional bits, is therefore approximately given by [33] bits (4) (5) It should be noted that the larger the number of noise sources, the lower will be the output accuracy. The noise power can however be reduced by increasing the internal wordlengths for the fractional bits at different stages of the SRR, at the expense of increased hardware complexity. Next, we shall consider the signal overflow effects.
B. Overflow Handling
Signal overflows occur when the allocated wordlength of the integer bits is insufficient to accommodate the growth in integer wordlength of the signal after additions. In order to avoid overflow, more bits must be allocated to the integer part of the adder output and the register holding it. There is, however, an option to retain or decrease the number of bits in the fractional part, depending on the required output accuracy. In FIR filters, it is possible to determine whether signal overflow will occur at a particular adder using the L1 scaling measure. More precisely, the input signal is assumed to take on its maximum value denoted by . Then, the maximum value after implementing the th impulse response coefficient of the target system is bounded by (4) (5) (6) Using (4-6), it is possible to determine the worst-case integer wordlength of each adder and hence the size of its output register to avoid signal overflow. It should be noted that there are other methods such as L2 scaling to handle signal overflows. However, there is still a small probability that overflows will occur. To determine this option, we can imagine that a noise is generated by the rounding option and the minimum acceptable wordlength is then determined as if it was a rounding source due to multiplication. If the minimum wordlength obtained is larger than the existing wordlength, then the wordlength has to be increased. Otherwise, rounding can be performed if the additional noise generated does not violate the prescribed accuracy. For IIR filters, scaling is usually performed at certain stages of the system to avoid overflow [33] . Since scaling is a multiplication operation, it can be treated similarly by our model.
V. WORDLENGTH DETERMINATION
In this section, the problem of minimizing the hardware complexity of the SRR subject to a prescribed output accuracy is studied. Since the number of adder cells and/or registers is usually the major hardware resources, they are employed as the measure of the hardware complexity. Other measures can also be used with slight modification of these algorithms. From Section IV, we know that the number of adder cells and registers is related to the exact wordlengths being used for each intermediate data. Therefore, the internal wordlengths of each intermediate data are the variables to be optimized. In general, the determination of the internal wordlength can be done in three steps. First of all, the real-valued coefficients of the SRR are designed as detailed in Section II. They are then converted into SOPOT coefficients using the random search algorithm and are implemented using the MB technique as mentioned in Section III. After that, the output accuracy of the SRR is specified statistically by its output noise power. It is assumed to be generated from the rounding operations performed, which depend on the formats of the internal wordlengths as described in Section IV-A. In what fallows, we shall propose three algorithms to determine these internal wordlengths. In our proposed IF architecture, all the filters are FIR filters except for the CIC filter, which is implemented as an integrator. The wordlengths of the basic CIC filter without any round-off and overflow errors will be treated separately in Section V-D.
A. Analytic Solution
The problem of determining the wordlengths for a given output noise power can be formulated as the following constrained optimization problem:
where is a constant weight vector, is the variable vector representing the fractional part of the internal wordlengths to be determined. In most cases, are chosen as one for all . If we allow to take on real values, instead of integer values, then the minimization problem in (5-1) can be solved analytically using the method of Lagrange multiplier [16] . Define the following Lagrangean function: (5-2) where is the Lagrange multiplier associated with . Taking the partial derivatives and setting them to zero yields
From which, we obtain as follows: (5-4) Equating the left hand side of (5-4) , and after slight manipulation the desired result (5-8) Alternatively, we can minimize subject to a given bit budget:
. The design problem becomes
Using again the Lagrange multiplier method, the optimal solution of is found to be (5-10)
A possible problem with the analytical formula above for the wordlength is that are real valued. To obtain an integer solution, they need to be rounded to the next largest integers. Moreover, for extremely low bit budget or large target variance, can even become negative. On the other hand, for high bit budget or small target variance, the problem is less serious and the solution so obtained is more accurate.
B. Random Search Algorithm
Similar to the random search algorithm proposed in [8] , the minimization problem in (5-1) can be formulated as follows:
where is a measure of hardware cost, say the total number of adder cells and is a reference vector, which stores the maximum wordlengths of the intermediate data [e.g., in Fig. 11(a) ]. The latter provides a rounding option to either retain the fractional part for each scaled output or reduce it by an appropriate value depending on the required output accuracy. More precisely, the basic idea is to search for the vectors in the neighborhood of their full precision values, i.e., the values without rounding. Our goal is to minimize the internal wordlengths of each intermediate data as specified by and so that is minimized subject to the given specifications. The one with the minimum is declared as the solution of this problem. Similarly, the above formulation can be modified to handle the problem of maximizing the output bit accuracy with a given bit budget. There are several advantages of this algorithm: 1) it is applicable to problems with very complicated inequality constraints, as illustrated in this work; 2) the time to obtain a high quality solution is manageable in nowadays computers, especially when an initial solution is available by some means. For instance, one may use the solution obtained in Section V-A to speed up the searching process; 3) It is possible to combine this searching process with the SOPOT determination, but the computational time will be greatly increased. For simplicity, the two processes are performed separately in this work. Finally, the solution is integer valued. 
C. Bit Allocation Algorithm
It is interesting to note that the above analysis in Section V-A is similar to a classical problem in signal compression, known as bit allocation problem. There are in general two different approaches to solve this problem, namely the discrete Lagrange multiplier method [18] and the Marginal Analysis method [19] , [20] . Next, we shall extend the latter to solve the wordlength determination problem. The first problem we address below is to minimize subject to a given noise power . The variable is first initialized to zero. Then the algorithm allocates one bit to one of 's until the target noise power is met. In each step, the one with the largest reduction in output noise power is selected and its wordlength will be increased by one bit. The pseudo code of this algorithm is summarized in Table VI. Note that 's are both non-negative and integer valued. A similar algorithm for minimizing subject to a given bit budget can be derived as in Table VII . Again, are both non-negative and integer valued. For multiplier-less realization using SOPOT coefficients, one can easily compute the wordlength required to achieve a given output error variance. Once it is determined, the exact rounding operation at each node can be determined and hence the complexity of the adders and registers can be determined exactly. The overflow prevention can also be determined according to Section IV-B if the maximum input format is known. Finally, the algorithms described in this section can be combined to shorten the search time, as we shall illustrate by an example in Section IV.
D. CIC Filters
In this subsection, the wordlength determination of the basic CIC filter will be briefly studied. Without loss of generality, the CIC filter is assumed to have stages as shown in Fig. 13 . First of all, consider the integrator section at the left-hand side of the downsampler. Here, we propose to scale down the input signal by a factor of , i.e., shifted by bits to the right at each integrator to avoid excessive round-off and overflow errors, at the expense of slightly increased hardware complexity. Each integrator has a programmable shifter , , which is designed to shift the incoming signal from 0 up to bits, where is the maximum downsampling ratio of the CIC filter and is a positive power-of-two integer. In general, to implement an arbitrary right shift up to bits, an stage programmable shifter is required. Next, consider the comb section at the right-hand side of the downsampler shown in Fig. 13 . Each comb filter can be viewed as the transposed form of a particular FIR filter with two coefficients of 1 and such that the L1 scaling measure as mentioned in Section IV-B can also be applied. Therefore, both integrator and comb sections will be free of round-off and overflow errors if the integer and fractional bits are appropriately allocated. As an example, the number of stages and the maximum downsampling ratio of the CIC filter are chosen to be 3 and 16, respectively in order for the proposed SRR to support an overall downsampling ratio between 2 to 512. If a larger is chosen, the SRR can support a larger range of by slightly modifying the programmable shifters of the CIC filter. In this work, the input signal to the CIC filter is assumed to have a wordlength format of , i.e., 14-bits with . The wordlengths of the basic CIC filter so obtained are shown in Table VIII . The output signal at the CIC filter has a wordlength format of . It should be noted that the proposed wordlength determination algorithms can also be used to further minimize the wordlengths of the CIC filter since its internal transfer functions are known. For simplicity, we employ the wordlength formats in Table VIII and use the proposed algorithms to determine the wordlengths of the remaining components of the SRR, as we shall illustrate in the next section. Now, we present a detailed design example of the proposed SRR for a multi-standard receiver.
VI. DESIGN EXAMPLE
In this section, we demonstrate the application of the proposed SRR for a multi-standard receiver to support the GSM, W-CDMA, CDMA2000 and Hiperlan/2 standards. The hardware complexity and the performance of the SRR using the real valued and SOPOT coefficients are then examined. Finally, a comparison between the proposed and conventional SRR is presented. First of all, let us assume that the digitized IF signal is sampled at 80 M sps. Table IX summarizes some of the useful parameters of the GSM, W-CDMA, CDMA2000 and Hiperlan/2 standards [30] , [34] , [35] . It also includes the configurations and the computational complexities for both the real valued and SOPOT coefficient realizations of the proposed SRR. The target specifications of the proposed SRR are as follows:
(0.015 dB in passband deviation), (80 dB in stopband attenuation), (50 dB in fractional-delay error), and (96 dB in output accuracy). The output accuracy, in terms of the number of fractional bits, can be calculated from (4-4) to be . Table X shows the passband deviations and the stopband attenuations of the proposed SRR using the real valued and the SOPOT coefficients for different operating ranges of , i.e., cascading different components. In particular, the frequency responses of the SRR with , i.e., cascading the LPF#3, HBF and the VDF-based SRC with , using the real valued and SOPOT coefficients are shown in Fig. 14 . It can be seen that the stopband attenuations of the SRR using the real-valued coefficients are slightly better than that using the SOPOT coefficients. The opposite is true for the passband deviations. This is because the random search algorithm succeeded in finding a set of SOPOT coefficients which meet the stopband specification at 80 dB (though inferior to the real-valued coefficients), and at the same time slightly improve the given specification of passband deviation of 0.015 dB. This substantiates the usefulness of the proposed multiplier-less realization method as described in Section III. It should be noted that the total number of multipliers required to implement all the real-valued coefficients of the SRR is 117. On the other hand, the multiplier-less realization using SOPOT coefficients requires only 252 adders. After using the MB technique, the number of adders is further reduced to 111, which is about 44% of the hardware resources required for directly implementing all the SOPOT coefficients. The frequency responses of the proposed SRR, using the SOPOT coefficients, with the operating ranges: a)
, i.e., cascading the HBF and the VDF-based SRC with , and b) , i.e., cascading the LPF#1, LPF#2, LPF#3, HBF and the VDF with , are shown in Fig. 15(a) and (b), respectively. Once the SOPOT coefficients are determined, the internal transfer functions of these filters are known and there are totally rounding sources in the receiver except the basic CIC filter. Using (5) (6) (7) (8) , the optimal wordlength format for each intermediate signal is obtained. The weight vector has all its entries equal to one. The optimal value of is found to be 4145.2. As mentioned earlier, the entries of the vector are not integer valued. Therefore, for practical implementation, they are rounded to the closest integer just larger than them such that the 16-bit accuracy is still met. The corresponding value of becomes 4276 and the total noise power is decreased to 1.21 (or 16.527 bit accuracy). The results obtained using random search algorithm are summarized as follows: ; (i.e., 16.157 bit accuracy) and the computational time is about 20 minutes. For the bit allocation method, we obtain the following results in Table XI: ; (16.002 bit accuracy) and the computation time is within one minute. This method gives the best solution among the three algorithms studied with a much lower computational time than the random search algorithm. This suggests that the proposed bit allocation algorithm is well-suited even for a large scale system. It should be noted that the computational time of the proposed algorithm can be further reduced to a few seconds if the solution obtained from (5-8) is used as an initial guess. In order to avoid overflow, the worst-case integer bit format of each intermediate signal can then be calculated as described in Section II-B, assuming that the input signal to the compensated CIC filter has a format of , i.e., 14-bits with . The final output is found to have a wordlength format of . The wordlength formats of each filter output format are also shown in Fig. 2(b) and (c). Table XI summarizes the design results for various wordlength determination algorithms. If a fixed wordlength of 24-bit is used, the following results are obtained: ; (13.684 bit accuracy). This suggests that the proposed variable internal wordlength approaches is more efficient than traditional method in minimizing the hardware complexity of the SRR while achieving a prescribed output bit accuracy.
Since the proposed SRR is considerably different from the traditional programmable SRR, it is very difficult to make an exact comparison. Anyway, to give the reader an idea of the potential benefits and hardware savings of the proposed SRR, a comparison with the programmable receiver proposed in [5] is considered below. The architecture in [5] consists of a CIC filter with , an ISOP sharpening filter, five modified HBFs (MHBFs) as the multistage decimators, and an PFIR. Since the SRC was not designed in [5] , we assume that it is done using the same VDF-based SRC that we have proposed in Section II-C so that they have the same complexity. In addition, the finite wordlength effects were not considered in [5] . Therefore, the comparison will be based on the number of adders and multipliers required. Except for the ISOP sharpening filter, the dynamic range of the filter coefficients in the two structures is comparable. Table XII shows the hardware complexities of the other components for the two SRRs. It can be seen that the major hardware resources of the structure in [5] is the variable multipliers required in the PFIR. Although the multiplications can be time multiplexed using a high-speed multiplier, it will limit the maximum clock speed of the receiver for wideband applications, i.e., small downsampling factors. In the proposed architecture, the PFIR is replaced by a HBF with fixed coefficients, thanks to the novel VDF-based SRC. Therefore, the complexity can be greatly reduced. Note, though the VDF still requires three variable multipliers, it is still much less than that required for the PFIR. Apart from the PFIR, the hardware complexities of the two digital IF architectures are comparable. The proposed SRR, however, considerably outperforms [5] in passband deviation. Moreover, the results for the allpass-based realization of the proposed SRR [10] are also given in Table XII as a reference. It has slightly higher complexity and larger group delay error, but a considerably lower system delay, than its FIR linear-phase counterpart. Interested readers are referred to [10] for more details.
The proposed SRR has also been implemented and tested using the Altera Stratix FPGA and the Quartus II EDA tools from Altera Corporation [36] . Because of the use of SOPOT coefficients and the MB technique, the main hardware resources needed to implement the proposed SRR are shifters and carry save adders. The hardware complexity in terms of logic elements (LEs) and LC registers (LRs) required for the proposed SRR are summarized in Table XIII . It can be seen that the implementation of the VDF consumes about half of the total hardware resources and most of them are coming from the four expensive multipliers in the Farrow structure, which are implemented using the DSP blocks in the Stratix FPGA. Since the conventional receiver in [5] requires 55 multipliers, the numbers of LEs and LRs required are about 110000 and 55000 respectively, which are approximately nine times more than those required for the proposed SRR. For ASIC implementation, we expect that similar savings can be obtained if the SOPOT representation is employed, while the multiplication may be realized as optimized multiplier modules. For simplicity, we do not explore these additional implementation issues and options further in this paper. Overall, the design results illustrate the effectiveness of the proposed approach in the reduction of hardware complexity. It should be noted that the DFT-FB-based channelizer can also be implemented without any multipliers using the technique in [31] and [32] so as to reduce hardware complexity. However, details are omitted due to page limitation. 
VII. CONCLUSION
The signal round-off analysis and complexity optimization of a new digital IF for SRR are presented. An advantage of the SRR is that it can be implemented without any multiplications, apart from the limited number of multipliers required in the Farrow structure. Moreover, two novel algorithms for determining the internal wordlengths of the digital IF subject to a prescribed output accuracy are presented. The first one gives a closed-form analytic solution using Lagrange multiplier method, assuming that the wordlength is a real-valued quantity. The second one is based on the Marginal Analysis method and it gives an integer-valued solution. Design results show that the proposed algorithms work well with large number of variables and they are applicable to the wordlength determination problems for the realization of related digital filtering systems. Because of the short computational time, they may be useful for redesigning and reconfiguring systems for not only the proposed SRR but also other real-time applications. Another interesting direction is the extension of the current IF architecture to utilize IIR filters to reduce the overall system delay. The main challenge will be the more complicated signal overflow and roundoff problems because of the increased dynamic range of the intermediate signals.
