Abstract-In this work, we revisit the implementation of polyphase filter banks in Quadratic Residue Number System (QRNS) for banks with a large number of channels by developing a new design methodology suitable for large systems required in the new generation of satellites. Furthermore, we compare the QRNS filter bank with an equivalent bank implemented in the traditional Complex Two's Complement System (CTCS) in terms of throughput, area and power dissipation. The results for large filter banks confirm the earnings in power consumption by using the QRNS.
I. INTRODUCTION
Previous work showed the Quadratic Residue Number System (QRNS) [1] , [2] to be more efficient in terms of area and power dissipation than the traditional Complex Two's Complement System (CTCS) in the implementation of polyphase filter banks [3] . In [3] the filter is an 8-channel polyphase filter bank, while here we design a 128-channels polyphase filter bank derived from a 1024-tap prototype filter and compare the performances in terms of throughput, area and power dissipation between the implementation in CTCS and in QRNS.
The system specifications are derived from the analysis of channelizers used in the new generation of satellites for multimedia communications [4] .
Because of the large number of channels, the size and the complexity of the design, we developed a design methodology based on the characterization of the filter components and spread-sheets to perform the design space exploration and avoid long synthesis iterations. Furthermore, for both CTCS and QRNS, we developed parametric tools for generating the RTL-level VHDL description of the filter similar to the one described in [5] . The paper is organized as follows: in Section II the filter design is illustrated comprising fixed point analysis. In section III the filter implementation is discussed i.e. ,timing constraint issues, CTCS and QRNS architectures while in Section IV a description of the design flow and the used tools is given. The implementation results are presented in Section V.
II. FILTER DESIGN AND FXP ANALYSIS
The main difference from the design of [3] is due to the much larger number of channels (8 vs. 128) and the consequent decimation in frequency that allows the use of serial filters in the bank (one serial filter per channel) but the reference architecture used in this case is always a polyphase filter architecture. The number of taps of the prototype filter required for the proposed application is 1024. The filter coefficients a k have been scaled by a power of two factor in order to maximize registers use assuring that the coefficients are in the range ∈ [0.5, 1) while the input samples x(n) are represented by 11 bits. The fixed-point analysis to find the number of bits for the integer and fractional parts used for the representation of the internal variables has been done by both theoretical analysis and by simulation. The final wordlengths are illustrated in Figure 1 . Truncation has been introduced at the filters ouputs reducing the wordlength from 23 to 17 bits. The fixed point analysis on the IDFT coefficient quantization gave us 10 bits for the twiddle factors representation while the IDFT ouputs require 34 bits.
III. FILTER IMPLEMENTATION
The reference architecture chosen for the filter bank is the the polyphase structure. It is composed by a decimator, C = 128 FIR sub-filters (channels) of 1024/128 = 8 taps working at frequency F CLK /C followed by a Inverse Discrete Fourier Transform (IDFT) unit. The IDFT has been chosen instead of the Inverse Fast-Fourier Transform (IFFT) due to the smaller wordlength required for the representation of the intermediate variables in the IDFT case. In fact, the IDFT architecture is composed by a single level of multipliers while in the IFFT, in order to share some intermediate results, more levels of multipliers are used with an increase in wordlength for the intermediate variables. This choice is mandatory for the QRNS implementation because, in this number representation, scaling is a very costly operation.
A. Timing Constraint
To perform the design space exploration (DSE) the filter bank basic blocks have been characterized in terms of delay, area and power consumption. In particular, the space-certified library of standard cells used in the design has a delay of 150 ps for a fan-out-4 (FO4) load. The typical delay of the XOR gate ranged from 300 to 350 ps (depending on the loading conditions). The delay to register data (propagation delay plus set-up time) in flip-flops is in the order of 400-500 ps. To have a reasonable logical depth, about 15-20 gates between registers, the minimum clock period should be not shorter than 5 ns, that corresponds to a maximum clock frequency of 200 MHz. The timing constraints in the different parts of the polyphase filter are reported in Table I . As previously stated, to avoid long synthesis iterations, a DSE based on spread-sheets and pre-characterized functional blocks has been done. Several design implementation alternatives have been explored by selecting different values for the following parameters
• the level of parallelism;
• the structure of the serial filter;
• the positioning of pipeline registers in the IDFT block;
• the choice of the moduli set for the QRNS implementation.
The spread-sheets based DSE was accurate within 10% for timing and 5% for area 1 .
In parallel with the DSE, we developed tools to generate the RTL-level VHDL description of the units from configuration files containing the specification of the system to ease the debugging and to fast re-design when modifications in the architecture were necessary. 1 Power dissipation was not evaluated with the spread-sheets because of the dependency on switching activity
B. CTCS polyphase filter
The architecture of the CTCS polyphase filter is illustrated in Figure 2 . The dashed vertical lines in the figure represent pipeline registers. The filter bank has been implemented by 128 serial sub-filters (one for each channel, one complex multiply-accumulate (MACC) unit per sub-filter).
The 23-bit fixed-point representation at the sub-filters output is truncated to 17 bits to reduce the IDFT complexity. The IDFT is implemented by a serial architecture that computes one channel per clock cycle (it is designed to meet the design constraint of T CLK0 = 5.0 ns). The matrix-row by vector product is implemented by an array of 128 complex multipliers, and two adder trees are used to accumulate the result for the real and imaginary parts.
The implementation results for this unit are illustrated in Table II .
C. QRNS polyphase filter
The architecture of the QRNS polyphase filter is illustrated in Figure 3 . It is composed of three main blocks:
• the QRNS sub-filters bank, • a Conversion plus Truncation and Base Extension (CTBE) block, • the QRNS IDFT, plus the input and output conversions from CTCS to QRNS and vice-versa.
For the sub-filters bank the following moduli are selected to cover the 23 bits of dynamic range {13, 17, 29, 37, 41} .
The QRNS serial sub-filters, are implemented by using a single QRNS MACC unit. The dynamic range at the output of the QRNS sub-filters (23 bits for both the real and the imaginary part) is truncated to 17 bits. To perform truncation, it is necessary to convert the QRNS to CTCS and then truncate the binary representation. Then, the truncated values are converted back to the QRNS representation with base extension for 35 bits of dynamic range (moduli 53 and 61 are added). The structure of the IDFT is similar to the CTCS IDFT with the real and imaginary part that results completely separated (no mixed products) due to the QRNS representation.
The implementation results in terms of area and power consumption are shown in Table II .
Details on the QRNS decomposition and on the input and output conversions can be found in [3] .
IV. USED TOOLS AND DESIGN FLOW
The tools used in the design are • Cell Internal Power: power dissipated in cells internal nodes and short-circuit power. 3) The node activity at RTL level is back annotated on the synthesized netlist. 4) Starting from these activities, the tool computes by propagation the activities on the internal nodes of the synthesized circuit and generates the final power estimation.
The motivation for the used approach, i.e. simulation of the RTL code instead of the synthesized netlist, is to get a faster power estimation with a reasonable accuracy. The main limitation of this approach is that power consumption deriving from glitches is not computed but in any case we expect minor power estimation errors especially for the QRNS architecture (very short carry chains and local interconnects).
V. RESULTS
In Table II the result of the synthesis of the overall CTCS and QRNS architectures are shown. Area is reported as number of equivalent NAND2 gates. The totals are obtained by adding, for area and power, the 128-ch filter bank data to the IDFT data (i.e. the totals do not take into account the ROM for the twiddle factors).
The two architectures show about the same area, but the power dissipated in the QRNS filter is about half that of the CTCS.
These results somewhat confirm the previous findings (in [3] the area and power reductions were higher) for large polyphase filters. The smaller power consumption obtained by the QRNS is the result of a reduced switching activity due to the QRNS representation and operators, and a reduced switching capacitance (product activity-load in a node). That is, even if the overall switching activity is similar, in QRNS the activity is more evenly distributed on the nodes.
VI. CONCLUSIONS
In this paper the implementation of a 128 channels polyphase filter bank is illustrated. The CTCS version is compared with the QRNS version. The architectures show about the same area, but the power dissipated in the QRNS filter is about half that of the CTCS. Future work to understand in depth these results will include: power consumption evaluation based on the synthesized netlist including the delays in order to understand the advantages of QRNS with respect to the glitching power and the post-layout power consumption evaluation in order to understand the impact of the interconnects in the two architectures. We expect that QRNS due to shorter interconections should guarantee less switching power on the interconnect
