Abstract Systolic architectures for Wave Digital Filters are investigated for low power applications.
Introduction Wave Digital Filters, especially the Lattice forms which are a Parallel Combination of Allpass Subfilters (PCAS), are becoming established for various signal processing applications [1, 2] .
Because of their low coefficient sensitivity and regularity the filters are ideal for VLSI implementation. In earlier work, the 2-port adaptor had been employed to realise the 2 nd order section -the filter's fundamental building block. Recent work, however, suggests the superiority of a 3-port implementation especially in term of speed [3] . With an increasing demand for portability, low power implementation of the filters is now relevant. This letter explores the effect of different levels of pipelining on the area, speed and power.
Pipelined architectures The 3-port adaptor implements the 3 functions:
( )
If ( ) ( ) is to say to process that number of independent data streams per clock cycle. In the implementation of a filter of sufficient order this implies a reduction in the total system area by the same factor.
Results
The above architectures, and also the starting, non-pipelined case which we call B1, are implemented with Cadence Design Framework II using 1 µm ES2 standard cell CMOS technology.
The layouts are automatically generated and Verilog simulation with large number of randomly generated test vectors are performed, with extracted capacitances included. Minimum clock period is found by direct search and average power is determined by post processing a node activity dump and incorporating average cell power as quoted in the data library.
The results are shown in table1. The minimum clock period decreases with the number of bit slices in the critical path of the architecture, but not strictly in proportion, due to the effects of differing sum and carry delays for the standard cell used and the latch delays and setup times. The simulated power multiplied by the clock period is the energy per filter sample because the number of clock cycles taken to complete one filter sample (column 2) is exactly compensated for by the multiplexing capability discussed above. The area is the bounding box of the synthesized layout. The effective area is the area divided by the multiplex factor, which is used in calculating the power-area-delay product shown in the last column of the table.
We have isolated the contributions to the power dissipation from different parts of the circuits. This is shown in Figure 3a . Pcell decreases with the number of bit slices in the register transfer paths.
Because in each case the number of bit level operations required to complete the filter cycle is the same, this reduction is entirely attributable to a reduction in glitch activity [4] . This effect competes with the increasing power used in the pipeline registers, Pclk, leading to, at least in principle, a minimum in the total power. This occurs here and to seen for the architecture B15. Comparing the optimum architecture B15 to the starting architecture B1 we see that the power is approximately 50% and the power-are-delay product is increased by a factor of approximately 5. These comparisons are all for a single fixed supply voltage. The effect of varying Vdd is shown in Figure 3b . The minimum is less pronounced at lower supply voltages. The comparisons with supply voltage scaling [5] to normalised maximum speed are to be consider elsewhere. However, we feel that the system design contexts of VLSI filters is one in which a single or at best limited number of supply voltages would be available.
Conclusions This letter presents a low power, high speed pipelining solution for implementing a 2 nd order allpass section using 3-port adaptor. In this work, we have shown that a lowest power solution is achieved by selecting 2-bit level pipelining, which is also shown to give optimal power-area-delay performance . 
