unit
INTRODUCTION
The preference for analog signal processing (ASP) over digital signal processing (DSP) stems from the former's cost effectiveness and lower power consumption: both of these requirements being imperative for the personal communications systems (PCS) market. Analog filters, implemented in one of the conventional semiconductor technologies, are far less complex than their digital counterparts and occupy a much smaller area. Also, when operated from a 3.3-5V supply voltage, a digital filter consumes more energy than an analog one, implying shorter battery life.
The submicron features of modern CMOS processes provide high packing density and their low threshold voltages allow for reduction of supply voltage, resulting in dramatic reductions in energy consumption [ 11. As their power and area become comparable to those of ASP circuits, digital circuits are emerging as a better choice for portable applications. Digital implementation of complex and programmable systems is much easier than analog ones, and less sensitive to process variations.
Additional energy savings can be achieved by custom designing a DSP architecture for a particular target application [2] . This paper describes a high-order, fully programmable digital filter optimized for low energy per computation, while meeting throughput requirements for processing of audio signals. The filter was implemented in a low power, low voltage (LPLV) 0.5pm CMOS process.
FILTER DESIGN
The filter engine organization is shown in Figure 1 . In every sampling period, the biquad arithmetic unit performs up to 16 cascaded 16-bit biquad computations. The filter order is externally programmable and can be dynamically adjusted to achieve energy savings. A multiplexer selects the data input for the biquad unit from the actual data input and the biquad unit output. This way a higher order function is implemented by cascading biquad sections [3] whose control data flow graph (CDFG) is shown in Figure  2 . Programmable multiplication coefficients for each biquad section are stored in the coefficient RAM blocks.
-- I I 
11-25
0-7803-4455-3/98/$10.00 0 1998 lEEE between parallel and serial multiplication schemes. It was estimated that, for the given technology, a parallel 16-by-16 bit multiplication would require an approximate implementation area of 6-7mm2 and have many long interconnects contributing significantly to the total energy consumption. The size of the parallel multiplier usually implies that its implementation more than once is too expensive. A single multiplier architecture requires lumped memory organization, with an inherent overhead in memory access dissipation [2] . In order to provide locality of computation, the biquad arithmetic unit was partitioned into two multiplier-accumulator blocks MAC1 and MAC2 each one connected to its own coefficient memory block, as shown in Figure 3 . The use of two multipliers implied their sequential implementation in order to meet aredcost requirements of commercial portable systems. Multiplication in MAC blocks is based on the simple shiftadd multiplication technique, using a carry-propagate adder (CPA) and an accumulator. MAC blocks are customarchitecture multiplier-accumulators employing a modification of the shift-add multiplication such as to perform 2 multiplications and 2 additions each. MAC architecture is shown in Figure 4 . In order to perform C1. X I + C2. x2 the sum of coefficients c1 + c2 is pre-calculated by CPA1.
Data bits Xli and Xzi (i=0..15) select the input to CPA2 from the 4 possible values: zero, c1, cz or c1 + c 2 to be added to the accumulated result, shifted right by one bit position. CPA2 is also used to add the external input.
Special attention was given to the minimization of interconnects lengths in the MAC layout because of the large switching activity of its internal nodes and known interconnect capacitance for the particular technology. As a rule of thumb, lOpm of interconnect wire introduces capacitance equal to the gate capacitance of a minimum size transistor. The described MAC architecture is convenient for layout implementation due to its simultaneous use of coefficients c1 and c2 that allowed bit slicing of the arithmetic and coefficient memory circuit as a unique slice with maximum packing density, shown in Fig. 5 .
Circuit level simulations showed that a ripple carry adder (RCA) is fast enough for the specified number of computations and sampling rate of audio signals. Simulations also showed that RCA energy consumption is approximately equal to register energy consumption for the same datapath width. Therefore, for sequential multiplication, architectures using fewer registers are more energy efficient. For this reason, the described architecture, which uses only one register, was found more energy efficient than other architectures that typically require two registers of datapath size.
Use of bus connections between data memory blocks X1 and X2 was avoided by interleaving of these blocks as shown in Figure 6 , and providing point-to-point connections between corresponding X1 and X2 cells.
The coefficient memory blocks were implemented as register files with dual-read, single-write architecture, allowing for external read/write and internal read access. Depending on the application, the coefficients can be programmed during filter configuration time, or updated dynamically.
At the circuit level, standard CMOS circuits were used to drive all long wires connecting major building blocks, as well as for the implementation of control logic. By contrast, different circuit styles were used inside the datapath blocks to achieve compact layout and minimize the clock wire length. Interconnect capacitance rather than circuit input capacitance was found to dominate the total clock dissipation for the particular datapath layout width. Consequently, the aspect ratio of datapath bit-slices was identified as an important part of the overall energy saving strategy. Narrower and longer bit slices are more energy efficient because they reduce the wire capacitance of clocks and other global signals crossing the datapath [ 5 ] . They also allow larger spacing between global wires, thus reducing parasitic capacitance.
The 18 transistor full-adder circuit [6] shown in Figure 7a was used for CPA implementation due to its minimum transistor count. Its layout was compacted close to the minimum defined by the diffusion design rules, as shown in Figure 7b . thus providing low device and interconnect capacitance. The adder cell length of 30pm defines the with of the MAC bit-slice. An optimized latch circuit derived from the LEAP multiplexer gate [7] is shown in Figure 8a . This latch without PMOS pass transistors has lower input capacitance and allows higher packaging density. It was used for implementation of register file blocks, shift registers and some latches in the MAC datapath. The similar LATCH-MUX circuit, shown in Figure 8b , was implemented in MAC, to minimize its bit-slice size and reduce the switching activity of global control signals. Its control inputs were obtained by gating the clock with multiplexer control inputs. It combines functions of a 4:l multiplexer and a latch following it.
Leakage power dissipation is the key implementation issue for a low voltage CMOS technology featuring only low threshold transistors. Since the majority of transistors in DSP chips are in static RAM blocks, most leakage current takes place in the RAM. In this design, 2/3 of all transistors are used in the coefficient and data memory. The total leakage current for coefficient memory is defined by the width of the leakage path of a memory cell. Based on the total leakage path in the implemented memory blocks and the worst case process specification, the total static dissipation caused by leakage current does not exceed 8% of the overall dissipation.
The sources of energy consumption, obtained by simulation and layout based calculations, are listed in Table 1 . The total energy consumption is 337pJ per biquad section. The energy of the adders and registers dominates the total
11-27
T, r-l T, n a) latch b) LATCH-MUX Figure 8 . Optimized latch circuits dissipation (58%), and the interconnecs are responsible for an additional 25%.
EXPERIMENTAL RESULTS
The filter was implemented using a 1V CMOS process with 0.32V threshold P and N MOSFETs. A micrograph of the filter implementation is shown in Figure 9 . Its core area is 1.1 x 1.4 mm2. It operates from a supply voltage of 1V at a clock frequency of up to 20MHz and consumes an average energy of 330pJ per biquad section in close agreement with simulation. The characteristics of the filter are listed in Table 2 .
CONCLUSION
An energy efficient, low voltage implementation of a complex DSP function has been described in this paper. The experimental results show that a single low threshold CMOS technology is a viable choice for the implementation. An architectural design strategy that simultaneously minimizes the energy of datapath switching, interconnect capacitance and static leakage, was demonstrated. 
ACKNOWLEDGEMENT

11-28
transistors 28000
