Abstract
Introduction
The least mean squares (LMS) adaptive filtering algorithm used in a broad range of engineering applications is not amenable to pipelining [l] , a technique indispensable in the design of low-power or high-speed architectures [2] . The problem stems from the absence of a sufficient number of delays (registers) in the error feedback loop which provides the filter's prediction error to all the taps in order to update the filter coefficients.
The use of delayed coefficient adaptation in the LMS algorithm [l] has been an important contribution towards developing pipelined LMS adaptive filter architectures. The basic idea is to update the filter coefficients using a delayed value of the error, which provides the required registers in the error feedback loop to pipeline the filter architecture. However, as has been documented in [l] , the convergence speed of the delayed least mean squares (DLMS) adaptive filtering algorithm degrades progressively with increase in the 0-8186-7755496 $05.00 0 1996 IEEE adaptation delay. This is of particular concern in a non-stationary environment as it could lead to a loss of tracking capability. Long et.al [l] therefore recommend that "every effort should still be made to keep the delay as small as possible if it is not avoidable".
Based on the DLMS algorithm, many researchers have proposed pipelined architectures [3-51. The architecture reported in [5] uses significantly fewer adaptation delays compared with other existing architectures, thus resulting in superior convergence speed. We in this paper, use this architecture to derive a lowpower CMOS configurable processor array (CPA) for DLMS adaptive filtering. Towards this end, we extend the synthesis procedure reported in [5] to synthesize pipelined architectures with minimal adaptation delay subject to a power constraint. In order to meet the given power constraint, the supply voltage is treated as a free variable. The resulting low-power architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. An interesting and novel tradeoff between algorithmic performance and power dissipation is exhibited by this architecture. This tradeoff is illustrated through a system identification example.
Based on the low-power architecture which is designed for a specific filter order ( N ) , sample period (T,), and power reduction factor (p), we derive a CPA configurable for N , T, and p. Both, the programmable clock frequency required for the configurability of the CPA, and the variable power supply needed to meet the required power reduction, are realized by a phase locked loop based design [SI. The CPA has minimal hardware overhead for the configurability and hence will not dissipate significantly more power than the corresponding hard-wired array for a given N , T, and p. The CPA has two phases of operation viz., the configure phase -wherein the processor array is pro- grammed for a specific N , T, and p, and the execution phase -wherein the processor array performs the desired DLMS adaptive filtering operation.
The organization of the paper is as follows. Section 2 reviews the pipelined DLMS architecture with minimal adaptation delay reported in [5]. In Section 3, we extend the synthesis methodology of [5] to incorporate a power constraint. We also discuss the novel tradeoff b e tween algorithmic performance and power dissipation exhibited by the resulting low-power architecture, and substantiate this tradeoff through a system identification example. This low-power architecture is designed for a specific N, T, and /3. In Section 4, we extend this design to realize a processor array configurable for N , T, and /3. We summarize the work in Section 5.
Pipelined DLMS Architecture
The pipelined architecture shown in Fig. 1 is derived by applying a sequence of function preserving transformations (its described in [5]) on the standard signal flow graph representation of the DLMS adaptive filtering algorithm. In the figure, p represents the step-size, and z, y and d are respectively, the input, the output and the desired response of an N-th order adaptive filter. Further, m and 1 are respectively, the number of pipeline registers for a multiplier and a multiplier followed by an adder tree of depth [log, M I . Following the standa,rd notation, the pipeline registers are shown at the output of the arithmetic units. M is chosen such that, the broadcast delay of the M fan-out lines in the error-broadcast-path is less than the critical path delay (Tc,it) of tlhe architecture. The precise expressions for Fig. 1 are available in [7] .
The circular systolic array shown in Fig. 1 consists of a boundary processor module (BPM) and L (= &) folded processor modules (FPMs), where P indicates the number of computations performed by an FPM in any given sample period T,. For ease of exposition we shall assume throughout this paper that N is a multiple of M P . The extension of the architecture to the more general case of arbitrary N , is straightforward and is available in [7] .
The critical path of the architecture consists of an adder and a 2-to-1 multiplexer as indicated in Fig 
DH (4)
Since, minimizing the adaptation delay ( D A ) directly contributes to improved algorithmic performance (convergence speed) [l] , it is beneficial to minimize D A . However, we show in the following section that D A is a monotonically decreasing function of the supply voltage V. This brings about a novel tradeoff between algorithmic performance and power dissipation.
Algorithmic Performance vs. Power Dissipation
We first establish the relationship between the a d a p tation delay DA and the supply voltage V . Towards this end, we require the following expression for the
where, C is the capacitive load of the critical path, 5
and VT are device level model parameters and V is the supply voltage. As we know from [ l ] , minimizing the adaptation delay directly contributes to improved algorithmic perfor- where, CM and CA are the effective switched capacitance of a multiplier and an adder respectively (for ease of exposition, we neglect the power dissipated by the control circuitry and the boundary processor module) In (7), for the given algorithm specifications and technology, the only free variable is the supply voltage V. we need more number of FPMs than compared with the minimum D A solution. Further note that, for any / 3 < 1, the adaptation delay increases because P decreases. This brings about the interesting tradeoff between algorithmic performance and power dissipation. This tradeoff is illustrated through a system identification example described below. and the synthesis solution with a power constraint / 3 = 0.25, for various metrics. Note that, the synthesis solution with a power constraint uses extra hardware in order to achieve the power reduction. Further, the adaptation delay is more than the minimum adaptation delay solution (or synthesis solution without power constraint). Fig. 2 shows the simulation results obtained using Further, note the shift in the convergence plot of [4] due t o the large holdup delay which is a function of the filter order. where, a ranges between 1 and 2. For example, in a 0.6pm CMOS technology [lo] wherein a = 1.6, D A and DH for the low-power solution are 12 and 2 respectively. This would then imply that, the degradation in the convergence speed for the low-power solu- (8) and (9) are simple high level equations, the precise power reduction in an actual implementation may be some what different from p.
Nevertheless, it is significant that such a tradeoff can be made at the highest level of abstraction viz., the algorithmic level.
Configurable Processor Array
In this section, we describe a low-power configurable processor array (CPA) for DLMS adaptive filtering, configurable for filter order N , sample period T, and power reduction factor p. We see from the previous section, that the low-power architecture for a given N , T, and p is precisely specified by the values of P, L and V . Hence, the low-power CPA should be made programmable for P, L and V . In other words, the 0 -0 number of FPMs, the delay lines, the control for multiplexers, and the clock frequency and the supply voltage should be made programmable.
We conceive of a CPA having Lmax FPMs.
Since, the number of FPMs used (L 5 Lmax) is programmable, the error input (refer Fig. 1) should be broadcast to each of the LmaX FPMs separately. But, this would distort the systolic nature of the architecture. This is overcome by introducing a register and a 2-to-1 multiplexer in each FPM (refer Mux-V in Fig. 3) , at a minimal cost of a few adaptation delays. This then results in a CPA which is a linear systolic array as shown in Fig. 3 . In order to incorporate the above change, mux-I in each FPM (refer Fig. 3 ) is now a 3-to-1 multiplexer. The precise multiplexer definitions are available in [7] . The adaptation delay required by this linear systolic array for a specified N , T, and / 3 is given by the expression DA = + [2m+y+L1.
Note that, this value of DA is marginally more than that given by (3). The (Lmaz -L) unused FPMs are powered down to minimize power dissipation [2] .
The delay lines in the CPA which are a function of P , are indicated by an arrow across them (refer Fig. 3) . Since P is programmable, each of these delay lines have to be replaced by programmable delay lines. Further, the multiplexers except mux-V in the CPA are P-periodic. Hence the control for these multiplexers should be programmable. These can be achieved by straightforward design procedures, the details of which are available in [ll] .
Exactly like in the case of the CPA for an FIR filter [ll], we require a programmable clock (T, = 3) and a programmable power supply. This can be achieved by a phase-locked-loop (PLL) based design [6] as described in [ll] . The operation of the CPA consists of two phases viz., the configure-phase and the execution-phase. During the configurephase of the processor array, the various configurable elements are programmed. During the execution-phase, the desired DLMS adaptive filtering operation is performed. Note that, the hardware overhead for configurability is minimal. This leads us to conclude that, for a given N , T, and p, the CPA will not dissipate significantly more power than the corresponding hard-wired array.
Conclusion
We have presented a low-power pipelined DLMS adaptive filter architecture which exhibits a novel tradeoff between algorithmic performance and power dissipation. This architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. It was shown that this use of parallelism t o reduce power results in an increase in the adaptation delay, thereby creating a tradeoff between algorithmic convergence speed and the power reduction factor. This tradeoff was illustrated with a system identification example.
A processor array configurable for filter order, sample period and power reduction factor, was derived from the low-power architecture. Since the CPA has minimal control overhead, it is expected that it will not dissipate significantly more power than the corresponding hard-wired counterpart.
