Abstract
Introduction
The use of delayed coefficient adaptation in the LMS algorithm has enabled the design of modular systolic architectures for real-time transversal adaptive filtering [1]- [3] . However, the convergence behaviour of this delayed least mean squares (DLMS) algorithm, when cornpared with that of the standard LMS algorithm, is degraded and worsens with the increase in the adaptation delay [4]. Large number of adaptation delays have been used in previous systolic architectures, since they have been necessary for systolizing the system to support high sampling rates in a realtime environment. Hence, the design of a modular systolic architecture for transversal adaptive filteriug, that inaintains the convergence behaviour of the LMS algorithm by iniiiiinizing the adaptation delay, and also supports high input sampling rates with minimal input/output latency, is an important problem [5].
The area-efficient modular systolic architecture derived in this paper, uses a number of function preserving transformations to modify the standard signal flow graph (SFG) representation of the DLMS algorithm into a systolic architecture with minimal adaptation delay (DA). The key transformations used are slow- Finally, as in [3] , the use of associativity of addition reduces the input/output latency to a small constant, which is independent of the filter order. With the use of carry-save arithmetic, the systolic folded pipelined architecture can support very high sampling rates, limited only by the delay of a full adder.
The organization of this paper is as follows. In Section 2 we derive the systolic architecture. In Section 3 we analyse and evaluate the architecture and present simulation results of an adaptive line enhancer that shows the improved convergence behaviour due to reduced DA. We summarize tlie work in Section 4.
Deriving the Systolic Architecture
The architecture shown in Figure 1 is derived from the standard SFG representation of the DLMS algorithm [l] by applying the holdup, associativity, retiming, and slowdown transformations [6] , in that order.
In Figure 1 , D H is the holdup delay, P is tlie slowdown factor, N is the filter order, p is the step-size, and, X , Y arid R are respectively, the input, the output and the desired response. There are M P fan-out lines from a register in the error broadcast path. For ease of exposition, we assume N to be a multiple of have been moved to break the error broadcast path and the output accumulation path at regular intervals. This retiming is possible due to the use of the adaptation delays. The system inputs are assumed to be P-slow [6] , arid hence the hardware utilization efficiency of this architecture is 9%.
Note that, in Figure 1 fi sets of arithmetic units have been identified. In each set, the P arithmetic units in the filtering and weight-update portions are denoted as filtering arithmetic units (FAUs) and weight-update arithmetic units (WAUs) respectively.
In order to regain 100% HUE, we use the folding transformation [6]. Using locally-sequential-globallyparallel mapping, the P FAUs and WAUs of a set are mapped onto one physical FAU and WAU respectively. Further, the FAUs and the WAUs are scheduled in the order of their index j, j = 0, ..., ( P -1). The control circuitry of the resulting folded architecture is derived using the approach reported in [6] . Figure 2 shows the systolic array, while the details of the boundary processor module (BPM) and a folded processor module (FPM) are shown in Figure 3 .
The complex control circuitry present at the input to the FAU and WAU of an FPM (refer Figure 3(b) ), consisting of a delay-line with ( M P -l ) P registers and P-to-1 multiplexers Mux i , i = l , ..., M , is replaced by a simple structure (refer Figure 4 isters. The correctness of this structural optimization is established in [7] . Further, by moving appropriate number of registers from the BPM into the FPMs, the processors are pipelined as shown in Figure 4 . Us- ing the standard notation [6], the pipeline registers are shown at the output of the arithmetic units. The number of pipeline registers p l and p2 are given by:
and pa = [T"'+r'F MITal, where, T , and T, denote the delay of a multiplier and an adder respectively. For ease of exposition, we have neglected the delay associated with the pipeline registers. Since, pipelining changes the multiplexer definitions, an extended retiming theorem presented in [7] is used to redefine them. The precise multiplexer definitions are available in [7] . Note that, in an FPM, the error signal is broadcast from a synchronization register to M The performance of the architecture derived above is analyzed and evaluated in the following section.
Analysis and Evaluation
In Table 1 , we compare the new architecture with the systolic architecture reported in [2]. It should be noted that the performance metrics of [I] are essentially the same as that of [Z] . From the table, it is clear that the new architecture gains over [2] by a factor M P in the adaptation delay and by a factor P in hardware requirements, with minimal control overheads. Also, the input/output latency is independent of N . Further, the fastest sampliiig rate that the new architecture can support is $. 
Summary
We have presented an area-efficient modular systolic architecture for DLMS adaptive filtering with minimal adaptation delay and input/output latency. Due to the significant reduction in the adaptation delay, the convergence behaviour of the proposed architecture is considerably closer to that of the LMS algorithm than the architectures of [l] -[3]. The architecture was synthesized by the use of a proper sequence of function preserving transformations, namely associativity, retirning, slowdown, and folding. With the use of carry-save arithmetic, the systolic folded pipelined architecture can support very high sampling rates, limited only by the delay of a full adder.
