A block exact formulation [IJ of the LMS algorithm is well suited for efficient implementation on multiple-ALU archi tectures. The method can be applied to any number of
INTRODUCTION
It can be very difficult to implement the least mean square (LMS) adaptive filter algorithm efficiently on a digital sig nal process or (DSP) with multiple arithmetic logic units (ALUs). Parallelism is hard to achieve when the filter co efficients change for each sample.
Benesty and Duhamel [IJ have shown how to formulate the exact LMS algorithm in a block form and rearrange the computations to reduce the total number of multiplies and adds. The block exact form holds the filter constant during a block and then corrects for updates within the block. This form all ows the LMS algorithm to be tailored to the resources of a multi-ALU process or without the data rearrangement. Optimizing the utilization of processor re sources can lead to large performance increases even with out reducing the number of multiplies and adds.
Because the block form can be derived for any block size, this technique is applicable to any number of ALUs in a custom system on a chip (SOC), or to various generic commercial multi-ALU DSPs. An example analysis with source code is presented for the StarCore SC140 DSP core.
For large numbers of filter taps the resulting method trans lates to a 33% increase in performance over a traditional technique. M atlab and ass embly code for algorithm im plementations are available on the web [2].
IMPLEMENTATION CONSTRAINTS
A multi-ALU DSP is designed to compute several multi ply or multiply-accumulate (MAC) operations in parallel. This requires the ability to move data and instructions in parallel to keep up with the ALUs. Wide data buses for moving operands in parallel bring data alignment prob lems with them, which may reduce algorithm performance.
Wayne T. Padgett is on sabbatical leave from Rose-Hulman Inst. of Tech. Email: Wayne.Padgett@Rose-Hulman.edu. He would like to acknowledge the insightful comments and sugges tions of the StarCore Applications Team, especially Mso Zeng, Joe Monaco, Kevin Shay, and Stephen Dew.
Most DSP architectures are capable of circular addressing to avoid data movement in delay buffers, but data align ment can cause problems.
In an FIR filter, several taps can be computed at a time, but the delay buffer needs to shift one sample, not several. Since most DSPs are highly optimized for FIR filter computation, a variety of techniques have been developed to deal with the problem of needing to shift a circular buffer by a single sample when the data has a bus width of several words.
Standard Techniques
Multiplexing the incoming and outgoing data to allow "mis aligned" loads can solve the problem, but this requires extra hardware, and will incur a speed or silicon area penalty.
If the goal is to maximize performance, mis-aligned loads should be avoided.
Multi-sample programming [3J involves simply comput ing several outputs at a time so that the delay buffer also needs to shift several outputs at a time. A side benefit of this technique is that many data values can be reused, reducing the required data bus bandwidth and power re quirements. Unfortunately, the LMS algorithm changes the filter for each output sample so that multiple outputs can't be computed at the same time.
Multiple-loop programming involves writing several copies of the filter code, one for each possible data alignment, and using the correct one for each filter output. This method has two disadvantages: it increases code size by roughly the same multiple as the number of ALUs, and it tends to re quire more registers because it needs to realign the data in the processor.
Multiple-buffer programming is a variant of multiple loop programming which stores the four possible alignments of the delay buffer in memory. This avoids the need for ex tra registers, but it shares the code size penalty and also multiplies the data buffer storage memory requirement by the number of ALUs. The StarCore SC140 has four ALUs and can compute four MACs in one cycle. To support the ALUs, it has two data buses, each of which can move four 16 bit operands to or from memory in a cycle. Peak performance is most impor tant in the inner-loops of an algorithm. Like many DSPs, the SC140 has a zero overhead loop to maximize efficiency in the inner-loops. To optimize performance, four MACs must be executed every cycle if the process is arithmetic limited, or two four operand moves must be executed every cycle if the process is data movement limited. The processor uses a VLIW arclUtecture with a variable length execution set (VLES) capable of six instructions each cycle, four ALU in structions , and two address-unit (move) instructions. Each VLES may contain one to six instructions.
Move operations can be implemented for a single operand, pairs of operands, or groups of four operands. The address ing modes available are flexible and include direct, indirect, indirect plus index, modulo, and reverse-carry modes. Mis aligned loads are not available on the SC140 to avoid the performance penalties mentioned above.
BLOCK EXACT LMS FORMULATION
The goal of the block formulation of the LMS is to convert the computation from one that involves a new filter at each sample time,. to one that uses a fixed filter for an appropriate number of samples (four in the case of the SC140), followed by an update correction process. This has the advantage of making the LMS into an efficiently implement able fixed FIR filter, plus a small correction. 
where dIn] is the desired filter output. Note that the stan dard FIR filter computation yIn)
IT four delayed versions of these equations are written, for e[n), ern -t), ern -2], and ern -3] so that we obtain three more equations of the form 
This procedure produces the desired fixed filter formu lation, but the outputs cali still only be computed one at a time, since each still depends on previous error val ues. This problem is resolved by inverting a matrix to remove the e[ n ) values from the right hand side. First, the (4) is rewritten in matrix form using the substitution
Fortunately, (5) can be rewritten more compactly as (�( �( �. , ( n -1) ) sl(n -2Js2In]+ Sl (n]s2 (n -1]slln -2)slln -I)slln)
-sl ln -1) 1 0 (9) ( -s2InJ+ ) sl (n]sl (n -1] -s1{nJ 1 the StarCore Filter Library 12), although it has been op timized slightly. The LMS-4L is the LMS implemented in four loops, one for each data alignment. The BELMS algo rithm is the LMS rearranged as described above. Note that the BELMS is not equivalent to the well-known block LMS -the block LMS only updates once per block, while the BELMS computes an entire block, then corrects the error outputs as if they had been updated at each sample time. Fig. 1 shows the number of cycles required by each algo rithm for 16, 32, 64, 128, 256, and 512 taps. This plot was generated using a simulator with a realistic memory model, so memory conflicts can occur and cause some accesses to take more than one cycle. Therefore, the theoretical num bers do not match the simulation results perfectly. Table 1 gives a detailed comparison of various algorithm features.
The specific features and important numbers are discussed further below. The coefficient u p date loo p is limited by the number of reads and writes, not by the arithmetic. To maximize the data bus bandwidth, it uses two four-register sets as accumulators, processing eight coefficients in three cycles .
This achieves the maximum of six four-operand moves in three cycles, but requires T to be a multiple of eight. The LMS-SS should be able to perform at O.875T cycles plus overhead, bu� memory conflicts in the simulation in creased the number of cycles per output to O.9422T. Note that the code size is smallest for this method and it per forms well for small numbers of filter taps. Table 1 includes several more rows describing the fol lowing algorithm characteristics: These quantities are given for both the FIR loops and the coefficient update loops of each algorithm.
LMS-4L
The four loop LMS (LMS-4L) is a better comparison for the BELMS algorithm for large filter sizes, since it can take advantage of modulo addressing. It also has a similar code size to the BELMS algorithm. Although the overhead ap pears much larger for LMS-4L than the other two methods, it could probably be reduced somewhat. The parameter of interest is the increase in cycles· per tap. The switch to modulo addressing allows the LMS-4L to achieve four taps per cycle, T/4, in the FIR loop, maximizing the resources available (instructions issued, data bandwidth, and MACs).
The LMS-4L does data alignment in registers in both loops. In the coefficient update loop, the extra registers used for data alignment are not available to keep two copies of the coefficient accumulators as is done in the LMS-SS. Therefore, the LMS-4L requires eight cycles to process four taps, or T /2 cycles for the update loop. This problem could be resolved by going to a four buffer approach, but only with the associated data memory penalty. The four buffer approach was not implemented, but its theoretical perfor mance is shown in Table 1 as "Best Possible" for the LMS-4L. Note that the utilization numbers are low for the update loop of the LMS-4L. The simulated cycle counts in Fig. 1 are averaged over the four loops.
BELMS
Because the BELMS algorithm operates in blocks of four output samples, it is a good fit for the four-ALU SC140. As noted above, the BELMS algorithm can be re-derived for any desired block size or number of ALUs. The FIR loop can be imple�ented using a multi-sample algorithm since the filter is fixed for four samples. Even though the loop processes four taps per cycle, the data bandwidth is only 24% of the maximum, reducing power consumption. In the update loop, four coefficient corrections are computed at time, relieving the data movement limitation, so that the loop is now arithmetic limited. The processor can handle four coefficient updates per cycle, doing four MACs every cycle, again with reduced data bandwidth.
The result is that for filters large enough to neglect over head, the BELMS method achieves T /2 cycles per output, while the next best choice LMS-4L only achieves 3T/4 cy cles per output, a 33% improvement.
This improvement is reduced to about 24% in the sim ulation due to differences in memory conflicts. Because the BELMS allows reuse of coefficients, it maintains a 20% per formance advantage even over the memory intensive four buffer method (T/2 VB. 5T/8).
The BELMS implementation is done on a frame of 40 samples. at a time, so overhead values are averaged over 40 outputs. Also, the computations of the autocorrelation values must be computed once before recursion starts, and this is done once per frame in this implementation. The one non-recursive computation depends on T, but can be divided over as large a frame size as desired given frame latency constraints.
CONCLUSIONS
The BELMS algorithm allows the LMS to be customized for a multi-ALU architecture. Given typical hardware con straints, the BELMS is the most computationally efficient algorithm available for large numbers of filter taps. Because of the optimal use of processor resources, the BELMS algo rithm produces a 33% increase in performance over a four loop LMS method for the StarCore SC140. This perfor mance improvement is possible for a variety of telecommu nications systems without any hardware modifications.
