A VLSI implementation of a dedicated digital signal processor is presented. The processor is tailored for efficient implementation of transform domain adaptive filters. It incorporates a butterfly processor which performs butterfly operation to implement the required transformation. It is also able to perform complex addition, subtraction and multiplication. The butterfly processor makes use of a redundant binary tree multiplier with a recently proposed coding of signed-digit numbers which reduces the number of levels in the tree by one. An on-chip read only memory holds the transformation coefficients. The contents of the ROM determine the type of transform. The processor incorporates an ALU to perform integer arithmetic, address calculations and implementation of circular memory scheme. For fastest accessibility, the essential variables of the algorithm are implemented in a register fire.
INTRODUCTION
Extensive research on Least Mean Square (LMS) algorithm [1, 2] has resulted in various implementations. To speed up the convergence of the algorithm, the input samples are uncorrelated by an orthonormal transform. The resulting structure is called transform domain adaptive filter (TDAF) and the adaptive algorithm is called transform domain LMS (TDLMS). The effect of different transformations (e.g., Discrete Fourier Transform (DFT) and different Discrete Sine and Cosine Transforms (DST and DCT)) on the performance of LMS algorithm has been recently studied [3] . Implementation of the above mentioned transforms by structures similar to FFT makes the transformation to be of O(N log N) in computational complexity where N is the length of the transformation and O(.) is order of To perform the transformation, N/2 log2 N butterflies (which is defined later in this paper) should be executed. In a recent paper [4] , it is shown that the decimation-in-time FFT performed in sliding fashion can be implemented with an order of complexity of as low as N. In sliding implementation of the transform, the transformation is performed after each input sample enters the transformation structure. A further generalization of the Sliding FFT, where a block of L input samples are slided into an N-point FFT structure has been proposed in [5] . The number of butterflies to be executed in this case is N (log2L + 2) L.
nb,SFFT -In order to implement the sliding FFT efficiently, one should make use of a circular memory scheme [6] . Implementing transformation of the input samples in a sliding fashion with a circular memory scheme introduces a major improvement in the time requirement of TDLMS.
Another technique which is also used to improve the efficiency is to use the Frequency Domain Block LMS (FBLMS) structure, where the time domain convolution in the adaptive algorithm is replaced by frequency domain element by element multiplication [7] . This structure can improve the time requirement of the algorithm for the case L > 64. The FBLMS structure of [7] requires 3 DFT and 2 inverse DFT transforms. In this structure, one DFT and one inverse DFT can be removed. The resulting structure is called unconstrained FBLMS. The latter has the same speed of convergence as the former at the expense of a higher misadjustment. The misadjustment, however, is close to the one in the former for the case L << N, [5, 8] . The structure of unconstrained FBLMS is shown in Figure 1 . In the rest of this paper, the term FBLMS refers to the structure of unconstrained FBLMS. In the figure, FFT (1) as follows
where r/b,PiFFT and r/b,PFFT are the number of butterflies required to be executed to obtain pruned inverse FFT and pruned FFT respectively. Based on the above discussion, by implementing the TDAF by FBLMS structure and using sliding FFT for transforming the input samples, the less time requirement of this structure reduces the order of complexity of transformation. Moreover, using the pruned transform in the remaining parts of FBLMS structure, reduces the time requirement of the adaptive algorithm.
Another issue in enhancing the performance of the transformations is to use a special purpose processor to perform certain tasks. There have been many versatile designs to implement a hardware FFT. [9] shows a 4K FFT processor, which is based on a computational element that performs a radix 4 butterfly computation which, in turn, uses radix 2 butterfly to perform the operation. Each computational element incorporates 8 complex adders and 3 complex multipliers. The circuit elements used to implement this design were ECL and TTL commertial integrated circuits. In this design, 4 FFT processors are incorporated to perform filtering in frequency domain and converting back to time domain. Another example is [10] where a radix-2 FFT is implemented. Shift registers are used in this design to change the order of the results of each state. This design was made of discrete elements [11] gives a clear explanation of radix 2,4, and 8 butterflies. Examples of VLSI implementation of FFT processors can be found in [12] . A more recent paper proposed a single FFT processor which made use of 2 multiplier-accumulators and was able to perform one butterfly in 4 clock cycles [13] . Incorporating [14, 16] . Special hardware should be therefore available in the processor which can implement the circular memory scheme efficiently.
In this section, an efficient software implementation of sliding FFT is presented. Figure 2 shows the structure of decimation-in-time FFT of length 8 . Considering L to be 1, after one transformation is completed, in order to perform the transformation at the next sampling time, only the highlighted butterflies in the figure should be executed for the results of the rest are already calculated at the previous sampling time. Therefore these results should be kept in the memory. [6] has discussed the difficulties that one encounters in keeping and accessing the results in memory, has considered different approaches to the solution of the problem and has finally proposed an efficient approach which is briefly discussed here.
Let N be a power of 2 indicating the length of the transform. The transformation is performed in log2 N stages. In order to hold the results of each stage, N memory elements are needed. The input samples also require the same amount of memory. x(n-7)
x(n-6)
X (4) X (2) X (6) X (1) X (5) X (3) X (7) FIGURE 2 Decimation-in-time FFT of length 8.
an array called x is allocated with a length equal to the smallest integer power of 2 that is greater than or equal to this number. This is given by Lx 2 I2(N(lg2 N+I)) (4) where Lx denotes the length of x and Ii] denotes the smallest integer which is greater then or equal to i.
In order to access x, an index called Ix, is used.
It is first initialized to 0 and after each transformation is accomplished, it is incremented according to Ix (Ix + 1) AND (Lx-1). (5) In fact, a circular memory is implemented by this relation. Figure 3 shows the circular memory for the case N 4. By using (4) the length of the array is found to be L 16. However the number of stages required to perform the transformation is 2. In this way, x has 3 parts. Therefore only 12 elements are involved in each transformation. At first/x is set to zero pointing to the physical start of x. The first input sample, x(0), is put into the (! + N-1) AND (L.-l)'th element. All other elements are set to zero and then the transformation is performed (Fig. 3a) . To perform the next transformation, ! is incremented according to the assignment statement (5) . The next input sample, x(1), is then put into the (/x + N-1) AND (Lx-1)'th element of x. This is the element next to the element containing x(0). In this way, the first part of the array is rotated by one position. This is equivalent to shifting the contents of this part all by one position as well as shifting all other parts (Fig. 3b) . The same procedure is followed for the next three input samples (Fig. 3c ). As the next transformation is going to be performed, the last part of the array which contains the transform of the input samples is broken to two sections. This, however, is transparent to the algorithm for it uses the AND operator to index into the array (Fig. 3d ).
The circular memory scheme discussed above is implemented on conventional computers as well as digital signal processors, and the time requirement of the algorithm is measured. This is reported in [6] . Based on the results discussed in [6] , the circular memory scheme enables one to implement the sliding FFT with O(N) complexity, provided that there is enough hardware in the processor which can implement the scheme without any overhead instructions. Figure 4 shows the block diagram of the processor. It can implement the FBLMS structure. 
ARCHITECTURE OF THE MAIN PROCESSOR
where X and x2 are the inputs to the butterfly, W is a transformation coefficient and yl and y2 are the outputs. (6) and (7) can be written as follows:
Yl'r--xl'r+W'r*xz'r--W.i,x2.i (10) Yl "i--xl .i+ W.r,x2.i+ W.i,x2.r (11) Y2 In addition to the butterfly operation, the butterfly processor is able to perform complex and real multiplications, as well as addition and subtraction. Complex multiplication can be performed by substituting xl in (6) and (7) with zero.
Complex addition and subtraction can be performed by substituting W in (6) and (7) with one.
After presenting the basic operations, a detailed explanation of the architecture of the butterfly processor now follows.
The communication between the butterfly processor and the main processor is accomplished through input and output buffers. The input buffers contain the 2 inputs to the butterfly and the twiddle factor (i.e., the transformation coefficient). They are called x lInBuf, x2InBuf, and WInBuf. The output buffers contain the 2 outputs of the butterfly. There are called y lOutBuf and y2OutBuf and are shown in the top row of the figure. These buffers are internally latched. Therefore, while the butterfly processor is busy performing current operation, it is possible to read the results of the last operation from the output buffer and write the inputs and the twiddle factor of the next one into the input buffers in the same clock cycle. The width of the input and output buffers are 24 bits for real part and 24 bits for imaginary part. The input buffers of the twiddle factors are 16 bits for real part and 16 bits for imaginary part. Each read or write to any of the buffers will access both the real and imaginary parts.
The operation of signals mentioned in Figure 5 is described here.
Signals whose name start with "Load", load the latch to which they are connected with their input. For instance, "Loadrl" loads r with its input. The load operation is performed at the end of the cycle that the signal is active. Exceptions are signals LoadxlInBuf, Loadx2In Buf, LoadWN and LoadMulOutWin which load the input at the begining of the cycle. These signals are highlighted by "*" on Figure 5 . Signals whose name end in "iSel" are connected to multiplexers which select either real or imaginary part of the latch to which their inputs are connected. Each of these signals, when active, selects the imaginary part of the latch. 
ADAPTIVE FILTERS 127
For instance, signal xliSel selects the imaginary part of xl. Signals whose name end in "InSel" are connected to multiplexers. The inputs of the multiplexers, on the other hand, are named "0" and "1", respectively (Fig. 5) . Each of these signals, when active, selects the input named "1". For instance, r lInSel selects the input "1" of multiplexer r lIMux, which is the output of Adder3. Signals whose name start with "Out" are connected to tri-state buffers and activate the buffers. For instance, signal OutShifter3 activates the tri-state buffers connected to the output of Shifter3. When signal SubMBar is active, the output of Adder4 is the subtraction of its inputs. Otherwise, its output is the addition of its inputs. When signal Shifter3Div2 is active, the output of Shifter3 is its input times 0.5. Otherwise, its output is equal to its input. Signal MulOutWinValue contains the value to be loaded in MulOutWinReg.
Sequence of Executions in the Butterfly Processor
In this section the sequence of the operations performed for calculating the butterfly and inverse butterfly are explained.
For the butterfly operation, the butterfly processor performs the steps shown in Table I . In this table, identifiers are the names of the latches in the butterfly processor shown in Figure 5 . Furthermore, suffix ". r" refers to the real part and suffix ".i" refers to the imaginary part of the latches. All the assignments shown in Table I are performed at the end of the cycles. The exceptions are the assignments highlighted by "*". The reason is that the value of the latches W, xl and x2 are required during Cycle (1) and should be loaded at the start of the cycle. Table I also shows which of the signals in the architecture of Figure 5 should be activated to perform each operation. The steps shown in Table I are in accordance with (10) , (11), (12) and (13) . By tracing the value of the variables mentioned in this table, yl.r, y. i, y2. r and Y2" are found to contain values of the above equations at the end of Cycle (5) .
An interesting feature of the butterfly processor is that while it is performing Cycle (5) of one butterfly operation, it can also perform Cycle (1) of another one. This is similar to the concept of pipelining. The reason is, as can be seen in Table I (1) and should be loaded at the start of the cycle. Table II also shows which of the signals in the architecture of Figure 5 should be activated to perform each operation. The steps shown in Table II are in accordance with (14) , (15), (16) and (17). By tracing the value of the variables mentioned in this table, y.r, y.i, y2"r and y2"i are found to contain the values of the above equations at the end of Cycle (4) .
As can be seen in the table, the butterfly processor performs the inverse butterfly Operation in 4 cycles.
Multiplier
The butterfly processor incorporates an integer 24-bit 16-bit signed digit binary-tree multiplier. The output of the multiplier is 40 bits. The coding of the signed-digit is the one discussed in [15] where designing the Booth encoders based on that coding reduces the number of levels in the tree by one.
The binary tree has 3 levels only. In other words, there are only 4 partial products in the multiplier. The 16 bit input is connected to the Booth encoders.
Multiplier Output Window
The calculations in the main processor are in fixed point representation and the position of the decimal point changes in different stages of the adaptive algorithm which causes the precision of the output of the multiplier to change. On the other hand, the butterfly processor is able to make use of only 24 bits of the multiplication. This requires that there be a mechanism to select the appropriate 24 bits out of the result. For this prpose, a window called (MulOutWin) is provided at the output of the multiplier. The contents of a register called multiplier output window register determine which 24 bits are selected. This register is loaded by the main processor.
The multiplier output window is a multiplexer. The principle behind it is the same as that of the design of the barrel shifter made from pass transistors [16] . The window, however, is made based on barrel cell which uses a transmission gate. Each barrel cell consists of two transmission gates and has two data inputs, namely Data0 and Datal, and one data output, namely DataOut. It also has four command lines. The value of the In the barrel multiplexer, the barrel cells are arranged as shown in Figure signal Shift1.1 selects 39 of 40 bits of the output of the multiplier. It behaves similar to a sliding window which can shift one position and show 39 bits. When Shift1.1, is set to 1, the first row outputs the least significant 39 bits of its input, and when set to 0, it outputs the most significant 39 bits. The second row, controlled by signal Shift.2, selects 38 bits of the first row. When Shift1.2 is set to 1, the second row outputs the least significant 38 bits of its input, and when set to 0, it outputs the most significant 38 bits. The third row, controlled by signal Shift2 selects 36 bits of 38 bits of the second row. When Shift2 is set to 1, the third row outputs the least significant 36 bits of its input, and when set to 0, it outputs the most significant 36 bits. The forth row, controlled by signal Shift4 selects 32 bits of the bits of the third row. When Shift4 is set to 1, the forth row outputs the least significant 32 bits of its input, and when set to 0, it outputs the most significant 32 bits. The last row, controlled by signal Shifts selects 24 bits of the bits of the forth row. It is similar to a sliding window which can shift 8 positions. When Shifts is set to 1, the last row outputs the least significant 24 bits of its input, and when set to 0, it outputs the most significant 24 bits. Each Shift, signal, when active, makes the corresponding row to select the least significant bits of its input. The timing of the blocks of the chip has been simulated by SPICE circuit simulator. The slowest blocks of the chip are the ROM's. Based on the simulations performed by SPICE circuit simulator, the ROM's provide data in 80 nano-seconds. Another slow device in the processor is the multiplier which is located in-between the path of input buffers of butterfly processor, multiplier output window and the adders Adder3 and Adder4 (Fig. 5) . Based on simulations performed by SPICE circuit simulator on sub-blocks of this path, the approximate delay in this path is 70 nano-seconds. Based on the simulations made on long routes in the main processor, the delay is at most 20 nano-seconds. This includes the routing of control signals generated by controller.
A two-phase clock pulse is chosen to avoid the race hazard. Figure 8 shows the clock pulse. Phase is used to load the state flip flops of the controller and stage flip flops of the butterfly processor, as well as loading some of the registers which should be loaded at the start of the cycle (Tabs. I and II). Phase 2 is used to load data in other registers. The limiting circuit in determining the clock frequency is the ROM. Therefore, at least 80 nano-seconds are required. The width of the clock pulses are set to the maximum delay it takes for the control 132 A. NAJAFI et al. signals to reach their destinations, i.e., 20 nanoseconds. The margin between the two phases are also set to the same amount so that, because of the delay, the two phases won't be active at any location in the chip at the same time. The period of the clock pulse is therefore 140 nano-seconds or equivalently, the clock frequency is selected to be 7 MHz. Using this as the highest clock frequency, the time requirement of an unconstrained FBLMS algorithm developed for this processor is shown in Table III . Figure 9 shows the floor plan of the butterfly and Figure 10 shows its layout. Figure 11 shows the floor plan of the main processor and Figure 12 shows its layout. 6 . CONCLUSION A special purpose processor was proposed to efficiently implement the FBLMS algorithm. The butterfly processor performs one butterfly operation in 4 clock cycles and has only one multiplier compared to a previously known structure which requires the same number of cycles but two multipliers. The processor incorporated a dedicated butterfly processor. This, in turn, makes use of a modern multiplier. Other essential blocks of the processor include ALU and controller. The architecture of the processor as well as the VLSI implementation were discussed. The timing performance of the processor was also presented.
