The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.
III. COMPLEXITY ANALYSIS OF THE POST-ADDITION STAGES
For the post-addition stages, let A(N ) and B(N), respectively, denote the number of all required additions and the number of additions required in the final stage, and let C(N) denote the number of nodes that do not require butterfly computations in the first log 2 N stages. From (13) 
IV. CONCLUSIONS
An index-permutation based 2-D DCT algorithm has been presented in this correspondence. The succinct derivation of the proposed algorithm makes it easier to describe the process of how to map one 2-D DCT into a number of 1-D DCT's. From the idea of [8] , a matrix-form-based systematic expression for the post-addition stages in the proposed algorithm, which may improve the regularity of the structure, is currently under investigation.
I. INTRODUCTION
Many high-speed FFT processors have been obtained by implementing the fast Fourier transform in pipelined digital hardware with a butterfly calculation unit, two-port data memory, ROM for storing twiddle factors, and memory addressing controller integrated on a chip. It is possible to use an in-place strategy that stores butterfly outputs in those memory locations used by the inputs to the same butterfly. The in-place strategy requires only a minimum amount of memory. For this reason, only the in-place radix-2 decimation-in-time version of the fast Fourier transform is considered here.
If the butterfly unit has parallel inputs and outputs, then the two butterfly inputs will be accessed in the memory, and two butterfly outputs will be written back to the same memory in each cycle. In order to avoid this memory bottleneck, the two-port memory module is always divided into two separate banks so that two data values can be read and two data values written on each memory cycle.
Pease [4] observed that the addresses of the two inputs differ in their parity. It is possible to divide the memory into two banks: one bank with address parity even and the other with parity odd. Based on this observation, Cohen proposed a simplified control logic for radix-2 FFT processors. Johnson [2] arranged the memory module in their radix-r FFT processor in a similar way. The demerit of their strategies is that the address parity should be calculated before accessing to memory and the delay of parity calculation is large. Sinha [5] presented an interesting approach to memory assignment. The approach uses the well-known triple loop control structure. Therefore, the address generation in their approach is complicated, and its delay is large.
The processing speed or throughput of the FFT processor is dominated by its pipeline cycle or its clock rate. The disadvantages of existing methods [1] , [5] are that their address generation delays are near that of a carry lookahead adder. This problem will limit the clock rate of butterfly unit for large transforms. The objective of our work is to shorten the address generation delay, which is helpful in increasing the clock rate at which FFT processors can operate. Therefore, our work makes it possible to improve the performance of FFT processors.
II. RADIX-2 FFT ALGORITHM AND ADDRESS GENERATION FOR BUTTERFLY OPERATIONS
The discrete fast Fourier transform of the N -point is defined by
where W mk N = e 0j(2k=N)
; and k = 0; 1; 1 1 1 ; N 0 1: The radix-2
FFT is an efficient way to compute an N -point DFT, which is shown in Fig. 1 . We assume that the inputs are arranged in a bit reverse order and that the outputs are produced in a normal order. Define X k (01) = x m with k = rev(m) (bit reverse number corresponding to m). The butterfly calculations at pass p, which are shown in Fig. 1 , are expressed by
where 
Cohen [1] proposed a simple approach to generating s and t: two input addresses of butterfly operations. Denote by b and p the butterfly operation and pass indices, respectively. Their bitwidths are Note that the twiddle factor address can be generated by using an operation similar to a shift left operation. Since the address generation logic is very simple and its delay is only a half that of the data address generation, the twiddle factor address generation is not discussed here.
III. AN EFFECTIVE CONFLICT-FREE MEMORY ADDRESSING ASSIGNMENT

A. Address Generation for Parallel Data Access
Equations (4) and (5) indicate that the addresses s and t differ only in the pth bit at pass p: Let the binary notation of s at pass p be given by s = s n01 s n02 11 1s p+1 0s p01 111 s 1 s 0 : Based on (4) and (5), we have t = s n01 s n02 11 1s p+1 1s p01 111 s 1 s0: Denote sr = tr = sn01sn02 111 sp+1sp01 111 s1s0: We place Xs (p 0 1) and X t (p 0 1) in addresses s r and t r of memory banks M 0 and M 1 , respectively. This assignment allows the pair of butterfly inputs Xs(p 0 1) and Xt(p 0 1) to be accessed in a conflict-free manner.
Furthermore, the memory banks to store X s (p) and X t (p) are determined according to the value of s (p+1) (s (p+1) = b 0 ): This indicates that the pair of butterfly outputs at pass p; Xs (p); and X t (p) are located in the same memory bank. With this assignment, the two inputs of butterfly operations at pass (p + 1) can be accessed simultaneously.
Based on the above insight, we propose a memory addressing assignment for the inputs and outputs of butterfly operations, which is shown in Table I . In Table I , we denote a butterfly counter and pass counter by b and p, respectively, and the binary notation of b is given by b = bn02bn03 11 1b1b0: Define As = bn02bn03 111b10 and A t = b n02 b n03 11 1b 1 1: The address generation logic for read and write operations are depicted in Figs. 2 and 3, respectively.
B. The Properties of Our Memory Addressing Assignment
We need to analyze the properties of our strategy to ensure its correctness.
1) Conflict-Free Addressing for Reads:
The insight mentioned above shows that Xs (p 0 1) and Xt(p 0 1) can be accessed in parallel. 
2) Conflict-Free Addressing for Writes:
We know that the outputs of butterfly operations at pass p; X s (p); and X t (p) belong to the same memory bank. At the end of a butterfly calculation, X s (p) is written to a memory bank, whereas Xt(p) is stored in a register, and it will be written to the same memory bank in the next memory cycle. Table I shows that if the outputs of one butterfly calculation belong to memory bank 0, then the outputs of the next butterfly calculation will belong to memory bank 1. Therefore, writing the outputs of two adjacent butterfly operations can be performed simultaneously.
3) Conflict-Free Addressing for Reads and Writes Simultaneously: Let the input and output addresses of one butterfly operation be sr ; tr ; sw ; and tw , respectively, and the input and output addresses of the next butterfly operation be s 0 r ; t 0 r ; s 0 w ; and t 0 w , respectively. Assume the value of butterfly counter for one butterfly operation is Based on our strategy presented in the previous subsection, we obtain a memory addressing assignment, which is given below.
Addresses for Read/Write
Memory Banks We see that Xs(p) and X t (p) are written back to the same locations, where X s (p01) and X t (p01) reside, respectively. However, X t (p) needs to overwrite the location of X s (p 0 1), and X s (p) needs to overwrite the location of Xt(p 0 1), respectively. Fortunately, since memory reads, butterfly calculations, and memory writes build up a pipeline to compute a FFT, X s (p 0 1) has been loaded into butterfly unit before Xt(p) is overwritten to its location. Therefore, the input reading of one butterfly and the output writing of the previous butterfly can be performed without conflict.
IV. MEMORY ADDRESSING FOR THE INPUT/OUTPUT OF FFT PROCESSORS
A. Memory Addressing for Input
When sampled data X k (01) [see Section II, X k (01) = x m , with k = rev(m)] are loaded into an FFT processor, the data should be placed in such a way that the inputs of butterfly operations at the zeroth pass (p = 0) can be accessed concurrently. We choose a memory bank to place the sampled data according to the least significant bit of its index, as shown in Table II .
B. Memory Addressing for Output
Based on our strategy, the memory banks storing the final results are determined according to the least significant bit of their indices as the assignment for input described in the previous subsection. However, the address within each memory bank should be modified. The locations of FFT outputs are shown in Table III . [1] and [5] It is reasonable to assume that the delays of some basic circuits are as follows: In Cohen's scheme [1] , a cyclic shift on a butterfly counter should be performed, and address parity should be calculated before a read or write is performed. Taking the two operatons of address and data multiplexing (interchanges) into account, the total delays of address generation for reads and writes are given by T read = T write = maxfT rl ; T parity g + 2T mux = 2T and dlog 2 (n 0 1)e + 4T and :
Sinha [5] used the control method of three separate iteration loops to compute the fast Fourier transform. An addition should be done in the address generation (see [5, For our scheme, the delay of address generation for reads is dominated by the delay of cyclic shift operation, which is given by T read = T rl = T and dlog 2 ne + T and : Since storing Xt(p) in a register does not add an additional delay to that of address generation and the register access time of write operation is less than that of memory, the delay of address generation for writes is given by T write = T rl + T mux = T and dlog 2 ne + 3T and :
The time delay comparisons of Cohen and Sinha's designs [1] , [5] and ours are summarized in Table IV . Compared with [1] and [5] , our scheme reduces the delay of address generation nearly by half.
B. Hardware Complexity Comparisons
The size of each address generation circuit is approximated to a first order by the number of gates and transistors. The actual area required for a given circuit will depend on the types of gates, the number of gates, and the amount of wiring area, but the relative sizes are consistent with the gate counts. In addition, the wiring complexity of our design is comparable with that of Cohen's design [1] . To simplify the analysis and comparisons, we use the gate count and transistor count as a hardware complexity measure.
The complexity comparisons of the scheme in this correspondence and the other two schemes are shown in Table V . The comparisons are based on 1024-point FFT processors with data representation of 20 bits (corresponding to 40 bits representation for a complex data). In addition, the address generation circuits for reads and writes are We see from Table V that the hardware complexity of our scheme is comparable with Cohen's design [1] . This point stays true with an increase in bitwidth of data representation (the bitwidth 32).
Although the hardware complexity of our scheme is higher than that of Sinha's design [5] , their FFT processor is not pipelined. If their processor is pipelined, then the complexity of their address generation circuit will be higher than ours.
VI. SUMMARY
We have proposed an effective approach to the memory addressing of FFT processors, which is simple and is suitable for pipelined FFT processor implementations. The analysis and comparisons we made show that the delay associated with address generation is reduced nearly by half with equivalent hardware complexity compared with Cohen's design [1] . Two effective memory addressing schemes for the input and output of FFT processors are also given. With our strategy, a powerful FFT processor can be implemented efficiently.
ACKNOWLEDGMENT
The author is grateful to the reviewers for their suggestions and comments that improved greatly the paper's quality.
I. INTRODUCTION
In recent years, the development of low-power devices for applications in fields of communications and DSP has become an active area of research due to the proliferation of mobile communication systems. It is for this reason that numerous power reduction techniques have been proposed starting at the algorithmic-level [1] - [4] , architectural level [5] , logic level [6] , and the circuit level [1] . These techniques are currently being applied to develop low-power and high-speed transceivers for applications such as asymmetric digital subscriber loop (ADSL) [7] , high-speed digital subscriber loop [8] , and ATM-LAN [9] to achieve high bit rate digital communication over bandlimited channels.
The transceivers in most of these applications employ some form of adaptive equalization at the receiving end to combat corruption of the transmitted signal due to several sources of distortion such as intersymbol interference (ISI), crosstalk, and additive noise. In many of these applications, transmission schemes such as quadrature amplitude modulation (QAM) are employed, where the receiver consists of a phase splitter or a Hilbert transformer followed by Manuscript received April 14, 1997; revised February 19, 1998. This work was supported by the National Science Foundation under CAREER Award MIP-9623737. The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias Manolakos.
The authors are with the Coordinated Science Laboratory/Electrical and Computer Engineering Department, University of Illinois at UrbanaChampaign, Urbana, IL 61801 USA (e-mail: rhegde@uivlsi.csl.uiuc.edu; shanbhag@uivlsi.csl.uiuc.edu).
Publisher Item Identifier S 1053-587X (99) . In CAP transceivers, a PSPE is employed at the receiving end. This receiver consists of a parallel arrangement of two adaptive equalizers. In this correspondence, we propose a lowpower architecture for the PSPE employed in a CAP transceiver by exploiting the Hilbert relationship between the optimum solutions of the receive filters. The rest of this correspondence is organized as follows. In the next section, we describe the generic CAP transceiver scheme. In Section III, we present the proposed receiver architecture and analyze its properties. In Section IV, we show, via analysis and simulation results, that the proposed architecture results in considerable saving in power in an ATM-LAN environment with marginal degradation in performance.
II. THE CAP TRANSMISSION SCHEME
The block diagram of the generic CAP transmitter is shown in Fig. 1(a) . The bit stream to be transmitted is passed through a scrambler in order to randomize the data and are then fed into an encoder. The encoder maps a block of m bits into one of k = 2 m unique complex symbols S n = r n + jq n in a k-CAP scheme. In the 16-CAP scheme described here, we have m = 4 and k = 16. The impulse responses of the shaping filters p(kT 0 ) and p(kT (1) Due to the Hilbert relationship, the magnitude response ofp(n) is the same as that of p(n), but the phase response ofp(n) is shifted by +90 and 090 in the +ve and 0ve frequency regions, respectively.
The CAP receiver, which is shown in Fig. 1(b) , consists of an analogto-digital (A/D) converter operating at sampling frequency 1=T 0 followed by two adaptive digital filters in parallel [10] , which are also operating at sampling frequency of 1=T 0 = K=T , where K is the oversampling factor, and T is the symbol period. In the 16-CAP scheme, we have K = 4. These filters form the in-phase and quadrature-phase equalizers. The filter (F) block in these equalizers consist of an FIR filter whose coefficients are computed recursively in the weight up-date (WUD) block using the popular least mean squares (LMS) algorithm [12] . This algorithm minimizes the mean squared error (MSE) given by 
