We are interested in developing a programmable baseband processor for multiple radio standards, including the wireless LAN standards 802.1 la and 802. I I b. 802.1 la is based on OFDM and uses a 64-point FFT. Demodulation of the complementary code keying (CCK) used in 802.1 1 b includes the computation of a modified Walsh transform.
INTRODUCTION
With the upcoming 4th generation wireless systems and convergence of multiple radio standards into a single terminal, there is a need for building blocks that can be configured for computing different algorithms used in different standards.
As a starting point for developing a programmable baseband processor , the IEEE wireless LAN standards 802.11 ah/g have been studied. It was found that computation of FFT, which is used in OFDM standards such as 802.1 la and g. and the fast Walsh transfom, which is used in the 802.1 Ib standard, can use much the same datapath i J the radix-4 FFT algorithm is used. This paper describes converged hardware for computation of 64-point FFT, 64-point discrete cosine transform (DCT) and Walsh transform. 64-Point FFT is used in several OFDM standards, including IEEE 802.1 la. The Walsh transform is needed for demodulation of CCK (complementary code keying) which is used in IEEE 802.11 b.
The discrete cosine transform is used in several common audio and video compression algorithms. This makes the processor useful for both baseband and application acceleration for example for DAB (Digital Audio Broadcast-0-7803-7946-2/03/$17.00 02003 IEEE ing), DVB (Digital Video Broadcasting) or a 4th generation wireless multimedia terminal. DCT is often computed using a FFT processor with some pre-and postprocessing. The described FFTiFWT processor has been extended to also allow efficient computation of DCT.
Although this particular implementation only computes 64-point transforms, the concept is easily extended to include eg. 256-or 1024-point transforms. Only memory size and some parts of address generation would be modified.
Section 2 of this paper explains the theory behind the radix-4 FFT and the modified fast Walsh transform. Section 3 presents the proposed datapath and section 4 describes the addressing scheme. This is followed by implementation and synthesis results and conclusions.
THEORY

The radix4 FFT algorithm
The discrete Fourier transform, DFT, for 0 5 1 < G4, is defined by
$=U
where I'V64 = exp(-j2?r/G4). We now set out to derive a radix-4 FFT of ( I). This is done by factoring G4 = 4 x 4 x 4. The resulting algorithm will be similar to a 3-D DFT on a 4 x 4 x 4-cube. We make the following replacements of the indices k = 16R2 + 4ki + ko 1 = 1612 + 411 + lo, (2) where 0 5 k;, li < 4. We will also need i,the bitreversed version of 1,
We begin by evaluating rv; && = Wi16h.*+4b,+Lo)if2+4f,+le10~ 64 By inserting (2) , (3) and (4) into (I), we get where we have used that W, = -j . (5) is a radix-4 FFT that produces a bit-reversed output vector.
The modified FWT algorithm
The modified Walsh transform, for 0 5 1 < 64, is defined
The kernel function p(m, n) is given by
(7)
w h e r e m = < m z m l m o > i s i n h a s e 2 a n d n = < n2n1no > is in base 4.
We now try a deduction similar to the one carried out in section 2.1 Inserting (8) into (6), we get z(2l) = h and all other entries are zero.
With these modifications (5) and (IO) are identical.
The DCT algorithm
To calculate a DCT using FFT hardware the input samples have to he reordered according to figure 3 c and the the output samples have to he multiplied with compensation factors. The theory behind this method is explained for example in [5] . The input to the DCT is real-valued and only the real part of the output is used. However by using both real and imaginary parts of the input samples, and applying some post processing, two DCTs can he computed simultaneously. 16x12 instead of 14x12) and the coefficient ROM will hold 64x4=256 coefficients instead of 48x3=144.
DATAPATH
In FFT and DCT modes the datapath is pipelined into three stages. In the first stage two additions are executed, in the second stage one (real) multiplication is executed and the third stage has one addition (which is part of the complex multiplication) and one round or truncate operation.
In FWT mode, the critical path is just one addition and the datapath is not pipelined.
Based on 802.1 la, 12-hit precision (ie 12 bits each for real and imaginary parts) bas been chosen for input and output data. Precision requirement investigations has shown that to reach this precision it is enough to use 16-hit precision for intermediate results and 12-hit precision for coefficients. Each complex multiplier consists of four 12x16 bit real multipliers and two 30-bit adders. The memory word length is 32 bits. 
Memory architecture
All Algorithms use in-place calculation so it is enough to use one memory.
The memories are divided into four hanks of I6 words each, and each bank into two subbanks of eight words each. In each clock cycle the processor may read one word from each bank (totally four words) and write one word to each subbank (totally eight words).
Our implementation uses a register file with eight banks that have separate read and write ports. Once the correct sub bank has been selected, the three most significant bits of the address is used for addressing within the sub bank.
ADDRESSING
All addressing is based on two 6-hit counters, one forgenerating read addresses and one for write addresses (The write counter is delayed a number of steps corresponding to the pipeline depth; In FFT-or DCT-mode the coefficient ROM is also addressed by a delayed value of the read counter). In the following description it is assumed that the output ofthe 6-hit readorwrite counteris (zg,~1,n~,~3,xq,z~} 1,1, I) ). 2 2 and .XU in-dicates the current step. The first step has 2 iterations (=4 butterflies), the second step has 4 iterations and the last step has 8 iterations. The eight addresses that are written in parallel are a0-a7 below. The four addresses that are read in parallel are aO-a3: {1,1,0,~~,0,0}  aG= {l,O,O,~s,0,1}  a 7 = {l,l,o,z5,0,1) 
FFT and DCT
a4 = 11, O,O, z g , o , 0) a5=
Input and Output
The fact that in-place calculation is used results in the input (for DCT and FWT) and output (for all transforms) not being in order. The data therefore has to he reordered before andor after the computation.
For FFT the input is in order and the output is in bitreversed order, that is sample number {so,s~,sz,s~,s~,s~} is foundataddress {s5,s4,s3,s~,s1,s~}. ForFWTtbeinput and output is reordered as described by figure 3 a and b. For DCT the input is reordered as described by figure 3 c and the output is in bitreversed order. In our implementation the reordering is built into the ports used for storing input data and reading output data.
RESULT
The FFT/FWT/DCT processor described above has been implemented in VHDL and synthesized in a 0.13 pm process using Cadence Physically Knowledgeable Synthesis. 
CONCLUSION
This paper has presented similarities that have been found between the radix-4 FFT algorithm and the modified Walsh transform and has shown how these similarities have been exploited to design a converged processor for 64-point FFT, 64-point DCT and modified Walsh transform. The proposed architecture is suitable for a baseband processor that needs to handle both OFDM standards like IEEE 802.1laandtheIEEE802.1 lbstandardwhichusesthemodified Walsh transform for CCK demodulation. The performance exceeds by far the requirements of these two standards.
