In this paper, a prototype design for the Physical Layer of IEEE 802.1 l a standard, which is based on Orthogonal Frequency Division Multiplexing (OFDM) technique, is presented. Implementation aspects of an OFDM modem on Xilinx field programmable gate array (FPGA) are addressed. The system includes a synchronization circuitry used for packet detection and time synchronization.
Introduction
The multi-camer modems employing baseband OFDM are increasingly becoming desirable for high-speed mobile digital transmission systems. The basic principle of OFDM is to divide a high-rate data stream into a number of lower-rate streams that are transmitted over a number of multiplexed orthogonal subcarriers [I] . The orthogonality property of subcarriers makes the spectral efficiency of OFDM very high. Lower-rate symbol stream of subcarriers and insertion of the guard time interval to each symbol enable the system to deal with channel time desperation caused by multipath time delay spread.
As specified by IEEE 802.11a, the physical layer is based on a 52-carrier OFDM modulation scheme including 48 subcarriers for data symbols and 4 pilot subcarriers for channel estimation process [2] . The block diagram of implemented modem is shown in Figure 1 . The transmitter includes a symbol mapper, IFFT and cyclic extension blocks. IFFT block generates subcarriers and the cyclic extension block adds the guard time interval into the OFDM symbol.
The receiver consists of the guard time removal, FFT, symbol demapper and synchronizer blocks.
In OFDM systems, the complexity of synchronization is one of the major difficulties. For packet-oriented applications like IEEE 802.11a, a preamble as the training sequence is sent at the beginning of the transmission. Packetdetection is the initial task of synchronization process. A training sequence also achieves frequency synchronization to avoid Inter-Carrier Interference (ICI) and estimates symbol timing to determinc the position of Fast Fourier Transform (FFT) window. are demonstrated in Section 3 and 4. The synchronization is discussed in Section 5.
Finally, Section 6 provides the timing and area results of mapping the proposed OFDM modem into FPGA. Conclusions are given in Section 7.
Design Flow
The design flow starts with floating-point modeling and simulation using Cadence Signal Processing Woksystem (SPW). As the first step, algorithms for different parts of the system (including synchronization algorithms) are selected by performing system simulations. Bit Error Rate (BER) verses bit energy to noise spectral density ( E b / N o ) curve is used to verify the performance of the system. Then, the floating-point model is transferred to fixed-point model in SPW and arithmetic precisions for different parts of the system are identified.
In the next step, fixed-point blocks are replaced. with Hardware Design System (HDS) blocks to automatically generate VHDL codes. HDS library does not have equivalent blocks for all fixed-point blocks. In these cases, VHDL codes are prepared manually or imported from Intellectual Property (IP) cores such as Xilinx Coregen library. Another reason to use IP cores is that HDS generated VHDL codes may not be optimized for the targeted hardware. By comparing the BER curve from this step and the one obtained from floating-point model simulation, it is possible to verify that HDS model has the rcquired functionality performance. The input data of floating-point model is convened to fixed-point and applied to HDS model or VHDL codes. Therefore, all signals including the OFDM symbols and received data in each step of the design can . be compared with the results of previous steps.
Then the VHDL codes are synthesized in Synopsys. The final step is to map the circuitry into FPGA. The timing and area analysis are done after realizing the circuitry. If the time and area constrains are not met with the standard specification, we have to modify the models and repeat the mentioned steps again. Figure 2 depicts the proposed transmitter circuitry in this paper. The short and long preamble symbols are stored in "ROM I", which are transmitted at the beginning of transmission mode. All input data are continuously mapped and stored in "Dual Port RAM I" within 4ps time periods. The four pilot symbols can also be inserted in this block &om the other input port of "Dual Port RAMI". The real and imaginary components of input data are in a vector of 64 samples and each sample is 16-hit 2's complemcnt. These components are generated by the mapper block.
Transmitter Implementation
Imported from Coregen library, IFFT block computes 64-point complex fast Fourier transform employing a Cooley Tukey radix-4 decimation-in-frequency algorithm. To meet the IEEE 802.1 l a standard, the IFFT period needs to be 3.2 ps that requires highly pipelined architecture. For this purpose, Dual-MemorySpace (DMS) configuration for IFFT core is utilized. There are three phases for FFT computation process: Data Load Phase, Compute Phase and Result Upload Phase. Each phase needs several clock pulses. DMS mode allows input, computation and output operations to be overlapped and all computation is done within 192 clocks. The four blocks "Muxl", "Dual Port RAMZ", "IFFT" and "Dual Port RAM3" in Figure 2 are used for IFFT computation. To simplify the diagram, some blocks including the central circuit controller are not drawn. The controller consists of a counter, which counts between 0 and 287, and some combinational logics. They control all the blocks in the circuitry at each state. Figure 3 shows the proposed receiver circuitry. Most of blocks are connected to a controller circuit that is not shown in this diagram. The controller block is a counter similar to the one in transminer. However, the associated combinational logics are designed to schedule the blocks shown in Figure 3 . "Dual Port RAMI" in the diagram is for cyclic prefix removal and stores the incoming OFDM symbols with 20MHz clock cycle. Coregen FFT core with DSM configuration computes FFT components of the received data. The four blocks "Muxl", "Dual Port RAMZ", "FFT Core" and "Dual Port RAM? are used for 64-FFT computations. The clock speed for these blocks is 72 MHz. Pilot symbols are removed after FFT computation. They are available at the output port A of "Dual Port RAM3". There are also 48 extracted symbols, which are stored in "Dual Port R A M 4 . The final Parallel to Serial block converts the demapped integer symbols into the output hit streams.
Receiver Implementation

OFDM Synchronization Circuitry
Synchronization is one of the major tasks of any receiver. For an OFDM receiver, this should be done before demodulating the subcarriers. Usually, there are at least two synchronization tasks. One is timing synchronization and the other one is frequency synchronization. Frequency synchronizer has to estimate and correct carrier frequency offsets of the received signal. On the other hand, timing synchronizer has to find the symbol boundaries to prevent IS1 and ICI. Several schemes for the synchronization of this system are proposed in [3], [4] and [5]. The first task of synchronization circuitry in an IEEE 802.11a receiver is to detect the training symbols and detect the starting point of data frame. The short preamble symbols are used for frame detection; coarse frequency offset estimation and large-scale time synchronization. On the other hand, the long preamble training symbols are utilized for frequency fine-tuning and improving channel estimation accuracy. A pilot-assisted phase tracker enhances synchronization by removing the residual phase errors.
One of the properties of the short preamble symbol in IEEE 802.11a standard is its periodicity. Therefore, we can exploit this property by doing a correlation of incoming signal with delayed version of the same signal. This is called "Delayed Correlation". The complex correlation of input symbols r, in a window of N, is computed according to the following equation:
A modification [5] can be applied here to rewrite the summation of .Equation ( I ) into a recursive structure as given by:
where N is the length of sliding window. The average power of the input frame in time window N 3 can be calculated by:
If we have L complex samples in one-half of the first training symbol (excluding the cyclic prefix), then the estimation window in the Equation ( I ) and (4) will be L. The starting point of a frame can be estimated from the following equation:
The above criterion is a timing metric. The synchronization schemes proposed in [4] and [SI are based on this parameter. T o apply the above equation, we need to have a complex multiplier, two adders (one for addition and one for subtraction) and two delay blocks as shown in Figure 4 . The average power estimation defined in Equation (4) can also be done with the same techniques (called recursive summation structure). In Figure 4 , the input signal r(t) is correlated with 4 r delayed version of itself. The parameter Tis the sampling time. Each delay r i s equal to 16 T. The * sign means complex conjugate of a signal.
C-snr is a coefficient less than I , which has a role as the threshold in the following equation: lSnlz 2 threshold . ff (6) The model is simulated in AWGN channel and the result shows the system is able to detect correctly the preambles with SNR more than 15 when C-snr is equal to 0.81. For SNR less than 15, the system can detect the preamble but C-snr should be given lower values. A counter (not shown in the figure) observes the validity of Equation (6) for five short preamble symbols. The square of the correlated symbols should be more than the square of power for 5 symbols.
The real and imaginary pans of the correlation result are available at the output of design. They are used for phase estimations and frequency synchronization. An algorithm called CORDlC [6] can conven their Cartesian representation to polar coordinates, which are necessary for frequency synchronization.
Mapping OFDM modem into FPGA
The targeted hardware to implement the circuitry is Xilinx Virtex I1 FPGA. Virtex I1 family delivers complete solutions for telecommunication, wireless, networking and DSP applications [7] . The device used for this design is XC2V6000FF1152-4. There are 6 million gates on the device including 33,792 slices, 144 multiplier and 144 RAM blocks (each RAM block is 18 Kbits). The internal clock speed on the device can go up to 420 MHz, which meets completely the needs of the OFDM circuitry.
After mapping and routing phase, the result for area is reported as summarized in Table land  Table 2 . 
Slices
The timing reports (from Xilinx Timing Analyzer tool) show no timing conflict in the modem circuitry. The reported maximum net delay is equal to 10.023 ns. Since the maximum clock speed used in the system is 72 MHz, the synthesized circuitry meets the timing constraints. According to the results obtained from simulation, the total time required for one OFDM symbol computation in transmitter is 8.1 ps or 583 clock cycles. However, the circuitry is highly pipelined and the input data are fed into the circuitry continuously with hit rates equal to 72 Mbitsis.
The short and long training preambles need totally 16 ps or 1152 clock cycles for transmission. This time duration can be overlapped with the time necessary for initial input data modulation and computations.
For the receiver side, the circuitry needs 7 . 9~s or 569 clock cycles to extract the transmitted bits.
I. Conclusion
In this paper, an OFDM modem prototype for the Physical Layer of IEEE 802.1 la Standard has Multiplier Total Blocks Gates been designed and implemented. The proposed circuitry occupies totally 3675 slices on FPGA, which is almost equal to I480000 gates. The input data can be modulated with data rate up to 72 Mbits/s. The design flow presented in this paper can be used for rapid-prototyping of any other algorithms. We have shown how the proposed OFDM modem can be modeled in floating-point, fixed-point. By this method it is possible to evaluate the complexity of algorithms existing in a communication system.
