A 1Gbps FPGA-based wireless baseband MIMO transceiver by Toal, C et al.
An FPGA 1Gbps Wireless Baseband MIMO Transceiver  
 
Center the Authors Names Here [leave blank for review] 
 
Center the Affiliations Here [leave blank for review] 
Center the City, State, and Country Here 
(address and email contact are optional) 
 
 
 
ABSTRACT 
This paper presents the design and 
implementation of a baseband transceiver for 
System on a Chip (SoC). The presented 
architecture utilizes a 4x4 Multiple-Input Multiple-
Output (MIMO) system and is capable of enabling 
1Gbps wireless transmission.  
 
I. INTRODUCTION 
The next generation wireless networks are 
expected to provide high speed internet access 
anywhere and anytime. The popularity of iPhone 
and other types of smart-phones undoubtedly 
accelerates this trend and creates new traffic 
demand. Consequently, there is an increasing 
demand for high data rate transmission in future 
wireless networks. However, data transmission 
rate is limited by channel capacity which provides a 
theoretical upper limit for the data rate beyond 
which error-free transmission is impossible. 
 
II. BACKGROUND 
 
A.  Orthogonal Frequency Division 
Multiplexing 
OFDM is a method of encoding digital data on 
multiple carrier frequencies. A large number of 
closely spaced orthogonal sub-carrier signals are 
used to carry data. The data is divided into several 
parallel data streams or channels, one for each 
sub-carrier. Each sub-carrier is modulated with a 
conventional modulation scheme (such 
as quadrature amplitude modulation or phase-shift 
keying at a low symbol rate), maintaining total data 
rates similar to conventional single-
carrier modulation schemes in the same 
bandwidth. 
 
B. Multiple Input Multiple Output 
MIMO is the use of multiple antennas at both 
the transmitter and receiver to improve 
communication performance. 
 
III. RELATED WORK 
 
IV. DESIGN AND IMPLEMENTATION OF 4x4 
MIMO OFDM BASEBAND CIRCUIT 
 
A.  Transmitter Architecture 
Fig. 1 shows a block diagram of the 4x4 MIMO 
transmitter architecture. The data is broken into 
four separate and independent channels that will 
each be encoded and modulated for transmission.  
The transmitter must transmit preamble data 
before each burst of OFDM frames. The 
transmitter encodes, interleaves incoming data. 
The bits are grouped and mapped to I and Q 
complex data according to the modulation scheme. 
The modulated symbols are converted to the time 
domain via the IFFT before transmission along with 
cyclic prefix as a series of OFDM frames. 
The transmitter is preloaded with the frequency 
domain values for the short and long training 
sequences (STS and LTS), OFDM symbol pilots 
and a symbol mapper look-up table.  Uncoded data 
is streamed into the convolutional encoder. A 
generic convolutional encoder has been 
developed. Prior to logic synthesis, a user can 
specify the data-path width, data rate R and the 
puncture pattern. 
The output of the convolutional encoder is 
passed to a block interleaver circuit. The block 
interleaver consists of two memories, implemented 
using a large register structure on the FPGA. (The 
interleaving pattern of this entity meant that it could 
not be implemented using the embedded block 
RAM resources). 
The dual memory system allows continual 
streaming of data. Only when an entire memory 
block is full can it be read out to the symbol 
mapper. As one memory is accepting data from the 
convolutional encoder, the other memory streams 
IFFT
IFFT
IFFT
IFFT
OFDM 
Symbol 
Pilots
LUT
Symbol 
Mapper
ROM
Symbol 
Mapper
ROM
Memory Initialisation Files
Test 
Bench
ROM
Mem A
Mem B
Block Interleaver
Convolutional 
Encoder
 FSM
8
Rate = 1/2
Mem A
Mem B
Convolutional 
Encoder
FSM
8
Rate = 1/2
Mem A
Mem B
Convolutional 
Encoder
FSM
8
Rate = 1/2
Mem A
Mem B
Convolutional 
Encoder
FSM
8
Rate = 1/2
LTS ROM
STS 2 
ROM
STS 1 
ROM
Cyclic Prefix 
Buffer 
A
FSM
32 32
Buffer 
B
Buffer 
A
FSM
32 32
Buffer 
B
Buffer 
A
FSM
32 32
Buffer 
B
Buffer 
A
FSM
32 32
Buffer 
B
JESD204A
16
16
16
16
16
16
16
16
16
16
16
16
Data-Path Control Test Bench FSM
rfd
16
16
16
16
16
16
16
16
4
4
4
4
16
 
Figure 1: MIMO Transmitter Architecture. 
 
data out using the interleaving pattern as specified 
by the 802.11a standard.  A local finite state 
machine (FSM) controls the data flow through the 
interleaver. 
The symbol mapper is a simple look up 
memory. The address of this memory is the output 
of the block interleaver. The address 
width/interleaver output width defines modulation 
scheme i.e. for binary phase shift keying (BPSK) 
this must be 1-bit, 2-bit for QPSK, 4-bit for 16-QAM 
and 6-bit for 64-QAM. Each address of the symbol 
mapper LUT contains the corresponding I and Q 
values that represent the constellation location. 
The control path contains a master FSM which 
controls the transmission of each burst of OFDM 
frames including the preamble sequence. It 
enables the STS and LTS frequency domain 
symbols to be read out of their respective 
memories, and encoded data symbols to be read 
from an incoming FIFO and fed into the IFFT.  
The final block before transmission is the cyclic 
prefix (CP) block.  
 
Cyclic Prefix Buffer
0
fft_size - 1
2*fft_size - 1
fft_size - fft_size/4 - 1
fft_size - fft_size/4 - 1
OFDM Symbol
fft_real
fft_imag
Rd/Wr State Machine
sop
eop
nd
rfd
From IFFT To Digital IF Processing
 
Figure 3:Transmitter Cyclic Prefix Architecture. 
 
The transmitter cyclic prefix block is shown in Fig. 
1. This entity consists of a single dual port memory 
element implemented in block RAM, and a control 
state machine. The memory element is twice the 
size of the OFDM frame. This is necessary to 
enable continuous data streaming. The last 25% of 
the OFDM symbol is selected as the cyclic prefix 
and must be transmitted first. Therefore, while one 
complete frame is being transmitted through the 
read port of the memory, the other half of the 
memory is able to collect incoming data through 
the write port. 
The symbol mapper lookup memory is 
duplicated in a second RAM. The dual port nature 
of the memory enables two look-up tables to 
service all four channels.  
The MIMO system control-path varies in that it 
must schedule preamble transmission from each 
channel one at a time in order for channel 
estimation at the receiver. Fig. 2 shows the MIMO 
system preamble transmission pattern. STS data is 
transmitted from channel 0 only. The STS is 
required only for time synchronization and to 
ensure a clean signal, only one transmitter is 
enabled. LTS data is transmitted from all four 
channels one after another. This is essential for 
channel estimation at the receiver.  
 
TX 0
TX 1
TX 2
TX 3
STS LTS
LTS
LTS
LTS
DATA
DATA
DATA
DATA  
Figure 2: MIMO Preamble Pattern. 
 
A. Receiver Architecture 
The receiver must detect, demodulate and 
decode received OFDM symbols back to the 
original bit stream. Fig. 5 shows a diagram of the 
MIMO receiver architecture. 
 
Input Buffer 0
Input Buffer 1
Input Buffer 2
Input Buffer 3
Channel 0
Channel 1
Channel 2
Channel 3
FFT 0
FFT 1
FFT 2
FFT 3
LTS Freq Domain 
Buffers
0
511
Ĥ00(0)
Ĥ00(511)
Ĥ01(0)
Ĥ01(511)
512
1023
1024
1535
1536
2047
Ĥ03(0)
Ĥ03(511)
Ĥ02(0)
Ĥ02(511)
0
511
Ĥ10(0)
Ĥ10(511)
Ĥ11(0)
Ĥ11(511)
512
1023
1024
1535
1536
2047
Ĥ13(0)
Ĥ13(511)
Ĥ12(0)
Ĥ12(511)
0
511
Ĥ20(0)
Ĥ20(511)
Ĥ21(0)
Ĥ21(511)
512
1023
1024
1535
1536
2047
Ĥ23(0)
Ĥ23(511)
Ĥ22(0)
Ĥ22(511)
0
511
Ĥ30(0)
Ĥ30(511)
Ĥ31(0)
Ĥ31(511)
512
1023
1024
1535
1536
2047
Ĥ33(0)
Ĥ33(511)
Ĥ32(0)
Ĥ32(511)
QR 
Decomposition
& Matrix Inverse
+ ÷2
+ ÷2
+ ÷2
+ ÷2
Time 
Synchroniser Channel 0 Freq 
Domain Buffer (r0)
Channel 1 Freq 
Domain Buffer (r1)
Channel 2Freq 
Domain Buffer (r2)
Channel 3 Freq 
Domain Buffer (r3)
Channel estimate 
inverted matrices
0
511
Ĥ-100(0)
Ĥ-100(511)
Ĥ-101(0)
Ĥ-101(511)
512
1023
1024
1535
1536
2047
Ĥ-103(0)
Ĥ-103(511)
Ĥ-102(0)
Ĥ-102(511)
0
511
Ĥ-110(0)
Ĥ-110(511)
Ĥ-111(0)
Ĥ-111(511)
512
1023
1024
1535
1536
2047
Ĥ-113(0)
Ĥ-113(511)
Ĥ-112(0)
Ĥ-112(511)
0
511
Ĥ-120(0)
Ĥ-120(511)
Ĥ-121(0)
Ĥ-121(511)
512
1023
1024
1535
1536
2047
Ĥ-123(0)
Ĥ-123(511)
Ĥ-122(0)
Ĥ-122(511)
0
511
Ĥ-130(0)
Ĥ-130(511)
Ĥ-131(0)
Ĥ-131(511)
512
1023
1024
1535
1536
2047
Ĥ-133(0)
Ĥ-133(511)
Ĥ-132(0)
Ĥ-132(511)
MIMO Detector
Channel 0 
Demapper
Channel 1 
Demapper
Channel 2 
Demapper
Channel 3 
Demapper
Channel 0 
Deinterleaver
Channel 1 
Deinterleaver
Channel 2 
Deinterleaver
Channel 3 
Deinterleaver
Channel 0 
Viterbi
Channel 1 
Viterbi
Channel 2 
Viterbi
Channel 3 
Viterbi
r0 r1 r2 r3
Ĥ-133
Ĥ-132
Ĥ-131
Ĥ-130
Ĥ-123
Ĥ-122
Ĥ-121
Ĥ-120
Ĥ-113
Ĥ-112
Ĥ-111
Ĥ-110
Ĥ-103
Ĥ-102
Ĥ-101
Ĥ-100
y0
y1
y2
y3
 
 
Figure 5: MIMO Receiver Architecture. 
The first major entity on the receiver is the time 
synchroniser. The time synchroniser is designed to 
locate the start a burst of OFDM frames when the 
system is in idle mode. Fig. 4 shows a diagram of 
the time synchroniser architecture.  
 
 
Figure 4: Time Synchronizer Architecture. 
 
The time synchronizer must locate the end of 
the STS frame and the start of the LTS frame. The 
circuit is preloaded with the complex conjugate 
values of the last 16 STS symbols and the first 16 
LTS symbols. The incoming data is correlated with 
the pre-stored data. Every clock cycle, a sliding 
window of 32 consecutive data samples are 
multiplied with the 32 pre-stored preamble values 
and summed. 32 parallel complex multipliers are 
required along with a pipelined adder structure. 
The magnitude of the resulting complex value is 
calculated. A CORDIC block is used as it is much 
more resource efficient than square-root 
calculation logic. The CORDIC output is compared 
with a stored threshold value (representing the final 
STS to LTS transition peak). Once the signal is 
greater than the threshold value, the system 
assumes the start of a frame has been located.  
The time synchronizer is implemented on the 
FPGA using 128 18-bit multipliers. 
The input to the receiver contains a circular 
buffer. The buffer is large enough to handle time 
synchronizer latency. Once the start of frame is 
located, the LTS symbol minus the cyclic prefix is 
passed to the FFT.  Each subcarrier output is 
averaged from the two LTS frames and passed to 
the channel estimation block. Each LTS symbol is 
averaged using an adder followed by right-shift 
logic. 
Data is streamed into all four channels and 
stored temporarily into four individual circular 
32 -Stage Shift Register 
Data In Data Out 
X + 
X + 
X + 
X + 
X 
X + 
Buffer 
Threshold 
Comparator 
Ctrl /Decision 
Logic 
Time Synchronisation 
Locked 
Pre -calculated complex - 
conjugate time domain 
representation of expected 
SS values 
? 
Magnitude Calc 
i 
j X 
X 
+ 
buffers. Once the timing synchronizer indicates 
that the start of frame is located, the received LTS 
data, minus the cyclic prefix, is streamed into all 
four FFTs. For each subcarrier within the OFDM 
symbol a 4x4 complex matrix is obtained. This is 
the channel matrix. For each burst of OFDM 
symbols an array of 16 memories will be populated 
with the channel matrices. 
Once the channel matrix is obtained, the 
channel estimation process takes place. The 
channel estimator essentially computes the inverse 
of the channel matrix for every sub carrier. 
Matrix inversion is a computationally intensive 
calculation and in order to implement this 
efficiently, QR decomposition is performed on the 
channel matrix before inversion. 
Fig. 5 shows the matrix inversion process. The 
channel matrix is decomposed into two separate 
matrices, a Q matrix and an upper triangular matrix 
R. The inverse of the channel matrix is calculated 
by multiplying the transpose of the Q matrix with 
the inverse of the R matrix. 
 
QRD INV
RQ
Multiply
Q
T
R
-1
H
H
-1
 
Figure 5: Matrix Inversion Process. 
 
The channel matrix H is decomposed to a Q 
matrix and an upper triangle matrix R using a 
massive systolic array of CORDIC elements. QR 
decomposition is achieved by implementing the 
three angle complex rotation algorithm. Both 
rotational and vectoring CORDICs are 
implemented using ALUTs and registers. Fig. 6 
shows a block diagram of QRD circuit used to 
compute the R matrix. 
 
CORDIC(V)
Im(b) θb
-ve
CORDIC(V)
θ1
ǀbǀ
ǀaǀ
z
-θb
-ve
-θ1
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
Re(b)Im(b)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(V)
Im(b) θb
-ve
CORDIC(V)
θ1
ǀbǀ
ǀaǀ
z
-θb
-ve
-θ1
Re(b)Im(b)
CORDIC(V)
Im(b) θb
-ve
CORDIC(V)
θ1
ǀbǀ
ǀaǀ
z
-θb
-ve
-θ1
Re(b)Im(b)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(V)
Im(b) θb
-ve
CORDIC(V)
θ1
ǀbǀ
ǀaǀ
z
-θb
-ve
-θ1
Re(b)Im(b)
 
Figure 6: QR Decomposition Circuit To Compute R Matrix. 
 
This array consists of four boundary cells and 
six internal cells. The boundary cells each contain 
two CORDICS that operate in vectoring mode. The 
internal cells each consist of three CORDICS that 
operate in rotation mode.  
The boundary cells perform the vectoring 
calculation by rotating the complex value to the x-
axis. The angles that have been rotated through to 
do this are passed horizontally through the systolic 
array. The internal cells rotate the complex values 
that travel from top to bottom by the angles that are 
passed from left to right.  
The systolic array for calculating the R matrix 
is connected directly to the array for calculating the 
Q matrix. This structure of this array is shown in 
Fig. 7. 
 
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
Im(b)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)Im(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
CORDIC(R)
Re(b)
CORDIC(R)
Re(be
-jθb
)
Re(a’)
z
Im(be
-jθb
)
Re(b’)
CORDIC(R)
Im(a’)
z
Im(b’)
z
z
Im(b)
 
Figure 7: QR Decomposition Circuit To Compute Q Matrix. 
 
The channel matrix data enters the array from 
the top of the systolic array in the pattern illustrated 
in Fig. 8. Each CORDIC element has a latency of 
20 clock cycles in order to maintain a high clock 
speed the CORDIC. The QRD circuit therefore has 
a data-path latency of 440 clock cycles. 
 H00
H01
H02
H03
H10
H11
H12
H13
H20
H21
H22
H23
H30
H31
H32
H33
R Matrix Q Matrix
Channel Matrix
Identity Matrix
1
00
010
000 0
010
0 0
1
 
Figure 8: QR Matrix Data Flow. 
 
A scheduler has been implemented which 
controls the read address of the channel matrix 
memories and multiplexes the outputs of these 
memories into the systolic array. Initially data is 
only read from H00 memory and input to the first 
column of the QRD array. The first 20 addresses 
are read in, corresponding with the CORDIC 
latency. On the next clock cycle, data from H01 
memory is passed into the first QRD array column 
and data from H10 memory is passed into the 
second column. 
Once address 20 of memory H33 is entered 
into the first column, the address pointer of QR 
array column 0 points back to memory H00 and the 
first element of the channel matrix for sub carrier 
21 is accessed. At this point an init signal is set 
that resets all the feedback elements of the current 
QRD cell. This init signal propagates through the 
QRD array’s data-path to ensure the calculations 
are fully synchronous. 
The R matrices are captured and passed to the 
R inverse calculation block. The R inverse block 
calculates the following equations: 
R
-1
(3,3) = 1/R(3,3) 
R
-1
(2,2) = 1/R(2,2) 
R
-1
(2,3) = -R(2,3)* R
-1
 (3,3)/R(2,2) 
R
-1
(1,1) = 1/R(1,1) 
R
-1
(1,2) = -R(1,2)* R
-1
 (2,2)/R(1,1) 
R
-1
(1,3) = -(R(1,2)* R
-1
 (2,3)+R(1,3)* R
-1
 (3,3))/R(1,1) 
R
-1
(0,0) = 1/R(0,0) 
R
-1
(0,1) = -R(0,1)* R
-1
 (1,1)/R(0,0) 
R
-1
(0,2) = -(R(0,1)* R
-1
 (1,2)+R(0,2)* R
-1
 (2,2))/R(0,0) 
R
-1
(0,3) = -(R(0,1)* R
-1
 (1,3)+R(0,2)* R
-1
 (2,3)+R(0,3)* R
-1
 
(3,3))/R(0,0) 
 
This circuit is heavily pipelined with many shift 
registers required as some of the terms require 
higher computation and also because the 
calculation of some matrix terms require the result 
of other matrix terms (e.g. R-1 (2,3) calculation 
requires R-1 (3,3)). 
The inverse of the R inverse matrices and Q 
transpose matrices are stored in an array of 
memories. Once all subcarrier matrices calculated, 
they are streamed out one subcarrier at a time and 
fed into a 4x4 matrix multiplication block. Each 
resulting matrix is the channel matrix which itself is 
stored in a 4x4 array of dual-port memory blocks. 
The entire channel estimation process has a 
massive latency.  OFDM data frames are buffered 
in FIFOs. MIMO decoding takes place after the 
channel estimation process completes. OFDM data 
is read out of the four channel FIFOs. The 
corresponding channel estimation matrix is read 
out of the corresponding subcarrier location of the 
16 channel estimate memories. The OFDM data 
and the channel estimation data are multiplied 
together in the form of a matrix multiplication. This 
multiplication results in the equalized OFDM data. 
There are now four equalized and independent 
OFDM data streams. OFDM frames are stored in a 
FIFO at the output of the FFT block. This acts as a 
buffer so no data is lost whilst the channel 
estimates are computed. Once the channel 
estimates are all computed, data is read out of the 
FFT output FIFO. The corresponding channel 
estimate is read from the channel estimation 
memory block and equalization is performed on a 
carrier-per-carrier basis via a single complex 
multiplication. 
The pilot tones are extracted and de-
scrambled. The average value of the pilot tones is 
calculated and phase correction is performed on 
the entire OFDM symbol by multiplying each 
subcarrier by the pilot tone average. 
The next step on the receiver data-path is to 
perform feed forward timing synchronization. Again 
the (now phase-corrected) pilot tones must be 
extracted. Each pilot tone is divided by its 
subcarrier number and then the average is 
calculated to determine the feed-forward time 
synchronization value, Tau. 
Each subcarrier must be time corrected by 
adding the relevant Tau value to the real 
component and by subtracting it from the 
imaginary component. The relevant Tau value for 
each subcarrier is simply the subcarrier number 
multiplied by the Tau value. In order to simplify this 
process, a running adder is used. Each clock 
cycle, as the time correction is performed on each 
incrementing subcarrier, the Tau value is also 
incremented using a feedback adder. 
The symbol demapper is implemented using a 
decoder-multiplexer structure. The symbol 
demapper can be set up to perform hard or soft 
symbol demapping. 
The output of the symbol demapper is fed into 
a block de-interleaver. The block de-interleaver 
has the same structure as the interleaver on the 
transmitter, except that the read and write address 
patterns are reverted. The de-interleaver has a 
further generic in that it must be able to store the 
soft or hard bit representation of the data in every 
bit location. 
Error correction is performed using the Viterbi 
decoder.  
 
V. IMPLEMENTATION 
 
A. Transmitter 
Table 1 contains synthesis figures for the 
MIMO transceiver when configured with 16-QAM, 
and 64-point OFDM. Table 2 shows the resource 
utilization for each of the main processing blocks 
within the MIMO transmitter. Each element within 
this circuit is very similar to that of the SISO 
system. The greater resources required are simply 
due to replication for the four channels. 
Again, for a 512-point OFDM system the IFFT 
and interleaver will require eight times as many 
resources. The system will require approximately 
eight times as many memory bits. 
A clock frequency of 100 MHz is again 
achieved for the MIMO transmitter. 
Table 1: MIMO Transmitter Synthesis Results. 
Resource Used Available % Used 
ALUTs 33,423 424,960 7.8 
Registers 12,320 424,960 2.9 
Memory bits 265,408 21,233,664 1.2 
18-Bit DSP blocks 32 1,024 3.1 
Table 2: Resource Utilization By Entity. 
Function ALUTS Registers Memory 
Bits 
18-Bit 
DSP 
Conv 
encoder 
32 136 0 0 
Block 
interleaver 
28,016 1,730 0 0 
IFFT 3,854 9,152 8,896 32 
Cyclic prefix 40 128 0 0 
 
 
B. Receiver 
Table 3 contains synthesis figures for the 
MIMO receiver when configured with 16-QAM, and 
64-point OFDM. Table 4 shows the resource 
utilisation of the MIMO receiver by entity. The 
channel estimation and equalisation blocks (R 
matrix inverse, MIMO decoder, QR decomposition 
and QR multiplier) account for 86% of the ALUTS 
and 77% of the DSP multipliers within the circuit. 
The size and complexity of the channel 
estimation and equalisation blocks will remain 
constant with respect to OFDM frame size. 
However, for larger OFDM frame sizes the 
processing latency will increase so that for a 512-
point OFDM system, the number of memory bits 
required increases by a factor of approximately 
eight. 
There are plenty of memory resources 
available on the FPGA to accommodate a 512-
point OFDM system. A clock frequency of 100 MHz 
is achieved for the MIMO receiver system. 
Table 3: MIMO Receiver Synthesis Results. 
Resource Used Available Percentage 
Used 
ALUTs 183,957 424,960 43.2 
Registers 173,335 424,960 40.7 
Memory Bits 367,060 21,233,664 1.72 
18-Bit DSP 
Blocks 
896 1,024 87.5 
Table 4: Resource Utilization By Entity. 
Function ALUTS Registers Memory 
Bits 
18-
Bit 
DSP 
Block 
deinterleaver 
13,772 1,772 0 0 
FFT 3,196 9,650 10,736 64 
Time 
synchroniser 
3,557 8,983 0 128 
Viterbi decoder  5,028 2,848 18,460 0 
R matrix 
inverse 
55,431 31,711 6,226 56 
MIMO decoder 1,036 768 0 128 
QR 
decomposition 
101,697 109,447 322 248 
QR multiplier 1,368 1,169 0 256 
 
List and number all references at the end of 
the paper.  When referring to them in the text, type 
the corresponding reference number in 
parentheses as shown at the end of this sentence 
[1].   
. 
 
 
REFERENCES 
 
1. N. August and D.S. Ha, “Low Power Design of DCT and 
IDCT for Low Bit Rate Video Codecs” IEEE Transactions on 
Multimedia, Vol. 6, No. 3, pp.414-422, June 2004. 
2. N.J. August, H.-J. Lee, and D. S. Ha, “An Efficient Multi-user 
UWB Receiver for Distributed Medium Access in Ad Hoc 
and Sensor Networks,” IEEE Radio and Wireless 
Conference, pp. 455-458, September, 2004. 
3. CRC Handbook of Chemistry and Physics (1992), 73
rd
 
edition, pp. 75-81, edited by D.R. Lide, CRC, Boca Raton, 
Florida. 
4. M. Smith, "Title of paper," unpublished, to be published on 
08/01/06, in “venue”. 
5. K. Rose, "Title of paper with only first word capitalized," 
 
