FPGA Prototyping of A High Data Rate LTE Uplink Baseband Receiver by Wang, Guohui et al.
FPGA Prototyping of A High Data Rate LTE
Uplink Baseband Receiver
Guohui Wang, Bei Yin, Kiarash Amiri, Yang Sun, Michael Wu, Joseph R. Cavallaro
Department of Electrical and Computer Engineering
Rice University, Houston, TX 77005
Email: {wgh, by2, kiaa, ysun, mbw2, cavallar}@rice.edu
Abstract—The Third Generation Partnership Project (3GPP)
Long Term Evolution (LTE) standard is becoming the appro-
priate choice to pave the way for the next generation wireless
and cellular standards. While the popular OFDM technique
has been adopted and implemented in previous standards and
also in the LTE downlink, it suffers from high peak-to-average-
power ratio (PAPR). High PAPR requires more sophisticated
power ampliﬁers (PAs) in the handsets and would result in lower
efﬁciency PAs. In order to combat such effects, the LTE uplink
choice of transmission is the novel Single Carrier Frequency
Division Multiple Access (SC-FDMA) scheme which has lower
PAPR due to its inherent signal structure. While reducing the
PAPR, the SC-FDMA requires a more complicated detector
structure in the base station for multi-antenna and multi-user
scenarios. Since the multi-antenna and multi-user scenarios are
critical parts of the LTE standard to deliver high performance
and data rate, it is important to design novel architectures to
ensure high reliability and data rate in the receiver. In this paper,
we propose a ﬂexible architecture of a high data rate LTE uplink
receiver with multiple receive antennas and implemented a single
FPGA prototype of this architecture. The architecture is veriﬁed
on the WARPLab (a software deﬁned radio platform based on
Rice Wireless Open-access Research Platform) and tested in the
real over-the-air indoor channel.
I. INTRODUCTION
The uplink transmissions in the 3rd Generation Partnership
Project (3GPP) Long Term Evolution (LTE) [1] is based
on single carrier frequency division multiple access (SC-
FDMA), which is a promising technique for high data rate
and low peak-to-average-power ratio (PAPR) in uplink com-
munications in future cellular systems [2], [3]. Multiple-input
multiple-output (MIMO) wireless communication systems are
capable of providing data transmission at very high data
rates and reliability. However, the high data rate and high
complexity of LTE uplink receivers complicate the hardware
implementation. Hence, the system architecture should be
well designed to achieve high data rate and good error-rate
performance.
This paper presents an architecture and an FPGA prototype
of an LTE uplink MIMO receiver. This work, to the best of the
author’s knowledge, is the ﬁrst FPGA prototype of the LTE
uplink receiver that integrates several advanced algorithms and
features. The rest of the paper is organized as follows: Section
II presents the system model. In section III, the algorithm and
structure of MIMO detector in LTE uplink receiver system
will be given. The architecture of this receiver and the system
veriﬁcation are described in section IV. Section V and VI
Outer 
Encoder Modulator
N-point
DFT
Subcarrier
mapping
M-point
IDFT
Outer 
Decoder
Inner
Detector
N-point
IDFT
Subcarrier
Demapping
/Equalization
M-point
DFT
Channel
Source
Binary
Sink
Hard
Decision
Fig. 1. Linear model of MIMO system in LTE uplink receiver.
show FPGA implementation of the proposed architecture
and give some brief discussion about system performance
and scalability. Finally, we conclude this paper in Section VII.
II. SYSTEM MODEL
Consider a MIMO LTE uplink system with Nt transmit
antennas and Nr receive antennas [4]. The vector of infor-
mation bits is ﬁrst encoded with an error correcting code
and then interleaved to obtain the coded bits. We assume a
system with 2Q-ary modulation with Ch symbols per block per
antenna. After modulation, the data sequence is multiplexed
into Nt transmission blocks, each containing Ch symbols. Let
dip,n(i = 0, · · · , Ch − 1) denote the ith unit-energy symbol in
a subsequence transmitted by the nth antenna. Then a DFT
transforms the transmission blocks into frequency domain
subsequence {Dip,n}(i = 0, · · · , Ch − 1). Then the coded bits
are mapped into subcarriers. Finally, the frequency domain
sequence is transformed back to time domain by IDFT before
transmission. The overall channel memory is assumed to be
N . The received signal vector at the mth sampling time with
a block ym is given by
ym =
Nt∑
n=1
N∑
i=1
hind
n−i
n + n
m, (1)
where hin is the overall channel impulse response with respect
to dn−in , and n
m represents the additive white Gaussian noise
vector with zero-mean and variance σ2. Assume the channel
information is perfectly known by the receiver. The received
signal sequence is ﬁrst transformed into the frequency domain
by DFT. The frequency domain vector on the mth subcarrier
is given by
248978-1-4244-5827-1/09/$26.00 ©2009 IEEE Asilomar 2009
Ym =
Nt∑
n=1
Hmn D
m
n + N
m, (2)
where Ym, Hmn , D
m
n , and N
m denote the DFT of ym, hin,
din, and n
m, respectively.
III. MIMO DETECTION FOR LTE UPLINK
A. MMSE-FDE
There are three major types of equalizers: time domain
equalizers, frequency domain equalizers [5] and combined
equalizers [6]. For high ISI channels, time domain equalizers
have high complexity and become unattractive to implement.
Among frequency domain equalizers (FDE), zero-forcing FDE
(ZF-FDE) and minimum mean-square error FDE (MMSE-
FDE) equalizers are the simplest ones. The MMSE-FDE
equalizer has better performance than the ZF-FDE. Some
equalizers belong to the third type. For example, the block
MMSE equalizer [2] is a type of equalizer operating in
both time and frequency domains. This equalizer can achieve
better bit error rate (BER) performance with much higher
algorithmic complexity. Because of the simple architecture and
relatively good performance, the MMSE-FDE is chosen in our
implementation.
MMSE-FDE minimizes the mean square error between its
output and the symbols transmitted from the transmitter. The
equation for an MMSE-FDE is:
Y′ = (HˆHHˆ + σ2I)−1HˆHY, (3)
where Hˆ is an estimated channel matrix of H for each
subcarrier. Hˆ is the output of channel estimation module.
B. MIMO Detection
Maximum likelihood (ML) search is the optimum detection
method, which minimizes the BER. This scheme assumes
an exhaustive search over the set of all possible transmitted
symbol vectors Λ for the minimum square error given by:
sˆML = argmin
s∈Λ
||Hs− y||2. (4)
However, the complexity of full ML search is too high. Even
with modern silicon technology the full ML search is still
not feasible, especially for the MIMO detection with multiple
antennas and high modulation orders [7].
C. LLR Function for APP Detection
The outer soft decoder calculates the maximum a posteriori
(MAP) or a posteriori probability (APP) values. The soft APP
information is exchanged between inner detector and outer
decoder, and it is used as additional a priori knowledge in
the form of a vector of log-likelihood ratio (LLR) values. The
magnitude of the LLR value corresponds to the reliability of
the decision. The larger the LLR is, the more reliable the
decision for a decoded bit is.
 	  	


	









Fig. 2. The BER performance of LTE uplink receiver, 2×2 MIMO, 16-QAM.
Sphere detection (SD) solves the complexity problem of
ML detection with some acceptable performance loss. In [8],
the authors proposed a computationally efﬁcient SD and list
sphere detection (LSD) to achieve near-capacity performance
on a MIMO system. The candidates list L was used to compute
the APP information for each transmitted bit xk. It is assumed
that iterative detection and decoding of bits x that correspond
to one channel usage are performed. Then the LLR value of
the bit xk, k = 0, · · · ,M ·MC−1, conditioned on the received
vector symbol y can be expressed as:
LE(xk|y) ≈ 1
2
max
x∈L|Xk,+1
{
− 1
σ2
‖y −H · s‖2 + xT[k]·LA,[k]
}
−1
2
max
x∈L|Xk,−1
{
− 1
σ2
‖y −H · s‖2 + xT[k]·LA,[k]
}
,
(5)
where M is the number of transmit antennas, and MC is the
number of bits per constellation symbol.
Based on (5), we designed an APP unit to calculate the APP
information used by the inner detector and outer decoder with
reduced hardware complexity. This APP unit is reconﬁgurable
and can be used in different soft MIMO detectors.
Notice that in (5), − 1σ2 ‖y −H · s‖2 has been calculated
as a partial Euclidean distance (PED) in sphere detector.
Therefore, the APP unit can take full advantage of the soft
information from the inner detector to reduce the complexity
of the hardware system.
D. Simulation Results
The 2×2 MIMO receiver is veriﬁed on the Rice WARPLab
platform [9]. WARPLab is a scalable and extensible pro-
grammable wireless platform based on software radio to
prototype advanced wireless networks. Signals generated in
MATLAB can be transmitted in real-time over the air using
WARP nodes. This facilitates rapid prototyping of Physical
layer algorithms. The goal of this simulation is to verify that
our receiver system satisﬁes the system requirements of the
LTE standard.
Fig. 2 shows the BER performance for 2×2 MIMO receiver,
where 16-QAM modulation is used. The length of DFT and
IDFT are 128 and 72, respectively. The length of codeword
249
IFFT
CP 
Removal
.
.
.
DFT
.
.
.
.
.
.
Channel Estimation Block
D
E
M
U
X
.
.
.
APP 
Unit
IFFT
.
.
.
FPGA
CP 
Removal
DFT
Subcarrier
demapping
Subcarrier
demapping
MMSE 
Equalizer
Sphere
Detector
(SD) LDPC
decoder
LDPC
decoder
S
D
 
B
uf
fe
r
Fig. 3. The system diagram of LTE uplink receiver. This system is 2×2
MIMO system.
TABLE I
PARAMETERS FOR A 2×2 MIMO LTE UPLINK RECEIVER
Parameter Description
Channel bandwidth 1.4MHz∼20MHz
Modulation order QPSK, 16-QAM, 64-QAM
Number of symbols (per antenna) 72∼1200
Number of subcarriers (per antenna) 128∼2048
Coding LDPC code, rate=1/2
length=576bits∼2304bits
Equalizer MMSE-FDE
for LDPC decoder is 576. This system is tested in a real over-
the-air indoor channel to verify the algorithm performance.
IV. ARCHITECTURE
A. Overall Architecture of LTE Uplink Receiver
As can be seen in Fig. 3, a physical layer prototype
system for the LTE uplink receiver is designed, including
IDFT, MMSE-FDE, sphere detector [10], APP unit and LDPC
decoder [11]. Table I shows the implementation parameters
in detail.
B. MMSE-FDE
If the MMSE-FDE is directly built using (3), we need a large
wordlength to achieve a good precision. This is because the
range of values of HˆHHˆ + σ2I is much larger than the original
Hˆ. Researchers use different approaches to solve this problem.
In [12], the authors use more bits to perform the matrix
inversion than other operations to guarantee no overﬂow in
inversion. In [13], blockwise matrix inversion is used to
break a large inversion into a few small inversions. In [14],
the authors use a modiﬁed Gram-Schmidt QR decomposition
with a dynamic scaling algorithm which enhances numerical
stability. All of the above approaches are based on either
inverting the matrix HˆHHˆ + σ2I [12] [13], or performing the
QR decomposition on an extended matrix, which is larger than
Hˆ [14]. In order to further minimize the area and increase
the speed, here we propose a new method. Equation (3) is
converted into the following form:
(HˆHHˆ + σ2I)−1HˆHY= (Hˆ + σ2(HˆH)−1)−1Y. (6)
Compared with (3), equation (6) only needs to invert Hˆ
which has a much smaller range of values. This corresponds
Pre-
Processing
Flex
Demodulator
MUX
LLR
Computation
Processor 
Element
(PE)
LLR
Computation
Processor 
Element
(PE)
LLR
Computation
Processor 
Element
(PE)
LLR
Computation
Processor 
Element
(PE)
Partial 
Euclidean 
distance (PED)
Complex
symbol Bit
stream
Bit
PED
LLR
Value
Fig. 4. The top level diagram of the architecture of APP unit. The soft SD
in this architecture is Flex SD.
to a small wordlength during inversion. Also when σ2 goes to
zero, the equalizer will not become unstable. Because (6) will
converge to (Hˆ)−1, the MMSE-FDE will gradually converge
to a less accurate ZF-FDE. Another simpliﬁcation is to reduce
the number of multipliers in complex multiplication by using
strength reduction.
C. APP Function Unit
APP unit without feedback loop can be implementation
based on (5). The inputs to the APP unit are the output
candidates from the sphere detector. We use a ﬂex sphere
detector (SD) in our implementation [10] since it can handle
different modulations and antenna conﬁgurations with low
overhead.
The Flex SD receives the equalized symbols from the
MMSE-FDE, and produces a list of candidates per each chan-
nel usage; this means in a 2× 2 MIMO system, Flex SD will
produce candidates for 2 MIMO symbols per clock cycle. For a
16-QAM scenario, the Flex SD outputs 8 candidates in every
cycle, and in 8 continuous clock cycles, it will generate 64
candidates in all for 2 transmitted MIMO symbols [10]. Each
candidate contains 2 MIMO symbols and a partial Euclidean
distance (PED) value.
Fig. 4 shows the top level architecture of the APP unit.
The main parts include the preprocessing module, ﬂexible n-
QAM demodulator, multiplexing module, and several LLR
computation processor elements (PE). The preprocessing mod-
ule retrieves PED values and two complex symbols from the
input. Then the complex symbols are demodulated through
the ﬂexible n-QAM demodulator that can support QPSK, 16-
QAM and 64-QAM. PEs calculate the LLR value for each
coded bit using demodulated bits and the corresponding PED
values. A fully parallel architecture is utilized, that is, in order
to process all M ·MC bits of two symbols in parallel, we need
M · MC PEs working simultaneously.
There are several changes in the interface between detector
and decoder to enable the iteration loop to achieve extra
performance improvement. Fig. 5 shows the architecture of
APP unit with a feedback loop.
V. FPGA IMPLEMENTATION
We implemented most of the block units in the system of
Fig. 3 using Xilinx System Generator. The system is designed
for 2×2 MIMO receiver for the LTE uplink in which 2048
250
MUX
Candidate
memory
Soft
value
buffer
Feedback
soft value
buffer
APP Unit
without
feedback
LDPC
decoder
Sphere
detector
APP Unit
with 
feedback
Fig. 5. The architecture of the APP unit with iterative loop. Two buffers,
one candidate memory and a multiplexing have been added.
    












	





Fig. 6. Performance comparison between ﬁxed-point implementation and
ﬂoating-point for MMSE-FDE block.
subcarriers are transmitted and 1200 subcarriers are occupied
by symbols. 16-QAM and 2304-bit LDPC code are used. It
is noticeable that since the system is reconﬁgurable, we could
easily change the parameters of the receiver so that it could
support other proﬁles in the LTE standard.
A. MMSE-FDE
The performance comparison between ﬁxed-point imple-
mentation and ﬂoating-point is shown in Fig. 6. Before 30dB,
the two curves are almost the same. The performance loss
after 30dB occurs since σ2 becomes zero and the MMSE-
FDE reduces to the less accurate ZF-FDE as mentioned in
Section IV.
Xilinx System Generator is used to implement the proposed
MMSE-FDE. Table II shows the Xilinx ISE synthesis results
of the FPGA implementation.
TABLE II
FPGA RESOURCE UTILIZATION SUMMARY OF THE PROPOSED
MMSE-FDE FOR THE XILINX VIRTEX-4, XC4VFX100-10FF1517
DEVICE
Number of Slices 5,145/42,176 (12%)
Number of 4 input LUTs 9,019/84,352 (10%)
Number of DSP48s 64/160 (40%)
Max. Frequency 99.461MHz
Max. Data Rate 397.844Mbps
B. APP Function Unit
We use Xilinx System Generator to implement the proposed
APP unit architecture. The APP unit processes the input data
in parallel, and can be extended to support higher modulation
orders and more antennas. By replacing multiplications with
shift and addition operations, we reduce the number of multi-
pliers required and get higher maximum frequency. Table III
shows the Xilinx ISE synthesis result of the APP function unit.
TABLE III
FPGA RESOURCE UTILIZATION SUMMARY OF THE PROPOSED APP UNIT
FOR THE XILINX VIRTEX-4, XC4VFX100-10FF1517 DEVICE
Number of Slices 2,393 /42,176 (5%)
Number of 4 input LUTs 4,426 /84,352 (5%)
Number of DSP48s 0 /160 (0%)
Max. Frequency 208MHz
Max. Data Rate 1.628Gbps
C. Other Block Units
Table IV and Table V show the FPGA implementation
results for sphere detector and LDPC decoder blocks, respec-
tively. All of these parts are reconﬁgurable. We use the Xilinx
DFT core [15] to perform DFT and IDFT operations in our
system.
TABLE IV
FPGA RESOURCE UTILIZATION SUMMARY OF IDFT BLOCK FOR THE
XILINX VIRTEX-4
Number of Slices 3,748 /42,176 (8%)
Number of 4 input LUTs 5,699 /84,352 (6%)
Number of DSP48s 16 /160 (10%)
Max. Frequency 234MHz
Max. Data Rate 936Mbps
TABLE V
FPGA RESOURCE UTILIZATION SUMMARY OF SPHERE DETECTOR BLOCK
FOR THE XILINX VIRTEX-4, XC4VFX100-10FF1517 DEVICE
Number of Slices 7,780 /42,176 (18%)
Number of 4 input LUTs 14,300/84,352 (16%)
Number of DSP48s 81 /160 (50%)
Max. Frequency 220MHz
Max. Data Rate 220Mbps
VI. SYSTEM PERFORMANCE AND IMPLEMENTATION
CONSIDERATIONS
A. System Performance
In our LTE uplink receiver system, the parameters are
set as below: 2 receiving antennas, 2048 subcarriers, 1200
occupied subcarriers, 16-QAM, 2304-bit LDPC for 20MHz
channel bandwidth. By using two clock domains, with MMSE-
FDE, IDFT, APP unit and LDPC decoder in one slower clock
domain, and sphere detector in the other faster clock domain,
the current system can achieve a data rate of up to 220Mbps.
251
This data rate is much higher than the requirement given
by LTE standard, which is 115.2Mbps for 20MHz channel
bandwidth under the 2×2 16-QAM scenario [1].
B. Higher Data Rate
LTE standard speciﬁes signal transmissions in six possible
channel bandwidths ranging from 1.4MHz to 20MHz [1],
[3]. There are 72 occupied subcarriers available in a 1.4MHz
channel and 1200 occupied subcarriers available in a 20MHz
channel. The data rate of a 20MHz channel for a 2×2 64-QAM
uplink system is 172Mbps. To support this conﬁguration, the
overall architecture of our system does not need to change. We
only need to conﬁgure the sphere detector and APP unit to 64-
QAM mode. Accordingly, because the size of the transmission
signal sequence becomes larger, we should increase the size
of the buffer between blocks.
C. System Scalability
During implementing, in order to simplify the design pro-
cess, we assume that the input data comes from one user.
However, we could extend our receiver to support multi-
user access. As is depicted in Fig. 3, most of the blocks
do not need to change except for a few modiﬁcations. The
ﬁrst difference is, instead of using an N-point IDFT block,
we should replicate a few IDFT blocks with small length,
each of which is for one user. Another modiﬁcation is to
add LDPC decoders for multiple users. In order to separate
the codeword for each user, a de-multiplexing is required
between the APP computation unit and LDPC decoder. It is
noticeable that by replicating blocks for different users, we do
not modify the overall architecture or redesign the function
unit blocks. More hardware resources and more chip area are
required when extending the system by replicating function
blocks. However, there are opportunities to reduce the usage
of hardware resources. For example, we could exploit the
potential reuse of the IDFT blocks and it is probable that some
blocks could share speciﬁc hardware resources.
VII. CONCLUSION AND FUTURE WORK
This paper proposed a ﬂexible architecture of the high data
rate LTE uplink receiver, which integrates several advanced
algorithms and features. A single FPGA prototype of this
architecture is presented. It supports different numbers of
antennas and modulation orders. The prototype is imple-
mented using Xilinx System Generator and is veriﬁed on the
WARPLab platform with channels generated by the Azimuth
channel emulator. We also veriﬁed the system in real over-the-
air indoor channels.
The FPGA prototype we built can be ﬁt in one Xilinx
Virtex4 FX140 FPGA. It supports data rates up to 220Mbps,
which is much higher than the data rate requirement of the
LTE standard. The prototype of our LTE uplink receiver can
be conﬁgured to support different transmission bandwidths
speciﬁed by the LTE standard.
The future work is to extend our LTE uplink receiver
to support multi-access from different users. We will
further optimize the whole system to reduce the usage of
hardware resources by balancing the resource usage among
different parts of the system. For example, we could reuse
some modules and match the rates of different modules.
Furthermore, with this platform, we will investigate more
complicated algorithms with potentially better performance,
such as the block MMSE equalizer and more sophisticated
iterative detection-decoding schemes. We will also try to
increase the reconﬁgurability of the receiver in order that it
can be conﬁgured on the ﬂy.
ACKNOWLEDGMENTS
The authors would like to thank Nokia, Nokia Siemens
Networks (NSN), Xilinx, Azimuth Systems, and US Na-
tional Science Foundation (under grants CCF-0541363, CNS-
0551692, CNS-0619767, EECS-0925942 and CNS-0923479)
for their support of the research.
REFERENCES
[1] 3rd Generation Partnership Project, “3GPP TS 36.211 - technical spec-
iﬁcation group radio access networ; evolved universal terrestrial radio
access (E-UTRA); physical channels and modulation (Release 8),” Nov
2007.
[2] P. Radosavljevic, “Sphere detection and LDPC decoding algorithms and
architectures for wireless systems,” scholarship.rice.edu, Jan 2008.
[3] H. G. Myung and D. J. Goodman, “Single carrier FDMA: a new air
interface for long term evolution,” p. 185, Jan 2008.
[4] Y. Wu, X. Zhu, and A. Nandi, “Low complexity adaptive turbo space-
frequency equalization for single-carrier multiple-input multiple-output
systems,” IEEE Transactions on Wireless Communications, vol. 7, no. 6,
pp. 2050 – 2056, Jun 2008.
[5] D. Falconer, S. Ariyavisitakul, A. Benyamin-Seeyar, and B. Eidson,
“Frequency domain equalization for single-carrier broadband wireless
systems,” IEEE Communications Magazine, vol. 40, no. 4, pp. 58 – 66,
Apr 2002.
[6] R. Koetter, A. Singer, and M. Tuchler, “Turbo equalization,” IEEE Signal
Processing Magazine, vol. 21, no. 1, pp. 67 – 80, Jan 2004.
[7] D. Garrett, L. Davis, S. ten Brink, and B. Hochwald, “Silicon complexity
for maximum likelihood MIMO detection using spherical decoding,”
IEEE Journal of Solid-State Circuits, Jan 2004.
[8] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-
antenna channel,” IEEE Transactions on Communications, vol. 51, no. 3,
pp. 389 – 399, Mar 2003.
[9] “Wireless open access research platform.” [Online]. Available:
http://warp.rice.edu/
[10] K. Amiri, C. Dick, R. Rao, and J. Cavallaro, “Novel sort-free detector
with modiﬁed real-valued decomposition (M-RVD) ordering in MIMO
systems,” 2008 IEEE Global Telecommunications Conference, pp. 1 –
5, Jan 2008.
[11] Y. Sun, M. Karkooti, and J. Cavallaro, “VLSI decoder architecture for
high throughput, variable block-size and multi-rate LDPC codes,” 2007
IEEE International Symposium on Circuits and Systems, pp. 2104 –
2107, Apr 2007.
[12] S. Yoshizawa, Y. Yamauchi, and Y. Miyanaga, “VLSI implementation of
a complete pipeline MMSE detector for a 4 x 4 MIMO-OFDM receiver,”
IEICE Transactions on Fundamentals of Electronics, Jan 2008.
[13] J. Eilert, D. Wu, and D. Liu;, “Implementation of a programmable
linear MMSE detector for MIMO-OFDM,” 2008 IEEE International
Conference on Acoustics, Speech and Signal Processing, pp. 5396 –
5399, Jan 2008.
[14] J. Bhatia, K. Mohammed, A. Shah, and B. Daneshrad, “A practical,
hardware friendly MMSE detector for MIMO-OFDM-based systems,”
EURASIP Journal on Advances in Signal Processing, Jan 2008.
[15] “Discrete fourier transform v3.0.” [Online]. Available:
http://www.xilinx.com/products/ipcenter/DFT.htm
252
