A Customized Lattice Reduction Multiprocessor for MIMO Detection by Shahabuddin, Shahriar et al.
ar
X
iv
:1
50
1.
04
86
0v
1 
 [c
s.I
T]
  1
7 J
an
 20
15
Reference Matlab code is available at sites.google.com/site/shahriarshahabuddin/matlab simulator
A Customized Lattice Reduction Multiprocessor for
MIMO Detection
Shahriar Shahabuddin, Janne Janhunen,
Zaheer Khan, and Markku Juntti
Centre for Wireless Communications
University of Oulu, Finland
Email: firstname.lastname@ee.oulu.fi
Amanullah Ghazi
Department of Computer Science and Engineering
University of Oulu, Finland
Email: firstname.lastname@ee.oulu.fi
Abstract—Lattice reduction (LR) is a preprocessing technique
for multiple-input multiple-output (MIMO) symbol detection to
achieve better bit error-rate (BER) performance. In this paper,
we propose a customized homogeneous multiprocessor for LR.
The processor cores are based on transport triggered architec-
ture (TTA). We propose some modification of the popular LR
algorithm, Lenstra-Lenstra-Lova´sz (LLL) for high throughput.
The TTA cores are programmed with high level language. Each
TTA core consists of several special function units to accelerate
the program code. The multiprocessor takes 187 cycles to reduce
a single matrix for LR. The architecture is synthesized on 90 nm
technology and takes 405 kgates at 210 MHz.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) is a key technology
to utilize the available radio spectrum efficiently. The basic
idea of MIMO is to send multiple independent data streams
from multiple antennas in the same frequency band. These
independent streams need to be separated at the receiver to
identify the symbol that is being transmitted by using a MIMO
detector. Maximum likelihood (ML) is the optimal solution
for the MIMO detection problem that compares the incoming
symbol with every possible symbols in the constellation.
However, the ML algorithm is too complex for practical real-
time implementations. Linear detection is popular for practical
implementations. The linear MIMO detection algorithms are
less complex, but suffers from a degraded bit error-rate (BER)
performance.
Lattice reduction (LR) is a preprocessing technique that can
be used with the linear detection to significantly improve the
BER performance and reduce the gap between the traditional
linear detectors and optimal ML. LR transforms the MIMO
channel matrix to a near orthogonal matrix and thus facilitates
to achieve a better BER performance. The most used LR
algorithm is called the Lenstra-Lenstra-Lova´sz (LLL) algo-
rithm according to the name of the inventors [1]. The LLL
algorithm poses many challenges due to the undeterministic
execution time and higher computational complexity. We pro-
pose a modified LLL (MLLL) algorithm that is based on the
original LLL algorithm on complex domain. We use a fixed
structure for the LLL based on [2]. Instead of using the Lova´sz
condition, a less complex Siegel condition is applied [3]. An
early termination technique is used as proposed in [4]. We
demonstrate by Matlab simulation that the BER performance
loss of the hardware friendly MLLL algorithm is negligible.
There are several hardware accelerators proposed in [4]
[5] [6] [7] for different LR algorithms. The fixed hardware
implementations provide high data rate and consume less
silicon area compared to the customized application specific
processors (ASIP). The drawback of the fixed hardware imple-
ment ation is that it operates only on a fixed set of parameters
due to the hardwired control path and it is not possible to
modify the control path in the future. An ASIP customized
for a small set of algorithms is an attractive solution in terms
of cost, silicon area and high throughput. Most importantly, an
ASIP reduces the design risk with an instruction memory that
can be used to load new programs or control instructions. The
control instructions can be easily obtained by a retargetable
compiler for that particular customized architecture.
A customized very long instruction word (VLIW) processor
is implemented in [8] for the LR. We take different approach
and design a customized multiprocessor based on the transport
triggered architecture (TTA) paradigm. TTA is a processor
design philosophy where the programmer can control the
internal data transports between different function units of the
processor. TTA exploits the instruction level parallelism (ILP)
by processing several instructions in a single clock cycle. The
TTA based codesign environment (TCE) tool is used in this
work to design the TTA processor cores. TCE enables the
designer to write an application with a high level language
and design the target processor in a graphical user interface at
the same time. A turbo decoder and a MIMO detector design
using TCE can be found in [9] and [10]. In this work, every
core of the proposed multiprocessor is programmed with C
language to shorten the time-to-market. The multiprocessor
takes 187 cycles and achieves a maximum clock frequency of
210 MHz on 90 nm technology. To our knowledge, this is the
first TTA based customized architecture for LR.
II. SYSTEM MODEL
A. Conventional MIMO Detection
Consider a MIMO system consists of MT transmit antennas,
which are sending data over the channel and NR receive anten-
nas which are receiving transmitted bits from the channel. The
modulation scheme that is used here is quadrature amplitude
modulation (QAM) with the assumption NR ≥ MT . The
received signal y can be represented as
Reference Matlab code is available at sites.google.com/site/shahriarshahabuddin/matlab simulator
y = Hx+ n, (1)
where y ∈ CNR is the received signal vector, x ∈ CMT is
the transmit symbol vector, H ∈ CNR×MT is the channel
matrix and n ∈ CNR is the circularly symmetric complex
white Gaussian noise vector with zero mean and variance σ2.
In the receiver, the linear zero forcing (ZF) detector cal-
culates the inverse of the channel matrix to compute the
transmitted symbol vector which can be expressed by,
x˜ = (HHH)-1Hx = H†x. (2)
where H is the channel matrix and (·)† denotes the pseudoin-
verse. Typically, the channel matrix H is QR decomposed into
two parts as H = QR. Here Q ∈ C(NR×NR) denotes a unitary
matrix and R ∈ C(NR×NR) denotes an upper triangular matrix.
B. Lattice Reduction
A lattice is a periodic arrangement of discrete points. A
lattice can be characterized in terms of a set of basis vectors,
where any points of the lattice can be represented by a
superposition of integer multiples of the basis vectors. For
simplicity, we call the set B = (b1, b2, ...., bn) as the basis of
the lattice.
A complex valued lattice in the n-dimensional complex
space Cn can be defined as
L = {υ|υ = Bω}, (3)
where B is the basis of the lattice and ω = [ω1, ω2, ...., ωn].
Note that in (3), the υ, ω and matrix B can be replaced with
y, x and H respectively to obtain L = {y|y = Hx}. In this
case, the vector space L is the set of all possible undisturbed
received signal points. There are many bases that can span the
space L, and the aim of the LR algorithm is to find a set of
least correlated base with the shortest basis vectors [11].
C. LR-based MIMO Detection
LR finds an improved basis for the lattice induced by the
channel. The original basis and the reduced basis are related
by a unimodular matrix, T. Therefore, the LR aided detection
finds the received symbol in the new reduced basis and
afterwards transfer the signal in the original lattice. The new
channel matrix after the LR can be expressed as, H˜ = HT
and the transmitted signal is also treated as multiplied by T−1
which is z = T−1x for the reduced basis. The received signal
y = Hx+ n can be expressed as
y = HTT−1x+ n = H˜z+ n. (4)
The LR aided detection operates on H˜ and z instead of H
and x. The LR aided ZF detector can be expressed as
x˜ = (H˜HH˜)-1H˜z = H˜†z. (5)
The LR algorithm is applied on the QR decomposed H to
obtain the modified Q˜ and R˜. Afterwards, the lattice reduced
channel matrix can be obtained as H˜ = Q˜R˜.
III. LATTICE REDUCTION ALGORITHM
LLL algorithm is widely used to compute the suitable
unimodular matrix T and to obtain a reduced lattice basis.
LLL was originally proposed for the real valued LR [1].
However, the channel matrix is naturally complex valued and
therefore, complex version of LLL (CLLL) is used to reduce
the complexity.
The CLLL algorithm suffers from irregular dataflow,
which eventually leads to higher latency. Therefore, a fixed-
complexity LLL (fcLLL) algorithm is proposed in [2]. The
fcLLL alters the signal flow of the CLLL to follow a deter-
ministic structure. It is possible to utilize less complex Siegel
algorithm instead of the complex Lova´sz condition [3]. It is
also very important to use an early termination mechanism to
meet the strict requirements. Applying all this modifications,
we propose a modified-LLL (MLLL) algorithm for LR with
less complexity and negligible BER performance loss. The
MLLL implemented in this paper is summarized in Algorithm
1.
Algorithm 1 Modified CLLL Algorithm (MLLL)
INPUT: Q ∈ CNR×NR , R ∈ CNR×NR , δ
1: Initialization Q˜ := Q , R˜ := R , T := IMT
2: k := 2
3: while k ≤ iterations
4: for l = k − 1 to 1 step −1
5: µ = R˜(l, k)/R˜(l, l)
6: if µ 6= 0
7: R˜(1 : l, k) := R˜(1 : l, k)− µR˜(1 : l, l)
8: T(:, k) := T(:, k) − µT(:, l)
9: end
10: end
11: if δR˜(k − 1, k − 1)2 > R˜(k, k)2
12: Swap columns k − 1 and k in R˜ and T
13: Θ =
[
α β
−β α
]
with α = R˜(k−1,k−1)
‖R˜(k−1:k,k−1)‖
and
β = R˜(k,k−1)
‖R˜(k−1:k,k−1)‖
14: R˜(k−1 : k, k−1 : k) := ΘR˜(k−1 : k, k−1 : k)
15: Q˜(:, k − 1 : k) := Q˜(:, k − 1 : k)ΘT
16: k := max{k − 1, 2}
17: else
18: k := k + 1
19: end
20: end
The BER performance of the traditional ZF, original CLLL
aided ZF, MLLL aided ZF and the optimal ML is simulated
for various signal-to-noise (SNR) in a Matlab simulator. An
additive white Gaussian noise (AWGN) channel is used for
16-QAM modulation and the BER is averaged over 10 000
Monte-Carlo trials. Fig. 1 shows the MLLL algorithm with 5
iterations. It can be seen that the performance loss is negligible
compared to the original CLLL algorithm.
Reference Matlab code is available at sites.google.com/site/shahriarshahabuddin/matlab simulator
0 5 10 15 20 25 30 35 4010
−4
10−3
10−2
10−1
100
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
ZF
CLLL
MLLL (5 iterations)
ML
Fig. 1. BER peformance of MLLL algorithm.
IV. TTA MULTIPROCESSOR FOR MLLL
A. Special Function Units
Six special function units (SFU) are designed and written in
VHSIC hardware description language (VHDL) to accelerate
each iteration of the MLLL algorithm. A special function unit
is designed to support the complex multiplication (CMUL)
operation. Data level parallelism (DLP) is applied in the design
by packing the 16-bit real part and 16-bit complex part in
a 32-bit complex variable. Therefore, CMUL uses four 16-
bit multipliers, a single 16-bit adder and 16-bit subtractor to
support the complex multiplication.
Two SFUs for µ calculation and size reduction are designed
according to [8]. These SFUs are single-cycle and multiplier-
less. It is observed from the Matlab simulations that the value
of µ has a range of [−4, 4], and thus, the dynamic range of
the SFUs are set accordingly. Another simple SFU is designed
to compute the SIEGEL criterion. Instead of multiplying the
input with .75 the SIEGEL SFU calculates the value with a
combination of two shifters and one adder.
Fig. 2. 4-cycle cordic architecture.
The most complex SFU that lies in the datapath of a single
TTA core is the CORDIC SFU. A master-slave cordic is
considered in this work [4]. The master-slave CORDIC is
a combination of two CORDIC blocks in vectoring mode
and rotation mode respectively. By setting the input as 1
and 0 of the CORDIC with rotation mode, it is possible to
calculate the cosine and sine values directly. Therefore, the
angle calculation done in a conventional CORDIC block is
not needed here. In every stage it is possible to calculate the
values of the signums and add or subtract accordingly in the
rotation mode. As we need a 16-bit CORDIC, there are two
options to design it. An iterative CORDIC that uses registers
and iterates 16 times over the 1-stage datapath. However, it
takes 16-clock cycles to compute the output. For a processor
based implementation a 16-cycle SFU is complex as there will
be 15 NOP operations in the assembly code. It is possible to
fully unroll the CORDIC block without any registers. Then
the critical path for the CORDIC block becomes too high. We
find a compromise between the two approach and design a
4-stage CORDIC datapath that can be reused four times to
create a 4-cycle master-slave CORDIC. The block diagram of
the master-slave CORDIC is presented in Fig. 2, where the
ellipse contains a single stage of the datapath. An ARRANGE
SFU is designed to rearrange the 32-bit variables.
B. High Level Architecture of the multiprocessor
A 32-bit fixed point TTA processor is designed to support a
single iteration of the MLLL algorithm and five of these TTA
cores are connected in a pipelined fashion to compute the LR
matrix. Part of a single TTA processor core is illustrated inside
the dotted block of Fig. 3. For readability, the whole processor
is not given. The blocks in the upper part of the core represent
the function units and register files of the processor. The black
horizontal straight lines represent the buses of the processor.
The vertical rectangular blocks represent the sockets.
Fig. 3. The multiprocessor architecture.
Each core includes the load/store unit (LSU), arithmetic
logic unit (ALU), global control unit (GCU), register files,
several conventional function units and SFUs. The Q, R
and T matrix are read from three separate first-in-first-out
(FIFO) memory buffer by using the function units called
STREAM. The STREAM units can read every input sample
in one clock cycle. Three STREAM units are used to get the
inputs simultaneously. Three STREAM unit is used to write
the outputs in the memory buffer.
Ten register files are used to save the intermediate results. A
single Boolean register file is included in the processor design.
When the registers are not enough, the processor is able to
access the data memory to temporary store data through the
LSU. The SFUs can be called by macros to accelerate the
program code.
Reference Matlab code is available at sites.google.com/site/shahriarshahabuddin/matlab simulator
Eight buses are used in a single core and therefore the core is
able to process eight instruction in a single clock cycle. As the
datapath is exposed to the programmer in the TTA architecture,
it is possible to remove the unused or less frequently used
connection between the function units and buses. Thus, several
connections of the processor is removed to reduce the cost of
a core. The connection between function units and buses is
illustrated by black spots in the sockets of Fig. 3.
The cores are connected to one another by FIFO buffer
memories. In this way, five iteration of MLLL can be pro-
cessed in parallel. The cores are identical and except the first
core, the rest of the cores use the same program image.
V. RESULTS AND DISCUSSION
It can be seen from the TCE tool that the TTA multiproces-
sor takes 187 cycles to compute the MLLL algorithm. Some of
the operations executed during a single iteration is presented
in Table I. The conventional operations like addition, shifts
are not shown in the table.
TABLE I
NUMBER OF OPERATIONS
Operation No. of Ops
ARRANGE 18
CORDIC 9
CMUL 72
STREAM 84
SIEGEL 3
SIZE REDUCTION 34
The multiprocessor is synthesized using UMC 90 nm
standard cell library (fsd0k generic core1d0vtc). Synopsys
Design Compiler is used to estimate gate count and maximum
achievable clock frequency. The operating conditions (temper-
ature, operating voltage, manufacturing process quality) for
synthesis are set to default value (TCCOM). The maximum
clock frequency achieved during the synthesis for the multipro-
cessor is 210 MHz. The total gate count of the multiprocessor
at 210 MHz is around 405 kgates.
A comparison with different other implementations of LR
is presented in Table II. Two important VLSI architectures
for the LR algorithm can be found in [4] and [6] with low
latency and area. The authors implemented the reverse-siegel
LLL (RS-LLL) and hardware-optimized LLL (HOLL) in [4]
and [6] respectively. The VLSI architecture for the Clarkson’s
algorithm is provided in [5]. The architecture provides less
throughput than our architecture even after using a hardwired
control path. The latency of the VLSI architecture of [7] is
lowest, but with the price of a very low maximum achievable
clock frequency of 37 MHz. Though most of the VLSI
implementations take less cycles and area, the architectures
suffer from inflexibility, and as a consequence later field
updates are not possible.
As different variants of LLL algorithms are proposed in dif-
ferent literatures, a flexible implementation is a necessity. Our
customized multiprocessor is an example of such a flexible
implementation with moderate latency and cost. It is possible
to support different variants of LLL algorithms by updating
the instruction memory with new binary program image. The
updated binary instructions can be obtained by compiling the
other LLL algorithms for our particular architecture with the
help of a retargetable compiler.
The programmable VLIW core [8] takes less clock cycle
and flexible. The implementation consisted of not only LR,
but also QR decomposition and detection also. However, it is
not clear the amount of area needed only for the LR. The total
area is very high compared to the other implementations even
at 40 nm technology.
TABLE II
IMPLEMENTATION COMPARISON
Reference Architecture/tech. area max-freq. cycles
[4] .13 µm 107 kGE 333 MHz 14
[5] Virtex-II Pro N/A 100 MHz 420
[6] .13 µm 125 kGE 352 MHz 40
[7] 90 nm 200 kGE 37 MHz 5
[8] VLIW (40 nm) 6364 kGE 700 MHz 21
Proposed TTA (90 nm) 405 kGE 210 MHz 187
VI. CONCLUSION
We propose a modified LLL algorithm for LR. We simulated
in Matlab the performance of the algorithm and propose a cus-
tomized multiprocessor architecture for the MLLL. The cores
are programmable with the help of a retargetable compiler. The
flexible implementation shows great promise to support later
field updates and provides high throughput with a moderate
cost.
REFERENCES
[1] A. K. Lenstra, H. W. Lenstra, and L. Lova´sz, ”Factoring polynomials
with rational coefficients,” in Math. Ann, vol. 261, pp. 515 - 534, 1982.
[2] H. Vetter, V. Ponnampalam, M. Sandell, and P. A. Hoeher, ”Fixed
complexity LLL algorithm,” in IEEE Trans. on Signal Processing, vol.
57, no. 4, pp. 1634 - 1637, April 2009.
[3] M. Seysen, ”Simultaneous reduction of a lattice basis and its reciprocal
basis,” in Combinatorica, vol. 13, no. 3, pp. 363 - 376, 1993.
[4] L. Bruderer, C. Studer, M. Wenk, D. Seethaler, and A. Burg, ”VLSI
implementation of a low-complexity LLL lattice reduction algorithm for
MIMO detection,” in Proc. IEEE Intl. Symp. of Circuits and Systems,
May 2010.
[5] L. G. Barbero, D. L. Milliner, T. Ratnarajah, J. R. Barry, and C. Cowan,
”Rapid prototyping of Clarkson’s lattice reduction for MIMO detection,”
in Proc. IEEE Intl. Conference on Communications, June 2009.
[6] M. Shabany, A. Youssef, and G. Gulak, ”High-throughput .13-µm
CMOS lattice reduction core supporting 880 Mb/s detection,” in IEEE
Trans. on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 5,
pp. 848 - 861, 2013.
[7] C. -F. Liao, and Y. -H. Huang, ”Power saving 4 × 4 lattice-reduction
processor for MIMO detection with redundancy checking,” in IEEE
Trans. on Circuits and Systems II, vol. 58, no. 2, pp. 95 - 99, Feb.
2011.
[8] U. Ahmad et. al, ”Exploration of lattice reduction aided soft-output
MIMO detection on a DLP/ILP baseband processor,” in IEEE Trans. on
Signal Processing, vol. 61, no. 23, pp. 5878 - 5892, Oct. 2013.
[9] S. Shahabuddin, J. Janhunen, M. Juntti, A. Ghazi, and O. Silve´n, “Design
of a transport triggered vector processor for turbo decoding,” in Analog
Integrated Circuits and Signal Processing, vol. 78, no. 3, pp. 611-622,
March 2014.
[10] S. Shahabuddin, J. Janhunen, E. Suikkanen, H. Steendam, and M. Juntti,
“An adaptive detector implementation for MIMO-OFDM downlink,”
in International Conference on Cognitive Radio Oriented Wireless
Networks, Oulu, Finland, June 2014.
[11] D. Wubben, D. Seethaler, J. Jalden, and G. Matz, ”Lattice reduction,”
in IEEE Signal Processing Magazine, vol. 28, no. 4, pp. 70 - 91, April
2011.
