FPGA-based Tabu Search for Detection in Large-Scale MIMO Systems by Wu, Yun & McAllister, John
FPGA-based Tabu Search for Detection in Large-Scale MIMO
Systems
Wu, Y., & McAllister, J. (2014). FPGA-based Tabu Search for Detection in Large-Scale MIMO Systems. In
Proceedings of 2014 IEEE Workshop on Signal Processing Systems (SiPS). (pp. 1-6). Institute of Electrical and
Electronics Engineers (IEEE). DOI: 10.1109/SiPS.2014.6986073
Published in:
 Proceedings of 2014 IEEE Workshop on Signal Processing Systems (SiPS)
Document Version:
Peer reviewed version
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
Publisher rights
© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:15. Feb. 2017
FPGA-based Tabu Search for Detection in
Large-Scale MIMO Systems
Yun Wu, John McAllister
Queen’s University Belfast
UK
Email: {yun.wu, jp.mcallister}@qub.ac.uk
Abstract—The increasing scale of Multiple-Input Multiple-
Output (MIMO) topologies employed in forthcoming wireless
communications standards presents a substantial implementation
challenge to designers of embedded baseband signal processing
architectures for MIMO transceivers. Specifically the increased
scale of such systems has a substantial impact on the perfor-
mance/cost balance of detection algorithms for these systems.
Whilst in small-scale systems Sphere Decoding (SD) algorithms
offer the best quasi-ML performance/cost balance, in larger
systems heuristic detectors, such Tabu-Search (TS) detectors
are superior. This paper addresses a dearth of research in
architectures for TS-based MIMO detection, presenting the first
known realisations of TS detectors for 4× 4 and 10× 10 MIMO
systems. To the best of the authors’ knowledge, these are the
largest single-chip detectors on record.
I. INTRODUCTION
The continually increasing demands for higher data rates
in modern wireless communication systems has led to the
increased adoption of MIMO systems, where multiple antennas
are employed at both the transmitter and receiver ends of the
communication channel [1], in wireless standards such as IEEE
802.11n [2] and 802.16e. MIMO schemes achieve superior
channel capacity, throughput and diversity over single antennas
at the cost of increased baseband signal processing complexity
in transceivers. This places huge demand on embedded pro-
cessing resources in terms of throughput, latency and resource
utilization, particularly as the number of antennas increases
[3].
The baseband signal processing complexity of MIMO
systems is particularly apparent in the embedded detection
problem. To achieve close to Maximum Likelihood (quasi-ML)
performance, the detection complexity scales in a super-linear
fashion. A substantial body of work has addressed the design
of embedded detectors which employ Sphere Decoding (SD)
algorithms [4], [5], [6] for this operation, motivated directly
by the ability of SD algorithms to enable quasi-ML detection
performance for relatively small MIMO topologies of 2, 4 or
8 antennas.
However, as the size of MIMO topologies employed in-
creases, the dynamics of performance and cost of different
detection algorithms changes. Specifically, [3] compares how
the complexity and performance of linear detectors, SDs and
heuristic detectors based on Tabu Search (TS), varies with the
scale of MIMO system. One major conclusion which may
be drawn is that as the number of antennas increases from
a few (for instance, 2 or 4) to many, for instance 40, TS
becomes substantially more computationally efficient than the
Fixed Complexity Sphere Decoder (FSD) [7], a leading highly
efficient SD algorithm and even exhibits detection efficiency
beyond that of simple linear detectors. It appears that TS-based
detectors have an important role to play in the large-scale
MIMO receivers of tomorrow.
Whilst research on SD realisations in general and FSD
realisations in particular is well recorded, no comparable effort
has been invested in TS-type detectors and it is not clear that
the computational efficiency gains observed extend to gains in
physical cost and performance. This paper resolves this issue,
making three contributions:
1) FPGA-based TS MIMO detectors are presented for
4× 4 and 10× 10 MIMO are presented - the largest
such TS detectors on record.
2) The largest recorded single-chip FSD detector on
record is presented: a detector for 10 × 10 4-QAM
MIMO.
3) It is shown that, despite exhibiting detection effi-
ciency several orders of magnitude better than FSD,
TS-based detection lags FSD in terms of both cost
and performance, specifically exhibiting reductions in
absolute throughput and throughput per unit resource
of 47% and 82% respectively.
The remainder of the paper is organized as follows. Section
II introduces related background on TS for MIMO detection,
before Section III anlayzes the performance and cost relative
of TS relative to FSD. Section IV extends this comparative
analysis to incorporate FPGA architecture performance and
cost.
II. BACKGROUND
MIMO communication systems adopt multiple antennas at
both transmit and receive terminals, as shown in Fig. 1. An
M -element transmit antenna array transmitting a signal s ∈
CM via a multi-path fading channel H ∈ CN×M , results in
a received symbol vector r ∈ CN . Assuming Additive White
Gaussian Noise (AWGN), w ∈ CN at the receiver, r can be
modelled as
y = H · s + w. (1)
Detection algorithms form an estimate s˜ of the transmitted
symbol vector s given the received symbol vector y and
knowledge of the channel in the form of the channel matrix
H. Detection algorithms typically belong to one of three
s1
M
o
d
u
la
ti
o
n
&
 D
em
u
lt
ip
le
xi
n
g
s2
sM M
IM
O
 D
et
e
ct
io
n
D
em
o
d
u
la
ti
o
n
&
 M
u
lt
ip
le
xi
n
g 
b b
~
s1
~
s2
~
sN
~
w2
w1
h11
hN1
h12
h22
hN2
h1M
h2M
hNM
MIMO Channel
t1
t2
tM
h21
r2
y1
y2
yN
wN
r1
rN
Fig. 1: Generic MIMO Communication System
categories - linear detectors such as Zero Forcing (ZF) or
Minimum Mean Square Error (MMSE), non-linear detectors
such as Sphere Decoders (SD), and heuristic approaches. The
use and realisation of linear and SD algorithms is very well
studied, see e.g. [1], [7], [6], [5], [4]. The use of heuristic
approaches is not so well developed.
Heuristic detection methods include Genetic Algorithms
(GAs) [8], Ant Colony Algorithms (ACAs) [9], [10], Simu-
lated Annealing (SA) [11] and TS [11]. Among these, TS is
only approach which has proven capable of quasi-ML perfor-
mance [12], [13], [14], [15]. Figure 2 shows the tree structure
of a TS detection algorithm. TS iteratively updates a global
Tabu List (TL) record f stemming from y˜ an equalized version
of y; equalization may adopt various strategies, including ZF,
MMSE or OSIC - in this study we restrict this to ZF. During
each iteration i, f (i) is updated via four steps: initialization,
candidate enumeration, tabu move and TL (TL) updating. The
iterations of the TS algorithm repeat until a specified stopping
criteria is met.
s
Initialization
Candidate 
Enumeration
Tabu 
Move
Tabu 
Update
... .........
...
t1 t2 tM-1 tM
I iterations
Fig. 2: Generic TS Detector Tree Structure
1) Initialization: An initial solution vector, x(0) = s˜0, is
derived from y˜ZF , the ZF equalized version of y. A simplified
ML cost calculation [13] is performed by defining yMF =
HH ·yZF and Hˆ = HH ·H; the initial vector, f(0), for ML cost
calculation is given by [13]
f (0) = Hˆ · x(0) − y˜ZF . (2)
In addition, TL is initialized while the iteration number
and the Tabu Period are set to I and P relatively.
2) Candidate Enumeration: The candidate symbol vectors
at each iteration are derived from s˜ZF by selecting the K
nearest symbols in the modulation constellation set for each an-
tenna, selecting K ·M symbol vectors [13]. Letting z(i)(m, k)
represents one of the enumerated candidates at ith iteration,
where m denotes the antenna index and k denotes the selected
neighbor symbol index, then, only the mth symbol in s˜ZF is
changed to its kth neighbor in modulation constellation set.
For each iteration, the starting solution is updated to s˜ from
last iteration.
The detection metric for TS measures the ML cost incre-
ment [13]
t(m, c) = 2 · <(e∗m · fim) +
∣∣e2m∣∣ · Hˆ(m,m) (3)
where e = z(i)(m, k) − xi is the differential vector between
current starting symbol vector and the enuemration candidate,
(•)∗ denotes the conjugate function and <(•) denotes the real
part of a complex value.
3) Tabu Move: As shown in Figure 2, K · M candidate
symbol vectors are judged and the starting symbol vector for
the next iteration is selected by
x(i+1) = arg min
j∈[1,K·M ],z(i) /∈TL
(tj). (4)
If (4) failed to find a suitable candidate, then the starting
symbol vector remain the same as last iteration, hence x(i+1) =
x(i).
4) Tabu List Updating: The TL is an integer matrix of size
(M ·Mc)×K [13]. Letting c denote the symbol index in modu-
lation constellation set Ω at mˆth antenna for x(i+1)(mˆ, kˆ), the
position of x(i+1)(mˆ, kˆ) in TL is at row (m− 1) ·Mc+1 and
column k. Hence, the update of TL is performed after each
tabu move by assigning the element in TL for x(i+1)(mˆ, kˆ) to
P +1 if a successful tabu move is found and decrementing all
non-zero elements in TL by 1 [13]. f(i+1) is given by
f(i+1) = f(i) + emˆ · Hˆm (5)
where Hˆm denotes the mth column of Hˆ.
These iterative steps are performed until the iteration
threshold I is met. TS is promising for large scale MIMO
detection due to the complexity against detection performance,
nevertheless, no TS detection architecture for such systems are
recorded to date.
Whilst TS-based detectors have been recorded for some
time, there has been little record of their realisation. However,
[3] points to a prominent future for these approaches as the
size of antenna topologies increases. Specifically, it shows that
when the number of antennas employed increases to around 40,
the detection efficiency of TS-ype algorithms is beyond that of
linear detectors such as MMSE or SD algorithms. Specifically
it shows that, whilst maintaining the same computational
complexity, the Bit Error Rate (BER) achieved by TS is an
order of magnitude greater than that achieved by MMSE,
two orders of magnitude beyond that achieved by FSD and
a factor of up to 3 greater than that enabled by MMSE with
successive interference cancellation. However, these metrics do
not translate directly to implementation cost and performance
and as such the implementation efficiency of these detectors
is, as yet, unverified. In Sections III and IV we address this
shortfall by comparing the detection performance and real-time
performance and cost of TS and FSD-based detectors.
III. TS DETECTION - BEHAVIOUR AND PERFORMANCE
TS detection performs iterative candidate searching until
reaching a final solution under the stopping criteria, which
enables a trade-off between the detection performance and
computational complexity by varying the number of iterations
and neighbourhood size via the modulation scheme employed.
By dividing the TS detection into three stages, Fig. 3 shows
the dataflow diagram of Fig. 2 for three iterations and two
neighboring candidates enumerations.
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Tabu
min
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Tabu
min
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Enum
NB1 NB2
Tabu
min
Iterative Tabu 
Search
Fig. 3: TS Dataflow for I = 3,K = 2
Enum produces K candidate symbol vectors for different
antennas, NB performs the ML cost related calculations of
candidate symbol vectors and the Tabumin checks the TL
for each candidate and selects the candidate for next iteration.
The dashed line box indicates the iterative part of TS detections
which is flexible for more iterations and can be repeated for
a given number. By changing the number of iterations and
neighboring candidates, various detection performance metrics
can be achieved.
The variation in computational complexity of FSD and TS
are described in Fig. 4. This measures arithmetic complexity as
the number of antennas varies between 4 and 40, and when the
number of TS iterations I takes the values 16 or 300 (TS16 and
TS300 respectively). The complexity of each TS stage is also
captured in Table I, where I is the number of iterations, Nb the
number of enumerated neighbours, Mt the number of transmit
antennas and M the constellation size. A number of trends are
apparent: as anticipated by the linear increases in complexity
with the number of TS iterations, TS300 generally exhibits
complexity two orders of magnitude higher than TS16. Further,
the relative complexities of TS and FSD also varies with the
number of antennas - initially FSD is the least complex, but its
superlinear complexity growth with number of antennas means
that it is more complex than TS16 and TS300 for more than 10
and 25 antennas respectively.
5 10 15 20 25 30 35 40
102
103
104
105
106
107
108
Antenna Number
Ar
ith
m
et
ic
 O
pe
ra
tio
n
 
 
TS − 16 iterations
TS − 300 iterations
FSD
Fig. 4: Complexity Comparison of TS and FSD
TABLE I: Fixed Iteration TS Detection Complexty
± × Compare
(×I ·Mt) (×I ·Mt) (×I ·Mt)
Enum 2 (Nb − 1) + 5 (Nb > 1) 6 + 5 (Nb > 1) 6 + 3 (Nb > 1)
NB 8Nb 10Nb 0
Tabu Min 9 + 3Mt +
M·Nb
I 4 2Nb +
2
I
Given that the case where M = 10 antennas denotes the
first point where FSD is more computationally demanding than
TS16, 4×4 and 10×10 MIMO topologies employing 4-QAM
are chosen as comparison points. The BER performance of
FSD, TS16. TS300 and a linear MMSE detector are outlined
in Fig. 5. As Figure 5a shows TS300 detection has close-to-
FSD performance while TS300 detection has 3 dB lower SNR
gain than FSD but over 10 dB larger SNR gain than MMSE
for 4× 4 MIMO. Despite the much higher complexity of the
TS schemes, FSD generally performs much better. A similar
pattern is repeated in Fig. 5b, with the increased complexity
of FSD translating to much superior BER performance. In
particular it is notable that the performance of FSD is beyond
that of TS16, despite its relatively higher complexity.
Hence, although the TS detection does not achieve ML
performance, as the antenna number grows the computational
complexity of TS detection outperforms FSD with consid-
erable BER performance. The next section describes the re-
alization of TS detection on FPGA. Section IV studies the
impact of these relative computational complexity variations
of performance and cost for FPGA implementations.
IV. FPGA-BASED TABU SEARCH
A. The FPE - an ASIP for FPGA Signal Processing
The Xilinx-based FPE processor described in [4] and
illustrated in Fig. 6 is used as the foundation for the proposed
detection architectures. The FPE exploits a RISC load-store
architecture with a seven-stage pipeline and enables very high
performance with low cost relative to other FPGA soft proces-
sors by emphasising three key architecture design principles:
0 5 10 15 20 25
10−4
10−3
10−2
10−1
100
SNR(dB)
BE
R
 (B
it E
rro
r R
ate
) 
 
 
MMSE
TS−16 fixed
TS−300 fixed
FSD
TS−300(avg258)−adapt
(a) 4× 4
0 5 10 15 20 25
10−4
10−3
10−2
10−1
100
SNR(dB)
BE
R
 (B
it E
rro
r R
ate
) 
Tabu Search 4QAM −−10*10
 
 
MMSE
TS−16 iteration
TS−300 iteration
FSD
(b) 10× 10 MIMO
Fig. 5: MIMO Detection BER Performance Comparison
1) Lean Architecture: The FPE contains only com-
ponents critical to ensuring software programmabil-
ity: a Program Counter (PC) and Program Memory
(PM), Instruction Decoder (ID), Register File (RF),
Branch Detection (BD), Data Memory (DM) and an
Arithmetic Logic Unit (ALU) based around the on-
chip DSP48E slice. These are augmented only with
components enabling high performance processing
on FPGA at low resource cost: Immediate Memory
(IMM) for ROM storage of constant data, an off-
FPE Communication Module (COMM) and custom
coprocessors if required; these features support an
instruction set described in Table II.
2) Scalability: The FPE is designed to be combined
in very high numbers. FPEs can be combined into
SIMD FPE Processing Units (FPUs) to exploit data
parallelism with the added benefit of amortizing the
cost of several components across multiple FPEs (see
Fig. 6b); furthermore, large-scale MIMD networks of
FPUs communicating via FIFO queues of any size
may be created for the application at hand.
3) Configurability: The FPE is highly configurable -
almost every characteristic may be tuned to the
application at hand; frequently components can be
omitted if they are not used. Extending to enable
configuration of the number of FPU ways, this config-
urability ensures that there is no barrier to achieving
the absolute lowest cost architecture required for a
specific application. The configurable characteristics
of the FPU are described in Table III.
ID/RF
PC PM
RF
IMM
COMM
Branch 
Detection
Branch 
Control
Instruction
Fetch
Source 
Select
Result 
Select
Write 
Back
ID
ALU
DSP48E
Coprocessor
EX1
DM
EX2 EX3
(a) FPGA Processing Element (FPE)
FPE
Program 
Counter
Program 
Memory
Instruction 
Decoder
PC
PM
ID
Immediate 
Memory IM
Register 
File
RF
Arithmetic Logic 
Unit
ALU
FPE
Register 
File
RF
Arithmetic Logic 
Unit
ALU
FPE
Register 
File
RF
Arithmetic Logic 
Unit
ALU
(b) FPGA Processing Unit (FPU)
Fig. 6: The FPU Soft Processing Architecture
TABLE II: FPE Instruction Set
Instruction Function
C
T
R
L BEQ, BGT, BLT branch if equal/greater/less
JMP jump
IP
C GET, PUT load/push data from/to channel
GETCH, CLRCH load data from/clear channels
NOP no operation
A
L
U
MUL/ADD/SUB multiply/add/subtract
MULADD/MULSUB (FWD) multiply-add/subtract (& forward)
COPROC coprocessor access
M
E
M LD/ST load/store data from/to memory
LDIMM/STIMM load/store data from/to IMM
This combination of features is unique in FPGA-based
processors and can enable uniquely powerful architectures;
when realised on Xilinx Virtex 5 VLX110T FPGA, the com-
putational capability and cost of six FPE configurations -
16 bit Real (16R), 32 bit Complex (32C) and 32 bit Real
TABLE III: FPE Configuration Parameters
Parameter Meaning Values
n_ways FPU Width 1 - unlimited
data_ws Data wordsize (bits) 16/32
data_type Data type Real/complex
alu_ndsp No. DSP48E slices 1 - 4
pm_depth, pm_width PM Capacity/width Unlimited
imm_depth IMM Capacity/width Unlimited
dm_depth, rf_depth DM/ RF Capacity Unlimited
no_tx, no_rx No. Tx/Rx ports ≤1024
(32R) variants - are as described in Table IV1. To the best
of the authors’ knowledge, both the absolute performance and
resource metrics quoted in Table IV are the leading metrics
amongst recorded soft processors.
TABLE IV: FPE Arithmetic Performance
Config
Resource Latency Clock Throughput
LUTs DSP48Es (Cycles) (MHz) (MMACs/s)
16 R 90 1 4 483 483
16 C
132 1 7 476 119
172 2 5 453 226.5
140 4 5 474 474
32 R
185 2 6 431 215.5
182 3 7 431 431
B. FPE-based Tabu-Search
The FPGA TS architecture for the TS16 4 × 4 4-QAM
detector operating on 16-subcarrier OFDM shown in Fig. 7. As
this shows, 32 16-way FPUs and 92 banks of 16 FIFOSsa em-
ployed in a pipelined chain architecture2. The FPU behaviours
are interleaved - odd-numbered FPUs process the candidate
search stage schedule illustrated in Fig. 8, including Enum
and NB, while even-numbered SIMDs perform candidate
selection via Tabu min, with updated TL. Notice that a TL
of size M · Mt · Nb is transferred between each SIMD for
candidate selection.
Table ?? reports the perfomance and cost of the 4× 4 and
10 × 10 TS architectures and the 10 × 10 FSD architecture.
synthesis result of TS detection for 4×4 MIMO with 4-QAM
modulation on 16 OFDM subcarriers and both TS and FSD
detection for 10×10 on 8 OFDM subcarriers based on Xilinx
Virtex-6 FPGA. The 10×10 architectures reported in Table V
are the largest recorded single-chip MIMO detectors on record.
Note the variation in performance and cost across the
TS detectors and between the TS and FSD detectors. In
particular, it is worth noting that the 10 × 10 TS detector
has higher cost than FSD of a similar scale, despite having
lower computational complexity. Indeed, LUT and DSP48E1
cost have increased by factors of 2.9 and 1.3 respectively,
whilst throughput has decreased by 47%. Hence throughput
1All synthesis results are post place-and-route, employing flat criteria, with
neither speed nor area prioritized.
2The 10 × 10 TS16 and FSD detectors are simple modifications of the
architectures in Fig. 7 and [4] respectively and are not included here for
brevity.
TABLE V: 4× 4/10× 10 4 QAM MIMO Detector
Detection Scheme TS FSD
MIMO Scheme 4× 4 10× 10 10× 10
Clock (MHz) 293 301 280
Throughput (Mbps) 42.6 23.9 45.8
SIMDs 32 32 20
DSP48E1 512 384 288
LUTs (×103) 123.6 236.6 80.3
T/LUT (×102 Mbps) 3.4 1.0 5.7
T/DSP48E1 (×103 Mbps) 0.8 0.6 1.6
per unit LUT and DSP resource have decreased by 82.5%
and 62.5% respectively. These reductions in absolute cost and
performance and efficiency are compounded by reductions in
detection capability, since FSD 10 × 10 offers consistently
superior BER performance relative to both TS16 and TS300.
It is probable that the cause of these relatively poor results
is the maintenance of a single centralized TL and the iterative
nature of TS. This has two major effects - the requirement
to communicate this list between processing nodes has a
high resource cost. Furthermore the iterative nature of the TS
algorithm constrains throughput, since each stage may only be
performed on completion of the last.
Accordingly it is apparent that, despite promising high
performance detection at low computational complexity, in
reality TS-based detection for large-scale MIMO systems has
some way to go before it can be considered a practical
reality. The lack of research into the embedded realisation of
these algorithms, coupled with inherent throughput and data
management constraints imposed by the iterative nature of
TS and the maintenance of a centralized TL are significant
bottlenecks which contrain the performance and increase the
cost of current realisations.
V. CONCLUSION
As the scale of MIMO systems increases the balance of
efficiency amongst families of detection algorithms changes.
In particular, the work in [3] has shown that the efficiency
of heuristic detectors, specifically TS detectors is beyond that
of quasi-ML SD algorithms and even simple linear detectors
offering, in some cases a 66% reduction in BER for the same
computational complexity over MMSE-SIC. The gains relative
to SD algorithms are even more spectacular. However, this
paper shows that this gain in efficiency does not relate in
a straightforward manner to a gain in implementation cost.
Specifically, it has shown that, in the case of a 10×10 MIMO
system employing 16 OFDM subcarriers, implementation cost
increases dramatically relative to even FSD - indeed when
deployed on Xilinx Virtex R©-6 FPGA, LUT and DSP48E1
cost increase by factors of 3.0 and 1.33 respectively, whilst
performance reduces by 47.8%. It is apparent that the main-
tenance of a single global Tabu List has serious throughput
implications. If the potential of TS algorithms as detectors
in large scale MIMO systems is to be realised, substantial
research is required to overcome this limitation.
FPE16
FPE1
SIMD32
[112]
[17]
[1]
[97]
[32]
[16]
SIMD1
FPE16
FPE1
SIMD29
FPE16
FPE1
SIMD30
. . .
FPE16
FPE1
SIMD31
FPE16
FPE1
FPE16
FPE1
SIMD2
Fig. 7: Architecture of TS I = 16, K = 3 in 4× 4 MIMO OFDM System, 4-QAM
Enum NB1 NB2 NB3
Enum NB1 NB2 NB3
Enum NB1 NB2 NB3
Enum NB1 NB2 NB3
Fig. 8: Candidate Serarch Schedule
ACKNOWLEDGMENT
This work was sponsored by the UK Engineering and
Physical Sciences Research Council (EPSRC) under contract
number EP/H051155/1. The authors are grateful to Dr. Peng
Wang for his assistance in deriving the FPGA multi-SIMD
architectures.
REFERENCES
[1] G. J. Foschini, “Layered Space-Time Architecture for Wireless Com-
munication in a Fading Environment When Using Multi-Element An-
tennas,” Bell Labs Technical Journal, vol. 1, no. 2, pp. 41–59, 1996.
[2] IEEE802.11n, “802.11n-2009 IEEE Local and metropolitan area
networks–Specific requirements Part 11: Wireless LAN Medium Access
Control (MAC) and Physical Layer (PHY) Specifications Amendment
5: Enhancements for Higher Throughput,” p. 536, 2009.
[3] F. Rusek, D. Persson, B. K. Lau, E. Larsson, T. Marzetta, O. Edfors,
and F. Tufvesson, “Scaling Up MIMO: Opportunities and Challenges
with Very Large Arrays,” Signal Processing Magazine, IEEE, vol. 30,
no. 1, pp. 40–60, Jan 2013.
[4] X. Chu and J. McAllister, “Software-Defined Sphere Decoding for
FPGA-Based MIMO Detection,” Signal Processing, IEEE Transactions
on, vol. 60, no. 11, pp. 6017–6026, Nov. 2012.
[5] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
H. Bolcskei, “VLSI Implementation of MIMO Detection using the
Sphere Decoding Algorithm,” Solid-State Circuits, IEEE Journal of,
vol. 40, no. 7, pp. 1566–1577, July 2005.
[6] X. Huang, C. Liang, and J. Ma, “System Architecture and Imple-
mentation of MIMO Sphere Decoders on FPGA,” Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 2, pp.
188–197, Feb. 2008.
[7] L. Barbero and J. Thompson, “Fixing the Complexity of the Sphere
Decoder for MIMO Detection,” Wireless Communications, IEEE Trans-
actions on, vol. 7, no. 6, pp. 2131–2142, June 2008.
[8] L. Hanzo and T. Keller, OFDM and MC-CDMA: A PrimerMIMO-
OFDM for LTE, WiFi and WiMAX. Wiley-IEEE Press, April 2011.
[9] C. Xu, L.-L. Yang, R. Maunder, and L. Hanzo, “Near-Optimum Soft-
Output Ant-Colony-Optimization Based Multiuser Detection for the
DS-CDMA Uplink,” in IEEE International Conference on Communi-
cations, 2008. ICC ’08., 2008, pp. 795–799.
[10] C. Xu, R. Maunder, L.-L. Yang, and L. Hanzo, “Near-Optimum Mul-
tiuser Detectors Using Soft-Output Ant-Colony-Optimization for the
DS-CDMA Uplink,” IEEE Signal Processing Letters, vol. 16, no. 2,
pp. 137–140, 2009.
[11] T. Abrao, F. Ciriaco, L. Oliveira, B. Angelico, P. J. E. Jeszensky, and
F. Casadevall, “GA, SA, and TS Near-optimum Multiuser Detectors
for s/MIMO MC-CDMA Systems,” in Wireless Communication and
Sensor Networks, 2008. WCSN 2008. Fourth International Conference
on, 2008, pp. 173–178.
[12] H. Zhao, H. Long, and W. Wang, “Tabu Search Detection for MIMO
Systems,” in IEEE 18th International Symposium on Personal, Indoor
and Mobile Radio Communications, 2007. PIMRC 2007., 2007, pp. 1–5.
[13] N. Srinidhi, S. K. Mohammed, A. Chockalingam, and B. Sundar Rajan,
“Near-ML Signal Detection in Large-Dimension Linear Vector Chan-
nels Using Reactive Tabu Search,” Online arXiv:0911.4640v1 [cs.IT].,
November 2009.
[14] B. Rajan, S. Mohammed, A. Chockalingam, and N. Srinidhi, “Low-
Complexity Near-ML Decoding of Large Non-Orthogonal STBCs
Using Reactive Tabu Search,” in IEEE International Symposium on
Information Theory, 2009. ISIT 2009., 2009, pp. 1993–1997.
[15] A. Chockalingam, “Low-Complexity Algorithms for Large-MIMO De-
tection,” in 2010 4th International Symposium on Communications,
Control and Signal Processing (ISCCSP), 2010, pp. 1–6.
