Implementation and Complexity Analysis of List Sphere Detector for MIMO-OFDM systems by Myllylä, Markus et al.
Implementation and Complexity Analysis of List
Sphere Detector for MIMO-OFDM systems
Markus Myllylä and Markku Juntti Joseph R. Cavallaro
Centre for Wireless Communications Dept. of Electrical & Computer Engineering
P.O. Box 4500, FIN-90014 University of Oulu, Finland Rice University, Houston, TX 77251-1892, USA
{markus.myllyla, markku.juntti}@ee.oulu.ﬁ cavallar@rice.edu
Abstract—A list sphere detector (LSD) is an enhancement of
a sphere detector (SD) that can be used to approximate the
soft output maximum a posteriori probability (MAP) detector
used in the detection of the multiple-input multiple-output
(MIMO) signals. The LSD consists of three different parts: the
preprocessing unit, the LSD algorithm unit and the log-likelihood
ratio (LLR) calculation unit. Architecture design is the key point
to enable an efﬁcient implementation of the LSD. In this paper,
we design the architecture for the whole detector structure and
exploit the parallelism and pipelining possibilities of the presented
architecture units. The designed architecture is implemented in a
ﬁeld programmable gate array (FPGA) using Mentor Graphics
Catapult C tool. We show that a scalable architecture can be
designed for the LSD. The LSD is also shown to be feasible
for practical implementation, and the implementation complexity
and latency results are presented.
I. INTRODUCTION
The ever increasing data rates in wireless communication
systems require the use of the available bandwidth as efﬁ-
ciently as possible to maximize the capacity of the system. The
multiple-input multiple-output (MIMO) concept in combina-
tion with orthogonal frequency division multiplexing (MIMO-
OFDM) has been adapted to multiple wireless telecommu-
nication standards, such as the 3rd generation partnership
project (3GPP) long term evolution (LTE) and IEEE 802.16e.
The optimal joint detection and decoding of a MIMO signal
with forward error coding (FEC) can be approximated with
an iterative (turbo type) receiver with separate detector and
decoder [1], where the optimal soft output detector is the
maximum a posteriori probability (MAP) detector. However,
the computational complexity of the MAP detector is an
exponential function of the number of transmit antennas and
modulation levels, and, thus, it is not typically promising in
practical implementation. A list sphere detector (LSD) [1] is
a variant of the sphere detector (SD) [2], [3] that can be used
to approximate MAP detector with much lower computational
complexity [1], [4].
The architecture design is a key point in efﬁcient im-
plementation of a algorithm. In this paper, we identify and
introduce the key functional units of the LSD, and design a
highly parallel and scalable architecture for MIMO-OFDM
systems. The possibilities for parallelism and pipelining in
the microarchitecture units are introduced and analyzed. The
designed architecture is implemented on a Virtex-IV ﬁeld
programmable gate array (FPGA) chip for 4 × 4 MIMO
system with 16- quadrature amplitude modulation (QAM).
The implementation is done using Mentor Graphics’ Catapult
 C Synthesis tool [5] with high-level ANSI C++ language,
which is then completely synthesized to produce the result-
ing RTL. We present the complexity and latency results of
the implementation and analyze the major challenges of the
implementation.
The paper is organized as follows. The MIMO signal model,
the SD principles, and the LSD are presented in Section II.
The list sphere detector architecture details are introduced in
Section III. The LSD implementation trade-offs and results are
presented in Section IV. Conclusions are drawn in Section V.
II. MIMO SIGNAL DETECTION
An OFDM based multiple-antenna system with NT transmit
(TX) antennas and NR receive (RX) antennas is considered
with assumption NR ≥ NT and QAM constellation. The
received signal at baseband can be expressed in terms of
symbol interval as
y = Hx + η, (1)
where the received signal vector y ∈ CNR×1, the transmit
symbol vector x ∈ ΩNT ⊂ CNT×1 and the noise vector
η ∈CNR×1 are deﬁned in the frequency domain. The elements
of η are independent and complex zero-mean Gaussian with
equal power σ2 for both real and imaginary parts. The channel
matrix H ∈ CNR×NT contains complex Gaussian fading
coefﬁcients with unit variance. The entries of x are chosen
independently from a complex QAM constellation Ω with
Q bits per symbol, i.e., the uncoded transmission rate is
R = NTQ bits per channel use (bpcu). The complex system
model in (1) can be reduced into an equivalent real model
with new real dimensions MT = 2NT, MR = 2NR [3]. We
assume a practical case of system with forward error coding
(FEC) and with separate soft-input soft-output (SISO) detector
and decoder at the receiver as shown in Figure 1. The turbo
principle can be applied in the receiver so that the detector
and decoder exchange the information in iterative fashion to
approximate the optimal joint detector and decoder [1].
A. Sphere Detection
The sphere detectors (SDs) achieve the hard output maxi-
mum likelihood (ML) solution of x with a reduced number of
considered candidate symbol vectors in the search compared
to traditional exhaustive search algorithms. Then the sphere
1852978-1-4244-2941-7/08/$25.00 ©2008 IEEE Asilomar 2008
Authorized licensed use limited to: Rice University. Downloaded on June 30, 2009 at 16:57 from IEEE Xplore.  Restrictions apply.
(QFRGHU,QW4$00RG63,))7
))7 /6'GHWHFWRU 'H,QW 'HFRGHU
&KDQQHO
[
\ /' /$ /'
/(

,QW 
/$ /(
Fig. 1. A coded MIMO-OFDM system model with an iterative receiver
structure.
search is done by limiting the search to points that lie inside
a MR-dimensional hyper-sphere S(y,
√
C0) centered at y.
After QR decomposition (QRD) of the channel matrix H, the
condition can be written as [3]
||y˜− Rx||22 ≤ C0, (2)
where C0 is the squared radius of the sphere, R ∈ IRMR×MT
is an upper triangular matrix with positive diagonal elements,
Q ∈ IRMR×MR is an orthogonal matrix, and y˜ = QHy.
Due to the upper triangular form of R the values of x can
be solved from (2) level by level using the back-substitution
algorithm. Let xMTi = (xi, xi+1, . . . , xMT )T denote the last
MT − i + 1 components of the vector x. The squared partial
ED (PED) of xMTi can be calculated as [4]
d(xMTi ) = d(x
MT
i+1) + |y˜i −
MT∑
j=i
Ri,jxj |2
= d(xMTi+1) + |bi+1(xMTi+1)−Ri,ixi|2,
(3)
where d(xMTMT ) = 0, bi+1(x
MT
i+1) = y˜i −
∑MT
j=i+1 Ri,jxj , Ri,j
is the (i, j)th term of R and i = MT , . . . , 1. Depending on the
search strategy and the channel realization, the SD searches
a variable number of nodes in the tree structure, and aims to
ﬁnd the point x = xMT1 , also called a leaf node, for which the
ED d(xMT1 ) is minimum.
B. List sphere detector
The performance of a channel coded system may suffer
signiﬁcantly with hard output detector compared to the optimal
soft output MAP detector. The list sphere detector (LSD) [1]
can be used for obtaining a list of the most probable candidate
symbol vectors L ∈ ZNcand×NT as an output, where Ncand is
the size of the candidate list so that 1 ≤ Ncand ≤ 2QNT . The
list can then be used to approximate the soft output MAP
solution with reduced complexity. Depending on the list size
Ncand, it provides a tradeoff between the performance and the
computational complexity. A high level architecture of the list
sphere detector, which consists of the preprocessing unit, the
LSD algorithm unit and the LLR calculation unit.
The preprocessing unit decomposes the channel matrix H
into upper triangular form as in (2), which enables the symbol-
by-symbol tree search. Typically QR decomposition (QRD) is
assumed in literature to perform the channel matrix decom-
position into an upper triangular matrix R and an orthogonal
matrix Q, which are given as an input with received signal y
to the LSD algorithm. However, it has been shown that the
detection order of the transmitted spatial streams effects to
the number of visited nodes, i.e., the algorithm complexity
[3], [6]. We assume the use of sorted QRD (SQRD) [7]
as preprocessing, where the ordering of the spatial layers is
included into modiﬁed Gram-Schmidt decomposition process.
The SQRD algorithm leads to close to optimal detection order
so that the strongest signal is located at the top of the sphere
search tree.
The LSD algorithm unit executes the tree search and gives
the candidate list L as an output. In this paper, we consider
the increasing radius (IR) - LSD algorithm [8], [9], which
is a modiﬁcation of Dijkstra’s algorithm [10] to a LSD
algorithm: Dijkstra’s algorithm is optimal in the sense of
visited number of nodes in the tree structure [10], [9] and the
output candidate list L includes the most probable candidates.
The algorithm operates in a sequential fashion, and extends the
partial candidate s = xMTi+1 and the father candidate sf = x
MT
i+2
always with the next best admissible nodes xi and xi+1, if
admissible node exists [3]. The algorithm uses two memory
sets, where the extended partial of ﬁnal candidates are stored:
the ﬁnal candidate memory L, which is the size of Ncand
candidates and the partial candidate memory S, which size
is dependent on the executed algorithm iterations. After each
iteration, the algorithm continues with the partial candidate
with the minimum PED from S until the PED is larger than
the radius C0.
The approximation of soft output information LD(bk) is
calculated in the log-likelihood ratio (LLR) calculation unit
using the given candidate list. The a posteriori log-likelihood
ratio (LLR) can be decomposed by using the Bayes’ theorem
as [1]
LD(bk) = ln
P (bk = 1)
P (bk = 0)
+ ln
p(y|bk = 1)
p(y|bk = 0)
= LA(bk) + LE(bk|y),
(4)
where LA(bk) is the a priori information and LE(bk) is the
extrinsic information of the bits provided by the detector
or decoder. The probability p(y|bk) can be determined for
a system containing Gaussian noise directly from the cost
information known about the candidates and then the max-
log-MAP approximation can be calculated as [11]
LE(bk|y) ≈ maxx∈χk,1(
−d(x)
2σ2
)− max
x∈χk,0
(
−d(x)
2σ2
). (5)
where χk,1 = {x|bk = 1} is the set of bit vectors x in L
having bk = 1. The performance loss due to max-log-MAP
approximation is rather small compared to the more complex
log-MAP algorithm.
1853
Authorized licensed use limited to: Rice University. Downloaded on June 30, 2009 at 16:57 from IEEE Xplore.  Restrictions apply.
1RUPFDOF
_KL_L 07
,QSXWGDWD
2XWSXW
GDWD
45
RUGHU
5$0
45
&RQWUROORJLF
5RZLLWHUDWLYHRUGHULQJ
5$0UHDGZULWHFRQWURO
,WHUDWLYHXSGDWH
N L07
5LN4N_KN_
&DOFXODWLRQRI
5LL VTUW_KL_
4L  4L GLY5LL
5HJLVWHU
5HJLVWHUV
Fig. 2. A high level architecture of a sorted QR decomposition unit.
III. ARCHITECTURE
A. SQRD
The SQRD algorithm architecture is illustrated in Figure
2. The architecture operates in a sequential fashion, and
calculates one row of the R and one column qi of the Q
at a time.
The norm calculation unit calculates the channel matrix
column norms, which are used to determine the initial per-
muting order of the columns. The norm calculation requires
a total of M2T multiplication (MUL) operations, and, thus,
different levels of parallelism and pipelining can be applied
for the microarchitecture of the unit. The control logic unit
deﬁnes the permutation order of columns at iteration i as
i = 1 . . .MT , and controls the calculation units and the
memory access. The memory unit is used for storing the Q
and R matrices during the decomposition. The registers are
used to temporarily store the current used rows of R and
columns of Q, and the norm values. The actual calculation
of the diagonal element Ri,i and the column qi is executed in
the calculation unit, which requires a square-root, a reciprocal
division operation and MR MULs. Parallelism and pipelining
can be applied in the MUL operations. The iterative update
unit updates the elements in Ri,k, the columns qk, and the
norm values |hk|2, where k = 1 . . .MT . The update of the
variables can be carried out by multiply-and-add (MAC) units,
but the number of computations depend on the current iteration
i. We designed an efﬁcient time-sharing microarchitecture for
the calculations, which enables different levels of parallelism
and pipelining, and it is illustrated in Figure 3. The parallel
MAC units are time-shared to calculate ﬁrst the Ri,k variable
with given k, and then the column qk and the norm value |hk|2
are updated. The architecture calculates iteratively all k values.
As the number of different values assigned for the parameter
k varies depending on the decomposition phase, the maximum
efﬁcient level of parallelism is to use MT MAC units.
0D[07SDUDOOHO
0$&XQLWV
[ 
TN
TL
TN
5LN
_KN_
_KN_

Fig. 3. The designed microarchitecture for SQRD iterative update unit.
3DUWLDO
FDQGLGDWH
PHPRU\
0LQKHDS
)LQDO
FDQGLGDWH
PHPRU\
0D[KHDS
&RQWUROORJLFXQLW
&DOFXODWLRQ
RIEL
6((DQG
3('FDOF
,QSXWGDWD
\51FDQG
ΩR07
2XWSXWGDWD
)LQDOOLVW/
&DOFXODWLRQ
RIEL
6((DQG
3('FDOF
a
Fig. 4. A parallel and scalable architecture for the IR-LSD algorithm.
B. IR-LSD
The architecture for the soft-output IR-LSD algorithm is
designed to include parallel and pipelined operations and it is
scalable for different antenna and constellation conﬁgurations.
The designed architecture, which is designed to have as much
parallel processing as possible, operates in sequential fashion,
and the main units and the connections between units are
illustrated in Figure 4. The SEE and PED units deﬁne and
extend the selected partial candidate and its father node with
the next best admissible nodes and calculates the PEDs of the
updated candidates. The partial candidate memory unit is used
to store the already extended partial candidates while up to
Ncand leaf candidates with lowest EDs are stored to the ﬁnal
candidate memory unit. The logic unit deﬁnes the candidate(s)
to be extended and stored in the next algorithm iteration. This
means that the candidates extended in the iteration D = 1 are
stored to the memory at the same time as the next iteration
round candidates are extended. The storing of the partial
candidates to the partial memory unit and the storing of ﬁnal
candidate to the ﬁnal memory unit are then executed in parallel
with the SEE and PED units, and, thus, the total latency of
one algorithm iteration is then equal to the latency of the
highest latency parallel unit plus the latency of the control
unit. The total latency of one signal vector detection process,
i.e., one algorithm run, is dependent on the required number
of algorithm iterations, which is also relative to the number of
checked nodes in the search tree. After the algorithm search
the output ﬁnal list L is given to the LLR calculation unit.
1) SEE and PED unit: The two SEE and PED units in
the IR-LSD architecture execute the extension of the partial
candidate and the father candidate in the algorithm description
in parallel. Each SEE and PED unit is divided into two sub-
1854
Authorized licensed use limited to: Rice University. Downloaded on June 30, 2009 at 16:57 from IEEE Xplore.  Restrictions apply.
[Ω
'HILQH
QWKPLQ
_HL_
Q G[L0W
G[L0W
EL5L
[
\L
3DUDOOHO
08/
XQLWV
 
5LL  [
_Ω_SDUDOOHO
0$&XQLWV
Fig. 5. The designed microarchitecture for the SEE and PED unit.
units as shown in Figure 5 in detail. The ﬁrst unit calculates
the bi+1(xMTi+1), which is the part of PED calculation that
is independent from the new symbol xi, as in (3). A total
of MT − i − 1 MULs, which can be implemented with
different levels of parallelism, are required in the calculation of
bi+1(xMTi+1), where i is the current layer in the search tree and
imax = MT . The Schnorr-Euchner enumeration (SEE), which
is done the second unit, is designed in a slightly modiﬁed
fashion from the way presented in (14) in [3]. Instead of
calculating the costly and high latency division operation,
we calculate the absolute value in (3) with ΩR different
symbols xi. The calculation can be done with different levels
of parallelism, i.e., 1−|ΩR| separate parallel MAC units. The
desired nth best node is determined by deﬁning ﬁrst the the
node, i.e., symbol, with minimum PED. The information with
the sign of the value is used to determine the desired nth best
node [3] and the PED is calculated by square operation and
the added to the PED of the previous nodes.
2) Memory units: The memory units are designed as binary
heap [12] data structures, which keep the stored elements in
order according to selected deﬁnition. The partial candidate
memory set S is implemented as min-heap, where the stored
partial candidates are ordered so that the candidate with
minimum PED is always sorted to be at the top of the heap.
The ﬁnal memory set L is implemented as max-heap, where
the candidates are sorted according to the maximum PED.
C. LLR calculation
The soft output information LD(bk) is calculated with
the IR-LSD algorithm output list L by using the max-log-
MAP approximation as in (5). The LLR calculation unit
microarchitecture is illustrated in Figure 6. The architecture
can be divided into two main parts: the scaling of the ED
values and the search of maximum values for each bit. The
units can be pipelined to increase the execution speed.
The ED values in the candidate list L are scaled by mul-
tiplying them with the inverse of the noise variance 1/(2σ2),
i.e., a reciprocal division and a total of Ncand MULs are
required. Different levels of parallelism and pipelining can
be applied for the MUL operations in order to speed up the
calculations. The max-log-MAP approximation is calculated
for each bit bk and the calculation requires that all the Ncand

^G[`P
^[`P
N 417
/'EN
P _/_
,)
EN 
3DUDOOHO
08/
XQLWV
[
x
1
−
( )ba,max

( )ba,max
EN 
EN 

Fig. 6. The designed microarchitecture for LLR calculation unit that uses
the max-log-MAP approximation.
ED values in the candidate list L are checked for each QNT
bits in order to determine the maximum values for both bit
counterparts. Thus, two sequential logic loops are required in
the calculation, what are illustrated with m and k variables in
the architecture description. The latency of the loops can be
decreased by applying parallel and pipelined logic to check
multiple ED values or bits in parallel.
IV. IMPLEMENTATION
The FPGA implementation of the IR-LSD architecture is
done for NR = NT = 4 system with 16-QAM constellation
by ANSI C++ language and then synthesized through Men-
tor Graphics’ Catapult C Synthesis tool [5] to produce bit-
accurate, parallel hardware. Catapult is used to create complex,
high-performance hardware, and allows to quickly experiment
with different design speciﬁcations for an application speciﬁc
integrated circuit (ASIC) or an FPGA.
A. Trade-offs and word lengths
The IR-LSD algorithm is a sequential search algorithm and
requires a variable number of algorithm iterations to execute
the tree search depending on the channel realization. The num-
ber of visited nodes can be ﬁxed in the hardware implementa-
tion in order to determine the hardware resources and latency
of the implementation. An effective and straightforward way
to ﬁx the number of iterations is to deﬁne a maximum limit
for the algorithm iterations Dmax [13]. We performed Monte
Carlo simulations in order to verify the performance of the
IR-LSD with the limited search and ﬁxed-point word lengths.
A 1/2 rate [13,15] turbo coded MIMO–OFDM system was
assumed with NT = NR = 4, 16-QAM constellation in
an uncorrelated typical urban (UNC) 6 tap channel with a
velocity of 120 kmph. The receiver includes a IR-LSD with
a list size Ncand = 15, where the absolute values of the soft
output LLRs are limited to |LD(bk)| < 8, and a max-log-
MAP turbo decoder with 8 iterations. The performance of the
IR-LSD based receiver with different detector conﬁgurations
is presented in left subplot in Figure 7. The effect of limited
IR-LSD search is studied by setting a maximum value Dmax
for the number of executed algorithm iterations. The results
show that the IR-LSD works also with limited search and
max-log-MAP approximation, and the required maximum and
average algorithm iterations Dmax and Davg with different
SNR are shown in right subplot in Figure 7. The determined
1855
Authorized licensed use limited to: Rice University. Downloaded on June 30, 2009 at 16:57 from IEEE Xplore.  Restrictions apply.
10 12 14 16
0
5
10
15
20
25
30
35
ES/N0 (dB)
Th
ro
ug
hp
ut
 (M
bp
s)
4x4 MIMO, 16−QAM, N
cand=15, UNC channel
 
 
12 14 16 18 20
0
10
20
30
40
50
60
70
80
ES/N0 (dB)
N
um
be
r o
f i
te
ra
tio
ns
Number of required iterations for target FER=0.1
 
 
D
max
D
avg
Log−MAP, no limit, float
Max−Log, no limit, float
Max−Log, no limit, fixed
Max−Log, limit=100, fixed
Max−Log, limit=80, fixed
Max−Log, limit=60, fixed
Fig. 7. Throughput vs SNR: Performance of the real IR-LSD based receiver
in a 4 × 4 antenna system with 16-QAM constellation (left). The required
number of LSD algorithm iterations for target FER (right).
TABLE I
WORD LENGTHS FOR THE IR-LSD IN A 4× 4 SYSTEM WITH 16-QAM.
SQRD H R Q norm sqrt() div()
(W,I) (11,3) (26,4) (26,3) (27,5) (23,3) (21,3)
LSD alg. y˜ R Ω bi(s) d(s)
(W,I) (10,4) (9,3) (8,1) (12,5) (10,5)
LLR σ LLR LD(bk)
(W,I) (8,0) (10,5) (6,4)
word lengths for the IR-LSD are listed in Table I, where the
W and the I refer to the number of bits used in total and
to the integer parts of the representation. It can be seen that
the SQRD requires up to 27 bits in internal word lengths
to produce accurate enough decomposition of the real 8 × 8
channel matrix H. We note that the high word lengths could
be decreased by introducing internal scaling of the SQRD
variables. A maximum of 12 bits is feasible in the IR-LSD
internal processing, and the LLR calculation requires 10 bits.
B. Implementation results
The Catapult C  Synthesis tool output RTL was synthe-
sized with Mentor Graphics Precision RTL. The design was
targeted for Xilinx Virtex-4 chip, and the device utilization of
the FPGA chip and the latencies of the main units are shown
in Table II. The resource allocations are listed in control logic
block (CLB) slices, block random access memories (RAMs),
and DSP48 units. The SQRD implementation, which used
two parallel MULs in the main calculation units, is able to
calculate 110k QRD operations in a second. The maximum
throughput of the IR-LSD algorithm implementation, where
two and four parallel MULs and MACs were used in the
SEE and PED unit, is 13.3Mbps. The throughput of the LLR
calculation unit implementation, where two parallel MULs and
full parallelism for one bit max calculation and pipelining was
used, is 75.5Mbps. It should be noted that parallel units can
be used in an OFDM system to obtain higher total throughput.
V. CONCLUSIONS
We designed and introduced a parallel and scalable archi-
tecture units for the IR-LSD. It was shown that the main
TABLE II
THE DEVICE UTILIZATION FOR XILINX VIRTEX-IV CHIP AND LATENCIES.
Resource SQRD LSD alg. LLR calc.
Slices 2848 1595 1841
BRAMs 2 6 0
DSP48s 24 6 3
Latency 9.06μs 0.133μs · D 0.212μs
Throughput 110k oper./s 13.3Mbps@19dB 75.5Mbps
operations of the algorithm can be run in parallel and are
scalable to different conﬁgurations with minor changes. The
designed architecture was implemented for 4× 4 system with
16-QAM on a Virtex-IV FPGA chip. The results show that
the LSD is feasible for practical implementation.
ACKNOWLEDGEMENTS
This work was done in MITSE project which was supported
by Elektrobit, Nokia, Nokia-Siemens Networks, Texas Instru-
ments and the Finnish Funding Agency for Technology and
Innovation, Tekes. The authors thank Mentor Graphics for the
possibility to evaluate Catapult C  Synthesis tool.
REFERENCES
[1] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-
antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, Mar. 2003.
[2] U. Fincke and M. Pohst, “Improved methods for calculating vectors
of short length in a lattice, including a complexity analysis,” Math.
Comput., vol. 44, no. 5, pp. 463–471, May 1985.
[3] M. O. Damen, H. E. Gamal, and G. Caire, “On maximum–likelihood
detection and the search for the closest lattice point,” IEEE Trans.
Inform. Theory, vol. 49, no. 10, pp. 2389–2402, Oct. 2003.
[4] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
H. Bölcskei, “VLSI Implementation of MIMO Detection Using the
Sphere Decoding Algorithm,” IEEE Journal of Solid-State Circuits,
vol. 40, no. 7, Jul. 2005.
[5] M. G. Datasheet, “Catapult Synthesis,” Mentor Graphics, Tech. Rep.,
2008, http://www.mentor.com/products/esl/high_level_synthesis/ cata-
pult_synthesis/index.cfm.
[6] M. Myllylä, M. Juntti, and J. Cavallaro, “The Effect of Preprocessing to
the Complexity of List Sphere Detector Algorithms,” in Proc. Int. Symp.
Wireless Pers. Multimedia Commun. (WPMC), Saariselkä, Finland, 8-11
September 2008.
[7] D. Wübben, R. Böhnke, V. Kühn, and K. Kammeyer, “MMSE extension
of V-BLAST based on sorted QR decomposition,” in Proc. IEEE Veh.
Technol. Conf. (VTC), vol. 1, Orlando, Florida, Oct. 6–9 2003, pp. 508–
512.
[8] M. Myllylä, J. Cavallaro, and M. Juntti, “A List Sphere Detector based
on Dijkstra’s Algorithm for MIMO-OFDM Systems,” in Proc. IEEE Int.
Symp. Pers., Indoor, Mobile Radio Commun. (PIMRC), Athens, Greece,
Sep 12 - 19, 2007.
[9] W. Xu, Y. Wang, Z. Zhou, and J. Wang, “A Computationally Efﬁcient
Exact ML Sphere Decoder,” in Proc. IEEE Global Telecommun. Conf.
(GLOBECOM), vol. 4, Nov. 29–Dec. 3 2004, pp. 2594–2598.
[10] E. W. Dijkstra, “A note on two problems in connexion with graphs,”
in Numerische Mathematik, vol. 1, Mathematisch Centrum, Amsterdam,
Netherlands, 1959, pp. 269–271.
[11] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and
sub-optimal MAP decoding algorithms operating in the log domain,”
Proc. IEEE Int. Conf. Commun. (ICC), pp. 1009–1013, 1995.
[12] D. Knuth, The Art of Computer Programming, Volume 3: Sorting and
Searching, Third Edition. Addison-Wesley, 1997.
[13] M. Myllylä, M. Juntti, and J. Cavallaro, “Implementation Aspects of
List Sphere Detector Algorithms,” in Proc. IEEE Global Telecommun.
Conf. (GLOBECOM), Washington, D.C., USA, Nov 26 - 30, 2007.
1856
Authorized licensed use limited to: Rice University. Downloaded on June 30, 2009 at 16:57 from IEEE Xplore.  Restrictions apply.
