ARCHITECTURE DESIGN AND IMPLEMENTATION OF THE INCREASING RADIUS - LIST SPHERE DETECTOR ALGORITHM by Myllylä, Markus et al.
ARCHITECTURE DESIGN AND IMPLEMENTATION OF THE INCREASING RADIUS - LIST
SPHERE DETECTOR ALGORITHM
Markus Myllylä∗, Markku Juntti
Centre for Wireless Communications
FIN-90014 University of Oulu, Finland
{markus.myllyla, markku.juntti}@ee.oulu.ﬁ
Joseph R. Cavallaro
Dept. of Electrical & Computer Engineering
Rice University, Houston, TX 77251,USA
cavallar@rice.edu
ABSTRACT
A list sphere detector (LSD) is an enhancement of a sphere de-
tector (SD) that can be used to approximate the optimal MAP de-
tector. In this paper, we introduce a novel architecture for the in-
creasing radius (IR)-LSD algorithm, which is based on the Dijk-
stra’s algorithm. The parallelism possibilities are introduced in the
presented architecture, which is also scalable for different multiple-
input multiple-output (MIMO) systems. The novel architecture is
implemented on a Virtex-IV ﬁeld programmable gate array (FPGA)
chip using high-level ANSI C++ language based Catapult C Synthe-
sis tool from Mentor Graphics. The used word lengths, the latency
of the design, and the required resources are presented and analyzed
for 4 × 4 MIMO system with 16- quadrature amplitude modulation
(QAM). The detector implementation achieves a maximum through-
put of 12.1Mbps at high signal-to-noise ratio (SNR).
Index Terms— Sphere, LSD, architecture, implementation
1. INTRODUCTION
Multiple-input multiple-output (MIMO) channels offer improved ca-
pacity and signiﬁcant potential for improved reliability compared to
single antenna channels. Sphere detector (SD) calculates the hard
output maximum likelihood (ML) solution with reduced complexity
compared to full-complexity ML detectors [1]. A list sphere detector
(LSD) [2] is a variant of the sphere detector that can be used to ap-
proximate a MAP detector, which is the optimal detector for forward
error coded (FEC) systems with lower complexity [2, 3, 4].
The sphere algorithms are often divided into breadth-ﬁrst search
algorithms, such as the K-best algorithm [4], and sequential search
algorithms, such as the Schnorr-Euchner enumeration (SEE) based
algorithms [3]. The architecture design is important for an efﬁcient
implementation, and architecture solutions with different levels of
parallelism have been introduced for the most common K-best and
SEE sphere algorithms, e.g., in [3, 4, 5]. In this paper, we consider a
sequential search algorithm, namely the increasing radius (IR)-LSD
algorithm, which is a modiﬁcation of Dijkstra’s algorithm [6] to a
LSD algorithm and optimal in the sense of visited number of nodes
in the sphere search tree structure [6, 7]. We identify and intro-
duce the key functional units of the IR-LSD algorithm, and design
a novel, highly parallel and scalable architecture for the algorithm.
∗This work was done in MITSE project which was supported by Elek-
trobit, Nokia, Nokia-Siemens Networks, Texas Instruments and the Finnish
Funding Agency for Technology and Innovation, Tekes. The authors would
like to thank Mentor Graphics for the possibility to evaluate Catapult C 
Synthesis tool.
The designed architecture is implemented on a Virtex-IV ﬁeld pro-
grammable gate array (FPGA) chip for 4 × 4 MIMO system with
16- quadrature amplitude modulation (QAM). The implementation
is done using Mentor Graphics’ Catapult  C Synthesis tool with
high-level ANSI C++ language, which is then completely synthe-
sized through the tool to produce the resulting RTL. We present the
complexity and latency of the implementation and describe the ma-
jor challenges.
The paper is organized as follows. The MIMO signal detection,
and the IR-LSD algorithm are presented in Section II. The designed
architecture is presented in Section III. The implementation of the
algorithm is presented in Section IV. The conclusions are drawn in
Section V.
2. MIMO SIGNAL DETECTION
A narrowband system with NT transmit and NR receive antennas
is considered with assumption NR ≥ NT and QAM constellation.
The received signal can be expressed in real domain as [1]
y = Hx + η, (1)
where the received signal vector y ∈ IR2NR×1, the transmit sym-
bol vector x ∈ Ω2NTR ⊂ IR2NT×1 and the Gaussian noise vector
η ∈ IR2NR×1 are deﬁned in the frequency domain. The channel
matrix H ∈ IR2NR×2NT contains real Gaussian coefﬁcients with
unit variance. The complex QAM constellation Ω is transformed
into real symbol alphabet with QR bits per symbol ΩR ⊂ Z, e.g.,
ΩR = {−3,−1, 1, 3} in the case of 16-QAM. We assume a practi-
cal case of system with FEC and with separate soft-output detector
and decoder at the receiver, where the detector generates soft output
information LD1(bk) of each transmitted bit bk [2].
The SDs ﬁnd the ML solution of x with reduced complexity
compared to exhaustive search algorithms. Then the sphere search is
done by limiting the search to the points inside a 2NR-dimensional
hyper-sphere S(y,
√
C0) centered at y. After QR decomposition
(QRD) of the channel matrix H, the condition can be written as [1]
||y˜− Rx||22 ≤ C0, (2)
where C0 is the squared radius of the sphere, R ∈ IR2NR×2NT is
an upper triangular matrix with positive diagonal elements, Q ∈
IR2NR×2NR is an orthogonal matrix, and y˜ = QTy. Due to the
upper triangular form of R the values of x can be solved from (2)
level by level using the back-substitution algorithm. Let x2NTi =
(xi, xi+1, . . . , x2NT )
T denote the last 2NT − i + 1 components of
the vector x. The squared partial Euclidean distance (PED) of x2NTi
553978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009
can be calculated as [3]
d(x2NTi ) = d(x
2NT
i+1 ) + |y˜i −
2NTX
j=i
Ri,jxj |2
= d(x2NTi+1 ) + |bi+1(x2NTi+1 )−Ri,ixi|2,
(3)
where d(x2NT2NT ) = 0, bi+1(x
2NT
i+1 ) = y˜i −
P2NT
j=i+1 Ri,jxj , Ri,j
is the (i, j)th term of R and i = 2NT , . . . , 1. Depending on the
search strategy and the channel realization, the SD searches a vari-
able number of nodes in the tree structure, and aims to ﬁnd the point
x = x2NT1 , also called a leaf node, for which the ED d(x
2NT
1 ) is
minimum.
The list sphere detector (LSD) [2] can be used for obtaining a list
of candidates of the transmitted symbol vectors and the correspond-
ing EDs L ∈ ZNcand×2NT as an output, where Ncand is the size of the
candidate list so that 1 ≤ Ncand ≤ 2QR2NT . The LSD output can-
didate list can then be used to approximate the log-likelihood ratio
(LLR) of the transmitted data as a soft output. The increasing radius
(IR) - LSD algorithm is listed as Algorithm 1. The algorithm oper-
ates in a sequential fashion starting from the root layer, and extends
the partial candidate s = x2NTi+1 with the next best admissible node
xi. The father candidate sf = x2NTi+2 is also, if admissible node xi+1
exists, extended. The variables n1 and n2 indicate the order num-
ber of the next best node, i.e., inform how many nodes have been
checked, for the child and father candidates, respectively. The algo-
rithm uses two memory sets: the ﬁnal candidate memory L, which
is the size of Ncand candidates and the partial candidate memory S,
which size depends on the algorithm iterations, i.e., while loop rep-
etitions. The stored partial candidate information N (s, d(s), n2, i),
which is stored to S, includes the partial candidate, the PED, the
number of extended father nodes, and the current layer, respectively.
The information stored to the ﬁnal list L includes only the candi-
date and the ED, and the C0 is updated according to the largest ED
in L. After each iteration, the algorithm continues with the partial
candidateN with the minimum PED from S until d(s) < C0.
Algorithm 1 [L] = IR-LSD(y˜,R,Ncand,ΩR,NT)
1: Initialize sets S and L, and set C0 = ∞, m = 0, n1 = 1
2: InitializeN (s = x2NT2NT , d(s) = 0, n2 = 2, i = 2NT − 1)
3: while d(s) < C0 do
4: Determine the n1th best node xi for sc = (xi, x2NTi+1 )
T and
calculate d(sc)
5: Determine the n2th best node xi+1 for father candidate sf =
(xi+1, x2NTi+2 )
T and calculate d(sf ) if n2 ≤ |ΩR|
6: if d(sc) < C0 then
7: if sc is a leaf node then
8: StoreNF (sc, d(sc)) in {L}m
9: Set m = m + 1 or, if the set is full, set m according to
{L}m with max ED and C0 = d(s)m
10: Continue withN (s, d(s), n1 + +, 1) if n1 + 1 ≤ |ΩR|
11: else
12: StoreNc(sc, d(sc), n2 = 2, i−−) in S
13: end if
14: end if
15: ifNf calculated and d(sf ) < C0 then
16: StoreNf (sf , d(s)f , n2 + +, i) in S
17: end if
18: Continue withN with min PED from S and set n1 = 1
19: end while

	






	




	

	




 		!
	

"


 		!
#
$

%$

&

 

Fig. 1. An architecture for the IR-LSD algorithm.
3. ARCHITECTURE
A list sphere detector consists of the QRD, the LSD algorithm, and
the log-likelihood ratio (LLR) calculation units. The QRD unit de-
composes the channel matrix H into R and Q, which are given as an
input with y to the LSD algorithm. The LSD algorithm unit executes
the sphere tree search and determines the output candidate list L.
The approximation of LD(bk) is calculated in the LLR calculation
unit using the candidate list L. In this paper, we focus our attention
to the architecture of the IR-LSD algorithm.
The architecture for IR-LSD algorithm is shown in Figure 1, and
it consists of two SEE and PED units, a partial candidate memory
unit, a ﬁnal candidate memory unit, and a logic unit. The SEE and
PED units deﬁne and extend the selected partial candidate and its
father node with the next best admissible nodes and calculates the
PEDs of the updated candidates. The partial candidate memory unit
is used to store the already extended partial candidates while the leaf
candidates are stored to the ﬁnal candidate memory unit. The logic
unit deﬁnes the candidate(s) to be extended and stored in the next
algorithm iteration.
The IR-LSD algorithm architecture, which is presented in Fig-
ure 1, is designed to have as much parallel processing as possible to
decrease the overall latency of one algorithm iteration. In one algo-
rithm iteration, the algorithm studies one or two new nodes of the
sphere search tree depending if the father node is extendable or not.
The two SEE and PED units are designed to execute the algorithm
description lines 4 and 5 in parallel. The control logic unit executes
the logic between lines 6−18, and deﬁnes the candidates to be stored
and the candidate to be extended in the next iteration. This means
that the candidates extended in the iteration round 1 are stored to the
memory in iteration round 2. The storing of the partial candidates
in lines 12 and 16 to the partial memory unit and the storing of ﬁnal
candidate in line 8 to the ﬁnal memory unit is then done in parallel
with the SEE and PED units. Thus, the total latency of one algo-
rithm iteration is equal to the latency of the highest latency parallel
unit plus the latency of the control unit. The total latency of one
signal vector detection process, i.e., one algorithm run, is dependent
on the required number of algorithm iterations, which is also rela-
tive to the number of checked nodes in the search tree. The required
number of iterations depends on the system conﬁguration and the
channel environment.
3.1. SEE and PED Unit
There are two similar SEE and PED units in the IR-LSD architec-
ture as shown in Figure 1. The units execute the partial candidate
extension in lines 4 and 5 in the algorithm description separately in
554
 
  	
 


 
 	  
 
Fig. 2. A min-heap memory architecture.
parallel. Both units are not used in all of the cases, but in practice
both units are occupied > 90% of the time. Each SEE and PED unit
is divided into two subunits as shown in Figure 1.
The ﬁrst unit calculates the bi+1(x2NTi+1 ), which is the part of
PED calculation that is independent from the new symbol xi, as in
(3). The unit can be implemented with different levels of parallelism
to get faster calculation of the multiplication (MUL) operations. The
number of required multiplications in the calculation of bi+1(x2NTi+1 )
is 2NT − i − 1, where i is the current layer in the search tree and
imax = 2NT . It should be noted that the average layer in the search
process is less than half of the tree hight NT , because a larger ratio
of the search process is done in the upper part of the tree.
The second unit executes the SEE, i.e., determines the n1th best
node xi, and calculates the PED of the extended candidate accord-
ingly. The SEE is done in a slightly modiﬁed fashion from the way
presented in (14) in [1]. Instead of calculating the costly and high
latency division operation, we calculate the (3) with ΩR different
symbols xi, what can be implemented with 1− |ΩR| parallel MUL
and subtraction (SUB) operations. Then the |ΩR| values are sorted
and the PED is calculated with the n1th best node. The architecture
could be designed to determine ﬁrst the symbol with minimum PED
and then determine the n1th best node using logic the same way as
presented in [1].
3.2. Memory Units
The memory units are designed as binary heap data structures [8],
which keep the stored elements in order according to selected deﬁ-
nition. The partial candidate memory set S is implemented as min-
heap, where the stored elementsN (s, d(s), n2, i) are ordered so that
the candidate with minimum PED is always sorted to be at the top
of the heap. The ﬁnal memory set LF is implemented as max-heap,
where the stored elements N (s, d(s)) are sorted according to the
maximum PED. A binary min-heap tree structure is illustrated in
Figure 2, where the value of the memory slot is illustrated inside the
circle and the memory address underneath the circle.
The used operations with heap memory are read min/max,
extract-min/max, and insert new. The running time of read min/max
operation is O(1) [8],i.e., it requires just a memory read of the base
address. The insert and extract-min/max operations running time is
O(log2(k)) in the worst case [8], where k is size of the memory and
log2(k)	 is the height of the tree. In the insert operation, we store
the information to the next available memory slot, which is illus-
trated as inserting the value X to the address 8 in Figure 2. The value
is swapped to correct level with up-heap operation. The operation
requires at least one read and write operation of the memory, and it
is repeated maximum of 
log2(k) times until the new element is in
its correct place. The extract-max/min operation extracts the base
address element and replaces it with either a new element or the last
element of the memory. Then the down-heap operation is executed
to move the element in the right position. The down-heap operation,
which requires at least two memory reads and one memory write,
is repeated also a maximum of 
log2(k) times until the new added
Table 1. Determined word lengths for the real IR-LSD algorithm.
Signal y˜ R Ω bi(s) d(s)
(W,I) (10,4) (9,3) (8,1) (12,5) (10,5)
element is at the correct position.
In the IR-LSD architecture, the sizes of LF and S are equal to
the required list size Ncand and to the maximum number of algo-
rithm iterations. In the worst case, one extract-min and one insert
operations are required to the partial memory unit in each iteration.
The partial memory size can be decreased by introducing a sepa-
rate sphere constraint for the stored candidates. Therefore, with a
proper choice of the sphere constraint, the required memory size is
decreased without any performance loss. The sphere constraint can
be determined to be, e.g., relative to the previous largest candidate
in the ﬁnal list or relative to the partial candidate search level.
3.3. Scalability
The architecture can be used as such in systems with different num-
bers of transmit antennas NT and constellation size Ω. The number
of transmit antennas NT and constellation Ω effect the size of the
search tree and, thus, effect the required number of algorithm itera-
tions to detect the transmitted signal vector x. The maximum number
of required iterations is equal to the number of required elements in
the partial candidate memory. Also the required operations by the
SEE in SEE and PED unit depend on the constellation size Ω. The
proper ﬁnal list size also varies with system conﬁguration.
4. IMPLEMENTATION
The IR-LSD algorithm architecture was implemented with real sig-
nal model for 4× 4 MIMO system with 16-QAM. The performance
of a turbo coded system was studied with a real IR-LSD, sorted QRD
(SQRD) [9] preprocessing and log-MAP LLR calculation in an un-
correlated (UNC) channel as shown in the left subplot in Figure 3.
The LSD candidate list size was selected as Ncand = 15 and the
used ﬁxed-point word lengths are listed in Table 1, where W and the
I refer to the total number of bits and the number of bits used for
the integer part representation, respectively. We also studied the re-
quired maximum and average number of iterations Dmax and Davg
for 10% target frame error rate (FER) by the LSD algorithm with dif-
ferent SNR as shown in the right subplot in Figure 3. It can be seen
that the required maximum number of iterations Dmax decreases
with increasing SNR and the IR-LSD algorithm requires as low as
Dmin = 9 iterations, i.e., 18 studied nodes, in high SNR environ-
ment to reach target FER. The size of the partial candidate memory
is selected as Dmax = 80 to support the lower SNR operation.
4.1. Complexity and Latency
The Catapult C  Synthesis tool output RTL was synthesized with
Mentor Graphics Precision RTL and the FPGA place and route op-
eration was done with Xilinx ISE software for Xilinx Virtex-IV chip
with fs = 150 MHz clock frequency. The device utilization of Xil-
inx Virtex-IV chip and the latencies of the units in terms of clock
cycles (cc:s) Δtot are shown in Table 2. The latencies of the mem-
ory units are calculated according to the average number of heap op-
erations, and the units are implemented in dual port RAM memory,
which enable two parallel read/write operations. The total latency of
the detector iteration consists of the latencies of the slowest parallel
unit, which is the SEE and PED unit, and of the control logic unit.
555
13 14 15 16 17 18 19
0
10
20
30
40
50
60
70
80
90
ES/N0
N
um
be
r o
f i
te
ra
tio
ns
Number of iterations for target FER
 
 
10 11 12 13 14 15 16
0
5
10
15
20
25
30
35
ES/N0
Th
ro
ug
hp
ut
 (M
bp
s)
4x4 MIMO,16QAM,real IR−LSD with LLR clip 8, UNC,120kmph
 
 
L=64,no limit,float
L=32,no limit,float
L=15,no limit,float
L=15,no limit,fixed
L=15,limit=100,fixed
L=15,limit=80,fixed
L=15, limit=60,fixed
D
max
D
avg
Fig. 3. Throughput vs SNR (left) and alg. iter. vs SNR (right).
Table 2. Device utilization for Xilinx Virtex-IV chip and latencies.
Resource CLB Slices BRAMs DSP48s Latency
SEE&PED x2 536 0 3 17cc
Partial mem. 522 4 0 16cc
Final mem. 291 2 0 11cc
Control logic 144 0 0 5cc
Total 2029 6 6 22cc
The total guaranteed throughput of the implementation can be
calculated as
Throughput =
2NTQRfs
ΔtotDavg
bits/s. (4)
The maximum guaranteed throughput is then 12.1 Mbps at γ = 21
dB for 10% target FER with Davg = 9. However, the guaranteed
implementation throughput is 1.6 Mbps at γ = 13 dB, which can be
considered as the worst case scenario with Davg = 70. Thus, the
throughput is mainly dependent on the number of iterations Davg .
4.2. Discussion and Comparison to Other Work
The main limiting factor for higher throughput is the latency of one
algorithm iteration. The total number of iterations can only be low-
ered with some lattice reduction techniques or by sacriﬁcing the per-
formance of the detector. The latency of one algorithm iteration is
currently limited by the SEE and PED unit, and it could be lowered
by, e.g., ASIC implementation. As far as the authors know, there has
not been any architecture designs or implementations of the IR-LSD
algorithm in the literature.
The parallel nature of the introduced architecture makes the al-
gorithm implementation competitive against the current state of the
art depth ﬁrst algorithm or soft output detectors [10, 11]. Some par-
allelism can be added by dividing the search into separate real and
imaginary branches as in the hard output work in [10]. However, the
increased throughput results in approximately double the complex-
ity and decreased performance. The complex signal model leads to
higher maximum throughput in [11], but results in more complex
units and higher average number of visited nodes. The main interest
in practice is the performance and the complexity of the implemen-
tation in the worst case scenario, because the implementation has
to be able to work in those conditions. As the other works typi-
cally present the maximum throughput results, a direct comparison
of other work is difﬁcult, but the authors believe that with possible
ASIC implementation the IR-LSD algorithm is a very good compet-
ing alternative.
5. CONCLUSIONS
We designed and introduced a novel and scalable architecture for the
IR-LSD algorithm with parallel processing units. The architecture
was designed so that the main operations of the algorithm can be run
in parallel and it is scalable to different conﬁgurations with minor
changes. An implementation of the architecture was presented for
4 × 4 system with 16-QAM on a Virtex-IV FPGA chip. The com-
plexity and the latency of the implementation were presented and
analyzed. The throughput of the current implementation could be
enhanced, e.g., with an ASIC implementation.
6. REFERENCES
[1] M. O. Damen, H. El Gamal, and G. Caire, “On maximum–
likelihood detection and the search for the closest lattice point,”
IEEE Trans. Inform. Theory, vol. 49, no. 10, pp. 2389–2402,
Oct. 2003.
[2] B. Hochwald and S. ten Brink, “Achieving near-capacity on a
multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no.
3, Mar. 2003.
[3] Andreas Burg, Moritz Borgmann, Markus Wenk, Martin Zell-
weger, Wolfgang Fichtner, and Helmut Bölcskei, “VLSI Im-
plementation of MIMO Detection Using the Sphere Decoding
Algorithm,” IEEE Journal of Solid-State Circuits, vol. 40, no.
7, July 2005.
[4] Z. Guo and P. Nilsson, “Algorithm and implementation of the
K-best sphere decoding for MIMO detection,” IEEE J. Select.
Areas Commun., vol. 24, no. 3, Mar. 2006.
[5] B. Widdup, G. Woodward, and G. Knagge, “A Highly-Parallel
VLSI Architecture for a List Sphere Detector,” in Proc. IEEE
Int. Conf. Commun. (ICC), Paris, France, 20-24 June 2004, pp.
2720–2725.
[6] E. W. Dijkstra, “A note on two problems in connexion with
graphs,” in Numerische Mathematik, Mathematisch Centrum,
Amsterdam, Netherlands, 1959, vol. 1, pp. 269–271.
[7] W. Xu, Y. Wang, Z. Zhou, and J. Wang, “A Computation-
ally Efﬁcient Exact ML Sphere Decoder,” in Proc. IEEE
Global Telecommun. Conf. (GLOBECOM), Nov. 29–Dec. 3
2004, vol. 4, pp. 2594–2598.
[8] D. Knuth, The Art of Computer Programming, Volume 3: Sort-
ing and Searching, Third Edition, Addison-Wesley, 1997.
[9] D. Wübben, R. Böhnke, V. Kühn, and K. Kammeyer, “MMSE
extension of V-BLAST based on sorted QR decomposition,” in
Proc. IEEE Veh. Technol. Conf. (VTC), Orlando, Florida, Oct.
6–9 2003, vol. 1, pp. 508–512.
[10] X. Huang, C. Liang, and J. Ma, “System architecture and
implementation of MIMO sphere decoders on FPGA,” IEEE
Trans. VLSI Syst., vol. 16, no. 2, pp. 188 – 197, Feb. 2008.
[11] C. Studer, A. Burg, and H. Bolcskei, “Soft-output sphere de-
coding: algorithms and VLSI implementation,” IEEE J. Select.
Areas Commun., vol. 26, no. 2, pp. 290 – 300, Feb. 2008.
556
