VLSI Architecture for MIMO Soft-Input Soft-Output Sphere Detection by Esther P. Adeva et al.
J Sign Process Syst (2013) 70:125–143
DOI 10.1007/s11265-012-0709-z
VLSI Architecture for MIMO Soft-Input Soft-Output
Sphere Detection
Esther P. Adeva · Tobias Seifert · Gerhard Fettweis
Received: 20 April 2012 / Revised: 1 October 2012 / Accepted: 4 October 2012 / Published online: 18 December 2012
© The Author(s) 2012. This article is published with open access at SpringerLink.com
Abstract Achieving good detection performance while
incurring low complexity is known to be one of
the major challenges in multiple-input multiple-output
(MIMO) communications based on spatial multiplex-
ing. The tuple search detector (TSD) was recently in-
troduced, improving this trade-off with regard to other
tree-search-based algorithms (e.g. single tree search or
list sphere detector). Motivated by the tremendous gain
achievable through the turbo principle and based on a
previously developed soft-output (SO) TSD implemen-
tation, this work presents the first soft-input soft-output
(SISO) TSD realization, scalable in constellation size
and number of antennas and mapped to a highly par-
allel and pipelined VLSI architecture. The proposed
SISO-TSD VLSI realization is instantiated for 4 × 4
MIMO transmission and 64-QAM constellation in 65-
nm CMOS technology. For a given BER↔complexity
trade-off, the throughput ranges from 57.3 Mbps (it-
erative detection-decoding with 3 iterations) to 403.6
Mbps (non-iterative detection-decoding) at a clock fre-
quency of 454 MHz. The BER↔complexity trade-off
can be moreover adjusted according to transmission
conditions, reaching >1 Gbps in high SNR scenarios.
A silicon area of 0.14 mm2 (97.7 kGEs) is occupied
by the SISO-TSD core, reporting low power dissipa-
tion (58.2 mW – 73.9 mW) under typical case operat-
E. P. Adeva (B) · T. Seifert · G. Fettweis
Vodafone Chair Mobile Communications Systems,






ing conditions. The proposed detector implementation
achieves hence high throughput with reasonable hard-
ware complexity, representing a very competitive strat-
egy with regard to relevant state-of-the-art realizations.
Keywords MIMO · Iterative detection-decoding ·
Soft-input soft-output (SISO) detection · Sphere
detection · Tuple search detector · SIMD ·
Pipeline-interleaving · VLSI architecture
1 Introduction
Future mobile communication systems will make use
of multiple-input multiple-output (MIMO) techniques
in combination with high constellation orders to en-
hance spectral efficiency. The turbo principle (i.e. it-
erative detection-decoding) is in this regard foreseen
as a strategy to approach the full potential of MIMO.
Additionally, transmission of spatially multiplexed data
streams allows increasing data rates as well as di-
versity. As it is widely known, in such systems the
potential search space grows exponentially with in-
creasing number of transmit antennas (NT) and con-
stellation size (Q). As a consequence, the inherent
high detection complexity of most tree-search-based
detection strategies represents a limiting factor towards
efficient detector implementation. Low-complex detec-
tion strategies provide poor bit error rate (BER) per-
formance (e.g. linear detector, successive/parallel inter-
ference cancelation—SIC/PIC detector, . . . ), whereas
implementations achieving high BER performance
present unusable high complexity (e.g. search ap-
proaches like unclipped single tree search sphere
126 J Sign Process Syst (2013) 70:125–143
detector—STS-SD). Besides the mentioned difficulties
concerning complexity, most of today’s communication
standards define several transmission modes (e.g. up
to 8 × 8 MIMO and up to 64-QAM in Long Term
Evolution Advanced—LTE-A [1, 2]), thus becoming
flexibility and scalability of detection approaches an
additional challenge to be addressed. The tuple search
detector (TSD) introduced in [17] has demonstrated to
outperform the complexity↔BER-performance trade-
off of comparable tree search detection strategies like
STS [25] or list sphere detection (LSD) [31]. Addition-
ally, TSD’s search complexity presents a roughly linear
trend with NT and Q [4], hence becoming especially
well suited for high-order transmission scenarios (NT ≥
4, Q > 16). In contrast to breadth-f irst approaches
presenting fixed search complexity, the complexity of
the so-called depth-f irst detection algorithms (like STS,
LSD and TSD) is variable. This fact represents an ad-
ditional challenge towards practical realizations, since
the inherent irregular and data-dependent control flow
typically frustrates efficient algorithm parallelization,
thus considerably limitig the achievable throughput en-
hancement. In this work, the first soft-input soft-output
(SISO) TSD realization capable of reaching LTE-A
data rates is presented and compared to a previously
developed soft-output (SO) TSD implementation [3].
The turbo principle [10] allows significant further im-
provement of BER performance, but it also generally
leads to further complexity increase resulting from the
multiple detection-decoding runs per received symbol.
Both benefit and cost of processing a-priori information
from channel decoder will be hence evinced throughout
this work.
The considered communications system model is in-
troduced in Section 2, followed by the description of
the complexity-reduced SISO-TSD algorithm (Section
3) representing the basis of our implementation. In
order to minimize the required resources and speed
up the computations, fixed-point arithmetic is utilized,
as described in Section 4. Analysis of the computa-
tional complexity and the critical path is presented
in Section 5. Considered parallelism approaches for
further throughput enhancement are detailed in Sec-
tion 6. In Section 7 the processor architecture is pre-
sented and relevant implementation challenges are in-
troduced. Corresponding VLSI implementation results
are presented in Section 8 and discussed in Section 9.
As summarized in Section 10, low-complexity, scalabil-
ity and high throughput (up to 1 Gbps)1 characterize
1At 454 MHz, for 4 × 4 transmission and 64-QAM in high SNR
scenarios, non-iterative case.
the proposed SISO-TSD implementation, resulting in
a very competitive detection strategy with regard to
relevant state-of-the-art realizations.
2 System Model
Throughout this paper, we consider a NT × NR MIMO
system based on a bit-interleaved coded modulation
(BICM) transmission strategy with NT transmit and
NR receive antennas, as depicted in Fig. 1. A vector
u of i.i.d. information bits is encoded by the outer
channel encoder with rate R. The resulting stream of
vectors c′ is bit-interleaved with a random interleaver
and portioned into blocks c of NT · L bits, where L
denotes the number of bits per transmit symbol. For
the transmission, the corresponding bits c ∈ C , covered
in the set of permitted bit vectors, are mapped (e.g.
Gray mapping) onto complex constellation symbols
x(c) = [x0, ...xNT−1]T = map(c) with xi ∈ X , being X
the set of valid transmit symbols xi with cardinality
#X = #C = 2L = Q. The transmit energy is normal-
ized so that E {xxH} = ES/NTI, where Es represents the
average transmit energy of the transmitter.
Regarding the transmission, we consider an uncorre-
lated, flat fading channel and an additive noise vector
n ∈ CNR×1 at the receiver with complex components of
zero mean i.i.d. gaussian random variables of variance
N0/2 per real dimension (E {nnH} = N0I). The consid-
ered passive channel is represented by H ∈ CNR×NT and
is assumed to be perfectly known at the receiver. The
received signal y is therefore given by
y = Hx + n
and the signal-to-noise-ratio (SNR = Es/N0) at the
receiver applied to the energy of one information bit
can be stated as Eb/N0 = Es NR/N0 NTLR.
At the receiver side, the detection process is carried
out by a complex-valued TSD algorithm in conjunction
with a PCCC (Parallel Concatenated Convolutional
Code) channel decoder, which can operate in iterative
detection-decoding fashion. In order to ensure compa-
rability of results, a setup equivalent to the one used in
e.g. [11, 32] has been used for our simulations.2
2Channel decoder is BCJR-based (Bahl, Cocke, Jelinek and
Raviv) and uses (7R, 5) convolutional codes, 8 internal iterations
and rate 1/2. Information block size of 9216 bits (including tail
bits) and Gray mapping are used.
































Figure 1 Communications system model with BICM transmitter and iterative receiver.
3 Complexity-Reduced Tuple Search
MIMO Detection
3.1 Fundamentals
Task of the focused detector is the determination of
the bits c most likely sent as well as of reliability in-
formation for these bits. This can be accomplished by



















{λ0} + 1N0 minc|cm,l=−1 {λ0} ,
(1)
where Eq. 1 results from application of the max-log
approximation [23]. The l-th bit of a symbol sent by the












represents the metric for a set of received symbols
y, a given c and the a-priori knowledge La. xˆ corre-
sponds to a possible transmitted symbol. Consequently,
besides the most likely sent symbol arg min
xˆ(c)|c∈C
{λ0} (i.e.
the detection hypothesis) and its corresponding metric
λ0(cML), the detector has to determine also the counter-
hypotheses arg min
xˆ(c)|c∈C ,cm,l =cMLm,l
{λ0} and their metrics for each
bit.
3.2 Tree Search Basics
Since brute force (full Max-Log A Posteriori
Probability—APP) detection of Eq. 1 is known to
be of exponentially growing computational complexity
with the number of transmit antennas and order of
the constellation, several close to optimal detection
approaches have been lately proposed, some of the
most promising are based on tree search strategies.
Transforming the detection problem is permitted by
the QR-decomposition (QRD) of H = QR, where Q
is unitary and R an upper triangular matrix [30]. With
modified received symbols y′ = QHy, determination of
the squared Euclidean distance
∥
∥y′ − Rxˆ(c)∥∥2 (3)
can be interpreted as a tree search. Resulting from this,

























It should be noticed that λa(xˆi) contribution may in-
crease or decrease λi, depending on both ci, j and the
sign of La(ci, j). This may lead to exploration of unfa-
vorable nodes during the tree search as well as to ex-
clusion of favorable ones. In order to avoid this effect,
128 J Sign Process Syst (2013) 70:125–143













with ci, j =sign(La(ci, j))
(6)
The search is carried out in depth-f irst fashion, succes-
sively extending the selected nodes by analyzing their
child nodes. As introduced in [20] and described in
Section 3.5, a regularized control flow and the parallel
calculation of sibling parent nodes as well as of leaf
nodes permit a so called one-node-per-cycle implemen-
tation [6]. Based on this, the average number of node
extensions n performed in the detection is taken as
search complexity measure throughout this paper.
3.3 Tuple Search and Bit-Specific Candidate
Determination
Computing the L-values in Eq. 1 requires the deter-
mination of a detection hypothesis and all counter-
hypotheses as described in Section 3.1. Explicit search
for all the needed minimums leads to impractically high
n [12]. Therefore, instead of searching all possible
minima, TSD [17] searches a subset of T most likely
leaves, similarly to the LSD approach. The metrics
λ0 of these leaves are stored in a search tuple T :=
{λ0 (c1) , λ0 (c2) , . . . , λ0 (cT)}, defining the sphere radius







The tuple search is additionally combined with sepa-
rated bit-specific storage of information for the L-value
calculation [17]. The resulting TSD achieves much
better BER performance than LSD, at a significantly
reduced n compared to STS detection. Additionally,
adjustment of n ↔BER-performance trade-off is en-
abled by varying the size of the tuple T.
3.4 Complexity Reduction Approaches
Strategies like sorted QR decomposition (SQRD)
[30], MMSE preprocessing [32] as well as sphere and
L-values clipping [9] are widely applied approaches
towards n reduction. Novel approaches like search se-
quence determination (SSD) [18] and metric estimation
(ME) [4] contribute to further reduce the computa-
tional complexity, as described in the following.
3.4.1 Search Sequence Determination (SSD)
The SSD strategy introduced in [18] replaces the costly
metric calculations in Eq. 4 and sorting operations
required by SE enumeration by few inexpensive basic
operations. The SSD approach is based on geometrical
position analysis relative to reference nodes xr, deter-
mined by




(with ·	 representing a rounding operation to the
closest constellation symbol). Resulting enumeration is
defined by fixed sequences which are mapped to con-
stellation symbols during the detection, based on the
relative position between y′′′i and x
r. The computational
complexity of this strategy can be further decreased
by reducing the amount and length of the considered
sequences. As proposed in [18], in this work only m = 2
sequences of n = 14 partially combined elements have
been used.
It should be noticed that the SSD strategy represents
a very good approximation of the ideal SE enumeration
under the assumption that λi depends exclusively on the
Euclidean distance (as it occurs in SO detection). As
introduced in [18] and further investigated in [24], when
a-priori information is present (i.e. in SISO detection),
the node enumeration resulting from the SSD approach
may differ significantly from the ideal SE enumera-
tion. This effect leads to significant BER performance
degradation, especially if specif ic leaf sequences (Fig.
2) are used. In [18] the Min-Search approach is pro-
posed to mitigate this effect. By means of this strategy
the detection performance is significantly enhanced,
but the computational complexity increases due to the
additional metric computations and sorting operations
required. In this work an adaptive hypothesis strategy is
applied. In contrast to Min-Search, this approach makes
use of the already analyzed symbols instead of search-
ing for additional ones. The set of nodes B considered
by the specif ic leaf sequence comprises seven elements
(one possible hypothesis xr and six candidate counter-
hypotheses in real and imaginary directions of xr) [18].
In case that B contains a symbol xr′ with lower λ0 than
the initially selected xr, the proposed approach rectifies
the potential leaf hypothesis as xr′, whereas remaining
symbols in B (including the originally selected hypoth-
esis xr) are treated as potential counter-hypotheses, as
illustrated in Fig. 2.
J Sign Process Syst (2013) 70:125–143 129
xr
xr
Figure 2 Candidate symbols after application of the adaptive
hypothesis approach (64-QAM constellation).
Even though the adaptive hypothesis approach
presents a slightly worse BER↔ n trade-off than Min-
Search [24], it does not require the costly additional
metric computation and sorting operations incurred
by the latter. Adaptive hypothesis represents therefore
a very attractive approach towards efficient hardware
implementation, as shown in Section 7.2.
3.4.2 Metric Estimation
The computationally expensive operations required for
the metric calculation (Eq. 4) can be considerably sim-
plified by an estimation based on the SSD’s geometrical
approach [4]. Euclidean distances in Eq. 4 are replaced
by predefined geometrical distances dm,n correspond-





∥2 ≈ r2iid2m,n. (9)
It is additionally possible to precalculate r2ii as well as
r2iid
2
m,n, hence reducing the complex products in Eq. 4
to a single real multiplication or even completely dis-
solving it. This considerable reduction in computational
complexity makes this technique especially interesting
for realizable hardware implementations.
3.5 Algorithm Partitioning and Regularization
The TS-based detection process can be decomposed,
as proposed in [20], in an arbitrary number of regular-
ized loops. The operations performed within each loop
have been partitioned into the task blocks §a . . . §n,
as illustrated in Fig. 3. Selection of first node to be
extended (Eq. 8) is performed in §a. SSD’s geometric
position analysis is subsequently carried out in §b ,
determining the corresponding search sequence and
the second node to be extended. Metric (Eq. 9) and
interference (Eq. 4) associated to these nodes are de-
termined by §c, § f, §g and §h. Subsequently, the radius
tuple is updated (Eq. 7) by §d. Final task within the
loop is the selection of the tree layer to be processed
during next loop (§e). As mentioned in Section 3.2,
a so called one-node-per-cycle(-loop) realization is
desired. Consequently, sibling parent nodes and leaf
nodes are assumed to be processed in parallel [18]. For
this purpose, it is necessary to define separated blocks
which process first (§a, §c, § f ), second (§b , §g, §h) and
subsequent nodes (§i, § j, §k) individually. Likewise,
definition of task blocks processing candidate symbols
(§l, §m) for later L-values calculation (Eq. 1) in §n is
required. In order to extend the SO-TSD functionality
to process soft-information from the channel decoder
(SISO-TSD), the task blocks computing metric dis-
tances (§c, §g, § j and §l, as highlighted in Fig. 3) have
been modified to include a-priori information accord-
ing to Eq. 6. Additionally, the task block §m has been
modified to include the adaptive hypothesis strategy
described in Section 3.4.1.
















1st. node selection 2nd. Node sel. /
search sequence det.




















130 J Sign Process Syst (2013) 70:125–143
4 Fixed-Point Representation
Fixed-point arithmetic is commonly applied in order
to limit the latency and hardware resources required
in a practical implementation. However, overflow and
quantization errors incurred due to the inherent limited
precision typically lead to BER performance degrada-
tion in the context of MIMO detection. It is there-
fore necessary to define the bit-resolution appropri-
ately, according to the trade-off between BER perfor-
mance and hardware complexity. In [3] a maximum
word width of 8 bits3 was proposed, whereas ∼ 0.3 dB
SNR loss has to be accepted. In this work, suitable
fixed-point representation of La(ci, j) has been addi-
tionally investigated and required considerations taken
into account, as described throughout the following
sections.
4.1 Normalization
As investigated in [3], parameters involved in Eq. 1
and hence in Eq. 4 (y′i, R = {rii, rij}i = j and La(ci, j), with
i, j = {1, . . . , (NT − 1)}) possess a dynamic range fluctuating
with the received energy Er. In order to define fixed
ranges, independent of transmission conditions, elim-
ination of Er dependency is required. For this pur-
pose, a scaling factor sf = f√Er (with f represent-
ing a constant factor) is utilized to normalize pa-
rameters r˜ii = rii/sf and L˜a(ci, j) = La(ci, j)/s2f , leading








/s2f . Concerning y
′
i and rij, normalization with
sf is not necessary since application of the SSD strategy
(Eq. 8) eliminates the energy dependency. A certain
scaling factor f ′ is nevertheless required in order to
accommodate their dynamic range to the proposed
8-bit quantization. Based on this, y˜ = y′i/ f ′ and r˜ij =
rij/ f ′ are additionally defined. To ensure flexibility
of the proposed detector implementation, fixed-point
representation has to be additionally independent of
the system configuration. For this purpose, adjust-
ing f and f ′ according to the transmission scheme
is proposed [3], as shown in Table 1.4 As a result,
definition of different fixed-point representations de-
3Integer and fractional word lengths have been individually de-
termined for each of the variables involved in the detection.
Integer word lengths have been selected to cover the variables
range, while fractional lengths have been determined based on
BER performance analysis through simulations. Further details
can be found in [16].
4Values determined through simulation for the given system
model.
Table 1 Scaling factors f and f ′, to enable flexibility of the
proposed fixed-point representation.
Q 4 16 64
f 12 6 3
f ′ 8 16 32
pending on transmission characteristics (as e.g. in [27])
is avoided. By applying the described normalization,
8-bit fixed-point representation is enabled, indepen-
dent of transmission conditions and adaptable to dif-
ferent NT and Q by simple adjustment of f and f ′. The
resulting detector presents lower hardware complexity
and higher flexibility than comparable sphere detector
(SD) realizations [21, 27], typically requiring greater
precision (10–16 bits) to achieve comparable BER
performance.
4.2 Simulation Results
Figure 4 depicts SNR↔ n achieved at 10−5 BER, using
different transmission and detection schemes operat-
ing in a non-iterative receiver.5 Results correspond-
ing to unclipped (Lmax = ∞) STS(FLP) detection are
depicted representing full max-log-APP SNR perfor-
mance bound. SIC(FXP) detection has been implemented
using the proposed TSD(FXP) realization with limited
n = NT, hence representing the lowest n bound. Re-
sults corresponding to the proposed TSD for a given
SNR↔ n trade-off are also included [3]. As depicted,
fixed-point arithmetic leads to ∼0.3dB SNR degrada-
tion with respect to the analogous TSD floating point
approach. Metric underestimations resulting from the
limited precision lead, in addition, to a n reduction of
∼20 %. It should be noticed that SNR↔ n difference
between TSD(FLP) and TSD(FXP) is approximately con-
stant for the considered transmission configurations,
hence validating the flexible normalization approach
proposed in Section 4.1. Additonally, it should be ob-
served that SNR↔ n relationship is adjustable accord-
ing to transmission requirements, ranging from close to
full max-log-APP SNR performance at significantly re-
duced n, to SIC SNR performance and complexity. For
the depicted SNR↔ n trade-offs, TSD(FXP) achieves
considerable complexity reduction with regard to un-
clipped STS(FLP) (∼1–6 orders of magnitude less nodes
are extended during the tree search), while providing
considerable SNR performance gain with respect to
SIC (∼4dB in most of the considered scenarios).
5FLP denotes floating point, while FXP stands for fixed point.
J Sign Process Syst (2013) 70:125–143 131













STS(uncl.), ZF − FLP
TSD−SSD−ME, MMSE − FLP
TSD−SSD−ME, MMSE − FXP















Figure 4 n required for 10−5 BER, corresponding to different
transmission and detection schemes (It = 1).
As long as a-priori information is considered, the
complexity of the metric computation increases with
increasing Q (e.g. for 64-QAM modulation 6 further
additions are required for each partial metric λi, as
derived from Eq. 4). It should be additionally noticed
that due to the inherent range limitation, n increases
as the bit-width of La(ci, j) is reduced, as illustrated
in Fig. 5b. In order to keep the added computational
complexity as low as possible at no SNR degradation
(Fig. 5a), 5-bit quantization is selected. A comparison
between non-iterative (It = 1) and iterative detection
with 3 iterations (It = 4) is depicted in Fig. 6 for differ-
ent T values. For the SNR↔ n trade-off given by T =





















T=16 T=8 T=4 T=2
2.8 dB
x 7.0
Figure 6 n required for 10−5 BER, corresponding to different
tuple sizes T for a 4 × 4 64-QAM transmission with 3 (It = 4)
and without (It = 1) detection-decoding iterations.
8, a detection performance gain of ∼3 dB is achieved
by performing 3 detection-decoding iterations (It = 4)
with respect to the non-iterative case (It = 1). This
significant performance improvement comes at the cost
of ∼7 times higher n. Note that despite this increase,
complexity of SISO-TSD with It = 4 is still ∼3 orders
of magnitude lower than unclipped STS with It = 1
(Fig. 4). As previously observed for the non-iterative
case [4], increasing T beyond a certain value (e.g.
32) does not lead to significant further performance
improvement.









































FXP − 5 bits
FXP − 4 bits
FXP − 3 bits
(a) BER (b) Complexity (  n)
Figure 5 Impact of the fixed-point representation of La(ci, j) on BER performance (SISO-TSD, 4 × 4 MIMO, 64-QAM, It = 4).
132 J Sign Process Syst (2013) 70:125–143
5 Computational Complexity and Critical Path
Estimation
In order to estimate the hardware implementation
costs, computational complexity and latency will be
analyzed in the subsequent. Resulting from the appli-
cation of the complexity reduction techniques in Sec-
tion 3.4 only basic operations (addition, comparison
and shift operations) are required to carry out the de-
tection (Eqs. 1, 4–9). The cost of comparison and addi-
tion operations is assumed to be approximately equiv-
alent, whereas the shift operation cost is neglegted.
Table 2 summarizes the amount of computations c
(add-equivalent operations, i.e. addition and compar-
ison) required by each task block for the SISO-TSD.
Despite inclusion of a-priori processing (Eq. 6), the
metric determination tasks (§c, §g, § j) still present the
lowest computational complexity c. However, depen-
dance on Q is now observed, in contrast to SO-TSD.
Except task block §l (approx. 50 % more complex due
to processing of a-priori information), c of remaining
task blocks remains unaffected in comparison to SO-
TSD, presenting §l and §m (i.e. handling the bit-specific
metrics) the highest c. Computational complexity de-
pends hence in general on system order (Q, NT), except
for the radius update task §d (growing linearly with
T), the mentioned metric determination tasks §c, §g, § j
(only depending on Q) and the node selection task
blocks §a, §b , §i (independent of NT, Q and T). Re-
garding the resulting overall computational complexity
ctotal, an increase of 10 % is observed in comparison
with SO-TSD (except for 2 × 2 64-QAM, where the
Table 2 Computational complexity c (in number of add-
equivalent operations) and latency of each task block, for differ-
ent transmission schemes.
Task 4 × 4 4 × 4 4 × 4 2 × 2 8 × 8 l
4-QAM 16-QAM 64-QAM 64-QAM 64-QAM
(T = 4) (T = 8) (T = 16) (T = 8) (T = 16)
§a 4 4 4 4 4 1
§b 11 11 11 11 11 1
§c 4 6 8 8 8 1
§d 9 13 21 13 21 1
§e 18 18 18 6 42 1
§ f 14 20 32 12 72 4
§g 3 5 7 7 7 1
§h 14 20 32 12 72 4
§i 15 15 15 15 15 1
§ j 3 5 7 7 7 1
§k 14 20 32 12 72 4
§l 33 62 102 98 110 6
§m 55 125 227 131 419 1
§n 8 16 24 12 12 1
ctotal 205 340 540 348 872 lcp
increase reaches up to 30 %). It should be noticed that
doubling NT or Q implies increasing ctotal by a factor of
∼1.6 (ctotal ≈ 4/5NT ≈ 4/5Q), which represents a
marginal increase compared to the SO-TSD implemen-
tation (where a factor ∼1.5 was observed).
Concerning latency l, the cost of shift operations is
neglected, while the cost of addition and comparison
operations is assumed to be equivalent to l = 1 clock
cycle. This consideration is a quite conservative design
constraint, which ensures achieving the target through-
put (depicted in Section 8.1) in a worst-case hard-
ware implementation. For compatibility with the SO-
TSD implementation (especially concerning the task
scheduling and pipeline-interleaving schemes shown in
Section 7) the computations included for processing of
a-priori information are also condensed in one clock cy-
cle. This assumption relaxes the previously mentioned
conservative design constraint for the metric computa-
tion blocks (§c, §g, § j and §l). As a consequence, slight
reduction of the maximum achievable clock frequency
should be expected. The latency l = 1 can be hence
assumed for the metric determination in both SO and
SISO detectors. According to the data dependency
analysis,6 most of the computations enclosed within
each task block may be performed in parallel. Con-
sidering sufficient parallelization level (varying with
NT, Q and T), fixed l can be assumed (shown in Table
2). Resulting from this, latency of the regularized loop
(Fig. 3) becomes independent of the system order (NT,
Q) and T, in contrast to computational complexity.
Loop latency can be further reduced by partially over-
lapping execution of the comprehended task blocks, re-
sulting in the task scheduling scheme illustrated in Fig.
7c. Since output from blocks §h, §l, §m and §n is never
required in the immediately subsequent loop [16], the
critical path is comprised by blocks §a − §e. Consid-
ering the latencies given in Table 2, the critical path
presents therefore a latency lcp = 5 clock cycles. Similar
assumptions and analysis can be applied to SD and
SIC detector implementations, resulting in the schemes
shown in Fig. 7a and b, respectively [16]. As illus-
trated, lcp of the proposed (8-bit precision) SO/SISO-
TSD implementation is <1/4 of lcp corresponding to
typical 16-bit precision SD realization and <1/2 of 16-
bit precision SIC detector lcp. In this regard, SSD and
ME approaches represent, together with the proposed
task scheduling, the main strategies contributing to this
considerable reduction of the critical path latency.
6The l = 1 assumption takes into account existing data dependen-
cies. Detailed data dependency analysis is considered to be out of
the scope of this work and thus is not presented.
J Sign Process Syst (2013) 70:125–143 133







(a) Conventional sphere detector [16 bits]
(b) Conventional SIC detector [16 bits]
Interference calc. (next nodes)Minimum-Search (next nodes)
Metric calculation


























(c)TSD-SSD-ME [8 bits] (this work)
6 Parallelization Strategies for Throughput
Enhancement
Implementing the bit- and instruction-level parallelism
mentioned in Section 5 does not require major consid-
erations besides dependency analysis.6 The detection
throughput is dominated by (1) the number of loops
required to complete the detection (or equivalently, n)
and (2) the loop latency lcp (for a given frequency).
In order to enhance throughput, further parallelization
approaches are considered in this work.
6.1 Pipeline-Interleaving
The pipeline-interleaving technique [13] allows increas-
ing the throughput provided that the processed data
streams are independent. In this work, this strategy
is applied in a similar way as done in [7, 14], as fur-
ther described in the following. Task blocks defined
in Section 3.5 can be grouped according to their exe-
cution times tex, illustrated in Fig. 7c. Thereby 5 sets
of tasks Sx, ∨ x = {0, . . . , 4} are defined, each assigned
to a different processing element (PE). Notice that
those tasks executed after lcp cycles (§m and §n) are
scheduled together with the tasks corresponding to
the subsequent loop. Based on this, the throughput
can be enhanced by interleaving the execution of sets
Sx corresponding to different detection paths d, in
pipelined fashion. Figure 8b illustrates the pipeline-
interleaving scheme of Sxd, where e.g. S32 represents
the execution of task set 3 corresponding to detection
path 2 (i.e. third interleaved detection path). Through-
put and resource utilization are maximized when the
number of interleaved detections equals the maximum
number of possible pipeline stages P, i.e. P = 5 for the
identified critical path (lcp = 5). Assuming in addition
that a new detection starts as soon as another has
finished, no PE is idle after filling in the pipe. By ap-
plying these considerations, resulting average detection
throughput τ is enhanced by a factor of nearly 5 with
regard to a non-pipelined implementation. It should be
noticed that the operations required for processing the
a-priori information have been distributed throughout
the described pipeline stages, in order to avoid detri-
menting the maximum achievable clock frequency with
respect to the SO detector implementation in [3].
6.2 SIMD Vectorization
As introduced in [20], vectorization of n-variable
SD (e.g. for parallel orthogonal frequency-division
multiplexing (OFDM) subcarrier processing) results
in increased average n (n) per vector element, in
contrast to n-fixed detectors (M-algorithm, K-best
. . . ). Assuming that a vector is completely processed
as soon as the execution of all its q parallel slices
have finished, the overall vector latency is deter-
mined by the slowest path (i.e. greater n). Result-
ing from this, n increases with the degree of vector-
ization q, thereby dramatically affecting the achiev-
able speedup. A simple approach to ease this prob-
lem consists in bounding n ≤ nmax [20], but early
termination of paths with n > nmax leads to BER
performance loss. It is nevertheless possible to find a
suitable nmax, e.g. nmax = 1.25 × n, causing negligi-
ble BER performance loss while allowing to nearly
achieve the maximum feasible speedup q [20]. The
simplicity and effectiveness of this approach make it
134 J Sign Process Syst (2013) 70:125–143
Table 3 Memory
specifications for
NT = NR = 4 and Q = 64
(b f = 40, q = 1).
Memory Width Length Size Content
(bits) ( words) (bytes)
IMEM 64 40 320 Program code
SMEM 64 15 120 System and channel info.
VMEMI 64 40 320 Received symbols (y)
VLAMEM 128 40 640 Soft input (a-priori info.)
VMEMO 192 40 960 Soft output (L-values)
especially interesting for vectorization of n-variable
SD, like TSD. The control flow regularization de-
scribed in Section 3.5 enables almost straightforward
application of SIMD vectorization to SO/SISO-TSD.
Assuming accordingly vectorized memory access (de-
scribed in Section 7.1) and bounding n appropriately,
τ may be enhanced by a factor of nearly q (τ q-SIMD ≈
q × τ ).
7 VLSI Architecture
The hardware realization presented in this work is
based on the SO-TSD ASIP model and implementation
concepts proposed in [3, 5], extended in order to sup-
port processing of a-priori information (SISO-TSD).
The design is based on the so-called synchronous trans-
fer architecture (STA) template [8] and is controlled
by means of very long instruction words (VLIWs). The
architecture is comprised of basic modules known as
functional units (FUs). The output ports are buffered
with registers so that data produced by a FU can be
directly consumed by connected FUs. The proposed
design is depicted in Fig. 9 and described through next
sections. It is mainly comprised of:
– Control unit: the instruction decoder gets VLIWs
read from the instruction memory (IMEM) and
maps the corresponding operations onto the FUs
comprising the design. A Flow Control Unit (FCU)
generates IMEM addresses [3], assisting the se-
quencer in handling the conditional execution flow
(as further detailed in Section 7.1.2).
– Data path: it contains data memories (SMEM,
VMEMI, VLAMEM, VMEMO), banks of data and
address registers and the MIMO detection mod-




The design comprises an instruction memory (IMEM)
and the four data memories specified in Table 3.
VMEMI, VLAMEM and VMEMO are q-fold vector
memories with regard to SIMD vectorization, while
SMEM is scalar. These memories represent buffers of
length b f words, with each word containing the infor-
mation of one detection path. Notice that VLAMEM
has been included in the SISO-TSD implementation in
order to enable processing a-priori information. Con-
sidering 4 × 4 MIMO transmission with 64-QAM con-
stellation, using buffers of size b f = 40 words (i.e. 40
individual detection paths) and disregarding vectoriza-
tion (q = 1), a total of ∼2 KB data memory is required.7
7.1.2 Memory Access
The latency cost inherent in memory access frequently
penalizes the design achievable throughput. In order
to ensure data availability without increasing the de-
tection latency, data access is performed in parallel to
the execution of detection tasks. As depicted in Fig. 8a
(where Detection Loop represents the regularized loop
of Fig. 3), the memory access is conditional, randomly
triggered by detection termination and channel update
events. Consequently, joint control flow of the inter-
leaved detection paths is frustrated. In order to satisfy
the individual data access demands of each data path
without disrupting the pipelined execution of remain-
ing ones, the pipeline-interleaving scheme illustrated
in Fig. 8b is employed, combined with the following
approaches [3]:
– Distributed AGU (Address Generation Unit): due
to the different nature and frequency of mem-
ory access, the address generation functionality has
been separately defined and distributively imple-
mented for each of the defined memories. VMEMI,
VLAMEM and SMEM access is sequentially per-
formed, thus relying on simple pointers and a regis-
ter file for address storage. Moreover, VMEMI and
VLAMEM present identical access patterns and
consequently, data can be loaded simultaneously
based on a unique pointer. As proposed in [3],
7Values stored in data memories are represented using the fixed-
point representation described in Section 4.
J Sign Process Syst (2013) 70:125–143 135































































































































LD – Load operation






VMEMO addresses are generated based on tags
which ensure that y and the corresponding L-values
occupy homologous memory areas in VMEMI and
VMEMO, respectively.
– SMEM cache: a partial copy of the highly frequently
accessed SMEM into a register file is proposed in
[3], thereby allowing faster data access and saving
costly memory access operations. SMEM access is
thus only required when channel information has to
be updated (depending on the channel coherence
time).
– VMEMI/VLAMEM prefetching: periodic and un-
conditional read operations are performed during
the complete detection process, temporary storing
prefetched data (24 bytes) in registers. Internal
signaling detects when data has been effectively
consumed (i.e. a new detection begins) in order
to update VMEMI/VLAMEM pointer accordingly
and prefetch new data.
– Flow Control Unit (FCU): despite the above de-
scribed considerations, VMEMO access (triggered
by termination of up to P = 5 different detection
paths) as well as SMEM access (depending on the
channel coherence time) are still conditional. Re-
sulting from this, multiple conditional branch op-
erations have to be performed by the sequencer,
as illustrated by the arrows in Fig. 8b. Based on
control signals received from the detector mod-
ule, the FCU (Fig. 9) identifies the trigger events
and determines the corresponding IMEM address,
subsequently transferring it to the sequencer. As
a result, the sequencer controls the execution
flow by uniquely performing unconditional branch
operations.
By applying these strategies, no overhead (i.e. addi-
tional clock cycles) is caused and control flow regular-
ization (Section 3.5) is not disrupted due to memory
access. It should be noticed that the control-flow and
pipeline-interleaving schemes illustrated in Fig. 8 have
been extended with respect to [5] in order to include
the VLAMEM access operations required for process-
ing a-priori information (SISO-TSD). Resulting latency
depends uniquely on the detection process, as described
in Section 8.1. The achievable speedup provided by
the parallelization approaches considered in Section 6
remains therefore unaffected.
7.2 MIMO Detector Module
The task blocks §a − §n (Fig. 3) defined by the pro-
posed algorithm partitioning (Section 3.5) are directly
mapped to the STA FUs comprising the MIMO detec-
tor module (Fig. 9) as detailed in the following.
7.2.1 Node Enumeration Unit (NEU) and Interference
Computation Unit (ICU)
The SSD strategy (Eq. 8) is implemented by the NEU
(comprised by task blocks §a, §b and §i in Fig. 3). The
predefined sequences are stored in small (∼32 bytes)
look-up-tables (LUTs). The operations involved in
the determination of the interference-reduced received
136 J Sign Process Syst (2013) 70:125–143







































































signal y′′ (Eq. 5) are performed in the ICU (comprised
by task blocks § f , §h and §k).
7.2.2 Metric Computation Unit (MCU)
As described in Section 3.4.2, r2iid
2
m,n values are pre-
calculated and stored in SMEM, thus reducing the
complexity of Eq. 4 to a simple addition operation
(performed in task blocks §c, §g and § j in Fig. 3). This
unit has been extended with respect to [5], in order to
include the up to L addition operations (Eq. 6) required
to take a-priori information from channel decoder into
account, as shown in Fig. 10.
7.2.3 Radius Administration Unit (RAU) and Level
Determination Unit (LDU)
The RAU (block §d in Fig. 3) is responsible for updat-
ing and sorting the list of leaf metrics λ0 comprising
the search tuple T . The search radius R is adapted
according to Eq. 7 by this entity, as detailed in Section 3.
Subsequently, the LDU (block §e) makes a decision on
the next tree level to be explored, implementing depth-
first tree traversal and prunning subtrees which lead to
metrics exceeding the search radius R. Both entities























si,j sign of La(ci,j)
Metric Computation Unit (MCU)
0 0 0 0 0
Figure 10 Block diagram of the proposed MCU architecture for SISO-TSD (dimensioned for 64-QAM).
J Sign Process Syst (2013) 70:125–143 137
Figure 11 Block diagram of the hypothesis update procedure with adaptive hypothesis.
7.2.4 Soft-output Administration Unit (SAU) and
L-values Computation Unit (LCU)
SAU (blocks §l and §m in Fig. 3) and LCU (block §n)
enable generation of soft-information. In block §l those
symbols representing the most favorable candidates
for (counter-)hypotheses are firstly selected (based on
the specif ic leaf sequences mentioned in Section 3.4.1)
and their corresponding metrics are computed. Subse-
quently, in block §m the previously stored hypothesis
and counter-hypotheses are replaced by those candi-
dates presenting lower metrics. Resulting metric values
are fed to the LCU (§n), calculating the L-values ac-
cording to Eq. 1. These entities are internally pipelined
according to the 5-stages scheme described in Section
6.1.
Block §m has been extended with respect to [5]
in order to support the adaptive hypothesis approach
described in Section 3.4.1. The update procedure of
the hypothesis is depicted in Fig. 2 (analogous proce-
dure is applied to update the counter-hypotheses). The
metrics λ0 of the previously stored hypothesis and the
new candidate hypothesis (xhyp. and xr′, respectively)
are firstly compared. Subsequently, λ0 and the binary
representation (bitmap(x)) of the symbol with the low-
est metric are stored. It should be noticed that in [5],
the closest-to-y′′′i constellation symbol (x
r in Fig. 2) is
directly taken as the candidate hypothesis, whereas in
this work the symbol xr′ ∈ B with the lowest metric
is considered instead (i.e. the candidate hypothesis is
rectified, as explained in Section 3.4.1). For this pur-
pose, the find_min block shown in Fig. 11 has been
included. In this block the metrics λ0 of all candidate
xpc are compared in parallel against λ0 of all other x
q
c
contained in B (with p, q = {0, 1, . . . , L}), requiring
(L + 1)2 comparators. This results in a matrix of binary
flags b p,q of size (L + 1) × (L + 1), where the row p
containing (L + 1) flags set to “1”8 corresponds to the
8Notice that the diagonal elements of the flag matrix (b p,q with
p = q) are forced to “1” in order to satisfy this condition.
candidate symbol xpc ∈ B with minimum λ0. By these
means, the increase in latency of block §m is minimized,
at the cost of increasing its area in ∼2 kGE (8.5 kGE in
[5], 10.6 kGE in this work).
7.2.5 Synchronous-Transfer Network
The FUs are connected according to the previously
introduced STA principle. The information is interme-
diately stored in the FUs registered output ports, which
act as pipeline registers. The data transfer is synchro-
nously performed in feed-forward fashion among con-
nected FUs. In particular, two data flow paths can be
distinguished. Main data stream is fed forward across
the units involved in the tree traversal and metric
computation (NEU → MCU → RAU → LDU) and
subsequently fed backward (LDU → NEU) closing the
detection loop in Fig. 3. The number of ports of each
of these blocks has been extended with respect to [5] to
include storage and forwarding of a-priori information.
In parallel to this, data is fed forward to SAU and
finally to LCU for L-values computation.
8 Implementation Results
In order to assess the implementation complexity of
the proposed detector implementation, the described
VLSI architecture has been modeled in Verilog HDL
and synthesized9 using Synopsis Design Compiler.
RTL and gate-level netlists are verified against the
same test vectors generated from a MATLAB/C++
fixed-point model. For the analysis, a detector design
configured for NT = NR = 4 with 64-QAM constella-
tion has been instantiated. A maximum clock frequency
9Due to a version migration concerning Synopsis Design Com-
piler and synthesis libraries, the SO-TSD design publicated in [5]
has been re-synthesized for this work. Consequently, area and
power consumption results differ slightly from those presented in
[5].
138 J Sign Process Syst (2013) 70:125–143





















4x4, 64−QAM, @ 454 MHz
 
 
SO−TSD (It = 1) 
SISO−TSD (It = 4) − overall throughput
SISO−TSD (It = 4) − throughput per iteration
T=2
(112.3 Mbps @ 12.4 dB)
T=32
(162.2 Mbps @ 13.85 dB)
T=16
(279.4 Mbps @ 13.9 dB)
T=32
(30.3 Mbps @ 10.95 dB)
T=8
(403.6 Mbps @ 14.1 dB)
T=2
(1.1 Gbps @ 15.55 dB)
T=4
(681 Mbps @ 14.7 dB)
Figure 12 TSD average throughput at fmax = 454MHz, for dif-
ferent tuple sizes (T) at 10−5 BER (4 × 4 MIMO, 64-QAM).
fmax = 454 MHz is reached under typical case operat-
ing conditions.10
8.1 Throughput
Based on the considered pipeline-interleaving strategy
(Section 6.1) and disregarding SIMD vectorization (q =
1), average detection throughput τ at frequency f is
given by
τ TSD = NT · L
n
f [bits/s].
Regarding iterative receivers, formal determination of
the system throughput should take into account (1) the
throughput of the MIMO detector (variable with the
number of detection-decoding iterations [19, 24]) and
(2) the throughput of the channel decoder. For results
comparability and since channel decoder realization is
out of the scope of this work, the decoder throughput
will be in the following disregarded.
Figure 12 depicts the throughput corresponding to
the proposed SISO-TSD realization, both in an itera-
tive detection-decoding system with 3 iterations (It =
4) as well as in a non-iterative system (It = 1). In both
cases, different throughput↔SNR-performance trade-
offs are shown (adjustable through the tuple size T).
As expected from the complexity trend observed in
Fig. 6, the throughput of SISO detection with It = 4
decreases notably with regard to SO detection (It = 1),
10TSMC 65 nm low-power CMOS libraries are used. Typical case:
(1.2 V, 25 ◦C). Worst case: (1.08 V, 125 ◦C).
Table 4 Area breakdown of SO and SISO TSD at 454 MHz (4 ×
4 MIMO, 64-QAM).
SO-TSD SISO-TSD
kGE mm2 % kGE mm2 %
Total 133.9 0.193 100.0 164.6 0.237 100.0
Memory 49.8 0.072 37.2 66.9 0.096 40.6
TSD core 84.1 0.121 62.8 97.7 0.141 59.4
Ctrl. path 3.1 0.004 2.3 3.2 0.005 2.0
Data path 81.0 0.117 60.5 94.5 0.136 57.4
NEU 9.9 0.014 7.4 13.8 0.020 8.4
ICU 16.6 0.024 12.4 16.4 0.024 10.0
MCU 5.8 0.008 4.3 7.7 0.011 4.6
RAU 5.9 0.008 4.4 6.9 0.010 4.2
LDU 4.8 0.007 3.6 6.1 0.009 3.7
SAU 24.2 0.035 18.1 28.9 0.042 17.6
LCU 3.9 0.006 2.9 3.9 0.006 2.4
Register file 8.2 0.012 6.1 8.2 0.012 5.0
Other logic 1.8 0.003 1.4 2.7 0.004 1.7
moreover depending on the value of T. For small T
size (T < 4) a factor of approx. 10 is observed. For
medium T values (T = 8 − 16) throughput of SO-TSD
(SISO-TSDIt=1) is ≈6–7 times greater than throughput
of SISO-TSDIt=4. For large T (T > 16), throughput of
SO-TSD is ≈5 times greater. To sum up, throughput
of SISO-TSDIt=4 is between 5–10 times lower than that
of SO-TSD, being the throughput degradation smaller
for higher T values. It should be additonally noticed
that while a throughput improvement by a factor of ≈7
can be achieved by varying T in SO-TSD, only half of
this improvement (≈3) is achievable by SISO-TSDIt=4.
8.2 Area
Table 4 shows the area breakdown of the proposed
SISO-TSD design, extracted from pre-layout synthesis
reports. In order to provide technology-independent
area characterization, the number of gate-equivalents
(GEs) is additionally specified.11 A total area of 0.24
mm2 is required, representing an increase of ∼16 %
with regard to the SO-TSD implementation (disregard-
ing memory). In both SO/SISO-TSD, ∼40 % of the
area corresponds to memory (Section 7.1) and ∼60
% to the SISO-TSD core (Section 7.2). The control
path represents small area overhead (∼2 %). Regard-
ing the SISO-TSD core, obtained results evince that
those modules involved in the soft-output computation
(SAU+LCU) present the highest hardware complex-
ity, as expected from the computational complexity
analysis in Section 5. Similar result is observed in [25]
11One GE corresponds to the area of a two input NAND gate
synthesized using TSMC 65nm libraries.
J Sign Process Syst (2013) 70:125–143 139
Table 5 Total power consumption of SO and SISO TSD under
typical case operating conditions, at 454 MHz (4 × 4 MIMO, 64-
QAM).
SO-TSD SISO-TSD
mW % mW %
Total 76.27 100.0 107.41 100.0
Memory 18.90 24.8 34.93 32.5
TSD core 58.20 76.3 73.90 68.8
Ctrl. path 1.11 1.5 1.27 1.2
Data path 57.09 74.9 72.63 67.6
NEU 8.18 10.7 12.18 11.3
ICU 12.20 16.0 12.50 11.6
MCU 4.61 6.0 6.73 6.3
RAU 4.48 5.9 5.90 5.5
LDU 4.42 5.8 5.93 5.5
SAU 15.06 19.7 20.03 18.6
LCU 2.83 3.7 2.99 2.8
Register file 4.39 5.8 4.55 4.2
Other logic 0.92 1.2 1.81 1.7
comparing area breakdowns of hard-output and soft-
output sphere detector realizations. The area of ICU
and NEU represents ∼1/2 of the area required by
the soft-output computation modules. LDU, RAU and
MCU modules present the lowest area complexity.
8.3 Power Consumption
Power analysis is performed using Synopsis PrimeTime,
based on switching activity annotated during post-
synthesis simulation into a value change dump (VCD)
file. Simulations12 are carried out in Mentor ModelSim
under typical case operating conditions, at fmax = 454
MHz. The reported total power consumption is detailed
in Table 5. As illustrated, memory consumes ∼30 % of
the total power for SISO-TSD. A slightly lower propor-
tion is observed regarding SO-TSD, since VLAMEM is
not required. Concerning the TSD core, the observed
power consumption results are in line with the obtained
results on area. The greatest dissipation value corre-
sponds to the soft-output computation modules (∼20
%), followed by ICU and NEU (∼10 %). Remaining
entities consume <7 % of the total power. In compar-
ison with the SO-TSD, those units forwarding (NEU,
RAU, LDU) and/or processing (MCU, SAU) a-priori
information (Section 7.2) consume ∼30 %–50 % more
power. This represents an overall increase of ∼27 %
for the SISO-TSD core, while ∼40 % higher power
consumption is observed for the complete design (in-
cluding memory), mainly contributed by the inclusion
of VLAMEM.
12Operation of the SISO-TSD core is simulated for a sufficiently
high number of input signals.
9 Results Discussion
In this work the first implementation of a SISO-
TSD detector has been presented, combining the low
complexity and good BER performance of the TSD
algorithm with the high throughput of highly paral-
lel and pipelined architectures. In order to ratify the
efficiency of the proposed realization, a comparative
analysis including relevant state-of-the-art tree-search-
based MIMO detectors from literature [22, 27–29] is
summarized in Table 6. Even though comparing BER
performance of the considered detection algorithms is
not the focus of this analysis, BER↔SNR operating
points are included as reference.13 It should be no-
ticed that NT and Q focused by some of the consid-
ered references differ from those considered in this
work. For comparability reasons, results corresponding
to 4 × 4 MIMO with 64-QAM modulation have been
extracted from literature, as far as available. Note that
area results extracted from [27] correspond to 4 × 4
MIMO with 64-QAM modulation, whereas fmax and τ
are only reported for 16-QAM. In [28] only 16-QAM is
analyzed. Analogously, It = 4 has been considered for
the SISO scenario except in [29], where only It = 2 is
examined. It is additionally worth mentioning that in
order to allow fair comparison of designs based on dif-
ferent implementation technologies, presented results
have been appropriately scaled (as detailed in Table 6).
For simplicity and comparability with results from lit-
erature, hardware cost of the input and output memory
buffers (Section 7.1) is in the following disregarded.
9.1 Throughput
In order to provide reliable throughput comparison,
throughput normalized to fmax (τ/ fmax) is considered.
It should be noticed that the implementations proposed
in this work as well as in [27] present variable through-
put (depending on SNR), in contrast to realizations in
[28, 29] and [22] (fixed throughput). Variable through-
put is hence compared by considering average through-
put at 10−5 BER (1 % FER in [27]) as indicated in
Table 6. As illustrated, SO-TSD outperforms in terms
of τ/ fmax all the considered SO-detectors, with excep-
tion of [22]. In this regard, it should be noticed that
[22] reports peak throughput. Considering SO-TSD
13It should be noticed that presented SNR↔BER-performance
trade-offs differ not only depending on the detection algorithm,
but also on dissimilarities of the considered communications
system model (e.g. channel fading, channel correlation, decoder
type, code rate, . . . ). For further details, the corresponding cita-
tions are referred.
140 J Sign Process Syst (2013) 70:125–143
Table 6 Design comparison of different MIMO detector realizations.
Detector [this work] Witte et al. [27] Wu and Masera [28, 29] Patel et al. [22]
DF-SD (TSD, T = 8) DF-SD (STS) BF-FSD BF-SD (K-Best)
MIMO 4 × 4 2 × 2–8 × 8 4 × 4 4 × 4
system
Modulation 64-QAM 64-QAM QPSK, 16-,64-QAM 16-QAM 16-,64-QAM 64-QAM
Turbo it. SO (It = 1)[5] SISO (It = 4) SO (It = 1) SISO (It = 4) SO (It = 1)[28] SISO (It = 2)[29] SO (It = 1)
BER↔SNR BER 10−5 BER 10−5 FER 0.01 FER 0.01 BER 10−5 BER 10−5 BER 10−3
performance @ 14.1 dB @ 11.3 dB @ 15 dB @ 11.2 dB @ 6.5 dB @ 9.25 dB @ 27 dBa
Technology 65 nm 65 nm 90 nm 90 nm 130 nm 130 nm 65 nm
fmax (MHz) 454 454 379 250 400 384.6 833
τ (Mbps) 403.6 40 40 5 213.3 279.7 2000
τ/ fmax 0.89 0.09 0.11 0.02 0.53 0.73 2.4
Area (mm2) 0.12 0.14 − 0.18 (0.045b ) − 0.57
Area (kGE) 84 98 105 185 30 121 298
NHEc 0.21 2.45 1.90 26.72 0.07 0.22 0.15
(kGE/Mbps)
Powerd (mW) 58.2 73.9 − − 239e
Energy / bit 53.4 67.8 120e
(pJ/b)
a Uncoded system.
b Technology scaling by the factor (130/65)2.
c Normalized Hardware Efficiency [15].
d Power consumption under typical case operating conditions (1.2 V, 25 ◦C).
e Patel et al. [22] reported 280 mW and 140 pJ/b @ 1.3 V, Vdd scaling by the factor (1.3/1.2)2 is applied here.
DF Depth-First tree traversal, BF Breadth-First tree traversal, FSD Fixed Sphere Detector.
maximum throughput of 1.1 Gbps, a τ/ fmax ratio close
to that of [22] is then obtained. Moreover, [22] achieves
a very high throughput at the cost of ∼3 times greater
gate count compared to the proposed TSD. Regarding
SISO-realizations, τ/ fmax of SISO-TSD is ∼4.5 times
better than that of [27]. It should be noticed that in [29]
It = 2 is considered, hence achieving higher throughput
than that of [27] and this work (It = 4). SISO-TSD with
It = 2 achieves τ/ fmax ≈ 0.3, approaching that of [29].
9.2 Area
Table 6 compares the silicon area (in mm2) as well as
the gate count (in kGEs) corresponding to the consid-
ered detectors. Both SO-TSD and SISO-TSD consume
less area (gate count) than considered realizations, with
exception of [28]. In this case it should be noticed
that [28] supports only 16-QAM modulation, while this
work supports 64-QAM. Based on the overall compu-
tational complexity ctotal depicted in Table 2 for 16-
and 64-QAM, gate count of SISO-TSD can be roughly
estimated14 as 84 × 540340 ≈ 53 kGE for 16-QAM, very
closely approaching the area of the efficient implemen-
tation proposed in [28].
14Estimation based uniquely in the overall computational com-
plexity, disregarding register savings. Further area reduction
should be expected if the latter was taken into account.
9.3 Normalized Hardware Efficiency (NHE)
In order to perform a fair comparison, the normalized
hardware efficiency (NHE) is additionally examined. It
is defined as the ratio between the core area (kGE)
and the throughput (scaled to the same technology)
[15]. SO-TSD presents a good efficiency ratio, outper-
forming the implementation in [27]. Patel et al. [22] and
[28] present a slightly better ratio due to the consider-
ation of peak throughput and the simpler modulation
scheme (16-QAM) employed, respectively. Regarding
SISO detection, TSD outperforms by far the efficiency
ratio of [27]. Wu and Masera [29] presents a better
ratio since only It = 2 is considered, leading to higher
throughput than for It = 4. For SISO-TSD with It = 2,
a more comparable ratio is observed (0.7 KGE/Mbps).
9.4 Power Consumption
Due to lack of results on power consumption reported
in literature, comparison is here limited to SO-detector
realizations. The architecture presented in [22] reaches
a very high clock frequency (833 MHz) and inher-
ently very high (peak) throughput (2000 Mbps), con-
sequently leading to high power consumption. Corre-
sponding SO-TSD and SISO-TSD power consumption
is 4 and 3 times lower, respectively. Measurement re-
sults of a system on a chip (SoC) including the SO-TSD
module can be found in [26].
J Sign Process Syst (2013) 70:125–143 141
9.5 SISO vs. SO Detectors
As explained in previous sections, the existing SO-TSD
implementation [5] has been extended in this work in
order to enable processing of a-priori information. In
contrast to comparable SISO-detectors ([27, 29]), fmax
of the proposed SISO-TSD realization is not degraded
in comparison to the SO-TSD. Likewise, the area in-
crease (14 kGE) due to processing of a-priori informa-
tion is small (16 %) in comparison to the designs re-
ported in literature (∼76 % in [27] and ∼50 % in [29]).
Regarding the power consumption (and inherently the
energy per bit), an increase of ∼27 % is observed.
As additionally depicted, SISO-TSD achieves 10 times
lower throughput than SO-TSD for the considered T =
8. It is nevertheless possible to achieve beyond 100
Mbps with It = 4 and more than 1 Gbps with It = 1,
by adjusting T of SISO-TSD as described in Section
8.1. Additionally, strategies further reducing n (i.e.
further enhancing the throughput) have been recently
proposed for the SISO-TSD in [24].
10 Conclusion and Future Work
In this work key strategies enabling the first efficient
implementation of the SISO-TSD algorithm have been
presented. A novel flexible 8-bit fixed-point represen-
tation has been utilized, significantly reducing the com-
plexity compared to state-of-the-art implementations
requiring >10 bits (e.g. [27]). Presented regularization
and architectural concepts permit efficient application
of pipelining and parallelization approaches. The costs
of processing a-priori information have been addition-
ally analyzed. Considering a 4 × 4 MIMO system with
64-QAM, the SISO-TSD implementation presents only
16 % larger area and 27 % higher power consump-
tion than the previous SO-TSD realization, whereas
a SNR gain of up to ∼3.2 dB is provided for It =
4. The proposed VLSI design incorporates sophis-
ticated concepts to reduce the algorithm computa-
tional complexity as well as parallelization strategies
for throughput enhancement. In combination with the
definition of a suitable architecture, a high-throughput,
low-hardware-complexity and low-power-consumption
implementation is enabled. Resulting implementation
has been shown to be well suited for iterative receiver
architectures, outperforming in most aspects similar ap-
proaches (e.g. STS-SD) and achieving data rates com-
parable to or even greater than less complex and more
parallelizable detection strategies (e.g. K-Best, FSD).
The presented realization has in general demonstrated
to reduce the hardware cost with regard to comparable
state-of-the-art realizations. The clearly outstanding
implementation results, together with scalability and
parallelizability, make the proposed design a very fa-
vorable candidate for MIMO detection in a wide range
of applications.
Future work will target further throughput improve-
ment by exploiting SIMD vectorization and possibly
increasing frequency by optimizing the critical path. In
addition to this, architecture optimization for further
reduction of area and power consumption are planned.
Acknowledgements This work has been supported by the Ger-
man Federal Ministry of Education and Research (BMBF) under
grant 13N11810.
Open Access This article is distributed under the terms of
the Creative Commons Attribution License which permits any
use, distribution, and reproduction in any medium, provided the
original author(s) and the source are credited.
References
1. 3GPP LTE, evolved universal terrestrial radio access
(E-UTRA), base station (BS) radio transmission and
reception. Technical Specification 36.104 Release 10.
http://www.3gpp.org.
2. 3GPP LTE, Evolved universal terrestrial radio access
(E-UTRA), user equipment (UE) radio transmission and
reception. Technical Specification 36.101 Release 10.
http://www.3gpp.org.
3. Adeva, E.P., Mennenga, B., Fettweis, G. (2011). Scalable
ASIP implementation and parallelization of a MIMO sphere
detector. In Proceedings of the 11th international conference
on embedded computer systems: architectures, modeling, and
simulation (SAMOS XI). Samos, Greece.
4. Adeva, E.P., Mennenga, B., Fettweis, G. (2011). Survey on
an efficient, low-complex tuple search based sphere detector.
In Proceedings of the ieee 34th sarnof f symposium. Princeton,
USA.
5. Adeva, E.P., Shah, M.A., Mennenga, B., Fettweis, G. (2011).
VLSI architecture for soft-output tuple search sphere decod-
ing. In Proceedings of the IEEE workshop on signal process-
ing systems (SIPS’11). Beirut, Lebanon.
6. Burg, A., Borgmann, M., Wenk, M., Zellweger, M., Fichtner,
W., Bölcskei, H. (2005). VLSI implementation of MIMO de-
tection using the sphere decoding algorithm. IEEE Journal of
Solid-State Circuits, 40, 1566–1577.
7. Burg, A., Wenk, M., Fichtner, W. (2006). VLSI implemen-
tation of pipelined sphere decoding with early termination.
In Proceedings of the 14th european signal processing confer-
ence, 2006 (EUSIPCO 2006). Florence, Italy.
8. Cichon, G. (2004). A novel compiler-friendly micro-
architecture for rapid development of high-performance
and low-power DSPs. Dissertation, Technische Universität
Dresden.
9. de Jong, Y., & Willink, T. (2005). Iterative tree search de-
tection for MIMO wireless systems. IEEE Transactions on
Communications, 53(6), 930–935.
10. Hagenauer, J. (2002). The turbo principal in mobile com-
munications. In Proceedings of the international symposium
on information theory and its applications (ISITA’02). Xi’an,
China.
142 J Sign Process Syst (2013) 70:125–143
11. Hochwald, B., & ten Brink, S. (2003). Achieving near-
capacity on a multiple-antenna channel. IEEE Transactions
on Communications, 51, 389–399.
12. Jaldén, J., & Ottersten, B. (2005). Parallel implementation
of a soft output sphere decoder. In Proceedings of asilomar
conference on signals, systems, and computers.
13. Lee, E., & Messerschmitt, D. (1987). Pipeline interleaved
programmable DSP’s: architecture. IEEE Transactions
on Acoustics, Speech and Signal Processing, 35, 1320–1333.
14. Lee, J. (2006). Area efficient pipelined vlsi implementation of
list sphere decoder. In Proceedings of the Asia-Pacif ic Con-
ference on Communications, 2006 (APCC ’06). Busan, Korea.
15. Mahdavi, M., & Shabany, M. (2012). Novel MIMO detec-
tion algorithm for high-order constellations in the complex
domain. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems. doi:10.1109/TVLSI.2012.2196296.
16. Mennenga, B. (2010). Aufwandsgünstige Detektion in
Mehrantennensystemen mittels komplexitätsreduzierter
Baumsuchverfahren. Dissertation, Technische Universität
Dresden.
17. Mennenga, B., von Borany, A., Fettweis, G. (2009). Complex-
ity reduced soft-in soft-out sphere detection based on search
tuples. In Proceedings of the IEEE international conference
on communications (ICC’09). Dresden, Germany.
18. Mennenga, B., & Fettweis, G. (2009). Search sequence de-
termination for tree search based detection algorithms. In
Proceedings of the IEEE sarnof f symposium 2009. Princeton,
USA.
19. Mennenga, B., Fritzsche, R., Fettweis, G. (2009). Iterative
soft-in soft-out sphere detection for MIMO systems. In Pro-
ceedings of the IEEE 69th vehicular technology conference,
(VTC’09-Spring). Barcelona, Spain.
20. Mennenga, B., Matus, E., Fettweis, G. (2009). Vectorization
of the sphere detection algorithm. In Proceedings of the IEEE
international symposium on circuits and systems (ISCAS’09).
Taipei, Taiwan.
21. Myllyla, M., Cavallaro, J.R., Juntti, M. (2010). Architecture
design and implementation of the metric first list sphere de-
tector algorithm. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, 19, 895–899.
22. Patel, D., Smolyakov, V., Shabany, M., Gulak, P. (2010).
VLSI iumplementation of a WiMAX/LTE Compliant low-
complexity high-throughput soft-output K-best MIMO detec-
tor. In Proceedings of the IEEE international symposium on
circuits and systems 2010, (ISCAS’10).
23. Robertson, P., Villebrun, E., Hoeher, P. (1995). A compar-
ison of optimal and sub-optimal map decoding algorithms
operating in the log domain. In Proceedings of the IEEE
international conference on communications, 1995 (ICC ’95).
24. Seifert, T., Adeva, E.P., Fettweis, G. (2013). Towards
complexity-reduced soft-input soft-output sphere detection.
In Submitted to the 9th international itg conference on sys-
tems, communications and coding 2013, (SCC’13). Munich,
Germany.
25. Studer, C., Burg, A., Bölcskei, H. (2008). Soft-output sphere
decoding: algorithms and VLSI implementation. IEEE Jour-
nal on Selected Areas in Communications, 26, 290–300.
26. Winter, M., Kunze, S., Adeva, E.P., Mennenga, B.,
Matus, E., Fettweis, G., Eisenreich, H., Ellguth, G., Höppner,
S., Scholze, S., Schüffny, R., Kobori, T. (2012). A 335Mb/s
3.9mm2 65nm CMOS flexible MIMO detection-decoding en-
gine achieving 4g wireless data rates. In Proceedings of the
59th international solid-state circuits conference (ISSCC’12).
San Francisco, USA.
27. Witte, E.M., Borlenghi, F., Ascheid, G., Leupers, R., Meyr,
H. (2010). A scalable VLSI architecture for soft-input soft-
output single tree-search sphere decoding. IEEE Transac-
tions on Circuits and Systems II: Express Briefs, 57, 706–710.
28. Wu, B., & Masera, G. (2010). A novel VLSI architecture of
fixed-complexity sphere decoder. In 13th euromicro confer-
ence on digital system design: architectures, methods and tools
(DSD).
29. Wu, B., & Masera, G. (2012) Efficient VLSI implementation
of soft-input soft-output fixed-complexity sphere decoder. In-
stitution of Engineering and Technology (IET) Communica-
tions, 6, 1111–1118.
30. Wübben, D., Böhnke, R., Rinas, J., Kühn, V., Kammeyer, K.
(2001). Efficient algorithm for decoding layered space-time
codes. Electronics Letters, 37, 1348–1350.
31. Yee, M.S. (2005). Max-log-MAP sphere decoder. In Proceed-
ings of the IEEE international conference on acoustics, speech,
and signal processing (ICASSP ’05), (Vol. 3).
32. Zimmermann, E., & Fettweis, G. (2006). Unbiased MMSE
tree search detection for multiple antenna systems. In Pro-
ceedings of the International symposium on wireless personal
multimedia communications (WPMC’06). San Diego, USA.
Esther P. Adeva received her Dipl.-Ing. in Electrical Engineer-
ing from the Miguel Hernández University (Spain) in June 2009.
She developed her diploma thesis at the Technische Universität
Dresden, focusing on the analysis and implementation of tree-
search-based MIMO detection algorithms. She is currently pursu-
ing a PhD at the Vodafone Chair and she continues her research
on the exploration of efficient architectures for MIMO detection.
Tobias Seifert received his Dipl.-Ing. in Electrical Engineering
from the Technische Universität Dresden in January 2012. In his
diploma thesis he dealt with implementation aspects of MIMO
sphere-search-based detectors for iterative detection-decoding.
He is currently pursuing a PhD at the Vodafone Chair and
continues his research on strategies for efficient MIMO detection
implementation.
J Sign Process Syst (2013) 70:125–143 143
Gerhard Fettweis earned his Ph.D. from RWTH Aachen in 1990.
Thereafter he was at IBM Research and TCSI Inc., California.
Since 1994 he is Vodafone Chair Professor at TU Dresden,
Germany, with main research interest on wireless transmission
and chip design. He is IEEE Fellow and an honorary doctorate
of TU Tampere.
