Optimized Configurable Architectures for Scalable Soft-Input Soft-Output
  MIMO Detectors with 256-QAM by Mansour, Mohammad M. & Jalloul, Louay M. A.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 1
Optimized Configurable Architectures for
Scalable Soft-Input Soft-Output MIMO
Detectors with 256-QAM
Mohammad M. Mansour, Senior Member, IEEE, and Louay M.A. Jalloul, Senior Member, IEEE
Abstract
This paper presents an optimized low-complexity and high-throughput multiple-input multiple-output (MIMO)
signal detector core for detecting spatially-multiplexed data streams. The core architecture supports various layer
configurations up to 4, while achieving near-optimal performance, as well as configurable modulation constellations
up to 256-QAM on each layer. The core is capable of operating as a soft-input soft-output log-likelihood ratio
(LLR) MIMO detector which can be used in the context of iterative detection and decoding. High area-efficiency
is achieved via algorithmic and architectural optimizations performed at two levels. First, distance computations and
slicing operations for an optimal 2-layer maximum a posteriori (MAP) MIMO detector are optimized to eliminate the
use of multipliers and reduce the overhead of slicing in the presence of soft-input LLRs. We show that distances can
be easily computed using elementary addition operations, while optimal slicing is done via efficient comparisons with
soft decision boundaries, resulting in a simple feed-forward pipelined architecture. Second, to support more layers, an
efficient channel decomposition scheme is presented that reduces the detection of multiple layers into multiple 2-layer
detection subproblems, which map onto the 2-layer core with a slight modification using a distance accumulation
stage and a post-LLR processing stage. Various architectures are accordingly developed to achieve a desired detection
throughput and run-time reconfigurability by time-multiplexing of one or more component cores. The proposed core
is applied as well to design an optimal multi-user MIMO detector for LTE. The core occupies an area of 1.58 MGE
and achieves a throughput of 733 Mbps for 256-QAM when synthesized in 90 nm CMOS.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) systems have become mainstream technology for achieving high spectral
efficiencies in wireless communications standards such as IEEE 802.11ac [1] and the 3GPP Long-Term Evolution
(LTE) [2]. Detection of spatially-multiplexed MIMO streams plays a key role in receiver design, both in terms of
performance and complexity, and has remained to be an active area of research [3]–[7]. The focus has been on
developing area-/energy-efficient VLSI implementations of MIMO detectors that are capable of achieving close to
optimal performance.
M. M. Mansour is with the Department of Electrical and Computer Engineering at the American University of Beirut, Lebanon, e-mail:
mmansour@ieee.org. L. Jalloul is with Qualcomm Inc., San Jose, CA, e-mail: jalloul@ieee.org.
ar
X
iv
:1
50
6.
04
64
4v
1 
 [c
s.I
T]
  8
 Ju
n 2
01
5
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 2
A plethora of MIMO detectors have appeared in the literature on this subject, offering various performance-
complexity tradeoffs. Suboptimal zero-forcing (ZF) and minimum mean-squared error (MMSE) detectors [5], as well
as nonlinear parallel and successive interference cancellation schemes [8]–[11], require relatively low complexity but
sacrifice performance. On the other hand, tree-search or list-based detectors require substantially higher complexity
but can offer (near-)ML performance, such as the well-known sphere decoding algorithm [12]–[19]. Other tree-
search schemes, such as the K-Best algorithm [20]–[26], address the non-deterministic throughput aspects of sphere
decoders. Practical implementation aspects have been investigated in [18], [23], [25]–[39].
Subspace detection based on channel decomposition offers a good compromise between performance and com-
plexity (e.g. see [40]–[43]). In these schemes, the effective MIMO channel matrix is decomposed into parallel
subchannels that can be used to detect subsets of streams in parallel. By allowing subspaces to overlap, additional
diversity can be gathered by putting a low reliable data stream into several detection sets. The LORD algorithm
proposed in [44], [45] can be viewed as a special class of subspace MIMO detectors. It achieves ML performance
(in the max-log-MAP [46] sense) on 2 transmit antennas, but its performance degrades when the number of
antennas increases. In [47], the LORD algorithm was generalized to 4-transmit antennas by using matrix inversion
to decompose the channel into single streams.
Support for ever increasing data rates has come through an increase in the number of supported spatial streams,
or through the use of more bandwidth via carrier aggregation [48]. LTE-Advanced uses up to 8 spatial streams,
or the aggregation of five component carriers for a bandwidth of 100 MHz, which lead to staggering speeds of
over 1 Gbps. While the receiver complexity to detect 8 spatial layers remains to be very challenging especially
for dense constellations, the use of carrier aggregation with distinct or separate physical layers and convergence
at higher layers seems more tractable. Since each physical layer of a component carrier is required to support 2
or 4 spatial layers, the need for the hardware optimization of these MIMO detector cores becomes paramount,
especially if near-ML performance is desired, higher-order modulations such as 256-QAM are to be supported, and
high-throughput processing is a must.
Contributions: We propose in this work an optimized and configurable 2×2 soft-input soft-output maximum a
posteriori (MAP) MIMO detector, and use it as a basic building block for constructing high-throughput detectors for
higher-order layers. The key features and advantages of the proposed detector core are: 1) scalability in supporting
multiple layers, 2) flexibility in accommodating multiple layer-configurations and detection of subsets of layers,
3) configurability of supported constellations per layer, 4) support for soft-input log-likelihood ratios (LLRs) from
channel decoder, 5) near-ML performance, and 6) reduced-complexity and high-throughput operation. We develop
extensive optimizations at both the algorithmic and architectural levels targeted for a 2×2 soft-input soft-output
MAP MIMO detector, as well as its extension to support more spatial layers. In particular, optimizations of
distance computations (to eliminate multipliers and simplify slicing) are shown to result in substantial reduction in
computational complexity when supporting constellations up to 256-QAM. Furthermore, the complexity of a 1D
slicer is shown to play a key role in the overall complexity of the detector, when soft-input LLRs are supported.
To this end, an efficient slicing scheme based on soft decision boundaries is presented. Moreover, a low-complexity
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 3
scheme that decomposes a MIMO channel into multiple subsets of decoupled streams is proposed. It is shown that
decoupled streams can be detected efficiently and in parallel using the optimized 2×2 core. Moreover, the 2×2
core is applied in the context of multi-user (MU-MIMO) for joint modulation classification and data detection. The
core has been implemented on an FPGA, and synthesized as well using a generic 90 nm ASIC CMOS library.
The rest of the paper is organized as follows. After introducing the system model in Section II, Section III
presents the optimizations targeted for a 2-layer MAP MIMO detector in terms of distance computations and
slicing. Key equations for distances and soft decision boundaries are derived assuming both zero and non-zero
input LLRs. Section IV proposes a matrix decomposition scheme to support detection of more spatial streams.
We show that the key distance equations scale in a straightforward fashion from the 2-layer case, where only
a new distance-accumulation and a post-LLR processing phases are needed. In Section V, single and multi-core
detector architectures are developed. The core is applied in Section VI part of MU-MIMO detection for constellation
estimation and data detection. Synthesis and simulations results are reported in Section VII. Finally, Section VIII
ends with concluding remarks.
II. SYSTEM MODEL
Consider a MIMO system with N transmit and M ≥ N receive antennas. The equivalent complex baseband
input-output system relation can be modeled as y˜=Hx+n, where y˜∈CM×1 is the received complex signal vector,
H ∈ CM×N is the complex channel matrix, x =[x1 x2 · · ·xN ]T ∈ X = X1× · · · ×XN is the N×1 transmitted
complex symbol vector, and n∈CM×1 is a zero-mean complex Gaussian circularly symmetric random noise vector
with covariance σ2IM . Each symbol xn belongs to a complex constellation Xn of size Qn=2qn , and is associated
via the map b(·) with a coded bit-interleaved vector b(xn) = bn = [bn,1 bn,2 · · · bn,qn ]T of length qn over the
set {−1,+1}, where binary 0 maps to +1. Let |X |=Q= 2q , and denote the binary vector associated with the
overall symbol vector x as b(x)=[b1; · · · ;bN ]=[bn,j ], for n=1, · · · , N , and j=1, · · · , qn. Motivated by recent
standards, we assume rectangular QAM constellations, where Xn=Pn×Pn, and Pn is a 1D Pn-PAM constellation
with Pn=
√
Qn.
We assume H is known to the receiver, has full column rank and is decomposed as H=QL, where Q∈CM×N
is a unitary matrix and L ∈ CN×N is a lower triangular matrix (LTM) with positive and real diagonal elements.
Since Q is unitary, it preserves Euclidean norm as well as noise statistics. Hence we use the transformed relation
y,Q∗y˜=Lx+Q∗n∈CN×1 to model the MIMO system.
A hard-decision (HD) maximum a posteriori (MAP) MIMO detector achieves log-max [46] optimal performance
by finding the symbol vector x in X that is closest to the received vector y under the unscaled “distance” metric [16]:
d(x) , ‖y − Lx‖2 − bT(x)λ, (1)
where λ= [λ1; · · · ;λN ] = [λn,j ] is a column vector of a priori LLR values λn,j ∈R associated with the bits in
b(x), assuming these bits are statistically independent:
λn,j =
1
σ2
ln
Prob(bn,j = +1)
Prob(bn,j = −1) .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 4
The subvector λn=[λn,1, · · · , λn,qn ]T is associated with the bits b(xn) of the nth symbol xn. The hard-decision
MAP solution of the MIMO detection problem is given by1
dMAP = min
x∈X
d(x) and xMAP = arg min
x∈X
d(x). (2)
For joint iterative MIMO detection and decoding however, soft-input soft-output MIMO detectors are required.
A log-max optimal soft-input soft-output MAP MIMO detector computes 2q other minimum distance metrics as
follows:
ΛMAPn,j = min
x∈X (+1)n,j
d(x)− min
x∈X (−1)n,j
d(x), (3)
for n= 1, · · · , N and j= 1, · · · , qn, where X (+1)n,j = {x∈X : bn,j =+1} and X (−1)n,j = {x∈X : bn,j =−1} are the
subsets of symbol vectors in X that have their corresponding jth bit in the nth symbol +1 and −1, respectively.
III. OPTIMIZED MIMO MAP DETECTION FOR 2 LAYERS
Finding the MAP solutions in (2) and (3) require computing
∏N
n=1Qn distance metrics. When N = 2, a
simplification [44] can be applied to reduce the number of computations from Q1 ·Q2 to Q1+Q2. Triangularizing
the channel matrix as H=QL with Q being unitary, we obtain:
y−Lx=
y1
y2
−
α 0
γ β
x1
x2
 ,
where y=Q∗y˜, with α, β ∈ R+ and γ∈C. Then (1) becomes
d(x) = f1(x1) + f2(x2 |x1), where (4)
f1(x1) = |y1−αx1|2−bT(x1)λ1, and
f2(x2 |x1) = |y2−γx1−βx2|2−bT(x2)λ2.
The minimum distance in (2) can then be computed as
min
x∈X
d(x)= min
x1∈X1
x2∈X2
{f1(x1) + f2(x2 |x1)} (5)
= min
x1∈X1
{
f1(x1) + min
x2∈X2
f2(x2 |x1)
}
= min
x1∈X1
{f1(x1) + f2(xˆ2(x1) |x1)}
= min
x1∈X1
d(x1, xˆ2(x1)) (6)
where
xˆ2(x1) = arg min
x2∈X2
{
|y2−γx1−βx2|2−bT(x2)λ2
}
. (7)
Denote the set of sliced symbol vectors for all x1 in (6) by
O1 =
{
[x1 xˆ2(x1)]
T
: x1∈X1
}
. (8)
The bit LLRs of symbol x1, for j=1, · · · , q1, are given by
ΛMAP1,j = min
x1∈X (+1)1,j
d(x1, xˆ2(x1))− min
x1∈X (−1)1,j
d(x1, xˆ2(x1)). (9)
1The quantities dMAP in (2) and ΛMAPn,j in (3) need to be scaled by σ
2/2.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 5
1 0 0 1 0 1
1 0 0
Gray 
map
Real
Gray 
map
1 0 1
Imaginary
0 1 1
0 0 1
Fig. 1. Gray-coded mapping for 64-QAM in LTE [2].
To obtain the bit LLRs of x2 however, we triangularize H as Q′ L′ so that a zero appears in the upper left
corner:
y′−L′x =
y′1
y′2
−
 0 α′
β′ γ′
x1
x2
 ,
where y′=Q′∗y˜; α′, β′ ∈ R+ and γ′ ∈ C. Then (1) becomes
d(x) = f ′2(x2) + f
′
1(x1 |x2), where
f ′2(x2) = |y′1−α′x2|2−bT(x2)λ2, and
f ′1(x1 |x2) = |y′2−γ′x2−β′x1|2−bT(x1)λ1,
and the minimum distance in (2) can be computed as
min
x∈X
d(x)= min
x2∈X2
{
f ′2(x2) + min
x1∈X1
f ′1(x1 |x2)
}
= min
x2∈X2
{f ′2(x2) + f ′1(xˆ1 |x2)}
= min
x2∈X2
d(xˆ1(x2), x2), (10)
where xˆ1(x2)=arg min
x1∈X1
f ′1(x1 |x2). Denote the set of sliced symbol vectors for all x2 in (10) by
O2 =
{
[xˆ1(x2) x2]
T
: x2∈X2
}
.
The bit LLRs of symbol x2, for j=1, · · · , q2, are given by
ΛMAP2,j = min
x2∈X (+1)2,j
d(xˆ1(x2), x2)− min
x2∈X (−1)2,j
d(xˆ1(x2), x2). (11)
Since Q and Q′ are unitary, the MAP solutions in (6) and (10) are identical. To find the hard-decision (HD)-
MAP solution, only 1-sided QLD is needed on either layer 1 or 2. If Q1 ≤ Q2, a list of Q1 distances D1 =
{d(x1, xˆ2(x1)) : x1∈X1} is generated by enumerating all symbols x1 ∈ X1 and the minimum is selected. If
Q2 < Q1, a list of Q2 distances D2 = {d(xˆ1(x2), x2) : x2∈X2} is generated and the minimum is selected.
However, to generate soft LLRs, 2-sided QLDs are needed, and both lists of distances must be generated to select
the appropriate minima according to (9) and (11).
A. Distance Metric Optimizations
For efficient distance computations, we separate the real and imaginary parts of all complex variables, and exploit
the fact that the real and imaginary parts of each QAM symbol are mapped independently into 1D PAM symbols,
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 6
i.e., some bits are used only for mapping of the real part and some only for the imaginary part (see Fig. 1). Note
that this mapping is used in, e.g., the IEEE 802.11ac [1] and LTE [2] standards. Under this assumption, we can
split the bias term bT(xn)λn into a part b
T
nRλnR , bT(xnR)λnR associated with the bits of the real part of the
QAM symbol, and a part bTnIλnI , bT(xnI)λnI associated with the bits of the imaginary part. Let γ = γR +jγI,
xn=xnR+jxnI, yn=ynR+jynI for n=1, 2. Then the distance in (4) becomes
d(x)=f1R(x1R)+f1I(x1I)+f2R(x2R|x1)+f2I(x2I|x1), (12)
where the terms on the righthand side are given by
f1R(x1R)=(y1R−αx1R)2−bT1Rλ1R
f1I(x1I)=(y1I−αx1I)2−bT1Iλ1I
f2R(x2R|x1)=(y2R−γRx1R+γIx1I−βx2R)2−bT2Rλ2R
f2I(x2I|x1)=(y2I−γRx1I−γIx1R−βx2I)2−bT2Iλ2I.
Expanding (12), minimizing with respect to x2R and x2I, and removing irrelevant terms, we obtain the following
key equation:
d¯(x)= f¯1R(x1R)+f¯1I(x1I)
+ min
x2R∈P2
f¯2R(x2R |x1)+ min
x2I∈P2
f¯2I(x2I |x1), (13)
where P2 is the 1D PAM constellation in X2 of layer 2, and
f¯1R(x1R) = Ax
2
1R+Cx1R−bT1Rλ1R (14)
f¯1I(x1I) = Ax
2
1I+Dx1I−bT1Iλ1I (15)
f¯2R(x2R |x1) = (Ex1R+Fx1I)x2R
+ (Bx22R+Gx2R−bT2Rλ2R) (16)
f¯2I(x2I |x1) = (Ex1I−Fx1R)x2I
+ (Bx22I+Hx2I−bT2Iλ2I). (17)
The constant coefficients in (14)-(17) are given by
A = α2 + |γ|2 , B = β2, (18)
C = −2 (αy1R + γRy2R + γIy2I) ,
D = −2 (αy1I − γIy2R + γRy2I) ,
E = +2βγR, F = −2βγI, G = −2βy2R, H = −2βy2I, (19)
and can be precomputed off-line from H and y. The HD-MAP solution is obtained by populating all Q1 distances
in (13) and selecting the minimum. The same applies for the LLRs.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 7
B. Slicing Assuming Zero Prior LLRs
Assuming the input LLRs λ are zero, the rightmost term in (1) vanishes and the MAP detection problem reduces
to a least-squares integer ML problem. Then xˆ2 in (7) can be obtained by slicing (y2 − γx1)/β∈C to the nearest
constellation point in X2 using the operator bueXn , arg min
x∈Xn
|u− x|:
xˆ2 = b(y2−γx1)/βeX2 ∈ X2. (20)
By separating the real and imaginary parts as xˆ2 = xˆ2R+jxˆ2I, the slicing operation in (20) splits into:
xˆ2R = b(y2R−γRx1R+γIx1I)/βeP2 ∈ P2, (21)
xˆ2I = b(y2I−γRx1I−γIx1R)/βeP2 ∈ P2, (22)
where P2 = {p1, p2, · · · , pP2} is the P2-PAM constellation, and P2 =
√
Q2. The operations in (21)-(22) reduce to
simple comparisons with the (deterministic) decision boundaries of P2 as follows. Let z2 = y2−γx1 = z2R +jz2I
where
z2R =y2R−γRx1R+γIx1I, (23)
z2I =y2I−γRx1I−γIx1R.
Assume the constellation points are ordered such that pi<pk if i<k. Then xˆ2R maps to the point pi that satisfies
β
pi−1 + pi
2
≤ z2R < β pi + pi+1
2
(24)
for i = 1, · · · , P2, where p0 = −∞ and pP2+1 = +∞. Similarly for xˆ2I. Hence the actual distances f2(x2|x1)
themselves need not be computed for all x2 and a given x1 in order to find the symbol x2 that minimizes f2(x2|x1)
in (7). Therefore, (6) requires only |X1|=Q1 distance computations. By the same argument, (10) requires only
|X2|=Q2 distance computations.
C. Slicing Assuming Non-Zero Prior LLRs
When the prior terms are included in the distance computations, slicing cannot be directly applied in (7) since
the decision boundaries now depend on the bias term bT(x2)λ2. We develop next an optimal scheme that enables
efficient slicing similar to (24) based on [49]. In [50], a scheme that computes suboptimal slicing boundaries was
presented. Compared to our approach, [50] incurs a performance loss with equivalent complexity.
The real part of xˆ2 in (7) is given by
xˆ2R =arg min
x2R∈P2
{
(z2R−βx2R)2−bT(x2R)λ2R
}
.
To decide in favor of pi∈P2, then ∀ k 6= i, we must have
(z2R−βpi)2−bT(pi)λ2R < (z2R−βpk)2−bT(pk)λ2R. (25)
This condition can be formulated in terms of decision boundaries R(pi, pk)=R(pk, pi):
R(pi, pk) = B ·(pi + pk)− b
T(pi)−bT(pk)
pi − pk λ2R, ∀ k 6= i, (26)
between pi and all other pk’s in P2. Assuming the points in P2 are ordered such that pi < pk if i < k, then
for p1 to satisfy (25), we must have 2βz2R < R(p1, pk) for all pk > p1. For pP2 to satisfy (25), we must have
2βz2R>R(pP2 , pk) for all pk<pP2 . For any other internal point pi, i 6=1, i 6=P2, we must have 2βz2R<R(pi, pk)
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 8
for all pk >pi, and 2βz2R >R(pi, pk) for all pk <pi. These conditions can be combined into a single condition
for i=1, · · · , P2, as follows:
max
k=0,··· ,i−1
R(pi, pk) ≤ 2βz2R < min
k=i+1,··· ,P2+1
R(pi, pk), (27)
where p0 = −∞, pP2+1 = +∞, b(p0) = b(pP2+1) = 01×q2/2. Note that (26) and (27) reduce to (24) when
λ2R =01×q2/2.
Substituting (23) for z2R in (27), using the constants (19)-(19), and accounting for sign change, we obtain the
following slicing condition that is suitable for hardware implementation:
max
k=i+1,··· ,P2+1
R(pi, pk)−G ≤ Ex1R + Fx1I < min
k=0,··· ,i−1
R(pi, pk)−G. (28)
Note that in (28), the maximum on the lefthand side is now taken over all points pk ∈P that are greater than pi
as opposed to smaller than pi as was done in (27) due to the change in sign. Similarly for the minimum on the
righthand side in (28).
A similar analysis applied to compute xˆ2I = min
x2I∈P2
f2I(x2I|x1) leads to the decision regions I(pi, pk):
I(pi, pk) = B ·(pi + pk)− b
T(pi)−bT(pk)
pi − pk λ2I, (29)
using now λ2I, and the associated slicing condition:
max
k=i+1,··· ,P2+1
I(pi, pk)−H ≤ Ex1I − Fx1R < min
k=0,··· ,i−1
I(pi, pk)−H. (30)
Note that by construction of the decision boundaries in (27) (and their imaginary counterparts), the proposed
approach is optimal. The approach in [50] however is suboptimal because it employs heuristics to compute simplified
but suboptimal decision boundaries.
IV. EXTENSION TO HIGHER-ORDER LAYERS
The previous optimizations cannot be directly extended to N ≥ 3 layers because the structure of the lower
triangular matrix L includes off-diagonal terms that prevent searching for the MAP solution by enumerating symbols
in one layer and finding the minima through slicing individually on all other layers in parallel. More specifically,
in Fig. 2(a), the presence of the demarked entries in the LTM implies that determining the MAP solution requires
enumerating symbols on the first N−1 layers and slicing only on the last layer, as is typically done in tree-search
detectors (e.g. [30]), and hence still requiring O(
∏
nQn) complexity rather than O(
∑
nQn).
One desirable structure of H for a 4-layer MIMO system would be as shown in Fig. 2(b), in which the demarked
entries are zeroed out. Here, by enumerating symbols on layer 1, the minimum distances and associated symbols
on layers 2 to 4 can be searched for in parallel through slicing only on the corresponding layers, similar to the
2-layer system. This suffices to compute the LLRs associated with the bits on layer-1 symbol. A similar process is
repeated by decomposing H according to the structures shown in Figs. 2(c)-(e) [47] to compute the LLRs for bits
associated with layers 2 to 4.
Other “punctured” structures are also possible for a 4×4 system as shown in Fig. 3. They differ in 1) the number
of layers over which symbols are enumerated (enumeration or detection set), 2) the submatrix structure used to
propagate these enumerated symbols and cancel their interference effect from the remaining layers (interference
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 9
(a) (b) (c) (d) (e)
Fig. 2. 4×4 structures: (a) Full; (b)-(e) punctured structures for every layer.
cancellation set), and 3) the number of layers in which the minimum distance and associated symbol can be obtained
by slicing after interference cancelation (slicer set). Let U denote the size of the enumeration set, S the size of
the slicer set, and S×U the size of the interference cancellation set. We refer to this structure using the triplet
(U, S×U, S). For example, in Fig. 3(a), we enumerate over U = 1 layer only, cancel interference from this layer
to the 3 other layers using a 3×1 structure, and slice over S=3 layers. In the structure in Fig. 3(b), we enumerate
over U=2 layers, cancel interference using a 2×2 structure, and slice over S=2 layers.
LLR values are generated for bits in symbols included in the detection set only. Complementary structures that
enumerate symbols on other decoupled layers are required to generate their respective LLRs. For example, the
(1, 3×1, 3) structure requires 3 similar structures to generate LLRs for layers 2 to 4 (Fig. 2(c)-(e)). When U >1,
decoupled layers can overlap by placing a stream with low reliability in multiple detection sets.
slicer
slicers
slicers
enum. setenum. set
enum. set
intf. cancel intf. cancel intf. cancel
(a) (b) (c)
Fig. 3. (a) (1, 3×1, 3), (b) (2, 2×2, 2), (c) (3, 1×3, 1) punctured structures
A. WL Decomposition (WLD) Scheme
In [51], a decomposition scheme was introduced to transform H into a punctured LTM L with a desired structure
via a projection matrix W. In this section, we extend the scheme to handle soft-input MIMO detection using prior
LLRs fed from a soft-input-soft-output channel decoder. We assume N=M .
We seek a matrix W=[w1 w2 · · · wN ]∈CN×N such that W∗H=L is a punctured LTM and L=[lij ]∈CN×N
with lii∈R+. In general, if L is punctured, then W is non-unitary and hence does not preserve Euclidean norm:
y ,W∗y˜ = Lx+W∗n (31)
g(x),‖y − Lx‖2 6= ‖y˜ −Hx‖2 =d(x) (32)
However, if we impose the condition
diag(W∗W) = [1 1 · · · 1]T1×N ,
then the transformed noise vector W∗n has an unaltered covariance matrix E[W∗nn∗W]=σ2IN .
To induce a specific pattern of zeros below the main diagonal in L, we choose the columns of W to be
orthogonal to the columns of H= [h1 h2 · · · hN ] where these zeros are to be introduced. More specifically, let
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 10
In, n = 1, · · · , N , be the column index sets where puncturing is desired in each row n of H. Denote HIn the
submatrix formed by the columns of H whose index belongs to set In. Define the column vector w˜n =P⊥Inhn,
where
P⊥In = IN −HIn
(
H∗InHIn
)−1
H∗In , (33)
and HIn ={hm |m ∈ In}. Then the column vectors of W are given by
wn=
w˜n
‖w˜n‖=
P⊥Inhn√
h∗nP⊥Inhn
.
Furthermore, it was shown in [51] that L and W∗y˜ can be derived using a simple modification to the standard
QL decomposition procedure [52]. This avoids the need for expensive matrix inversion operations in (33). On
modern vector digital signal processors (DSPs), matrix QLD operations are natively supported and optimized part
of the instruction set. For example, on a CEVA XC-4210 processor [53], QL decomposition of a 4×4 complex
matrix requires only 12 clock cycles. Hence, we assume that the channel matrix H has been preprocessed by a
similar DSP, and detection is performed based on the transformed system in (31). Note that because of (32), the
solution to the detection problem is no longer optimal (but still achieves near-optimal performance as demonstrated
in Section VII).
B. Optimized Detection Algorithm Using WLD
We next present an optimized detection algorithm based on the WLD scheme, by extending the N = 2 case of
Section III. For simplicity, we only consider decompositions of the form (1, S×1, S), similar to Fig. 2. The N
layers are decoupled by first circularly shifting the columns of H, and then performing WLD on the permuted H.
We refer to the decomposition whose detection set is the mth layer as the mth WLD of H. To simplify notation,
we describe the detection steps for m = 1. The same steps apply to detect the other layers with an appropriate
adjustment to the layer indices. Let
x=

x1
x2
x3
...
xN

, y=

y1
y2
y3
...
yN

, L=

α 0 0 0 0
γ2 β2 0 0 0
γ3 0 β3 0 0
...
...
...
. . .
...
γN 0 · · · 0 βN

, (34)
be the transmitted symbol vector, received signal vector, and the WL-decomposed channel matrix in normal order,
respectively, where: yn∈C, xn∈Xn for n=1, · · · , N ; α, βn∈R+ and γn∈C for n=2, · · · , N . Then the distance
metric g(x) of x from y based on L in (32) can be written as
g(x)=f1(x1)+
N∑
n=2
fn(xn |x1), (35)
where
f1(x1)= |y1−αx1|2−bT(x1)λ1, and
fn(xn|x1)= |yn−γnx1−βnxn|2−bT(xn)λn,
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 11
for n=2, · · · , N . We next minimize g(x) similar to (5):
gWL,min
x∈X
{
f1(x1)+
N∑
n=2
fn(xn |x1)
}
(36)
= min
x1∈X1
{
f1(x1) +
N∑
n=2
min
xn∈Xn
fn(xn |x1)
}
(37)
= min
x1∈X1
{
f1(x1) +
N∑
n=2
fn(xˆn(x1) |x1)
}
= min
x1∈X1
g(x1, xˆ2(x1), · · · , xˆN (x1)) (38)
where
xˆn(x1) = arg min
xn∈Xn
{
|yn−γnx1−βnxn|2−bT(xn)λn
}
.
Denote the set of sliced symbol vectors for all possible x1 in (38) by (defined similar to (8) but for any N≥2)
O1 =
{
[x1 xˆ2(x1) · · · xˆN (x1)]T : x1∈X1
}
.
The symbol vector that minimizes (35) is denoted as
xWL,arg min
x∈O1
g(x). (39)
To efficiently determine gWL, we optimize the distance computations in (36) by splitting the complex quantities
into their real and imaginary components:
f1(x1)=f1R(x1R)+f1I(x1I)
f1R(x1R)=(y1R−αx1R)2−bT1Rλ1R
f1I(x1I)=(y1I−αx1I)2−bT1Iλ1I
and
fn(xn)=fnR(xnR)+fnI(xnI)
fnR(xnR|x1)=(ynR−γnRx1R+γnIx1I−βnxnR)2−bTnRλnR
fnI(xnI|x1)=(ynI−γnRx1I−γnIx1R−βnxnI)2−bTnIλnI
for n≥ 2. Substituting back in (37), expanding terms, minimizing w.r.t. xnR and xnI, and eliminating irrelevant
terms, we obtain
g¯(x) = f¯1R(x1R) + f¯1I(x1I) +
N∑
n=2
(
min
xnR∈Pn
f¯nR(xnR|x1) + min
xnI∈Pn
f¯nI(xnI|x1)
)
(40)
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 12
where
f¯1R(x1R)=Ax
2
1R+Cx1R−bT1Rλ1R
f¯1I(x1I)=Ax
2
1I+Dx1R−bT1Iλ1I
f¯nR(xnR)=(Enx1R+Fnx1I)xnR
+(Bnx
2
nR+GnxnR−bTnRλnR)
f¯nI(xnI)=(Enx1I−Fnx1R)xnI
+(Bnx
2
nI+HnxnI−bTnIλnI)
Similar to (18)-(19), the constants above are given by:
A = α2 +
N∑
n=2
|γn|2 , Bn = β2n
C = −2αy1R − 2
N∑
n=2
(γnRynR + γnIynI)
D = −2αy1I + 2
N∑
n=2
(γnIynR − γnRynI)
En=+2βnγnR, Fn=−2βnγnI, Gn=−2βnynR, Hn=−2βnynI.
Using g(x) (or g¯(x)), the LLRs of the bits in layer 1 are
ΛWL1,j = min
x1∈X (+1)1,j
g(x1, xˆ2(x1), · · · , xˆN (x1))− min
x1∈X (−1)1,j
g(x1, xˆ2(x1), · · · , xˆN (x1)). (41)
The bit-LLRs in the remaining N−1 layers are similarly obtained by using the other N−1 complementary WL
structures of H (see Fig. 2). Finally, equations (26)-(28) for N=2 can be used to slice xˆnR= min
xnR∈Pn
fnR(xnR|x1),
and (29)-(30) to slice xˆnI = min
xnI∈Pn
fnI(xnI|x1), but with the constants B,E, F,G,H replaced by Bn, En, Fn, Gn, Hn,
and P2, λ2R, λ2I by Pn, λnR, λnI.
C. Post LLR Processing
Since g(x) 6= d(x), there is no guarantee that the gWL and xWL obtained in (38) and (39) using one WLD
structure of H, are the same ones obtained using the other N−1 WLD structures with the columns of H permuted.
To avoid confusion, we refer to the quantities in (35), (38), and (39) pertaining to the mth layer WL decomposition
using the subscript m: gm(x), gWLm , x
WL
m .
The “WL-minimal” HD solution, denoted as gWLmin and x
WL
min, corresponds to the minimum of the N various g
WL
m
values:
gWLmin = min
m
gWLm , x
WL
min = arg min
xWLm
gWLm .
A similar minimization is required as well to adjust the LLR values ΛWLn,j relative to the global minimum g
WL
min
and the bits of its corresponding symbol vector xWLmin. This adjustment cannot be done by comparing the individual
ΛWLn,j ’s alone. One simple way is based on the list of distances gm(x) generated from all decompositions for
m=1, · · · , N , together with their corresponding symbol vectors. Let Om denote the set of symbol vectors
Om=
{
[xˆ1(xm) · · · xˆm−1(xm) xm xˆm+1(xm) · · · xˆN (xm)]T : xm∈Xm
}
, m=1, · · · , N, (42)
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 13
y
H
 
 
  
×
× ×
 1R 1Rf x
 1I 1If x
 2R 2Rf x
 2I 2If x
slicer
slicer
  2R 2Rmin f x
  2I 2Imin f x
+ 
d x
1Rλ
λ
 1 1f x
2xˆ2Ixˆ 2Rxˆ
  2 2min f x
+
+
Distance
Buffer
LLR
Processing
LLR
WL
,n j
1Iλ
2Rλ
2Iλ
+
Sliced
Symbol
Buffer
one-sided 2x2 MAP Detector
A
B
H
1 2,
1 2,
support for higher-order layersfrom DSP
Fig. 4. Block diagram of a parallel one-sided 2×2 MAP detector core, with input and output interfaces
where the nth sliced symbol in the mth WLD is
xˆn(xm) = arg min
xn∈Xn
{∣∣yn,m−γn,mxm−βn,mxn∣∣2−bT(xn)λn} ,
for n 6=m. Here yn,m, γn,m, and βn,m are defined as in (34) but relative to the mth WLD of H. Next, define the
partitions on Om: O(+1)n,j,m={x ∈ Om : bn,j =+1} and O(−1)n,j,m={x ∈ Om : bn,j =−1}. Then the “WL-minimal”
LLRs are given by
ΛWLn,j,min=min
m
 minx∈O(+1)n,j,mgm(x)
−minm
 minx∈O(−1)n,j,mgm(x)
 . (43)
D. Discussion
The key equations for the general N -layer case derived above reduce to the optimal equations derived in Section III
for N = 2. A comparison between the two shows that the same operations applied to compute d¯(x) in (13) are
applied to compute g¯(x) in (40), but using the respective constants of layer n instead of layer 2. Hence, a 2×2
MAP detector can be viewed as a building block for constructing detectors for higher-order layers, with a simple
modification to account for the extra accumulated sum terms in (40), in addition to the LLR processing of (43) at
the output stage. A parallel architecture will be developed next and its complexity analyzed.
V. PARALLEL 2-LAYER DETECTOR ARCHITECTURE
Figure 4 shows a block diagram of a parallel 2×2 MAP detector core that implements the key equations in (13)-
(17). For flexibility and scalability to higher-order layers, the constellations supported on each layer are configurable
from BPSK up to 256-QAM, and can be distinct on each layer. We assume the input constants (18)-(19) to the
detector are supplied by an external DSP. The outputs are two lists of distances D1,D2 and their associated lists
of symbol vectors O1,O2, which are fed to a post LLR processing stage to extract the LLRs values depending on
the number of layers.
A. Optimized Implementation of Distance Expressions
A careful inspection of expressions (14)-(17) shows that d¯(x) can be evaluated without using multipliers, assuming
the constants are pre-processed and fed as inputs to the detector. The reason is that the variables x1R, x1I, x2R, and
x2I are integers that belong to a PAM constellation. More specifically, in LTE [2], they are odd integers in the set
P2 ={2m+1 |m=−P/2+1, · · · , 0, · · · , P/2−1} and P =
√
Q2. Hence the terms that involve the products of x1R,
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 14
TABLE I
DISTINCT PRODUCT TERMS TO BE COMPUTED: x, y, z∈P2 ; r, s∈R.
# distinct terms 2-PAM 4-PAM 8-PAM 16-PAM
r·|x| 1 2 4 8
r·|x|·|y| 1 3 10 33
r·x2 1 2 4 8
(r·|x|±s·|y|)·|z| 2 14 116 914∣∣bT1Rλ1R∣∣ 1 2 4 8
x1I, x2R, x2I in (14)-(17) with the constants in (18)-(19) are simply integer multiples of these constants. These
product terms can be computed using basic addition operations with appropriate power-of-2 manipulations of the
operands without using expensive multipliers. Also from symmetry, only positive multiples need to be computed.
Table I summarizes the number of various distinct product terms that need to be computed for various PAM
constellation sizes.
Moreover, the dot products bT1Rλ1R between the input LLR vectors and all the bit vectors are simply all linear
binary combinations of the q1/2=(log2Q1)/2 individual input LLRs λi of λ1R:
±λ1 ± λ2 ± · · · ± λq1/2.
Also from symmetry, only half of these sums actually need to be computed, giving a total of 2q1/2−1 different
sums. The same applies to other dot product terms in (15)-(17).
Next, as x1R runs over the P1 integers in P1, the expression
(
Ax21R+Cx1R−bT(x1R)λ1R
)
takes P1 different
values. However, because of the Gray mapping of the bits, then bT(−x1R)λ1R 6=−bT(x1R)λ1R and hence there
is no symmetry that can be exploited to save in computations here. The same argument applies to the three other
expressions
(
Ax21I+Dx1I−bT1Iλ1I
)
, (Bx22R+Gx2R−bT2Rλ2R), and (Bx22I+Hx2I−bT2Iλ2I) in (15)-(17).
Finally, for the remaining sum of products of cross terms (Ex1R+Fx1I)x2R, as x2R cycles through the P2 integers
in P2, the expression takes P2 different values for every pair (x1R, x1I). However, for all possible (x1R, x1I),
repetitions occur. The number of unique values of (Ex1R +Fx1I)x2R is twice that of (E|x1R|±F |x1I|)|x2R|
(summarized in Table I). By symmetry, these are also the same values taken by the other sub-expression (Ex1I−
Fx1R)x2I in (17).
Therefore, hardware complexity will be measured in terms of number of adders, in addition to number of (2:1)-
multiplexers (muxes) needed to steer operands to these adders. Wider (n:1)-muxes can be constructed using n− 1
(2:1)-muxes.
We next determine the actual number of adders required to compute each of the unique terms in (14)-(17),
assuming 256-QAM and its underlying 1D 16-PAM constellation. The same analysis applies to other constellations.
The required multiples Ax21R for 16-PAM are {9, 25, 49, 81, 121, 169, 225}×A, which can be generated using 11
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 15
adders as follows:
9A=8A+A,
49A=64A−15A,
121A=128A−7A,
31A=32A−A
25A=16A+9A,
81A=32A+49A,
41A=32A+9A,
225A=256A−31A
15A=16A−A
7A=8A−A
169A=128A+41A
Similarly, the 8 multiples C |x1R| can be generated using 7 adders. For bT1Rλ1R, 8 values of can be generated as
(λ1+λ2)±(λ3+λ4),
(λ1−λ2)±(λ3+λ4),
(λ1+λ2)±(λ3−λ4),
(λ1−λ2)±(λ3−λ4)
with 12 adders. The other 8 are their negatives.
To generate the unique elements of (E|x1R|±F |x1I|)|x2R|, we first generate all unique sums with x2R =1, i.e.
(E|x1R|±F |x1I|), such that gcd(|x1R| , |x1I|)=1, and then generate all their multiples. The number of unique sums
of the form (E|x1R|±F |x1I|) with co-prime coefficients |x1R| and |x1I| from the set {1, 3, · · · , 15} is 49. We next
enumerate the unique multiples from each of these 49 classes. For (|x1R| , |x1I|) = (1, 1), there are 33×2 distinct
multiples of (E±F ). For (|x1R| , |x1I|)=(1, 3) or (3, 1), there are 18×2 distinct multiples of (E±3F ) and 18×2
of (3E±F ). For (|x1R| , |x1I|)=(1, 5), (5, 1), (3, 5), or (5, 3), there are 13×2 distinct multiples of each. Finally,
for the remaining 42 classes, there are 8×2 distinct multiples of each. Summing all distinct multiples we obtain
914.
Table II summarizes the various constants that appear in the computation of (E|x1R|±F |x1I|)|x2R| for 16-PAM,
and how they are generated using addition operations involving powers-of-2 operands and other already computed
constants. First, the odd multiples 3E, 5E, · · · , 15E, and 3F, 5F, · · · , 15F , require 14 adders. The term (E±F )
and its 33×2 distinct multiples require all the 36 constants in Table II and hence need (36+1)×2 adders. The term
(E±3F ) requires 18 constants {1, 3, 5, 7, 9, 11, 13, 15, 21, 25, 27, 33, 35, 39, 45, 55, 65, 75}, and hence needs 2×18
adders. The same count is needed for (3E±F ). On the other hand, the term (E±5F ) and its 13×2 distinct multiples
require only 13 constants {1, 3, 5, 7, 9, 11, 13, 15, 21, 27, 33, 39, 45} and hence need 2×13 adders. The same applies
for (5E±F ), (3E±5F ), and (5E±3F ). For the remaining 42 classes, only 8 constants {1, 3, 5, 7, 9, 11, 13, 15}
appear and hence need 2×8×42 adders. Summing all counts results in a total of 936 adders. Finally note that
(E|x1R|±F |x1I|)|x2R| takes the same values as (E|x1R|∓F |x1I|)|x2I| but in a different order.
B. Minimization by Exhaustive Search
One approach to implement the minimizations in (13) is by exhaustive search. In (16), for every pair (x1R, x1I),
16 out of 914 distinct values of (E|x1R|±F |x1I|)|x2R| pertaining to the 16 different x2R’s are added to Bx22R+
Gx2R−bT2Rλ2R, and the minimum is selected. Hence a total of 16×256 adders are needed to generate all possible
values of f¯2R(x2R |x1). The same holds for f¯2I(x2I |x1). To find the minimum among P2 quantities, a binary
tree of comparators comprised of P2−1 adders and P2−1 (2:1)-multiplexers are needed. A total of 2×256 such
comparators are needed. Finally, the 256 minima from each case are added to complete the sum for d¯(x) in (13).
To generate the hard-decision MAP solution, the minimum among the 256 distances d¯(x) must be taken and
the corresponding constellation symbol be identified. This requires a total of 255 adders and 255 muxes. On the
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 16
TABLE II
CONSTANTS THAT APPEAR IN (E|x1R|±F |x1I|)|x2R| FOR 16-PAM
3=2+1 5=4+1 7=8−1 9=8+1
11=8+3 13=16−3 15=16−1 21=16+5
25=16+9 27=32−5 33=32+1 35=32+3
39=32+7 45=32+13 49=64−15 55=64−9
63=64−1 65=64+1 75=64+11 77=64+13
81=32+49 91=64+27 99=64+35 41=32+9
105=64+41 117=128−11 121=128−7 135=128+7
143=128+15 37=33+4 165=128+37 169=128+41
61=64−3 195=256−61 31=32−1 225=256−31
other hand, to compute the output LLRs of the bits in x1 according to (9), the 256 distances in (13) must be
minimized over two complementary sets for every bit and their difference be taken. The 256-QAM constellation
points can be viewed as 16 columns each containing 16 points, or as 16 rows each containing 16 points. In LTE,
the 4 bits corresponding to the real part of the constellation points do not change in every column, and the 4 bits
corresponding to the imaginary part do not change in every row. Hence it suffices to take the minimum distances
among all points in each row and among all points in each column independently. The column minima are used
to compute the LLRs of the real bits by partitioning the columns into two groups of 8 columns depending on
whether the bit is +1 or −1 in the column. The minimum distance among each group of columns is taken, and
the difference of the two minima generates the LLR of that bit. The same applies to the imaginary bits and the
row minima. Hence a total of 2×16 16-point comparators are needed, amounting to 480 adders and 480 muxes, to
extract the minima, followed by 8 adders to take the differences.
Table III summarizes the core complexity using exhaustive search. The core requires 18290 adders and 8160
muxes.
C. Minimization by Slicing
We next analyze the complexity of computing min
x2R∈P2
f¯2R(x2R|x1) in (13) via the slicing approach by first
determining xˆ2R = arg min
x2R∈P2
f¯2R(x2R|x1) followed by evaluating f¯2R(xˆ2R|x1), for all possible x1. To minimize
f¯2R(x2R|x1), the decision boundaries R(x2R, x¯2R) in (26) must be computed for all x2R 6= x¯2R∈P2, and appropriate
minima and maxima must be extracted from these boundaries according to (28) and compared to Ex1R +Fx1I.
Similarly, to minimize f¯2I(x2I|x1), the decision boundaries I(x2I, x¯2I) in (29) must be computed for all x2I 6= x¯2I∈
P2, and appropriate minima and maxima must be extracted from these boundaries according to (30) and compared
to Ex1I−Fx1R.
By analogy, it suffices to analyze the complexity of (26) and (28). Since R(x2R, x¯2R) = R(x¯2R, x2R), only
P2(P2−1)/2=120 decision boundaries need to be computed (see Fig. 5). The sum |x2R+x¯2R| takes P2−2 distinct
non-zero values (2, 4, · · · , 2P2−4), and hence the product B|x2R+x¯2R| term in (26) requires 6 adders. Similarly, the
difference |x2R−x¯2R| takes P2−1 distinct non-zero values (2, 4, · · · , 2P2−2). For the division of
(
b2R−b2R
)T
λ2R
by these constants, where b2R =b(x¯2R), the term
(
b2R−b2R
)T
λ2R takes 80 distinct values, 40 of which can be
obtained by negation. These 40 values require 22 adders. The required ratios
(
b2R−b2R
)T
λ2R/(x2R−x¯2R) take
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 17
TABLE III
RESOURCES OF DETECTOR CORE USING EXHAUSTIVE SEARCH
# adders (& muxes) 2-PAM 4-PAM 8-PAM 16-PAM
Ax21R 0 1 4 11
C |x1R| 0 1 3 7∣∣∣bT1Rλ1R∣∣∣ 0 2 6 12
f¯1R(x1R) 4 8 16 32
D |x1R| 0 1 3 7∣∣∣bT1Iλ1I∣∣∣ 0 2 6 12
f¯1I(x1I) 4 8 16 32
f¯1(x1)= f¯1R(x1R)+f¯1I(x1I) 4 16 64 256
Bx22R 0 1 4 11
G |x2R| 0 1 3 7∣∣∣bT2Rλ2R∣∣∣ 0 2 6 12
Bx22R+Gx2R−bT2Rλ2R 4 8 16 32
H |x2I| 0 1 3 7∣∣∣bT2Iλ2I∣∣∣ 0 2 6 12
Bx22I+Hx2I−bT2Iλ2I 4 8 16 32
(E|x1R|±F |x1I|)|x2R| 2 16 122 936
f¯2R(x2R|x1) 8 64 512 4096
m2R =min{f¯2R(x2R|x1)} 4 48 448 3840
muxes → 4 48 448 3840
f¯2I(x2I|x1) 8 64 512 4096
m2I =min{f¯2I(x2I|x1)} 4 48 448 3840
muxes → 4 48 448 3840
f¯1+m2R+m2I 8 32 128 512
HD solution: min{f¯1+m2R+m2I} 3 15 63 255
muxes → 3 15 63 255
soft-output LLRs 6 28 118 488
muxes → 4 24 112 480
Total (soft-output) 60 346 2460 18290
12 120 1008 8160
only 40 distinct values, and require divisions by 3, 5, 7, 9, 11, 13, 15. However, each value of
(
b2R−b2R
)T
λ2R
need not be divided by all these 7 constants. By going over all various combinations, it is easy to show that the
number of divisions by the various values of |x2R−x¯2R| is as follows (constant : count):
2 : 4,
18 : 3,
4 : 3,
20 : 2,
6 : 5,
22 : 3,
8 : 2,
24 : 1,
10 : 5,
26 : 2,
12 : 3,
28 : 1,
14 : 4,
30 : 1
16 : 1,
Divisions by powers-of-2 are trivial. Division by 3 covers division by 6=3×2, 12=3×4, and 24=3×8, and hence
is needed 9 times. Division by 5 covers division by 10 and 20, and hence is needed 7 times. In a similar fashion,
division by 7 is needed 5 times, by 9 is needed 3 times, by 11 is needed 3 times, by 13 is needed 2 times, and by 15
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 18
Linear
Combinations 
(22 adders)
2Rλ
 2 2 2
T
R R Rb b λ
Division by 
constants
3,5, ¼,15
+
2R 2RB x x
 2R 2R,R x x
Fig. 5. Computation of decision boundaries
is needed once. The total number of such non-trivial divisions is 30. The complexity of a division-by-small-constant
circuit is roughly equivalent to a small number of adders for small bit-widths. Specifically, a divide-by-3 is equivalent
to 1 adder; by 5, 7, 9, and 11 are equivalent to 2 adders; and by 13 and 15 are equivalent to 3 adders. Hence, the
ratios in (26) can be computed using 54 adders. Finally, computing all 120 decision boundaries by adding/subtracting
the various 14 non-zero values of B |x2R+x¯2R| to the various 40 distinct ratios
(
b2R−b2R
)T
λ2R/(x2R− x¯2R)
requires 112 adders (B |x2R+x¯2R|=0 in 8 cases out of the 120).
Moving to (28), a subset of P2−1 minimum and P2−1 maximum regions must be extracted from these boundaries
for every hypothesis point x2R w.r.t. all other P2−1 points in P2. These can be obtained using a set of P2 comparator
trees, comprising a total of 14×15 = 210 adders and 210 (2:1)-MUXs. Next, G is subtracted from each of the
P2−1 min and P2−1 max boundaries using 30 adders. Finally, comparisons between Ex1R +Fx1I and these
min/max boundaries are required to determine xˆ2R according to (28). Each comparison requires 30 adders. Only
128 such comparisons are needed for |Ex1R±Fx1I|, requiring a total of 3840 adders. The other 128 are derived
by symmetry. Figure 6 shows the architecture of the slicer block in Fig. 4.
Based on the results from the slicers, the xˆ2R’s are used to evaluate f¯2R(xˆ2R|x1). This is done by selecting
the appropriate multiples |(Ex1R±Fx1I)xˆ2R| to be added to Bxˆ22R+Gxˆ2R−bT(xˆ2R)λ2R. Hence 256 adders are
needed, in addition to 128 (8:1)-MUXES and 256 (16:1)-MUXES.
Table IV summarizes the complexity resources of the slicer-based detector. The architecture requires 11246 adders
and 10372 muxes, which amount to a 38.52% savings in adders and an increase of 27.11% in muxes compared
with the previous architecture using exhaustive search minimization. The internal pipeline registers, output buffers
and accumulators in Fig. 4 are the same between the 2 architectures, and thus are not included in the comparisons.
D. Multi-Core Detector Architectures
Depending on the target throughput and the number of antennas N in the MIMO systems, multiple detector cores
similar to Fig. 4 can be configured to build a MIMO detector. Figure 7 shows a 2-sided fully parallel 2×2 MIMO
detector architecture that uses 2 separate cores to detect the two streams. Since the detection algorithm in this case
in optimal, the post LLR processing stage simply implements (9) and (11), without the need for distance buffering
and accumulation.
Figure 8 shows a 4-sided fully parallel 4×4 MIMO detector that uses 4 cores to process the 4 streams. Here distance
buffering and accumulation are needed before LLR processing in order to adjust the individual LLRs according
to (43). In this case, the WLD matrix inputs for all 4 streams using the decompositions in (34) are needed. If chip
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 19
1R 1IEx Fx
Comp
Comp
Comp
1R 1IEx Fx
 1, jR p p
 2 , jR p p  16 , jR p p 4 , jR p p
 3 , jR p p
 6 , jR p p
 5 , jR p p
Min levelMax level
2Rxˆ
 15 , jR p p
G
G
 º Max Min º 
Fig. 6. Block diagram of optimized slicer architecture
2x2 MAP
Detector
(layer-1)
2x2 MAP
Detector
(layer-2)
λ
y
H
2
2
2
A
B
H
 1d x
 1 2ˆx x
 2d x
 1 2xˆ x
y
H
1
1
1
A
B
H
DSP
LLR
Processing
LLR
WL
,n j
2-sided MAP detector
´ 
´ ´ 
´ 
´ ´ 
Fig. 7. Block diagram of 2-sided MAP detector
area is the constraining factor, a MIMO detector can be built using a single core that is time-multiplexed among
the 4 streams.
VI. APPLICATION TO MU-MIMO DETECTION
Multi-user MIMO (MU-MIMO) has been proposed as a method for increasing the capacity of wireless net-
works [54], [55]. In MU-MIMO, multiple users are scheduled on the same physical resource blocks (PRBs).
Several receiver processing methods have been proposed in the literature for MU-MIMO [55]–[58]. We consider
an optimal MU-MIMO detection method based on the joint constellation estimation of the interfering user and data
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 20
TABLE IV
RESOURCES OF DETECTOR CORE USING SLICERS
# adders (& muxes) 2-PAM 4-PAM 8-PAM 16-PAM
f¯1(x1)= f¯1R(x1R)+f¯1I(x1I) 4 16 64 256
B|x2R+x¯2R| 0 0 2 6(
b2R−b¯2R
)Tλ2R 0 2 8 22
(b2R−¯b2R)
x2R−x¯2R
T
λ2R 0 1 10 54
R(x2R, x¯2R) 0 4 24 112
min /max boundaries 0 6 42 210
(MUXES) 0 6 42 210
min /max boundaries−G 2 6 14 54
|Ex1R±Fx1I|·|x2R| 2 16 122 936
Compare |Ex1R±Fx1I|
and min /max boundaries−G 4 48 448 3840
f¯2R(xˆ2R|x1)= |Ex1R±Fx1I|·|xˆ2R|+ 4 16 64 256(
Bxˆ22R+Gxˆ2R−bT(xˆ2R)λ2R
)
4 56 544 4736(
b2I−b¯2I
)Tλ2I 0 2 8 22
(b2I−¯b2I)
x2I−x¯2I
T
λ2I 0 1 10 54
I(x2I, x¯2I) 0 4 24 112
min /max boundaries 0 6 42 210
(MUXES) 0 6 42 210
min /max boundaries−H 2 6 14 54
Compare |Ex1I∓Fx1R|
and min /max boundaries−H 4 48 448 3840
f¯2I(xˆ2I|x1)= |Ex1I∓Fx1R|·|xˆ2I|+ 4 16 64 256(
Bxˆ22I+Hxˆ2I−bT(xˆ2I)λ2I
)
4 56 544 4736
f¯1(x1)+f¯2R(xˆ2R|x1)+f¯2I(xˆ2I|x1) 4 32 128 512
soft-output LLRs 6 28 118 488
muxes → 4 24 112 480
Total 36 258 1654 11246
(2:1)-MUXS 12 148 1284 10372
detection. The optimal MU-MIMO detector can be efficiently implemented with a slight modification of the MAP
MIMO detector developed in Section III.
A. MU-MIMO System Model
We consider a practical OFDM-based MU-MIMO system where 2 users are co-scheduled on the same PRBs,
and each UE has 2 receive antennas. Let K be the number of tones in each PRB. Also let user 1 denote the user of
interest with known constellation XS, while user 2 denotes the interfering user whose constellation XI is unknown
to user 1’s receiver. The received frequency-domain complex signal vector y[k]∈C2×1 at the UE of interest on the
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 21
2x2 MAP
Detector
(layer-1)
2x2 MAP 
Detector
(layer-2)
λ
y
H
2
2
2
A
B
H
 1g x
ˆ , 2,3, 4ix i 
 2g x
y
H
1
1
1
A
B
H
LLR
Processing
ˆ , 1,3,4ix i 
´ ´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ ´ 
2x2 MAP 
Detector
(layer-3)
2x2 MAP 
Detector
(layer-4)
y 4
4
4
A
B
H
 3g x
ˆ , 1, 2,4ix i 
 4g x
y 3
3
3
A
B
H
ˆ , 2,3, 4ix i 
´ ´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
´ 
DSP
H
H
4-sided MAP detector
LLR
WL
,n j
Distance
Buffer
+
Symbol
Buffer
Distance
Buffer
+
Symbol
Buffer
Distance
Buffer
+
Symbol
Buffer
Distance
Buffer
+
Symbol
Buffer
Fig. 8. Block diagram of 4-sided MAP detector
kth resource element (RE) over which the 2 users are scheduled is given by
y[k]=H[k]x[k]+n[k]
=h1[k]x1[k]+h2[k]x2[k]+n[k], k=1, · · · ,K,
where H[k] = [h1[k] h2[k]]∈ C2×2 is the complex channel matrix with h1[k] and h2[k] representing the cascade
of the channel and precoders of user 1 and user 2, respectively; x[k]=[x1[k] x2[k]]T denotes the transmitted 2×1
QAM symbol vector where x1[k]∈XS, x2[k]∈XI; and n[k]∈C2×1 is the noise vector at the kth RE modeled as a
zero-mean complex Gaussian random vector with variance σ2.
B. ML MU-MIMO Detection
The maximum likelihood estimate of the constellation of the interfering user based on y[1], · · · ,y[K] is given
by
XˆI =arg max
XI∈M
p
(
{y[k]}Kk=1
∣∣∣ {H[k]}Kk=1 ,XS,XI) ,
where K is the number of REs over which XI is constant, and
M , {4-QAM, 16-QAM, 64-QAM, 256-QAM} ,
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 22
denotes the set of allowable constellations for the interferer. Assuming that x1[k], x2[k],n[k] are independent for
all k=1, · · · ,K, the ML estimate of the interferer’s constellation can then be written as
XˆI = arg max
XI∈M
1
|XI|K
K∏
k=1
∑
x1[k]∈XS
∑
x2[k]∈XI
p(y[k] |H[k],XS,XI, x1[k], x2[k]) , (44)
where |XI| denotes the size of the interfering user’s constellation, under the assumption that
P(x1[k])=
1
|XS| , and P(x2[k])=
1
|XI| , k=1, · · · ,K.
Let d(x[k])=‖y[k]−H[k]x[k]‖2 /σ2, we can then write (44) as
XˆI =arg max
XI∈M
1
|XI|K
K∏
k=1
∑
x[k]∈XS×XI
exp(−d(x[k])).
Using the log-max approximation [59], we can approximate the ML estimate XˆI by [60]
XˆI≈arg min
XI∈M
(
K log (|XI|) +
K∑
k=1
min
x[k]∈XS×XI
d(x[k])
)
, (45)
where log(·) is the natural logarithmic function.
Once the co-scheduled user’s constellation, XˆI, is estimated, then the LLR of the jth bit of the desired user QAM
symbol x1[k] on the kth RE is given by [44]
ΛMLk,j ≈ min
x1[k]∈X (+1)S,j
x2[k]∈XˆI
d (x[k]) − min
x1[k]∈X (−1)S,j
x2[k]∈XˆI
d (x[k]), (46)
where X (+1)S,j = {x ∈ XS : bj = +1} and X (−1)S,j = {x ∈ XS : bj =−1}. As seen from (46), computing the LLRs
involves the same distance computations as those needed for the co-scheduler user’s constellation estimation in
(45). This fact is exploited in the architecture of a joint constellation classifier and data MU-MIMO detector shown
in Fig. 9, which uses an optimized one-sided MAP MIMO detector as its core. The MIMO detector processes the
received signal y[k] assuming all 4 possible choices of the interferer’s constellation. It generates 4 corresponding lists
of minimum distance metrics d(x[k]) and their associated symbol vectors x[k] for all the |M| possible hypotheses
of the interferer’s constellation, with x1[k]∈XS. These distances and symbols are stored in 4 buffers each of size
|XS| as shown in Fig. 9.
For each tone, the minimum distance from each list is passed to an adder that accumulates the minimum distances
over a span of K tones, during which the interferer modulation is assumed to be static. The resulting 4 minimum
accumulated distances for each interferer hypothesis are stored in a buffer. The minimum from this buffer is used
to identify the interferer’s constellation, and the corresponding stored distances in the buffers are selected and
forwarded for LLR processing according to (46).
Note that since the interferer’s modulation constellation remains static over K tones for a duration of 1 subframe
in LTE (14 OFDM symbols), the particular choice of K = 12 results in substantial savings in computations. The
detector only needs to run in the above mode to identify the interferer’s constellation for one OFDM symbol in the
subframe. It can then switch back to normal ML detection mode (without modulation classification) to generate the
LLRs for the remaining 13 OFDM symbols for the user of interest x1[k].
Taking the LTE scenario for hardware complexity analysis, the total number of possible tones in 1 PRB in a
subframe is 12×14 = 168. Of these tones, 28 are reserved for pilots (for cell specific reference signals and for
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 23
 ky
 kH
   min d kx
+
+
Buffer
 16-QAM 
Buffer
 64-QAM 
   
1
min
K
k
d k

 x
bias terms
interferer constellation size
Constellation estimation
LLR
Processing

LLRs
Buffer
 256-QAM 
I S
I
ˆ
2x2 MAP
Detector
0λ
1
1
1
A
B
H
´ 
´ ´ 
pre-processing
I
Buffer
 4-QAM 
min
 2 / log 2K
 4 / log 2K
 6 / log 2K
 8 / log 2K
Fig. 9. Block diagram of a MU-MIMO detector
UE specific pilots to support the MU-MIMO transmission mode), and 140 for data. In the hardware architecture
of Fig. 9, the total number of distance computations needed to generate the LLRs from these 140 data tones is
(140+12× 5)×|XS|. This corresponds to an increase of only 42.86% compared to the distances computed by an
ML detector with perfect knowledge of the interferer.
Figure 10 shows the results when XS is 64-QAM, with XI being 4-, 16-, and 64-QAM using K= 24 resource
elements. The plots show that the ML classification method has a 5 dB gain over the basic nulling approach when
XS is 4-QAM, and 2 dB gain in the case of XS being 64-QAM. Therefore, the gain of the ML classification method
is largest for small constellation sizes of the desired signal, i.e., the largest gain is attained when the receiver
complexity is minimal.
Figure 11 shows the performance of the joint ML classification and detection method as compared to an ML
receiver that has perfect knowledge of the interfering user’s constellation. Also shown in the figure is the performance
of the linear MMSE receiver that only uses the knowledge of the interfering user’s channel and does not exploit
knowledge of the interferer’s constellation. Both users use 64-QAM, with the turbo code of [61] and encoding
rate 1/2 using block size 6144 bits. The pedestrian-A (Ped-A) [62] multi-path fading channel with high antenna
correlation was used. The effective channel matrix is given by H = R1/2t HcR
1/2
r , where Hc is channel whose
entries are uncorrelated and generated according to the Ped-A model, Rt and Rr are the transmit and receive
antenna 2×2 correlation matrices, respectively, which have 1 on the diagonal entries and 0.9 on the off-diagonal.
As seen from Fig. 11, the joint ML classification and detection receiver is only 0.1 dB away from an ML receiver
that has perfect knowledge of the co-scheduled user constellation. The MMSE method has a significant performance
degradation as compared to the joint ML classification and detection receiver.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 24
−5 0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNR [dB]
Pr
ob
 o
f c
or
re
ct
ly 
de
te
ct
in
g 
in
te
rfe
re
r c
on
st
el
la
tio
n
Desired user constellation = 64−QAM
 
 
Nulling
ML
interferer =
4−QAM
interferer =
16−QAM
interferer =
64−QAM
Fig. 10. Probability of correct interferer modulation constellation detection versus SNR [60]. Solid lines are for nulling approach, dashed are for
the ML approach. Desired user constellation is fixed to 64-QAM and the co-scheduled user constellation is 4-, 16-, and 64-QAM. The channel
is i.i.d. block fading.
VII. IMPLEMENTATION AND SIMULATION RESULTS
The proposed 2 × 2 reconfigurable MIMO detector architecture was modeled in VHDL and synthesized on a
Xilinx Virtexr-6 FPGA. The core was also synthesized using a 90 nm CMOS ASIC library. The experimental
simulations below evaluate the coded bit-error rate (BER) performance of the proposed detection algorithm and the
implemented core, assuming a MIMO system employing either 2 transmit and 2 receive antennas, or 4 transmit
and 4 receive antennas. The channel encoder is based on the LTE turbo encoder specification [2] with interleaver
length 1024, using 16-QAM, 64-QAM, and 256-QAM modulation constellations. The channel entries are assumed
to be i.i.d. complex Gaussian random variables with unit variance. At the receiver end, we assume perfect channel
knowledge. The turbo decoder implements the true A Posteriori Probability algorithm, and performs 4 full decoding
iterations. Also, the detector and turbo decoder perform up to 4 outer joint detection and decoding iterations. Channel
decomposition is performed externally by a pre-processing stage and the coefficients in (18)-(19) are fed as input.
A. Performance Results
The bit-precision of the detector architecture can be configured to enable tradeoff analysis between gate complexity
and tolerable degradation in BER performance due to quantization noise. Figure 12 compares the BER performance
of the detector core for 2 layers and 64-QAM under various integer and fractional bit-widths, versus floating-point
performance, at SNR = 14 dB. The x-axis denotes the number of joint detection and decoding iterations. The
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 25
28 29 30 31 32 33 34
10−3
10−2
10−1
100
average SNR per−tone per−antenna [dB]
bl
oc
k 
er
ro
r r
at
e 
(B
LE
R)
 
 
ML Constellation Estimation
Known Constellation
MMSE
Ped−A channel
Fig. 11. BLER versus per-tone per-antenna SNR (dB). EPA channel, high correlation (0.9), 64-QAM for both users and code-rate 1/2.
1 1.5 2 2.5 3 3.5 4
10−5
10−4
10−3
iterations
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
float−pt
quan: 8.6
quan: 8.7
quan: 8.8
quan: 8.9
1 1.5 2 2.5 3 3.5 4
10−5
10−4
10−3
iterations
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
float−pt
quan: 9.6
quan: 9.7
quan: 9.8
quan: 9.9
2x2−MIMO
64−QAM
SNR = 14dB
Fig. 12. BER vs. outer detection-decoding iterations for various bit-precisions at SNR=14 dB for a 2× 2 MIMO system with 64-QAM.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 26
10 10.5 11 11.5 12 12.5 13 13.5 14
10−5
10−4
10−3
10−2
10−1
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
floating−point
quantized: 14.6
quantized: 12.6
quantized: 9.8
iterations
2x2−MIMO
64−QAM
Fig. 13. BER vs. SNR for various bit-precisions and up to 4 outer detection-decoding iterations, for a 2× 2 MIMO system with 64-QAM.
top figure corresponds to a fixed-point representation of (I.F ) = {8.6, 8.7, 8.8, 8.9}, where I denotes integer bit-
precision while F denotes fractional bit-precision. The bottom figure corresponds to the representation of (I.F )=
{9.6, 9.7, 9.8, 9.9}. As can be seen, when F starts to drop to 6, the BER starts to degrade. There is no significant
improvement in BER in going beyond I=9 integer bits, as demonstrated also in Fig. 13.
Figure 14 compares the BER performance of the core using 16-QAM, 64-QAM, and 256-QAM. The plots
demonstrate that most of the coding gain is attained after 3 outer iterations, assuming the inner turbo decoder
performs at most 4 full turbo decoding iterations.
In Figs. 15 and 16, the BER performance of a 4×4 MIMO system using the proposed WLD scheme is simulated.
In Fig. 15, the plots compare the BER versus SNR of the proposed WLD scheme with E = 1 and 2 structures
(Fig. 3a-3b), versus ML, zero-forcing (ZF), the approach of [47], and the sphere decoder with radius clipping [30], for
16-QAM. Both overlapping and non-overlapping subsets are considered. Two scenarios for distance computations
in (32) are followed; one based on H and one on L. The plots demonstrate that WLD with E = 2 using H
distances with overlapping subsets performs virtually as ML, and is less than 0.1 dB away from ML with no
overlapping. Also, for single streams, L distances perform better than H distances. The plots correspond to one
outer detection-decoding iteration, and 4 full internal turbo decoder iterations.
Figure 16 compares the BER performance for 64-QAM. The plots demonstrate again that the WLD scheme with
E = 2 using H distances and overlapping subsets performs very close to ML. Figure 17 shows the results for
256-QAM.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 27
4 6 8 10 12 14 16 18
10−6
10−5
10−4
10−3
10−2
10−1
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
16−QAM
64−QAM
256−QAM
iterations
iterations
iterations
Fig. 14. BER vs. SNR for a 2× 2 MIMO system with 16-, 64-, and 256-QAM.
7 8 9 10 11 12 13
10−7
10−6
10−5
10−4
10−3
10−2
10−1
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
ML
WLD−H1
WLD−L1
WLD−H2
WLD−L2
ZF
WLD−H2ov
Sphere
4x4−MIMO
16−QAM
Fig. 15. BER vs. SNR plots for a 4×4 MIMO system with 16-QAM.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 28
13 13.5 14 14.5 15 15.5 16
10−8
10−6
10−4
10−2
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
ML
WLD−H1
WLD−L1
WLD−H2
WLD−L2
WLD−H2ov
WLD−L2ov
4x4−MIMO
64−QAM
Fig. 16. BER vs. SNR plots for a 4×4 MIMO system with 64-QAM.
16 17 18 19 20 21 22 23
10−7
10−6
10−5
10−4
10−3
10−2
10−1
average SNR per receive antenna [dB]
bi
t e
rro
r r
at
e 
(B
ER
)
 
 
WLD−L1
WLD−H2
WLD−H2ov
ZF
4x4−MIMO
256−QAM
Fig. 17. BER vs. SNR plots for a 4×4 MIMO system with 256-QAM.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 29
0
250K
500K
750K
1M
1.25M
1.5M
1.75M
2M
2.25M
2.5M
G
at
e 
Co
m
pl
ex
ity
 
 
Theoretical64−QAM 256−QAM
ML+exh Min
ML + Slicer
MAP+exh Min
MAP+Slicer
Fig. 18. Hardware complexity of various synthesized detector cores.
B. Architecture Synthesis Results
Various architecture configurations for the 2 × 2 core with different algorithmic features and architectural
optimizations were synthesized, assuming 17-bit datapaths. The datapaths are pipelined with 6 stages and clocked
at 275 MHz. The input LLRs fed from the turbo decoder are 8 bits wide. The output LLRs from the detector are
passed to a dynamic scaling block (not included in this work) that scales the bit-widths down to 8 bits before
feeding them to the turbo decoder.
Figure 18 shows the gate complexity of 8 different architectures. Four architectures support reconfigurable
constellations up to 64-QAM, while the other four support up to 256-QAM. For the 64-QAM case, two architectures
are designed to support soft-outputs only without soft-inputs (i.e. ML detection, see Section III-B): one based
on distance minimizations using exhaustive search (Section V-B), and one based on minimization via slicing
(Section V-C). The other two 64-QAM architectures support both soft-outputs and soft-inputs (i.e. MAP detection,
see Section III-C), one with minimization based on exhaustive search and one via slicing. The other four 256-QAM
architectures are similar. All architectures have the same input/output interfaces, external buffers, and control logic.
The reported gate counts in gate-equivalent (GE) are for the core logic only.
The plots demonstrate that there is a significant increase in complexity (between 6.35x-6.82x) when supporting
256-QAM compared to 64-QAM. Furthermore, the slicer-based architectures using the proposed scheme in Sec-
tion V-C offer significant reduction in complexity compared to distance minimization by search (between 19.58%-
26.22% for 64-QAM, and between 24.28%-30.35% for 256-QAM). Finally, for slicer-based architectures, supporting
soft-inputs for MAP detection comes with an increase in gate count between 8.49%-9.83% compared to soft-output-
only ML detection. For minimization-by-search architectures, the overhead of supporting soft-inputs is only between
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 30
16−bit 17−bit 18−bit
0
100K
200K
300K
400K
G
at
e 
Co
m
pl
ex
ity
 
 
64−QAM
16−bit 17−bit 18−bit
0
500K
1M
1.5M
2M
G
at
e 
Co
m
pl
ex
ity
 
 
256−QAM
Fig. 19. Hardware complexity as a function of bit-width.
0.51%-1.71%. The gate counts predicted by the theoretical analyses in Section V are also plotted in Fig. 18. The
error ranges between 8%-11%, which asserts the validity of the model used and the theoretical analysis performed.
Figure 19 plots the gate complexity of the slicer-based MAP cores as a function of bit-width. The complexity
increases roughly between 5.2%-5.9% for every added bit. A similar trend was observed when synthesizing the
256-QAM core with soft-outputs only on a Virtex-6 FPGA. The area increases from 317937 LUTs (33%) for 18
bits to 337210 LUTs (35%) for 19 bits. The area jumps to 403498 LUTs (42%) when the integer bit-width is
increased to 12 bits.
The core achieves an average SNR-independent throughput of 2.2 Gbps for 2-layers with 256-QAM, when running
in soft-input soft-output mode. In 4×4 mode, the core achieves a throughput of 733 Mbps and consumes 320.56 mW
of power. This compares favorably with other detectors in the literature with throughput ranging from 757 Mbps at
410 kGE [11]; 772 Mbps at 212 kGE [33]; 1.2 Gbps at 1097 kGE for 16-QAM [35]; and 2.2 Gbps at 555 kGE [36]
for up to 64-QAM only. Table V provides a comparative summary of our implemented detector and the detectors
in [11], [33], [35], [36].
VIII. CONCLUSIONS
A configurable 2-layer soft-input soft-output MIMO detector core has been proposed as a basic building block for
constructing detectors with more spatial streams. Optimizations targeting distance computations and slicing opera-
tions reduce the overall complexity when supporting constellations up to 256-QAM. By appropriately decomposing
the MIMO channel, multi-layer detection is casted in terms of multiple parallel 2-layer detection problems, which
can be mapped onto the 2-layer core. Various architectures have been developed to achieve a high target detection
throughput. The proposed core has been applied as well to the design an optimal MU-MIMO detector for LTE.
The core occupies an area of 1.58 MGE and achieves a throughput of 733 Mbps with 320.56 mW of power for
256-QAM when synthesized in 90 nm CMOS. Future work will target expanding the core to handle 1024-QAM.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 31
TABLE V
SUMMARY AND COMPARISON OF IMPLEMENTATION RESULTS
Reference This work [11] [33] [35] [36]
Antennas ≤ 4×4 ≤ 4×4 ≤ 4×4 ≤ 4×4 4×4
Modulation [QAM] ≤ 256 ≤ 64 ≤ 64 16 64
Algorithm WLD
MMSE
STS-SD
Trellis
FSD
-PIC search
Iterative YES YES YES YES YES
Technology [nm] 90 90 90 65 90
Core Area [kGE]a 1580 410b 212 1097 555
Clock freq. [MHz] 275 568 193 320 370
Maximum 2200 (2×2)
757 772 1200c 2200
Throughput [Mbps] 733 (4×4)
Normalized hardware 0.72 (2×2)
0.54 0.28 0.91 0.25
efficiency [kGE/Mbps] 2.16 (4×4)
Power consumption 320.56 189.1 87.62
—
335.8
in [mW] @ [Mbps] @ 733 @ 757 @ 772 @ 2200
Energy efficiency
0.44 0.25 0.11 — 0.15
in [nJ/bit]
a One gate-equivalent corresponds to a 2-input drive-1 NAND gate.
b Includes preprocessing circuitry.
c Technology scaling to 90 nm CMOS technology according to A ∼ 1/s,
tpd ∼ 1/s, and Pdyn ∼ (1/s)(Vdd/V ′dd) [11].
REFERENCES
[1] IEEE Draft Standard - Part 11: Wireless LAN Medium Access Control and Physical Layer Specifications - Amendment 4: Enhancements
for Very High Throughput for operation in bands below 6 GHz, IEEE Std. P802.11ac/D7.0, Dec 2013. [Online]. Available:
http://www.ieee.org
[2] Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Channels and Modulation, 3GPP Std. TS 36.211. [Online]. Available:
http://www.3gpp.org
[3] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless Communications. Cambridge, U.K.: Cambridge Univ. Press,
2003.
[4] G. B. Giannakis et al., Space-Time Coding for Broadband Wireless Communications. New York: John Wiley and Sons, 2006.
[5] E. Biglieri et al., MIMO Wireless Communications. Cambridge, U.K.: Cambridge Univ. Press, 2007.
[6] C. Oestges and B. Clerckx, MIMO Wireless Communications. Oxford, U.K.: Elsevier Academic Press, 2007.
[7] A. Chockalingam and B. S. Rajan, Large MIMO Systems. Cambridge University Press, 2014.
[8] B. Hassibi, “An efficient square-root algorithm for BLAST,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process. (ICASSP),
Istanbul, Turkey, Jun. 2000, pp. 5–9.
[9] G. D. Golden et al., “Detection algorithm and initial laboratory results using V-BLAST space-time communication architecture,” IEE
Electronics Letters, vol. 35, no. 1, pp. 14–15, Jan. 1999.
[10] D. Wu¨bben, R. Bo¨hnke, V. Ku¨hn, and K. Kammeyer, “MMSE extension of V-BLAST based on sorted QR decomposition,” in Proc. IEEE
Vehic. Technol. Conf. (VTC), Orlando, Florida, Oct. 2003, pp. 508–512.
[11] C. Studer, S. Fateh, and D. Seethaler, “ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference
cancellation,” IEEE Trans. Syst. Sci. Cybern., vol. 47, no. 7, pp. 1754–1765, Jul. 2011.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 32
[12] E. Viterbo and E. Biglieri, “A universal decoding algorithm for lattice codes,” in 14e`me Colloque GRETSI, Juan-Les-Pins, France, Sep.
1993, pp. 611–614.
[13] E. Viterbo and J. Boutros, “A universal lattice code decoder for fading channels,” IEEE Trans. Inf. Theory, vol. 45, no. 5, pp. 1639–1642,
Jul. 1999.
[14] O. Damen, A. Chkeif, and J.-C. Belfiore, “Lattice code decoder for space-time codes,” IEEE Commun. Lett., vol. 4, no. 5, pp. 161–163,
May 2000.
[15] E. Agrell et al., “Closest point search in lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, Aug. 2002.
[16] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp.
389–399, Mar. 2003.
[17] B. Hassibi and H. Vikalo, “On sphere decoding algorithm. I. Expected complexity,” IEEE Trans. Signal Process., vol. 53, no. 8, pp.
2806–2818, Aug. 2005.
[18] J. Jalde´n and B. Ottersten, “On the complexity of sphere decoding in digital communications,” IEEE Trans. Signal Process., vol. 53, no. 4,
pp. 1474–1484, Apr. 2005.
[19] D. Seethaler, J. Jalde´n, C. Studer, and H. Bo¨lcskei, “On the complexity distribution of sphere decoding,” IEEE Trans. Inf. Theory, vol. 57,
no. 9, pp. 5754–5768, Sep. 2011.
[20] K.-W. Wong, C.-Y. Tsui, R. S.-K. Cheng, and W.-H. Mow, “A VLSI architecture of a K-best lattice decoding algorithm for MIMO
channels,” in Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS), vol. 3, Scottsdale, Arizona, May 2002, pp. 273–276.
[21] M. Wenk et al., “K-best MIMO detection VLSI architectures achieving up to 424 Mbps,” in Proc. IEEE Int. Symp. on Circuits and Systems
(ISCAS), Island of Kos, Greece, May 2006, pp. 1151–1154.
[22] S. Mondal, A. Eltawil, C.-A. Shen, and K. Salama, “Design and implementation of a sort free K-best sphere decoder,” IEEE Trans. VLSI
Syst., vol. 18, no. 10, pp. 1497–1501, Oct. 2010.
[23] L. Liu, F. Ye, X. Ma, T. Zhang, and J. Ren, “A 1.1-Gb/s 115-pJ/bit configurable MIMO detector using 0.13µm CMOS technology,” IEEE
Trans. Circuits Syst. II, vol. 57, no. 9, pp. 701–705, Sep. 2010.
[24] C.-A. Shen, A. Eltawil, and K. Salama, “Evaluation framework for K-best sphere decoders,” J. of Circuits, Systems and Computers,
vol. 19, no. 5, pp. 975–995, Aug. 2010.
[25] M. Shabany and P. Gulak, “A 675 Mbps, 4× 4 64-QAM K-Best MIMO detector in 0.13µm CMOS,” IEEE Trans. VLSI Syst., vol. 20,
no. 1, pp. 135–147, Jan. 2012.
[26] M. Mahdavi and M. Shabany, “Novel MIMO detection algorithm for high-order constellations in the complex domain,” IEEE Trans. VLSI
Syst., vol. 21, no. 5, pp. 834–847, May 2013.
[27] D. Garrett et al., “Silicon complexity for maximum likelihood MIMO detection using spherical decoding,” IEEE J. Solid-State Circuits,
vol. 39, no. 9, pp. 1544–1552, Sep. 2004.
[28] Z. Guo and P. Nilsson, “A VLSI architecture of the Schnorr-Euchner decoder for MIMO systems,” in Proc. IEEE CAS Symp. Emerging
Technologies, vol. 1, Shanghai, China, May 2004, pp. 65–68.
[29] A. Burg et al., “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40,
no. 7, pp. 1566–1577, Jul. 2005.
[30] C. Studer, A. Burg, and H. Bo¨lcskei, “Soft-output sphere decoder: Algorithms and VLSI implementation,” IEEE J. Sel. Areas Commun.,
vol. 26, no. 2, pp. 290–300, Feb. 2008.
[31] C.-H. Yang and D. Markovic, “A flexible DSP architecture for MIMO sphere decoding,” IEEE Trans. Circuits Syst. I, vol. 56, no. 10, pp.
2301–2314, Oct. 2009.
[32] ——, “A 2.89 mW 50 GOPS 16×16 16-core MIMO sphere decoder in 90 nm CMOS,” in European Solid-State Circuits Conf. (ESSCIRC),
Athens, Greece, Sep. 2009, pp. 344–347.
[33] F. Borlenghi et al., “A 772 Mbit/s 8.81 bit/nJ 90 nm CMOS soft-input soft-output sphere decoder,” in IEEE Asian Solid State Circutis Conf.
(A-SSCC), Jeju, Korea, Nov. 2011, pp. 297–300.
[34] L. Liu, J. Lofgren, and P. Nilsson, “Area-efficient configurable high-throughput signal detector supporting multiple MIMO modes,” IEEE
Trans. Circuits Syst. I, vol. 59, no. 9, pp. 2085–2096, Sep. 2012.
[35] Y. Sun and J. R. Cavallaro, “Trellis-search based soft-input soft-output MIMO detector: Algorithm and VLSI architecture,” IEEE Trans.
Signal Process., vol. 60, no. 5, pp. 2617–2627, May 2012.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 33
[36] X. Chen, G. He, and J. Ma, “VLSI implementation of a high-throughput iterative fixed-complexity sphere decoder,” IEEE Trans. Circuits
Syst. II, vol. 60, no. 5, pp. 272–276, May 2013.
[37] M. M. Mansour, S. Alex, and M. Jalloul, “Reduced complexity soft-output MIMO sphere detectors – Part I: Algorithmic optimizations,”
IEEE Trans. Signal Process., vol. 62, no. 21, pp. 5505–5520, Nov. 2014.
[38] ——, “Reduced complexity soft-output MIMO sphere detectors – Part II: Architectural optimizations,” IEEE Trans. Signal Process., vol. 62,
no. 21, pp. 5521–5535, Nov. 2014.
[39] M.-Y. Huang and P.-Y. Tsai, “Toward multi-gigabit wireless: Design of high-throughput MIMO detectors with hardware-efficient
architecture,” IEEE Trans. Circuits Syst. I, vol. 61, no. 2, pp. 613–624, Feb. 2014.
[40] Y. Jiang, J. Li, and W. W. Hager, “Joint transceiver design for MIMO communications using geometric mean decomposition,” IEEE Trans.
Signal Process., vol. 53, no. 10, pp. 3791–3803, Oct. 2005.
[41] ——, “Uniform channel decomposition for MIMO communications,” IEEE Trans. Signal Process., vol. 53, no. 11, pp. 4283–4294, Nov.
2005.
[42] S. Ariyavisitakul, J. Zheng, E. Ojard, and J. Kim, “Subspace beamforming for near-capacity MIMO performance,” IEEE Trans. Signal
Process., vol. 56, no. 11, pp. 5729–5733, Nov. 2008.
[43] Y. Chen and S. Brink, “Near-capacity MIMO subspace detection,” in Proc. IEEE Int. Symp. Personal Indoor and Mobile Radio Commun.
(PIMRC), Toronto, Canada, Sep. 2011, pp. 1733–1737.
[44] M. Siti and M. P. Fitz, “A novel soft-output layered orthogonal lattice detector for multiple antenna communications,” in Proc. IEEE Int.
Conf. Commun. (ICC), vol. 4, Istanbul, Turkey, Jun. 2006, pp. 1686–1691.
[45] ——, “On layer ordering techniques for near-optimal MIMO detectors,” in Proc. IEEE Wireless Commun. and Netw. Conf. (WCNC), Hong
Kong, Mar. 2007, pp. 1199–1204.
[46] M. S. Yee, “Max-log-MAP sphere decoder,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process. (ICASSP), vol. 3, Philadelphia,
PA, Mar. 2005, pp. 1013–1016.
[47] E. Ojard and S. Ariyavisitakul, “Method and system for approximate maximum likelihood (ML) detection in a multiple input multiple
output (MIMO) receiver,” US Patent 12/207,721, Mar. 19, 2009. [Online]. Available: http://www.google.com/patents/US20090074114
[48] L. A. C. Zhang and T. Meixia, “LTE-advanced and 4G wireless communications [Guest Editorial],” IEEE Commun. Mag., vol. 50, no. 2,
pp. 102–103, Feb. 2012.
[49] A. Gomma and L. Jalloul, “Efficient soft-input soft-output detection of dual-layer MIMO systems,” IEEE Trans. Wireless Commun., vol. 3,
no. 5, pp. 541–544, Oct. 2014.
[50] C. K. Yeung, J. Lee, and S. Kim, “A simple slicer for soft detection in gray-coded qam-modulated mimo ofdm systems,” in IEEE 35th
Sarnoff Sympos. (SARNOFF), Newark, NJ, May 2012, pp. 1–5.
[51] M. M. Mansour, “A near-ML MIMO subspace detection algorithm,” IEEE Signal Process. Lett., vol. 22, no. 4, pp. 408–412, Apr. 2015.
[52] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996.
[53] “CEVA-XC4210 DSP processors.” [Online]. Available: http://www.ceva-dsp.com/CEVA-XC4210
[54] J. Lee, J.-K. Huan, and J. Zhang, “MIMO technologies in 3GPP LTE and LTE-Advanced,” EURASIP J. on Wireless Commun. and Netw.,
vol. 2009, no. 1, pp. 1–10, 2009.
[55] J. Duplicy et al., “MU-MIMO in LTE systems,” EURASIP J. on Wireless Commun. and Netw., vol. 2011, no. 1, pp. 1–13, 2011.
[56] Z. Bai et al., “On the equivalence of MMSE and IRC receiver in MU-MIMO systems,” IEEE Commun. Lett., vol. 15, no. 12, pp. 1288–1290,
Dec. 2011.
[57] R. Ghaffar and R. Knopp, “Interference sensitivity for multiuser MIMO in LTE,” in IEEE Workshop on Sig. Proc. Advances in Wireless
Commun. (SPAWC), Jun. 2011, pp. 506–510.
[58] ——, “Interference-aware receiver structure for multiuser MIMO and LTE,” EURASIP J. on Wireless Commun. and Netw., vol. 40, pp.
1–17, 2011.
[59] A. Viterbi, “An intuitive justifications and a simplified implementation of the MAP decoder for convolutional codes,” IEEE J. Sel. Areas
Commun., vol. 16, pp. 260–264, Feb. 1998.
[60] A. Gomaa et al., “Multi-user MIMO receivers with partial state information,” IEEE Trans. Veh. Technol., Jan. 2015, (under review).
[Online]. Available: http://arxiv.org/abs/1502.00212
IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 34
[61] Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding, 3GPP Std. TS 36.212. [Online]. Available:
http://www.3gpp.org
[62] High Speed Downlink Packet Access: UE Radio Transmission and Reception FDD, 3GPP Std. TR 25.890. [Online]. Available:
http://www.3gpp.org
Mohammad M. Mansour (S’97-M’03-SM’08) received the B.E. (Hons.) and the M.E. degrees in computer and
communications engineering from the American University of Beirut (AUB), Beirut, Lebanon, in 1996 and 1998,
respectively, and the M.S. degree in mathematics and the Ph.D. degree in electrical engineering from the University
of Illinois at UrbanaChampaign (UIUC), Champaign, IL, USA, in 2002 and 2003, respectively.
He was a Visiting Researcher at Broadcom, Sunnyvale, CA, USA, from 2012 to 2014, where he worked on
the physical layer SoC architecture and algorithm development for LTE-Advanced. He was on research leave with
Qualcomm Flarion Technologies in Bridgewater, NJ, USA, from 2006 to 2008, where he worked on modem design and
implementation for 3GPP-LTE, 3GPP2-UMB, and peer-to-peer wireless networking physical layer SoC architecture and algorithm development.
He was a Research Assistant at the Coordinated Science Laboratory (CSL), UIUC, from 1998 to 2003. He worked at National Semiconductor
Corporation, San Francisco, CA, with the Wireless Research group in 2000. He was a Research Assistant with the Department of Electrical and
Computer Engineering, AUB, in 1997, and a Teaching Assistant in 1996. He joined as a faculty member with the Department of Electrical and
Computer Engineering, AUB, in 2003, where he is currently an Associate Professor. His research interests are in the area of energy-efficient
and high-performance VLSI circuits, architectures, algorithms, and systems for computing, communications, and signal processing.
Prof. Mansour is a member of the Design and Implementation of Signal Processing Systems (DISPS) Technical Committee Advisory Board of
the IEEE Signal Processing Society. He served as a member of the DISPS Technical Committee from 2006 to 2013. He served as an Associate
Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II (TCAS-II) from 2008 to 2013. He currently serves as an Associate Editor of
the IEEE TRANSACTIONS ON VLSI SYSTEMS since 2011, and an Associate Editor of the IEEE SIGNAL PROCESSING LETTERS since 2012.
He served as the Technical Co-Chair of the IEEE Workshop on Signal Processing Systems in 2011, and as a member of the Technical Program
Committee of various international conferences and workshops. He was the recipient of the PHI Kappa PHI Honor Society Award twice in
2000 and 2001, and the recipient of the Hewlett Foundation Fellowship Award in 2006. He has six issued U.S. patents.
Louay M.A. Jalloul (M’91-SM’00) received the B.S. degree from the University of Oklahoma, Norman, OK, USA,
in 1985; the M.S. degree from the Ohio State University, Columbus, OH, USA, in 1988; and the Ph.D. degree from
Rutgers, The State University of New Jersey, Piscataway, NJ, USA, in 1993, all in electrical engineering. He was a
Research Associate with the ElectroScience Laboratory, Ohio State University; and the Wireless Information Networks
Laboratory (WINLAB), Rutgers.
He is currently a Technical Director with Broadcom Corporation, Sunnyvale, CA, USA. Prior to that, he was a Senior
Director of Technology with Beceem Communications Inc. (a Silicon Valley startup providing solutions for mobile
broadband wireless communication systems). From September 2004 to September 2005, he was an Associate Professor with the Department of
Electrical and Computer Engineering, American University of Beirut, Beirut, Lebanon. In February 2001, he joined MorphICs Technology Inc.,
Campbell, CA (acquired by Infineon Technologies AG in April 2003) as the Director of Systems Architecture, where he led his team in the
development of the code-division multiple access (CDMA) cellular digital signal processor for the third-generation wideband CDMA standard.
From 1993 to 2001, he was with Motorola Inc., taking on various functions in research and development. He contributed to the early concepts
of high-speed downlink packet access and IS-2000 evolution to voice and data (1XEV-DV).
Dr. Jalloul has 57 issued U.S. patents and received numerous engineering awards for his innovations to Motorola products. He is a member
of Eta Kappa Nu.
