Data Detection in Large Multi-Antenna Wireless Systems via Approximate
  Semidefinite Relaxation by Castañeda, Oscar et al.
APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I 1
Data Detection in Large Multi-Antenna Wireless
Systems via Approximate Semidefinite Relaxation
Oscar Castan˜eda, Tom Goldstein, and Christoph Studer
Abstract—Practical data detectors for future wireless systems
with hundreds of antennas at the base station must achieve
high throughput and low error rate at low complexity. Since
the complexity of maximum-likelihood (ML) data detection is
prohibitive for such large wireless systems, approximate methods
are necessary. In this paper, we propose a novel data detection
algorithm referred to as Triangular Approximate SEmidefinite
Relaxation (TASER), which is suitable for two application sce-
narios: (i) coherent data detection in large multi-user multiple-
input multiple-output (MU-MIMO) wireless systems and (ii)
joint channel estimation and data detection in large single-input
multiple-output (SIMO) wireless systems. For both scenarios, we
show that TASER achieves near-ML error-rate performance at
low complexity by relaxing the associated ML-detection problems
into a semidefinite program, which we solve approximately using
a preconditioned forward-backward splitting procedure. Since
the resulting problem is non-convex, we provide convergence
guarantees for our algorithm. To demonstrate the efficacy of
TASER in practice, we design a systolic architecture that enables
our algorithm to achieve high throughput at low hardware
complexity, and we develop reference field-programmable gate
array (FPGA) and application-specific integrated circuit (ASIC)
designs for various antenna configurations.
Index Terms—FPGA and ASIC design, data detection, joint
channel estimation and data detection, large single-input and
multiple-input multiple-output (SIMO and MIMO) wireless sys-
tems, semidefinite relaxation.
I. INTRODUCTION
LARGE multiple-input multiple-output (MIMO) and single-input multiple-output (SIMO) wireless technology, where
the base station (BS) is equipped with hundreds or thousands
of antennas, are widely believed to play a major role in fifth-
generation (5G) cellular communication systems [2]–[7]. Such
large wireless systems promise improved spectral efficiency,
coverage, and range compared to traditional small-scale systems.
However, the extremely large number of BS antennas requires
the design of high-performance data-detection algorithms that
can be implemented efficiently in very-large scale integration
(VLSI) circuits [8]. In fact, data detection is among the most
critical baseband-processing tasks in terms of implementation
complexity, power consumption, throughput, and error-rate
performance for such systems [9], [10].
O. Castan˜eda and C. Studer are with the School of ECE, Cornell University,
Ithaca, NY; e-mail: oc66@cornell.edu, studer@cornell.edu
T. Goldstein is with the Department of CS, University of Maryland, College
Park, MD; e-mail: tomg@cs.umd.edu
A short version of this paper summarizing the TASER FPGA design for
large MU-MIMO data detection has been presented at the IEEE International
Symposium on Circuits and Systems (ISCAS) 2016 [1].
The system simulator for TASER used in this paper will be available on
GitHub: https://github.com/VIP-Group/TASER
To enable high-throughput uplink communication for massive
multi-user (MU) MIMO wireless systems (where tens of user
terminals transmit data to a BS with hundreds of antennas), a
variety of low-complexity data-detection algorithms [11]–[19],
as well as a few field-programmable gate array (FPGA) imple-
mentations [8], [20]–[22] and application-specific integrated
circuit (ASIC) designs [23] have been proposed recently. To
date, all data detectors that have been implemented in VLSI for
such high-dimensional problems rely on (approximate) linear
data detection [8], [20]–[23]. Such linear methods are known to
suffer from a significant error-rate performance loss for more
realistic systems with a not-so-large number of antennas at the
BS or where the number of user terminals is comparable to that
of the number of BS antennas [8]. Furthermore, the literature
on large MU-MIMO data detection almost exclusively relies
on the assumption of perfect channel state information (CSI)
at the BS—an assumption that cannot be satisfied in practice.
A. Contributions
In this paper, we propose a novel data detection algo-
rithm and corresponding VLSI designs for large wireless
systems. Our algorithm, referred to as Triangular Approximate
SEmidefinite Relaxation (TASER), can be deployed in two
different application scenarios: (i) coherent data detection in
massive MU-MIMO wireless systems and (ii) joint channel
estimation and data detection (JED) in large SIMO wireless
systems. Our detector builds upon semidefinite relaxation [24],
which enables near maximum-likelihood (ML) data detection
performance at polynomial (in the number of transmit antennas
or time slots) complexity for systems that communicate with
low-rate, constant-modulus modulation schemes [25]. TASER
approximates the semidefinite relaxation (SDR) formulation
of both the coherent ML and the JED ML problems using a
Cholesky factorization, and solves the resulting non-convex
problem using a preconditioned forward-backward splitting
(FBS) procedure [26], [27]. We provide theoretical convergence
guarantees for our algorithm, and we develop a corresponding
systolic array that enables high-throughput data detection at
low silicon area in an energy-efficient manner. We provide
reference VLSI implementation results for a Xilinx Virtex-7
FPGA and for a 40 nm CMOS technology, and we perform an
extensive comparison in terms of performance and complexity
with recently-proposed data detector implementations for large
MU-MIMO wireless systems [8], [20]–[23].
B. Relevant Prior Art
1) Data detection in large MU-MIMO: The literature on data
detection in large (or massive) MU-MIMO wireless systems
ar
X
iv
:1
60
9.
01
79
7v
2 
 [c
s.I
T]
  3
0 N
ov
 20
16
2 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
describes only a few algorithms that are able to achieve near-
optimal error-rate performance [11], [13], [18]. For these
algorithms, however, no hardware designs have been described
in the open literature. So far, only sub-optimal, linear data
detection algorithms have been integrated in FPGAs [8], [20]–
[22] or ASICs [23]. Unfortunately, such linear data detection
algorithms suffer from a significant error-rate performance loss
in “square” systems, where the number of users is comparable
to the number of BS antennas [8]. In contrast, the proposed
TASER algorithm achieves near-optimal error-rate performance,
even in symmetric large MU-MIMO systems where the BS-to-
user-antenna ratio is one.
2) SDR-based data detection: SDR is a well-known tech-
nique for achieving near-ML performance in multi-user code
division multiple access (MU-CDMA) [28], [29] and traditional,
small-scale MIMO [24], [30]–[36] wireless systems. Most
results on SDR-based data detection rely on computationally
inefficient, general-purpose convex solvers that require either
the solution to a linear system or an eigenvalue decomposition
per iteration—both of these operations entail prohibitive
complexity when implemented in hardware. As an exception,
the algorithm in [36] relies on block-coordinate descent, which
avoids the solution to a full linear system per iteration. While
computationally efficient, this method exhibits stringent data
dependencies, requires a high number of multiplications per
iteration, and consumes a large amount of memory, which
renders corresponding VLSI designs inefficient. TASER, in
contrast, is highly parallelizable and hardware friendly, and
is—to the best of our knowledge—the first SDR-based data
detector that has been successfully implemented in VLSI.
3) Joint channel estimation and data detection: JED is
known to significantly outperform traditional data detection
schemes that separate channel estimation from data detection.
We believe that JED is a promising solution for large cellular
systems, where pilot-contamination (i.e., pilot-based training for
users in adjacent cells interferes with the training pilots in the
current cell) poses a fundamental performance bottleneck [2].
The computational complexity of exact JED via an exhaustive
search grows exponentially in the number of transmission time
slots [37]. Hence, sphere-decoding (SD)-based methods have
been proposed for JED in the SIMO [37]–[40] and MIMO [41]
literature to reduce the computational complexity. Nevertheless,
the design of hardware implementations of high-throughput
sphere-decoders is challenging, and most existing designs only
achieve a few hundred Mb/s for small MIMO systems (see
[10], [42] for more details on SD-based data detectors). In
addition—to the best of our knowledge—no hardware design
for JED has been proposed in the open literature. In this paper,
we show that (i) JED can be performed using SDR and (ii)
TASER enables near-optimal, high-throughput JED for realistic
large SIMO wireless systems.
C. Notation
Lowercase boldface letters stand for column vectors; up-
percase boldface letters denote matrices. For a matrix A, we
denote its transpose, adjoint, and trace by AT , AH , and Tr(A),
respectively. We use Ak,` for the entry in the kth row and
`th column of the matrix A; the kth entry of a vector a is
denoted by ak = [a]k. The Frobenius norm of the matrix A
is ‖A‖F =
√∑
k,` |Ak,`|2 and the `2-norm of the vector a is
‖a‖2 =
√∑
k |ak|2. The identity matrix and all-ones vector
are denoted by I and 1, respectively. The real and imaginary
part of a complex-valued matrix A are denoted by <(A) and
=(A), respectively.
D. Paper Outline
The rest of the paper is organized as follows. Section II
introduces the large MU-MIMO and SIMO system models
and discusses coherent ML data detection as well as JED.
Section III introduces the TASER algorithm and provides a
theoretical convergence analysis. Section IV details our systolic
architecture. Section V shows reference implementation results
and provides a comparison with existing data detectors for large
MU-MIMO. Concluding remarks are presented in Section VI.
II. DATA DETECTION IN LARGE MULTI-ANTENNA
WIRELESS SYSTEMS
The algorithm and VLSI designs proposed in this paper
are suitable for two application scenarios: (i) coherent data
detection in large MU-MIMO systems and (ii) JED in large
SIMO systems. We next describe the corresponding system
models and show how both problems can be relaxed to a
semidefinite program (SDP) of the same form.
A. Coherent Data Detection for Large MU-MIMO Systems
The first application scenario is data detection in the large
(or massive) MU-MIMO wireless uplink with B BS antennas
and U user antennas. We consider the standard input-output
relation to model a narrow-band1 MIMO wireless channel [43]:
y = Hs + n. Here, y ∈ CB is the BS receive-vector, H ∈
CB×U is the MIMO channel matrix, s ∈ OU is the transmit
vector containing the data symbols from all users (O refers to
the constellation set), and n ∈ CB is i.i.d. circularly-symmetric
Gaussian with variance N0 per entry. Assuming that an estimate
of the channel matrix H was acquired during a dedicated
training phase, ML data detection corresponds to the following
problem [44]:
sˆML = arg min
s∈OU
‖y −Hs‖2. (1)
A number of computationally efficient sphere-decoding algo-
rithms have been proposed to solve the combinatorial problem
in (1) for conventional, small-scale MIMO systems [10], [45]–
[47]. Unfortunately, the worst-case and average computational
complexity of these exact methods still scales exponentially
with the number of users U [48], [49]. For large MU-MIMO
systems, where the BS-to-user-antenna ratio exceeds a factor
of two, recently-developed linear algorithms have been shown
to achieve near-ML performance [4], [5], [8]. For systems with
a large number of users where the BS-to-user-antenna ratio
1Our algorithm and circuit designs are also suitable for frequency-selective
channels in combination with orthogonal frequency-division multiplexing
(OFDM), where we consider the same input-output relation per subcarrier.
O. CASTAN˜EDA ET AL. 3
is close to one, however, linear methods are known to deliver
poor error-rate performance [8].
To enable near-optimal error-rate performance at low com-
plexity for such scenarios, we can relax the ML problem in (1)
into an SDP [24]. This relaxation step requires us to reformulate
the ML detection problem as follows. By assuming constant-
modulus QAM constellations, such as BPSK and QPSK, we
first perform the real-valued decomposition of the system model
y = Hs+ n using the following definitions:
y=
[<(y)
=(y)
]
, H=
[<(H) −=(H)
=(H) <(H)
]
,
s=
[<(s)
=(s)
]
, n=
[<(n)
=(n)
]
.
This decomposition enables us to reformulate the ML problem
in (1) into the following equivalent form:
s¯ML = arg min
s˜∈XN
Tr(s˜TTs˜). (2)
For QPSK, the matrix T = [H
T
H,−HTy;−yTH,yTy] is
of dimension N ×N with N = 2U + 1 and X ∈ {−1,+1}
with s˜ = [<(s);=(s); 1]. The solution s¯ML can then be
converted back into the complex-valued ML solution as
[sˆML]i = [s¯
ML]i + j[s¯
ML]i+U for i = 1, . . . , U . For BPSK,
the matrix T = [HTH,−HTy;−yTH,yTy] is of dimension
N ×N with N = U + 1 and s˜ = [<(s); 1]. Here, we define
the 2B × U matrix H = [<(H);=(H)]. Since =(s) = 0 in
this case, [sˆML]i = [s¯ML]i for i = 1, . . . , U . In Section II-C,
we detail how the problem in (2) can be relaxed into an SDP.
B. Joint Channel Estimation and Data Detection
The second application scenario is JED in large SIMO
wireless uplink systems where one single-antenna user commu-
nicates over K + 1 time slots with B BS antennas. We use the
following input-output relation to model the (narrow-band and
flat-fading) SIMO wireless channel [37]–[40]: Y = hsH +N.
Here, Y ∈ CB×(K+1) contains the received vectors acquired
over all K + 1 time slots, h ∈ CB is the unknown SIMO
channel vector that is assumed to be block fading, i.e., constant
over K + 1 time slots, sH ∈ O1×(K+1) is the transmit vector
containing the data symbols from all K + 1 time slots, and
N ∈ CB×(K+1) is i.i.d. circularly-symmetric Gaussian with
variance N0 per entry. By assuming that h is a deterministic
but unknown channel vector with unknown prior statistics, we
can formulate the following ML JED problem [40]:
{sˆJED, hˆ} = arg min
s∈OK+1,h∈CB
‖Y − hsH‖F . (3)
It is important to note that there exists a phase ambiguity
between both outputs of JED because hˆejφ is also a solution
whenever sˆJEDejφ ∈ OK+1 for some phase φ. As a conse-
quence, one may convey information either as phase changes
in the vector sH over time slots (known as differential encoding)
or “pin down” the phase of one entry of the transmit vector;
in what follows, we assume that the first transmitted entry is
known to the receiver.2
By assuming that the entries in s are constant modulus (e.g.,
BPSK or QPSK), the ML JED estimate of the transmit vector
reduces to [40]:
sˆJED = arg max
s∈OK+1
‖Ys‖2, (4)
and hˆ = YsˆJED is the estimate of the channel vector. For
a small number of time slots K + 1, the problem in (4)
can be solved exactly at low average complexity using SD
methods [40]. For systems with a large number of time slots,
however, the computational complexity of such algorithms
becomes prohibitive. In contrast to the coherent ML detection
problem described in (2), linear methods that approximate (4)
are unavailable as relaxing the constraint s ∈ OK+1 to s ∈
CK+1 causes the entries of s to grow without bound.
We now show how the ML JED problem in (4) can be
transformed into the same structure of the coherent ML problem
in (2), which enables SDR. Since the receiver is assumed to
know the first transmitted symbol s0, we rewrite the objective in
(4) as ‖Ys‖2 = ‖y0s0 +Yrsr‖2 , where Yr = [y1, . . . ,yK ]
and sr = [s1, . . . , sK ]T . Similarly to the coherent ML problem,
we perform the real-valued decomposition by defining:
y =
[<(y0s0)
=(y0s0)
]
, H =
[<(Yr) −=(Yr)
=(Yr) <(Yr)
]
, s =
[<(sr)
=(sr)
]
,
which allows us to rewrite ‖y0s0 +Yrsr‖2 = ‖y+Hs‖2. We
can now reformulate (4) in a form that is equivalent to (2) as
s¯JED = arg min
s˜∈XN
Tr(s˜TTs˜). (5)
For QPSK, the matrix T = −[HTH,HTy;yTH,yTy] is of
dimension N × N with N = 2K + 1 and X ∈ {−1,+1}
with s˜ = [<(sr);=(sr); 1]; for BPSK, the matrix T =
−[HTH,HTy;yTH,yTy] is of dimension N × N with
N = K + 1 and s˜ = [<(sr); 1]. Here, we define the 2B ×K
matrix as H = [<(Yr);=(Yr)]. Analogously to the coherent
ML case, the solution s¯JED can then be used to construct the
complex-valued ML JED solution of (4).
Evidently, the problems described in (2) and (5) exhibit the
same structure—we next show how both of these problems can
be solved approximately using the same SDR-based method.
C. Semidefinite Relaxation of the Problems in (2) and (5)
SDR is a well-known approximation to the coherent ML
problem [24], [28]–[30] and enables significantly lower (i.e.,
polynomial) computational complexity for systems employing
BPSK and QPSK constellations.3 SDR not only provides near-
ML performance, but also achieves the same diversity order as
the ML detector [25]. In contrast, the use of SDR for solving
2For SIMO systems, this approach resembles that of pilot-based
transmission—the difference to JED is, however, that we also use all transmitted
information symbols to improve the channel estimate and hence, to improve
the error-rate performance.
3SDR methods for higher-order constellations (such as 16-QAM) exist; see,
e.g., [32], [33] for more details.
4 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
the ML JED problem as proposed in Section II-B appears to
be novel.
SDR-based data detection starts by reformulating the prob-
lems in (2) and (5) in the following equivalent form [24]:
Ŝ = arg min
S∈RN×N
Tr(TS) subject to diag(S) = 1, rank(S) = 1.
(6)
Here, we used Tr(sTTs) = Tr(TssT ) = Tr(TS), where
S = ssT is a rank-1 matrix and s ∈ XN is of appropriate
dimension N . Unfortunately, the rank-one constraint in (6)
makes this problem at least as hard as the original two problems
in (2) and (5). The key idea of SDR is to relax this rank
constraint, which results in an SDP that can be solved in
polynomial time. Specifically, SDR applied to (6) results in
the following well-known optimization problem [24]:
Ŝ = arg min
S∈RN×N
Tr(TS) subject to diag(S) = 1, S  0, (7)
where the constraint S  0 ensures that the matrix S is positive
semidefinite (PSD). If the result of the problem in (7) is rank
one, then Ŝ = sˆsˆH where sˆ contains the exact estimate to (2)
and (5), i.e., SDR solves the original problem optimally. If the
resulting matrix Ŝ has a higher rank, then an estimate of the
ML solution can be obtained by taking the signs of the leading
eigenvector of Ŝ or by using randomization schemes [24].
While (7) can be solved exactly using interior-point meth-
ods [24], such algorithms typically require (i) a large number
of iterations, where each iteration requires either the solution
to a linear system or an eigenvalue decomposition, and (ii)
high numerical precision, which renders fixed-point hardware
challenging. We believe that these are the main reasons why—
until now—no VLSI design of an SDR-based data detector
has been proposed in the open literature.
III. TASER:
TRIANGULAR APPROXIMATE SEMIDEFINITE RELAXATION
We now detail TASER, a novel algorithm for approximately
solving the SDP presented in (7) using hardware accelerators.
A. Triangular SDP Formulation
The key idea of TASER builds on the fact that real-valued
PSD matrices S  0 can be factorized using the Cholesky
decomposition S = LTL, where L is an N×N lower-triangular
matrix with non-negative entries on the main diagonal. With
this result, we can reformulate the SDP shown in (7) using the
following equivalent form:
L̂ = arg min
L
Tr(LTLT ) subject to ‖`k‖2 = 1,∀k. (8)
Here, we replaced the constraint diag(LTL) = 1 of (7) by
the equivalent `2-norm constraint on the kth column `k =
[L]k. To obtain (approximate) solutions to the ML or JED
ML problems in either (2) or (5), respectively, we can take
the signs of the last row of the solution matrix L̂ from (8).
In fact, if the solution matrix Ŝ = L̂T L̂ has rank one (this
implies that TASER identified the ML solution), then the last
row of L̂ must contain the associated eigenvector as this is
the only vector of dimension N . If, however, the solution
matrix Ŝ = L̂T L̂ has a higher rank, an approximate ML
solution must be extracted somehow. As suggested in [50],
[51], taking the last row of the Cholesky decomposition results
in accurate rank-one approximations of PSD matrices. Our own
simulations in Section V-A confirm that this approximation
yields excellent error-rate performance, i.e., close to that of the
exact SDR detector followed by an eigenvalue decomposition.
We emphasize that this approach avoids costly eigenvalue
decompositions and randomization strategies that are required
by conventional solvers that compute Ŝ exactly using SDR.
B. Forward-Backward Splitting
We now develop a computationally efficient algorithm that
directly solves the triangular SDP formulation in (8). Unfor-
tunately, the problem described in (8) is non-convex in the
matrix L and hence, computing an optimal solution is difficult.
For TASER, we apply FBS [27], a computationally efficient
method to solve convex optimization problems, to the non-
convex problem in (8). While this approach is not guaranteed
to converge to the optimal solution of the non-convex problem
posed by (8), we show in Section III-E that TASER converges
to a critical point of (8). Furthermore, our simulation results
in Section V demonstrate near-ML error-rate performance.
FBS is an efficient, iterative method to solve convex op-
timization problems of the form xˆ = arg minx f(x) + g(x),
where the function f is smooth and convex, and g is convex
but not necessarily smooth or bounded. FBS performs the
following steps for t = 1, 2, . . . [26], [27]:
x(t) = proxg(x
(t−1) − τ (t−1)∇f(x(t−1)); τ (t−1))
until convergence or a maximum number of iterations tmax is
reached. Here, {τ (t) > 0} is a suitably-chosen sequence of step
size parameters, ∇f(x) is the gradient of the function f , and
the so-called proximal operator for the function g is defined
as [26], [27]:
proxg(z; τ) = arg min
x
{
τg(x) + 12‖x− z‖22
}
. (9)
In order to approximately solve (8) using FBS, we de-
fine f(L) = Tr(LTLT ) and incorporate the constraint using
g(L) = χ(‖`k‖2 = 1,∀k), where χ is the characteristic
function (which is zero if the constraint is met and infinity oth-
erwise). The gradient is given by ∇f(L) = tril(2LT), where
tril(·) extracts the lower-triangular part of the argument. Even
though the function g is non-convex, the proximal operator
defined in (9) has a closed form solution and is given by
proxg(`k; τ) = `k/‖`k‖2, ∀k; in words, the proximal operator
simply rescales the columns of L to have unit `2-norm.
In order to arrive at a hardware-friendly algorithm, we avoid
sophisticated step size rules such as the ones proposed in [27].
We use a fixed step size proportional to the reciprocal of
the Lipschitz constant of the gradient ∇f(L) as proposed
in [26]. Our step size corresponds to τ = α/‖T‖2, where ‖T‖2
is the spectral norm of the matrix T and 0 < α < 1 is a
system-dependent tuning parameter that we use to improve the
empirical convergence rate when running TASER for a small
number of iterations (see Section III-E for a discussion).
O. CASTAN˜EDA ET AL. 5
Algorithm 1 TASER
1: inputs: T˜, D, and τ = α/‖T˜‖2
2: initialization: L˜(0) = D
3: for t = 1, . . . , tmax do
4: V(t) = L˜(t−1) − tril(2τ L˜(t−1)T˜)
5: L˜(t) = proxg˜(V
(t))
6: end for
7: outputs: sk = sign(L˜
(tmax)
N,k ), k = 1, . . . , N − 1
C. Jacobi Preconditioning
To improve the convergence rate of FBS, we precondition the
problem presented in (8). To this end, we compute a diagonal
scaling matrix D = diag(
√
T1,1, . . . ,
√
TM,M ), which we
use to scale the matrix T as T˜ = D−1TD−1 so that T˜
has an all-ones main diagonal. The purpose of this so-called
Jacobi preconditioner is to improve the condition number of
the original PSD matrix T [52]. We then run FBS to recover
a normalized version4 of the lower-triangular matrix L˜ = LD.
We emphasize that preconditioning also requires us to modify
the proximal operator, which turns out to be proxg˜(˜`k) =
Dk,k˜`k/‖˜`k‖2, where ˜`k is the kth column of L˜. Since we
only rely on the signs of the last row of L̂ to obtain an estimate
of the ML problems, we can simply take the signs of the
normalized triangular matrix L˜.
D. The TASER Algorithm
We now have all the necessary ingredients for TASER, which
is summarized in Algorithm 1. The inputs of the algorithm
are the preconditioned matrix T˜, the scaling matrix D, and
the step size τ . We initialize the FBS procedure by L˜(0) = D,
which resulted in excellent performance for all considered
scenarios. The main loop of TASER then performs the gradient
and proximal steps as discussed in Sections III-B and III-C
until a maximum number of iterations tmax is reached. For
most situations, only a few iterations are sufficient to achieve
near-ML error rate performance (see Section V for numerical
results). The TASER algorithm computes an estimate for the
coherent ML and ML JED problems in (2) and (5), respectively.
E. Convergence Theory
The TASER algorithm tries to solve a non-convex problem
using FBS. Hence, our approach raises two questions, namely
(i) whether we should expect the minimization algorithm to
converge, and (ii) whether the local minima of the non-convex
problem correspond to minimizers of the convex SDP. We now
investigate both of these questions.
While the application of FBS for minimizing the proposed
semidefinite program is new, the convergence of FBS for non-
convex problems is well-studied. Reference [53] presents condi-
tions for which FBS converges with non-convex constraints. In
particular, the problem must be semi-algebraic, meaning both
the constraints and the epigraph of the objective can be written
4In the conference paper [1], we mistakenly stated L˜ = DL.
as the set of solutions to a system of polynomial equations.5
Fortunately, such results apply to the formulation (8). The
following result makes this statement rigorous.
Proposition 1. Suppose we apply FBS (Algorithm 1) to solve
the problem stated in (8). If we use the step size τ = α/‖T‖2
with 0 < α < 1, then the sequence of iterates {L(t)} converges
to a critical point of the problem in (8).
Proof: The function ‖`k‖22 is a polynomial in the entries
of L. The constraint set in (8) is the solution to the polynomial
system ‖`k‖22 = 1, ∀k and is thus semi-algebraic. The objective
function, being a quadratic form, is also trivially semi-algebraic.
By Theorem 5.3 of [53], we know that the sequence of iterates
{L(t)} converges, provided the step size is bounded from above
by the inverse of the Lipschitz constant of the gradient of the
objective. For our quadratic objective, the Lipschitz constant
is merely the spectral radius (`2-norm) of T.
Note that the Jacobi preconditioner in Section III-C results
in a problem of the same form as (8), but with constraints
of the form ‖˜`k‖22 = D2k,k and the step size τ = α/‖T˜‖2.
Consequently, Proposition 1 still applies. Note that this result
has the caveat that we are not guaranteed to find a (global)
minimizer, but rather stationary points, although we generally
observe minimizers in practice. Nonetheless, this convergence
guarantee is considerably stronger than what is known for
other low-complexity SDP methods, such as those inspired by
Burer and Montiero [54], which rely on non-convex augmented
Lagrangian schemes for which no guarantees currently exist.
The second question to ask is whether the local minima of
our non-convex formulation in (8) correspond to minimizers of
the convex SDP shown in (7). Interestingly, when the factors L
and LT are not constrained to be triangular, local minimizers
of (8) are known to yield optimal minimizers for the SDP (7)
(see [55], Corollary 3.6). Nevertheless, we have found that it
is better to enforce the triangular constraint in practice as it
substantially simplifies the architecture detailed next.
IV. VLSI ARCHITECTURE
We now propose a systolic VLSI architecture that implements
TASER and enables high-throughput data detection at low
hardware complexity.
A. Architecture Overview
Figure 1 shows the proposed triangular systolic array consist-
ing of N(N + 1)/2 processing elements (PEs), which mainly
perform multiply-accumulate (MAC) operations. Each PE is
associated with an entry L˜(t−1)i,j of the lower-triangular ma-
trix L˜(t−1) and stores L˜(t−1)i,j as well as the value V
(t)
i,j of the
V(t) matrix (cf. Algorithm 1). All PEs that are part of the same
column receive data from a column-broadcast unit (CBU); all
PEs that are part of the same row receive data from a row-
broadcast unit (RBU).
In the kth cycle during the tth TASER iteration, the ith RBU
sends the value L˜(t−1)i,k to all PEs on row i, while the jth CBU
5The authors of [53] actually prove results for the broader class of Kurdyka-
Łojasiewicz functions, of which semi-algebraic functions are a special case.
6 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
Fig. 1. High-level block diagram of TASER. We use a systolic array of
processing elements (PEs) for the diagonal (D) and off-diagonal (OD) elements,
which enables high throughput at low hardware complexity.
sends T̂k,j to all PEs on column j. We assume that the (scaled)
matrix T̂ = 2τT˜ has been computed in a pre-processing step
and is stored in a memory (see Section IV-C for more details
on the memory implementation). The L˜(t−1)i,k value coming
from each RBU is taken from the (i, k) PE and sent to all
other PEs in the same row.
With the data received from the CBU and the RBU, each PE
performs MAC operations until the result L˜(t−1)T̂ on line 4 of
Algorithm 1 is computed. To include the subtraction on line 4,
the operation L˜(t−1)i,j −L˜(t−1)i,1 T̂1,j is performed in the first cycle
of each TASER iteration and stored in the accumulator. During
subsequent cycles, the products L˜(t−1)i,k T̂k,j , with 2 ≤ k ≤ N ,
are sequentially subtracted from the accumulator. Since the
matrix L˜ is lower-triangular, we have L˜i,k′ = 0 if i < k′.
Hence, we avoid the subtraction of L˜(t−1)i,k′ T̂k′,j as they are
zero. This implies that the V (t)i,j values of the PEs in the ith row
of the systolic array are computed after only i clock cycles,
so the matrix V(t) on line 4 is completed after N cycles.
An example for an N = 3 array is shown in Figure 2(a).
In the first cycle of the tth iteration, the (1, 1) PE has access
to the values L˜(t−1)1,1 and T̂1,1, so it can compute V
(t)
1,1 =
L˜
(t−1)
1,1 − L˜(t−1)1,1 T̂1,1. In the same cycle, the PEs on the second
row perform their first MAC operation, which leaves L˜(t−1)2,j −
L˜
(t−1)
2,1 T̂1,j in their accumulators.
In the second cycle, the PEs on the second row receive L˜(t−1)2,2
via the RBU and T̂2,j via the CBUs (see Figure 2(b)), so they
can finish computing V (t)2,j = L˜
(t−1)
2,j − L˜(t−1)2,1 T̂1,j− L˜(t−1)2,2 T̂2,j .
In addition, in this same cycle, the (1,1) PE can use its MAC
unit to square its V (t)1,1 value. This result will be available and
sent to the next PE in the same column on the following cycle,
which is represented with the green arrow in Figure 2(c).
In the third cycle, the (2, 1) PE has access to V (t)1,1
2
(from the
(1, 1) PE) and V (t)2,1 (stored internally), so it can use its MAC
(a) First cycle (b) Second cycle (c) Third cycle (d) Seventh cycle
Fig. 2. Different cycles for the tth iteration on an N = 3 TASER array. The
symbols inside the PEs correspond to the quantities of interest, for each cycle,
stored in each PE, while the symbols inside the RBUs and CBUs correspond
to the quantities being transmitted by these units. All the L˜ values correspond
to the (t− 1)th iteration, while the V values are from the tth iteration.
Fig. 3. Architecture details of the column-scale unit (CSU), the column-
broadcast unit (CBU), and the off-diagonal (OD) and diagonal (D) processing
elements (PEs).
unit to square V (t)2,1 and add it to V
(t)
1,1
2
(see Figure 2(c)). The
result will be the sum of the squares of the first two elements
of the first column of V(t) and, in the next cycle, the result
will be available and sent to the next PE in the same column
(for this example, the (3, 1) PE), so it can repeat the same
procedure. This process is replicated in all the columns and
repeated until all the PEs of the array have completed their
calculations. By doing so, the squared `2-norm of each column
of V(t) is computed after N + 1 clock cycles, just one cycle
after V(t) is completed. In the (N + 2)th cycle, the squared
`2-norm for the jth column is passed to a scale unit, which
computes its inverse square root and multiplies the result with
Dj,j . This operation takes two clock cycles to complete, so its
result is ready in the (N + 4)th cycle.
In the (N + 4)th cycle, the scaling factor Dj,j/‖vj‖2
(where vj is the jth column of V(t)) is sent to all the PEs in
the same column via the CBU, as shown in Figure 2(d). Then,
in the (N+5)th and final cycle of the iteration, all PEs multiply
the received scaling factor to their associated V (t)i,j value to
obtain the next iterate L˜(t)i,j , thus completing the proximal step
on line 5 of Algorithm 1.
Prior to decoding the next symbol, line 2 of Algorithm 1
must be executed; this is accomplished using the CBUs, which
send the Dj,j values to the diagonal PEs, while the off-diagonal
PEs clear their L˜(t−1)i,j registers.
O. CASTAN˜EDA ET AL. 7
B. Processing Element
We use two slightly distinct types of PEs in our systolic
array: (i) off-diagonal (OD) PEs and (ii) diagonal (D) PEs (see
Figure 3). Both PE types support the following four operation
modes:
1) Initialization of L˜: This mode is used for line 2 of
Algorithm 1. All off-diagonal PEs initialize L˜(t−1)i,j = 0; the
diagonal PEs initialize their states with Dj,j received from
the CBU.
2) Matrix multiplication: This mode is used to compute
line 4 of Algorithm 1. The multiplier uses the inputs from
both broadcast signals. In the first cycle of the matrix-matrix
multiplication procedure, the multiplier’s output is subtracted
from L˜(t−1)i,j ; in all other cycles, it is subtracted from the
accumulator. Since each PE stores its own L˜(t−1)i,j value, in the
kth cycle, all the PEs in the kth column use their internal L˜(t−1)i,k
value to feed the multiplier, instead of the signals coming from
the RBU.
3) Squared `2-norm calculation: This mode is used for line
5 of Algorithm 1. Both of the multiplier’s inputs are V (t)i,j . For
the D-PEs, the result is passed to the next PE in the same
column. For the OD-PEs, the output of the multiplier is added
to the
∑i−1
n=j
(
V
(t)
n,j
)2
value from the preceding PE in the same
column; the result
∑i
n=j
(
V
(t)
n,j
)2
is sent to the next PE or to
the scale unit, if the PE is in the last row.
4) Scaling: This mode completes line 5 of Algorithm 1.
One of the multiplier’s inputs is V (t)i,j and the other is the value
Dj,j/‖vj‖2, which was computed previously by the scale unit
and received through the CBU. The result is L˜(t)i,j and is stored
in every PE as the L˜(t−1)i,j of the next iteration.
C. Implementation Details
To demonstrate the efficacy of TASER and the proposed
triangular systolic array, we implemented FPGA and ASIC
reference designs for various array sizes N . All designs were
developed and optimized in Verilog on register-transfer level
(RTL). The implementation details are as follows:
1) Fixed-point design parameters: To minimize the hardware
complexity while maintaining near-optimal error-rate perfor-
mance, all our designs use 14 bit fixed-point numbers. All PEs,
except for the ones in the bottom row of the triangular array,
use 8 fraction bits to represent L˜(t−1)i,j and V
(t)
i,j ; the PEs in
the bottom row use 7 fraction bits. For the element L˜N,N , we
do not use a PE and store the value (which remains constant)
in a register with 5 fraction bits.
2) Inverse square-root computation: The inverse square-root
operation in the scale unit is implemented using a look-up table
(LUT), which we synthesized using random logic. Each LUT
consists of 211 entries with 14 bits per word, of which 13 are
fraction bits.
3) T̂-matrix memories: For the FPGA designs, the T̂k,j
memories are implemented with LUTs used as distributed
RAM (i.e., no block RAMs were used); for the ASIC designs,
we use latch arrays built from standard cells [56] in order to
minimize the circuit area.
−2 0 2 4 6 8 10 12 14 16 18 20 220
20
40
60
80
100
3
222
SIMO lower bound
MMSE detector
Min. SNR [dB] required to achieve 1% VER
F
P
G
A
th
ro
u
g
h
p
u
t
(M
b
/
s)
TASER, B = 32
TASER, B = 64
TASER, B = 128
TASER, B = 256
(a) BPSK
0 5 10 15 20 25 30 35
0
10
20
30
40
50
60
14
4
22
SIMO lower bound
MMSE detector
Min. SNR [dB] required to achieve 1% VER
F
P
G
A
th
ro
u
g
h
p
u
t
(M
b
/
s)
TASER, B = 32
TASER, B = 64
TASER, B = 128
TASER, B = 256
(b) QPSK
Fig. 5. Throughput for the FPGA design vs. performance trade-off for a
32-user system. Vertical solid lines represent the SIMO lower bound; dashed
lines represent linear MMSE performance. TASER outperforms linear detectors
in almost all operation regimes. The numbers next to the markers correspond
to the number of TASER iterations tmax.
4) RBU and CBU design: The RBUs are implemented
differently for the FPGA and ASIC designs. For the FPGA
designs, the RBU of the ith row is an i-input multiplexer that
receives data from all the PEs on its row, and also sends the
appropriate L˜(t−1)i,k to these PEs. For the ASIC designs, the
RBU consists of a bidirectional bus, where each PE on its row
uses a tri-state buffer to send data through it one at a time,
while all the PEs on the same row acquire data from it. A
similar approach is used for the CBUs: We use multiplexers
for the FPGA designs and busses for the ASIC designs. For
both target architectures, the output of the ith RBU connects to
i PEs. This path suffers from large fan-out for large values of i,
eventually becoming the critical path for large systolic arrays.
The same behavior applies to the CBUs. In order to shorten
these critical paths in our architecture, we place stage registers
at the inputs and outputs of the respective broadcast units.
While this approach entails two penalty cycles per TASER
iteration, the overall detection throughput is increased as we
achieve a substantially higher clock frequency.
V. IMPLEMENTATION RESULTS AND COMPARISON
We now provide error-rate performance results for coherent
data detection in massive MU-MIMO systems and for JED in
massive SIMO systems. We then show reference FPGA and
8 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 14 16 18 2010
−3
10−2
10−1
100
128× 8 64× 16 32× 32
Average SNR per receive antenna [dB]
V
ec
to
r
er
ro
r
ra
te
(V
E
R
)
SIMO lower bound
Exact SDR detector
TASER
TASER (fixed point)
MMSE detector
K-best detector
ML detector
(a) BPSK
−10 −8 −6 −4 −2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 2810
−3
10−2
10−1
100
128× 8 64× 16 32× 32
Average SNR per receive antenna [dB]
V
ec
to
r
er
ro
r
ra
te
(V
E
R
)
SIMO lower bound
Exact SDR detector
TASER
TASER (fixed point)
MMSE detector
K-best detector
ML detector
(b) QPSK
Fig. 4. Vector error rate (VER) for three different B × U large-MIMO system configurations. Even for square massive MU-MIMO systems (see the 32× 32
system), TASER achieves near-optimal VER performance (close-to-ML and the SIMO lower bound) and approaches the performance of the exact SDR detector.
For systems with more BS antennas than users (see the 128× 8 system), all of the considered data detectors approach optimal performance.
ASIC implementation results which we compare to existing
designs for massive MU-MIMO systems.
A. Error-Rate Performance
1) Coherent massive MU-MIMO data detection: Figures 4(a)
and 4(b) show vector error rate (VER) simulation results for
TASER with BPSK and QPSK modulation, respectively.6 We
show simulation results for coherent data detection with i.i.d.
flat Rayleigh fading in tall 128× 8 and 64× 16 systems, as
well as a square 32× 32 large MU-MIMO system (we use the
notation B × U ). We show the performance of ML detection
(only for the U = 8 and U = 16 systems; computed using
the sphere-decoding algorithm in [10]), exact SDR detection
from (6), linear MMSE detection, and the K-best algorithm as
detailed in [57] with K = 5. As a baseline, we also include
the performance of the SIMO lower bound.
For the 128 × 8 massive MIMO system, we see that all
detectors approach optimal performance (even the SIMO lower
bound); this is a well-known result from the large MIMO
literature [2]–[5]. For the 64 × 16 massive MIMO system,
only the linear MMSE detector suffers from a (rather small)
performance loss; all the other detectors perform equally well.
6The vector error rate (VER) corresponds to P[s 6= sˆ], which is the
probability of detecting a different vector sˆ than the transmitted one s.
For the more challenging square 32 × 32 massive MIMO
system, we see that TASER achieves near-ML performance
and outperforms linear MMSE detection and the K-best algo-
rithm (note that, even with the sphere decoder, ML detection
exhibits prohibitive complexity). We also show the fixed-point
performance of our TASER hardware design, denoted by “fixed
point” in Figures 4(a) and 4(b), which demonstrates a small
implementation loss (less than 0.2 dB SNR at 1% VER).
Figures 5(a) and 5(b) show the trade-off between the through-
put of TASER for the FPGA design (see Section V-C for the
details) and the minimum SNR required to achieve 1% VER
for the coherent data detection in large-MIMO systems. We
also include the SIMO lower bound and the performance of
linear MMSE detection as a reference. The MMSE detector
serves as a fundamental performance limit of the conjugate
gradient least-squares (CGLS) detector [20], the Neumann-
series detector [8], the optimized coordinate-descent (OCD)
detector [21], and the Gauss-Seidel (GS) detector [22]. The
maximum number of TASER iterations tmax enables us to
tune the performance/complexity trade-off; only a few iter-
ations are sufficient to outperform linear detection. We also
see that TASER delivers near-ML performance and achieves
throughputs from 10 Mb/s to 80 Mb/s for the FPGA design.
2) JED in massive SIMO systems: Figures 6(a) and 6(b)
show BER simulation results for TASER with BPSK and QPSK
O. CASTAN˜EDA ET AL. 9
−14 −12 −10 −8 −6 −4 −2 0 2 4 610
−3
10−2
10−1
100
Average SNR per receive antenna [dB]
B
it
er
ro
r
ra
te
(B
E
R
)
SIMO (perfect CSIR)
SIMO (CHEST)
SDR
TASER
ML JED
(a) BPSK
−10 −8 −6 −4 −2 0 2 4 6 8 1010
−3
10−2
10−1
100
Average SNR per receive antenna [dB]
B
it
er
ro
r
ra
te
(B
E
R
)
SIMO (perfect CSIR)
SIMO (CHEST)
SDR
TASER
ML JED
(b) QPSK
Fig. 6. Bit error rate (BER) for a SIMO system with 16 BS antennas with
transmission over 16 time slots. TASER-based JED achieves near-optimal
BER performance (close-to perfect CSIR) and achieves performance similar to
the exact SDR detector and ML JED; channel estimation (CHEST) followed
by SIMO detection entails a 3 dB SNR loss.
modulation, respectively. The simulations are for a B = 16 BS
antennas and 16 time slots SIMO system; we perform tmax = 20
iterations and use an i.i.d. flat Rayleigh block-fading channel
model. We include the performance of the SIMO detection
with both perfect receiver channel state information (CSIR) and
channel estimation (CHEST), exact SDR detection from (6),
and ML JED detection (which is computed using the algorithm
proposed in [41]). We see that TASER achieves near-optimal
performance, as it is close to a system with perfect CSIR,
and outperforms detection via SIMO CHEST, while achieving
similar performance as ML JED and exact SDR detection at
manageable complexity. We note that the trade-offs between
throughput and SNR performance behave analogously to the
massive MU-MIMO case.
B. Computational Complexity
We now compare the computational complexity of TASER
with other large-scale MIMO data-detection algorithms pro-
posed in the literature, namely the CGLS detector [20], the
Neumann-series detector [8], the OCD detector [21], and the
GS detector [22]. Table I shows the number of real-valued
multiplications for tmax iterations. We see that the complexity
of TASER (for BPSK and QPSK) and the Neumann-series
TABLE I
COMPUTATIONAL COMPLEXITY OF DIFFERENT DATA DETECTION
ALGORITHMS FOR MASSIVE MIMO SYSTEMS
Algorithm Computational complexitya
BPSK TASER tmax( 13U
3 + 5
2
U2 + 37
6
U + 4)
QPSK TASER tmax( 83U
3 + 10U2 + 37
3
U + 4)
CGLS [20] (tmax + 1)(4U2 + 20U)
Neumann [8] (tmax − 1)2U3 + 2U2 − 2U
OCD [21] tmax(8BU + 4U)
GS [22] tmax6U2
aThe complexity is measured by the number of real-valued multiplications
for tmax iterations. Complex-valued multiplications are assumed to require four
real-valued multiplications. All results ignore the preprocessing complexity.
TABLE II
IMPLEMENTATION RESULTS ON A XILINX VIRTEX-7
XC7VX690T FPGA FOR DIFFERENT TASER ARRAY SIZES
Array size N = 9 N = 17 N = 33 N = 65
BPSK users / time slots 8 16 32 64
QPSK users / time slots 4 8 16 32
Slices 1 467 4 350 13 787 60 737
LUTs 4 790 13 779 43 331 149 942
FFs 2 108 6 857 24 429 91 829
DSP48s 52 168 592 2 208
Max. clock freq. [MHz] 232 225 208 111
Min. latency [clock cycles] 16 24 40 72
Max. throughput [Mb/s] 116 150 166 98
Power estimatea [W] 0.6 1.3 3.6 7.3
aStatistical power estimation at max. clock freq. and 1.0 V supply voltage.
detector scales with tmaxU3, whereas TASER is slightly more
complex; CGLS and GS both scale with tmaxU2, whereas GS
is slightly more complex; OCD scales with tmaxBU . Evidently,
the near-ML performance of TASER comes at the cost of
high computational complexity. In contrast, CGLS, OCD, and
GS are rather inexpensive, but also perform poorly in square
systems (see the 32× 32 results in Figure 4). We finally note
that TASER can be used for JED—the other (approximate)
linear detectors cannot be used for this application.
C. FPGA Implementation Results
To demonstrate the effectiveness of TASER, we developed
several FPGA designs for systolic array sizes of N = 9,
N = 17, N = 33, and N = 65. The FPGA designs were
implemented using Xilinx Vivado Design Suite and optimized
for a Xilinx Virtex-7 XC7VX690T FPGA. The associated
implementation results are shown in Table II. As expected, the
resource utilization increases quadratically with the array size
N . For the N = 9 and N = 17 arrays, the critical path is in
the PEs’ MAC unit; for the N = 33 and N = 65 arrays, the
critical path is in the row broadcast multiplexers, which limits
the throughput of the N = 65 array. In Table III, we compare
TASER to the few existing large MIMO data detector designs,
namely CGLS detector [20], the Neumann-series detector [8],
the OCD detector [21], and the GS detector [22]. All of these
detectors have been implemented on the same FPGA and for
a 128× 8 large-MIMO system. TASER achieves comparable
throughput to the CGLS and GS designs and significantly lower
10 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
TABLE III
COMPARISON OF 128× 8 LARGE-MIMO DETECTORS ON A XILINX VIRTEX-7 XC7VX690T FPGA
Detection algorithm TASER TASER CGLS [20] Neumann [8] OCD [21] GS [22]
Error-rate performance Near-ML Near-ML Near-MMSE Near-MMSE Near-MMSE Near-MMSE
Modulation scheme BPSK QPSK 64-QAM 64-QAM 64-QAM 64-QAM
Preprocessing Not included Not included Included Included Included Included
Iterations tmax 3 3 3 3 3 1
Slices 1 467 (1.35 %) 4 350 (4.02 %) 1 094 (1 %) 48 244 (44.6 %) 13 447 (12.4 %) n.a.
LUTs 4 790 (1.11 %) 13 779 (3.18 %) 3 324 (0.76 %) 148 797 (34.3 %) 23 955 (5.53 %) 18 976 (4.3 %)
FFs 2 108 (0.24 %) 6 857 (0.79 %) 3 878 (0.44 %) 161 934 (18.7 %) 61 335 (7.08 %) 15 864 (1.8 %)
DSP48s 52 (1.44 %) 168 (4.67 %) 33 (0.9 %) 1 016 (28.3 %) 771 (21.5 %) 232 (6.3 %)
BRAM18s 0 (0 %) 0 (0 %) 1 (0.03 %) 32a (1.08 %) 1 (0.03 %) 12a (0.41 %)
Clock frequency [MHz] 232 225 412 317 263 309
Latency [clock cycles] 48 72 951 196 795 n.a.
Throughput [Mb/s] 38 50 20 621 379 48
Throughput/LUTs 7 933 3 629 6 017 4 173 15 821 2 530
aThese designs use BRAM36s, which are equal to two BRAM18s.
TABLE IV
ASIC IMPLEMENTATION RESULTS FOR DIFFERENT TASER ARRAY SIZES
Array size N = 9 N = 17 N = 33
BPSK users / time slots 8 16 32
QPSK users / time slots 4 8 16
Core area [µm2] 149 738 482 677 1 382 318
Core density [%] 69.86 68.89 72.89
Cell area [GEa] 148 264 471 238 1 427 962
Max. clock freq. [MHz] 598 560 454
Min. latency [clock cycles] 16 24 40
Max. throughput [Mb/s] 298 374 363
Power estimateb [mW] 41 87 216
aOne gate equivalent (GE) refers to the area of a unit-sized NAND2 gate.
bPost-place-and-route power estimation at max. clock freq. and 1.1V.
latency than the Neumann-series and CD detectors. In terms
of the hardware-efficiency (measured in terms of throughput
per FPGA LUTs), our design performs similarly to CGLS,
Neumann and GS, and inferior to the CD design. For the
128× 8 massive MIMO system, all detectors achieve near-ML
performance. However, when considering the 32 × 32 large
MIMO system (see Figures 4(a) and 4(b)), TASER significantly
outperforms the error-rate performance of all these reference
designs. We conclude by noting that the CGLS, Neumann,
OCD, and GS detectors are able to support 64-QAM, whereas
TASER is limited to either BPSK or QPSK. This limitation
negatively affects the throughput and hardware-efficiency of
TASER, as the throughput of the other (approximate) methods
scales linearly in the number of bits per symbol—the provided
throughput and hardware-efficiency results favor the CGLS,
Neumann, OCD, and GS detectors.
D. ASIC Implementation Results
We also developed reference ASIC designs for systolic array
sizes of N = 9, N = 17 and N = 33. The ASIC designs
were implemented using Synopsys DC and IC Compiler and
optimized for a TSMC 40 nm CMOS process. The associated
implementation results are shown in Table IV. As for our FPGA
designs, the silicon area increases quadratically with the array
size N . This can be verified both visually in Figure 7 as well
 
(a) N=9
 
(b) N=17
 
(c) N=33
Fig. 7. Layout of the TASER ASIC designs for N = 9, N = 17 and
N = 33 array sizes. The different modules of the design were colored in the
following way: PEs are colored in blue, memories in red, the scale units in
green, the busses of the RBUs and CBUs in purple, and the control unit in
yellow. Light and dark versions of the same color are alternated according to
the order in which the modules appear in the hardware description code.
as numerically in Table V, where we see that the unit areas
of each PE and scale unit remain nearly constant, while the
total area of the PEs increases with N2. As expected, the unit
area of the T̂k,j memories increases with N , as each one of
these memories contains a column of an N ×N matrix. The
critical paths for the N = 9, N = 17, and N = 33 arrays are
within the PE’s MAC unit, the inverse square root LUT, and
the row broadcasting bus, respectively.
In Table VI, we compare our TASER ASIC implementation
to the Neumann-series detector in [23], which is—to the
best of our knowledge—the only ASIC design for massive
MU-MIMO systems. While the latter offers a significantly
higher throughput than our design, TASER’s reduced area
and power consumption result in superior hardware-efficiency
(measured in throughput per cell area) and power-efficiency
(measured in energy per bit). Furthermore, TASER enables
near-ML performance for massive MU-MIMO systems where
the number of users is in the same range as the number
O. CASTAN˜EDA ET AL. 11
TABLE V
AREA BREAKDOWN FOR DIFFERENT TASER ASIC ARRAY SIZES IN GATE EQUIVALENTS (GES)
Array size N = 9 N = 17 N = 33
Element Unit area Total area Unit area Total area Unit area Total area
PEs 2 391 (1.6 %) 105 198 (70.9 %) 2 404 (0.5 %) 365 352 (77.5 %) 2 086 (0.1 %) 1 168 254 (81.8 %)
Scale units 6 485 (4.4 %) 25 941 (17.5 %) 6 315 (1.3 %) 50 521 (10.7 %) 5 945 (0.4 %) 95 125 (6.6 %)
T̂k,j memories 734 (0.5 %) 5 873 (4.0 %) 1 451 (0.3 %) 23 220 (4.9 %) 2 888 (0.2 %) 92 426 (6.5 %)
Control unit 459 (0.3 %) 459 (0.3 %) 728 (0.2 %) 728 (0.2 %) 1 259 (0.1 %) 1 259 (0.1 %)
Miscellaneous – 10 793 (7.3 %) – 31 417 (6.7 %) – 70 898 (5.0 %)
TABLE VI
COMPARISON OF DATA DETECTION ASICS FOR
128 BS ANTENNAS, 8 USERS LARGE-MIMO SYSTEMS
Detection algorithm TASER TASER Neumann [23]
Error-rate performance Near-ML Near-ML Near-MMSE
Modulation scheme BPSK QPSK 64-QAM
Preprocessing Not included Not included Included
Iterations 3 3 3
CMOS technology [nm] 40 40 45
Supply voltage [V] 1.1 1.1 0.81
Clock freq. [MHz] 598 560 1 000 (1 125a)
Throughput [Mb/s] 99 125 1 800 (2 025a)
Core area [mm2] 0.150 0.483 11.1 (8.77a)
Core density [%] 69.86 68.89 73.00
Cell areab [kGE] 142.4 448.0 12 600
Powerc [mW] 41.25 87.10 8 000 (13 114a)
Throughput/cell 695 279 161areaa [b/(s×GE)]
Energy/bita [pJ/b] 417 697 6 476
aTechnology scaling to 40 nm and 1.1 V assuming: A ∼ 1/`2, tpd ∼ 1/`,
and Pdyn ∼ 1/(V 2` `) [58].
bExcluding the gate count of memories.
cAt maximum clock frequency and given supply voltage.
of BS antennas (see Figures 4(a) and 4(b)). We note that
the comparison presented on Table VI is not entirely fair.
TASER does not include preprocessing circuitry, whereas the
Neumann-series detector [23] includes preprocessing circuitry
and was optimized for wideband systems that use single-carrier
frequency-division multiple access (SC-FDMA).
We finally note that there exists a plethora of data-detector
ASICs for traditional, small-scale MIMO systems (see [9], [10],
[42], [47], [59]–[63] and the references therein). While most of
these designs achieve near-ML performance and/or throughputs
in the Gb/s regime in small-scale MIMO systems, their efficacy
for large MIMO is unexplored—a corresponding algorithm and
hardware-level comparison is left for future work.
VI. CONCLUSIONS
We have proposed—to the best of our knowledge—the first
data-detector implementation that uses semidefinite relaxation.
Our novel algorithm, referred to as Triangular Approximate
SEmidefinite Relaxation (TASER), is suitable for coherent data
detection in massive MU-MIMO systems, as well as joint chan-
nel estimation and data detection (JED) in large SIMO systems.
We have developed a corresponding systolic VLSI architecture
and implemented FPGA and ASIC reference designs. Our
results have shown that TASER achieves comparable hardware-
efficiency as existing massive MU-MIMO data detectors, while
providing near-ML performance, even for systems where the
number of users is comparable to the number of BS antennas.
Hence, for systems supporting a large number of low-rate users
(e.g., 16 users or more) where BPSK and QPSK transmission is
sufficient, TASER provides a viable alternative to sub-optimal,
linear data-detection methods, or optimal but computationally
expensive non-linear methods. We also note that TASER can be
used in so-called overloaded systems, i.e., systems with more
users than BS antennas—such a scenario may be of interest in
large sensor networks or for the internet of things (IoT).
There are many avenues for future work. Traditional SDR-
based data detection only supports BPSK and QPSK trans-
mission and hard-output data detection. Extending TASER to
support higher-order modulation schemes using the methods
in [32], [33] is the subject of ongoing research. Furthermore,
developing efficient ways to compute soft-output values (in
the form of log-likelihood ratio values) within TASER is left
for future work. Finally, SDR-based data detection for JED in
MU-MIMO systems is an interesting open research topic.
ACKNOWLEDGMENTS
The authors would like to thank Prof. C. Batten for his sup-
port with the ASIC design flow. C. Studer would like to thank
Prof. W. Xu for insightful discussions on JED. O. Castan˜eda
would like to thank Cornell University’s CienciAmerica Pro-
gram that enabled this research project. O. Castan˜eda and
C. Studer were supported in part by Xilinx Inc., and by the
US National Science Foundation (NSF) under grants ECCS-
1408006 and CCF-1535897. T. Goldstein was supported in
part by the US NSF under grant CCF-1535902 and by the US
Office of Naval Research under grant N00014-15-1-2676.
REFERENCES
[1] O. Castan˜eda, T. Goldstein, and C. Studer, “FPGA design of approximate
semidefinite relaxation for data detection in large MIMO wireless
systems,” in Proc. IEEE Intl. Conf. on Circuits and Systems (ISCAS),
May 2016.
[2] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers
of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, no.
11, pp. 3590–3600, Nov. 2010.
[3] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors,
and F. Tufvesson, “Scaling up MIMO: Opportunities and challenges
with very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp.
40–60, Jan. 2013.
[4] J. Hoydis, S. Ten Brink, and M. Debbah, “Massive MIMO in the UL/DL
of cellular networks: How many antennas do we need?,” IEEE Journal
on Selected Areas in Communications, vol. 31, no. 2, pp. 160–171, Feb.
2013.
[5] E. Larsson, O. Edfors, F. Tufvesson, and T. Marzetta, “Massive MIMO
for next generation wireless systems,” IEEE Communications Magazine,
vol. 52, no. 2, pp. 186–195, Feb. 2014.
12 APPEARED IN IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I
[6] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong,
and J. C. Zhang, “What will 5G be?,” IEEE Journal on Selected Areas
in Communications, vol. 32, no. 6, pp. 1065–1082, June 2014.
[7] L. Lu, G. Li, A. Swindlehurst, A. Ashikhmin, and R. Zhang, “An
overview of massive MIMO: Benefits and challenges,” IEEE Journal of
Selected Topics in Signal Processing, vol. 8, no. 5, pp. 742–758, Oct.
2014.
[8] M. Wu, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“Large-scale MIMO detection for 3GPP LTE: algorithms and FPGA
implementations,” IEEE J. Sel. Topics in Sig. Proc., vol. 8, no. 5, pp.
916–929, Oct. 2014.
[9] K. Wong, C. Tsui, R. Cheng, and W. Mow, “A VLSI architecture of a
K-best lattice decoding algorithm for MIMO channels,” in Proc. IEEE
International Conference on Circuits and Systems (ISCAS), May 2002,
vol. 3, pp. 273–276.
[10] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
H. Bo¨lcskei, “VLSI implementation of MIMO detection using the sphere
decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, no. 7, pp.
1566–1577, Jul. 2005.
[11] K. Vardhan, S. Mohammed, A. Chockalingam, and B. Rajan, “A low-
complexity detector for large MIMO systems and multicarrier CDMA
systems,” IEEE Journal on Selected Areas in Communications, vol. 26,
no. 3, pp. 473–485, Apr. 2008.
[12] H. Prabhu, J. Rodrigues, O. Edfors, and F. Rusek, “Approximative matrix
inverse computations for very-large MIMO and applications to linear
pre-coding systems,” in Proc. IEEE WCNC, 2013, pp. 2710–2715.
[13] S. Wu, L. Kuang, Z. Ni, J. Lu, D. Huang, and Q. Guo, “Low-complexity
iterative detection for large-scale multiuser MIMO-OFDM systems using
approximate message passing,” IEEE Journal of Selected Topics in Signal
Processing, vol. 8, no. 5, pp. 902–915, Oct. 2014.
[14] P. Svac, F. Meyer, E. Riegler, and F. Hlawatsch, “Soft-heuristic detectors
for large MIMO systems,” IEEE Transactions on Signal Processing, vol.
61, no. 18, pp. 4573–4586, Sep. 2013.
[15] Y. Hu, Z. Wang, X. Gaol, and J. Ning, “Low-complexity signal detection
using CG method for uplink large-scale MIMO systems,” in Proc. IEEE
ICCS, Nov 2014, pp. 477–481.
[16] B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “Conjugate gradient-based
soft-output detection and precoding in massive MIMO systems,” in Proc.
IEEE GLOBECOM, Dec 2014, pp. 4287–4292.
[17] L. Liu, C. Yuen, Y. L. Guan, Y. Li, and Y. Su, “A low-complexity
Gaussian message passing iterative detector for massive MU-MIMO
systems,” in Proc. IEEE ICICS, Dec 2015.
[18] C. Jeon, R. Ghods, A. Maleki, and C. Studer, “Optimality of large MIMO
detection via approximate message passing,” in IEEE International
Symposium on Information Theory (ISIT), June 2015, pp. 1227–1231.
[19] K. Li, B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “Accelerating
massive MIMO uplink detection on GPU for SDR systems,” in IEEE
Dallas Circuits and Systems Conference (DCAS), Oct. 2015.
[20] B. Yin, M. Wu, J. Cavallaro, and C. Studer, “VLSI Design of Large-
Scale Soft-Output MIMO Detection Using Conjugate Gradients,” in
Proc. IEEE ISCAS, May 2015, pp. 1498–1501.
[21] M. Wu, C. Dick, J. Cavallaro, and C. Studer, “FPGA design of a
coordinate-descent detector for large-MIMO,” in IEEE International
Symposium on Circuits and Systems (ISCAS), May 2016, pp. 1894-1897.
[22] Z. Wu, C. Zhang, Y. Xue, S. Xu, and Z. You, “Efficient architecture
for soft-output massive MIMO detection with Gauss-Seidel method,” in
Proc. IEEE Intl. Conf. on Circuits and Systems (ISCAS), May 2016.
[23] B. Yin, M. Wu, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer, “A
3.8 Gb/s large-scale MIMO detector for 3GPP LTE-Advanced,” in Proc.
IEEE ICASSP, May 2014, pp. 3907–3911.
[24] Z.-Q. Luo, W.-k. Ma, A. M.-C. So, Y. Ye, and S. Zhang, “Semidefinite
relaxation of quadratic optimization problems,” IEEE Sig. Proc. Mag.,
vol. 27, no. 3, pp. 20–34, May 2010.
[25] J. Jalde´n and B. Ottersten, “The diversity order of the semidefinite
relaxation detector,” IEEE Transactions on Information Theory, vol. 54,
no. 4, pp. 1406–1422, Apr. 2008.
[26] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183–202, Jan. 2009.
[27] T. Goldstein, C. Studer, and R. G. Baraniuk, “A field guide to forward-
backward splitting with a FASTA implementation,” arXiv preprint:
1411.3406, Nov. 2014.
[28] P. H. Tan and L. K. Rasmussen, “The application of semidefinite
programming for detection in CDMA,” IEEE Journal on Selected Areas
in Communications, vol. 19, no. 8, pp. 1442–1449, Aug. 2001.
[29] W.-K. Ma, T. N. Davidson, K. M. Wong, Z.-Q. Luo, and P.-C.
Ching, “Quasi-maximum-likelihood multiuser detection using semi-
definite relaxation with application to synchronous CDMA,” IEEE
Transactions on Signal Processing, vol. 50, no. 4, pp. 912–922, Apr.
2002.
[30] B. Steingrimsson, Z.-Q. Luo, and K. M. Wong, “Soft quasi-maximum-
likelihood detection for multiple-antenna wireless channels,” IEEE Trans.
Sig. Proc., vol. 51, no. 11, pp. 2710–2719, Nov. 2003.
[31] J. Jalde´n, C. Martin, and B. Ottersten, “Semidefinite programming
for detection in linear systems-optimality conditions and space-time
decoding,” in Proc. IEEE Inernational Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2003, vol. 4.
[32] A. Wiesel, Y. C. Eldar, and S. Shamai, “Semidefinite relaxation for
detection of 16-QAM signaling in MIMO channels,” IEEE Sig. Proc.
Letters, vol. 12, no. 9, pp. 653–656, Sep. 2005.
[33] N. Sidiropoulos and Z.-Q. Luo, “A semidefinite relaxation approach to
MIMO detection for high-order QAM constellations,” IEEE Sig. Proc.
Letters, vol. 13, no. 9, pp. 525–528, Sep. 2006.
[34] Y. Yang, C. Zhao, P. Zhou, and W. Xu, “MIMO detection of 16-QAM
signaling based on semidefinite relaxation,” IEEE Signal Processing
Letters, vol. 11, no. 14, pp. 797–800, 2007.
[35] W.-K. Ma, C.-C. Su, J. Jalde´n, T.-H. Chang, and C.-Y. Chi, “The
equivalence of semidefinite relaxation MIMO detectors for higher-order
QAM,” IEEE Journal of Selected Topics in Signal Processing, vol. 3,
no. 6, pp. 1038–1052, 2009.
[36] H.-T. Wai, W.-K. Ma, and A. M.-C. So, “Cheap semidefinite relaxation
MIMO detection using row-by-row block coordinate descent,” in
Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2011, pp. 3256–3259.
[37] H. Vikalo, B. Hassibi, and P. Stoica, “Efficient joint maximum-likelihood
channel estimation and signal detection,” IEEE Transactions on Wireless
Communications, vol. 5, no. 7, pp. 1838–1845, 2006.
[38] P. Stoica and G. Ganesan, “Space–time block codes: Trained, blind, and
semi-blind detection,” Elsevier Digital Signal Processing, vol. 13, no. 1,
pp. 93–105, Jan. 2003.
[39] P. Stoica, H. Vikalo, and B. Hassibi, “Joint maximum-likelihood channel
estimation and signal detection for SIMO channels,” in Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2003, vol. 4, pp. IV–13.
[40] H. A. J. Alshamary, M. F. Anjum, T. Al-Naffouri, A. Zaib, and W. Xu,
“Optimal non-coherent data detection for massive SIMO wireless systems
with general constellations: A polynomial complexity solution,” arXiv
preprint:1507.02319, 2015.
[41] W. Xu, M. Stojnic, and B. Hassibi, “On exact maximum-likelihood
detection for non-coherent MIMO wireless systems: a branch-estimate-
bound optimization framework,” in Proc. IEEE International Symposium
on Information Theory (ISIT), 2008, pp. 2017–2021.
[42] C. Studer, M. Wenk, and A. Burg, “VLSI implementation of hard-and
soft-output sphere decoding for wide-band MIMO systems,” in VLSI-
SoC: Forward-Looking Trends in IC and Systems Design, pp. 128–154.
Springer, 2010.
[43] D. Gesbert, M. Shafi, D.-s. Shiu, P. J. Smith, and A. Naguib, “From
theory to practice: an overview of mimo space-time coded wireless
systems,” IEEE Journal on Selected Areas in Communications, vol. 21,
no. 3, pp. 281–302, 2003.
[44] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless
Communications, Cambridge University Press, New York, USA, 2008.
[45] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in
lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, 2002.
[46] B. M. Hochwald and S. ten Brink, “Achieving near-capacity on a
multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp.
389–399, March 2003.
[47] C. Studer, A. Burg, and H. Bo¨lcskei, “Soft-output sphere decoding:
Algorithms and VLSI implementation,” IEEE J. Sel. Areas Commun.,
vol. 26, no. 2, pp. 290–300, Feb. 2008.
[48] J. Jalde´n and B. Ottersten, “On the complexity of sphere decoding in
digital communications,” IEEE Trans. Signal Process., vol. 53, no. 4,
pp. 1474–1484, Apr. 2005.
[49] D. Seethaler, J. Jalde´n, C. Studer, and H. Bo¨lcskei, “On the complexity
distribution of sphere decoding,” IEEE Trans. Inf. Theory, vol. 57, no.
9, pp. 5754–5768, Sept. 2011.
[50] F. R. Bach and M. I. Jordan, “Predictive low-rank decomposition for
kernel methods,” in Proc. 22nd International Conference on Machine
Learning (ICML), Aug. 2005, pp. 33–40.
[51] H. Harbrecht, M. Peters, and R. Schneider, “On the low-rank approxima-
tion by the pivoted Cholesky decomposition,” Elsevier Applied Numerical
Mathematics, vol. 62, no. 4, pp. 428–440, Apr. 2012.
O. CASTAN˜EDA ET AL. 13
[52] M. Benzi, “Preconditioning techniques for large linear systems: a survey,”
Elsevier Journal of Computational Physics, vol. 182, no. 2, pp. 418–477,
Nov. 2002.
[53] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods
for semi-algebraic and tame problems: proximal algorithms, forward–
backward splitting, and regularized Gauss–Seidel methods,” Mathematical
Programming, vol. 137, no. 1-2, pp. 91–129, Feb. 2013.
[54] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for
solving semidefinite programs via low-rank factorization,” Mathematical
Programming, vol. 95, no. 2, pp. 329–357, Feb. 2003.
[55] N. Boumal, “A Riemannian low-rank method for optimization over
semidefinite matrices with block-diagonal constraints,” arXiv preprint:
1506.00575, June 2015.
[56] P. Meinerzhagen, C. Roth, and A. Burg, “Towards generic low-power
area-efficient standard cell based memory architectures,” in 53rd IEEE
Intern. Midwest Symposium on Circuits and Systems (MWSCAS), Aug.
2010, pp. 129–132.
[57] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “K-best
MIMO detection VLSI architectures achieving up to 424 Mbps,” in
IEEE International Symposium on Circuits and Systems (ISCAS), May
2006, pp. 1151-1154.
[58] B. Razavi, Design of Analog CMOS Integrated Circuits, New York:
McGraw-Hill, 2002.
[59] C. H. Liao, T. P. Wang, and T. D. Chiueh, “A 74.8 mW soft-output
detector IC for 8×8 spatial-multiplexing MIMO communications,” IEEE
Journal of Solid-State Circuits, vol. 45, no. 2, pp. 411–421, Feb. 2010.
[60] C. H. Yang, T. H. Yu, and D. Markovic´, “A 5.8 mW 3GPP-LTE compliant
8×8 MIMO sphere decoder chip with soft-outputs,” in 2010 Symposium
on VLSI Circuits, June 2010, pp. 209–210.
[61] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, “A
scalable VLSI architecture for soft-input soft-output single tree-search
sphere decoding,” IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 57, no. 9, pp. 706–710, Sep. 2010.
[62] C.-F. Liao, J.-Y. Wang, and Y.-H. Huang, “A 3.1 Gb/s 8×8 sorting
reduced K-Best detector with lattice reduction and QR decomposition,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.
22, no. 12, pp. 2675–2688, Feb. 2014.
[63] C. Senning, L. Bruderer, J. Hunziker, and A. Burg, “A lattice reduction-
aided MIMO channel equalizer in 90 nm CMOS achieving 720 Mb/s,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61,
no. 6, pp. 1860–1871, June 2014.
