1-bit Massive MU-MIMO Precoding in VLSI by Castañeda, Oscar et al.
TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 1
1-bit Massive MU-MIMO Precoding in VLSI
Oscar Castañeda, Sven Jacobsson, Giuseppe Durisi,
Mikael Coldrey, Tom Goldstein, and Christoph Studer
Abstract—Massive multiuser (MU) multiple-input multiple-
output (MIMO) will be a core technology in fifth-generation (5G)
wireless systems as it offers significant improvements in spectral
efficiency compared to existing multi-antenna technologies. The
presence of hundreds of antenna elements at the base station (BS),
however, results in excessively high hardware costs and power
consumption, and requires high interconnect throughput between
the baseband-processing unit and the radio unit. Massive MU-
MIMO that uses low-resolution analog-to-digital and digital-to-
analog converters (DACs) has the potential to address all these
issues. In this paper, we focus on downlink precoding for massive
MU-MIMO systems with 1-bit DACs at the BS. The objective
is to design precoders that simultaneously mitigate multi-user
interference (MUI) and quantization artifacts. We propose two
nonlinear 1-bit precoding algorithms and corresponding very-
large scale integration (VLSI) designs. Our algorithms rely on
biconvex relaxation, which enables the design of efficient 1-bit pre-
coding algorithms that achieve superior error-rate performance
compared to that of linear precoding algorithms followed by
quantization. To showcase the efficacy of our algorithms, we
design VLSI architectures that enable efficient 1-bit precoding
for massive MU-MIMO systems in which hundreds of antennas
serve tens of user equipments. We present corresponding field-
programmable gate array (FPGA) reference implementations to
demonstrate that 1-bit precoding enables reliable and high-rate
downlink data transmission in practical systems.
Index Terms—Biconvex relaxation, digital-to-analog converter
(DAC), field-programmable gate array (FPGA), massive multi-
user multiple-input multiple-output (MU-MIMO), precoding,
quantization, very large-scale integration (VLSI).
I. INTRODUCTION
MASSIVE multiuser (MU) multiple-input multiple-output(MIMO) is widely believed to be a core technology in
fifth-generation (5G) wireless systems as it enables substantial
improvements in spectral efficiency and reliability compared
to traditional, small-scale MIMO technology [2]–[4]. These
advantages are a result of equipping the base station (BS)
with hundreds or thousands of antennas, which enables fine-
grained beamforming to serve tens of user equipments (UEs) in
O. Castañeda and C. Studer are with the School of Electrical and Computer
Engineering, Cornell University, Ithaca, NY (e-mail: oc66@cornell.edu,
studer@cornell.edu; web: vip.ece.cornell.edu).
S. Jacobsson is with Ericsson Research and Chalmers University of
Technology, Gothenburg, Sweden (e-mail: sven.jacobsson@ericsson.com).
G. Durisi is with Chalmers University of Technology, Gothenburg, Sweden
(e-mail: durisi@chalmers.se).
M. Coldrey is with Ericsson Research, Gothenburg, Sweden (e-mail: mikael.
coldrey@ericsson.com)
T. Goldstein is with the Department of Computer Science, University of
Maryland, College Park, MD (e-mail: tomg@cs.umd.edu).
The C1PO algorithm implemented in this paper builds upon the 1-bit
precoding algorithm presented at the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) [1]; in contrast to the
algorithm in [1], C1PO directly operates in the complex domain, comes with
convergence guarantees, and can be implemented efficiently in VLSI.
A MATLAB simulator for the precoders proposed in this paper is available
on GitHub: https://github.com/quantizedmassivemimo/1bit_precoding_VLSI
the same time-frequency resource. However, the large number
of antenna elements and radio frequency (RF) chains at the
BS results in a significant increase in hardware complexity,
system costs, and circuit power consumption. Furthermore,
massive MU-MIMO requires high interconnect and chip
input/output (I/O) bandwidth between the baseband-processing
unit at the BS and the radio units [5], [6]. As a consequence,
a successful deployment of this technology in 5G wireless
systems requires novel design approaches that jointly reduce
system costs, power consumption, and interconnect bandwidth
without degrading the spectral efficiency and link reliability.
A. Massive MU-MIMO with 1-bit DACs
We consider the massive MU-MIMO downlink in which the
BS is equipped with 1-bit digital-to-analog converters (DACs)
and transmits data to multiple UEs in the same time-frequency
resource. In traditional multi-antenna BSs, each RF port is
connected to a pair of high-resolution DACs (e.g., with 10-
bit precision). Scaling such architectures to massive MIMO
BSs, with hundreds or thousands of antennas would result
in prohibitively high power consumption and system costs.
The deployment of 1-bit DACs at the BS would mitigate this
problem. In addition, the use of 1-bit DACs enables one to
lower the linearity and noise requirements of the surrounding
RF circuitry, which has the potential to additionally reduce
the circuit power consumption. Another benefit of using 1-bit
DACs is the fact that lowering their resolution also reduces the
interconnect bandwidth between the baseband-processing unit
and the radio unit, as only one bit per sample is required by
each DAC. This aspect is of practical relevance for deployment
scenarios in which these two units are not co-located [5], [6].
The key challenges of 1-bit massive MU-MIMO systems are
to maintain high spectral efficiency and reliability. The work
in [7] demonstrates that the performance degradation caused
by 1-bit DACs in the downlink diminishes as the number of
BS antennas increases. Furthermore, as shown in [1], [7]–[10],
the use of 1-bit DACs in the downlink enables reliable data
transmission if sophisticated precoding algorithms that simulta-
neously mitigate multi-user interference (MUI) and quantization
artifacts are used. While conventional linear precoding methods,
such as zero-forcing (ZF) or minimum mean-squared error
(MMSE) precoding followed by quantization, require low com-
putational complexity [11]–[14], more sophisticated, nonlinear
methods are necessary to enable reliable communication at high
spectral efficiency. Such precoding methods, however, typically
require high computational complexity. As a consequence, a
successful deployment of 1-bit massive MU-MIMO calls for
the design of novel and efficient precoding algorithms that
can be implemented in hardware and reliably achieve high
throughput at low power consumption.
ar
X
iv
:1
70
2.
03
44
9v
2 
 [c
s.I
T]
  7
 N
ov
 20
17
2 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
B. Contributions
In this paper, we develop novel, computationally efficient
precoding algorithms for 1-bit massive MU-MIMO systems
and corresponding very-large scale integrated (VLSI) designs.
Our main contributions can be summarized as follows:
• We use biconvex relaxation (BCR) [15] to design a non-
linear 1-bit precoding algorithm. Our algorithm, referred
to as C1PO (short for biConvex 1-bit PrecOding), enables
reliable, high-rate downlink transmission in 1-bit massive
MU-MIMO systems for medium-sized antenna arrays.
• We propose a scalable and low-complexity algorithm vari-
ant, referred to as C2PO, which enables high-performance
1-bit precoding for massive MU-MIMO systems with
hundreds or thousands of antenna elements.
• For C1PO and C2PO, which both solve nonconvex
problems, we provide analytical convergence guarantees.
• We develop two massively parallel VLSI architectures that
implement C1PO and C2PO, and achieve high throughputs
in a hardware-efficient manner. Our architectures support
various BS and UE antenna configurations.
• We present reference designs on a Xilinx Virtex-7 field-
programmable gate array (FPGA) for various antenna
configurations that demonstrate the efficacy of our algo-
rithms and VLSI architectures.
• We compare our designs to a baseline precoder that uses
maximum ratio transmission (MRT) followed by quanti-
zation (MRT-Q), a method that achieves high hardware-
efficiency at the cost of poor error-rate performance.
• We study the trade-offs between error-rate performance
and hardware efficiency (in terms of throughput per area)
for the proposed FPGA designs.
Our results demonstrate the practical feasibility of 1-bit
precoding in massive MU-MIMO systems, supporting reliable
and high-rate downlink data transmission.
C. Relevant Prior Art
A number of papers have studied the use of low-resolution
analog-to-digital converters (ADCs) for the massive MU-MIMO
uplink (UEs transmit data to the BS) with particular focus on
the 1-bit case; see, e.g., [16]–[20] and the references therein.
All these results have shown that the use of 1-bit ADCs is
sufficient for reliable low-rate uplink transmission and that 4
to 6 bits are sufficient to close the gap to the infinite-precision
case in most scenarios. In contrast to the uplink, the quantized
downlink has gained attention only recently. Precoding in the
downlink with 1-bit DACs is a more challenging problem
as both MUI and quantization artifacts must be mitigated
simultaneously. The results in [11]–[14] have shown that so-
called linear-quantized precoders, which perform traditional
linear precoding followed by quantization, enable reliable
downlink transmission for very large BS antenna arrays in
the high signal-to-noise ratio (SNR) regime, even for systems
that use 1-bit DACs. More sophisticated, nonlinear precoding
algorithms have been proposed only recently in [1], [7]–[10]
and significantly outperform linear-quantized methods in the
presence of 1-bit DACs. The computational complexity of
these algorithms, however, is typically high, which prevents
an efficient implementation in practical systems. In contrast to
these precoding methods, we propose two novel nonlinear
precoding algorithms and VLSI designs that achieve high
throughput in a hardware-efficient manner.
While a large number of VLSI designs for data detection
in the massive MU-MIMO uplink have been proposed in the
literature (see, e.g., [21]–[24] and references therein), only
a handful of precoder designs for multi-antenna downlink
systems exist [5], [25]–[28]. Reference [25] proposes a VLSI
design for vector-perturbation precoding in small-scale MIMO
systems with high-precision DACs. The papers [26] and [27]
discuss hardware implementations for approximate linear
and ZF/MRT-based precoding, respectively, for massive MU-
MIMO systems with high-precision DACs. Unfortunately,
both of these publications do not provide detailed FPGA
implementation results. Reference [28] describes an application
specific integrated circuit (ASIC) design of a ZF precoder;
reference [5] presents a decentralized ZF precoder on a graphics
processing unit (GPU) cluster. Both of these precoders are,
however, designed for high-precision DACs and not for 1-
bit massive MU-MIMO systems. Hence, to the best of our
knowledge, the VLSI designs proposed in this paper are the
first hardware implementations reported in the open literature
that are suitable for precoding in the 1-bit massive MU-MIMO
downlink.
D. Notation
Lowercase and uppercase boldface letters designate column
vectors and matrices, respectively. For a matrix A, we denote its
transpose, Hermitian transpose, complex conjugate, and matrix
`2-norm by AT , AH , A∗, and ‖A‖2,2, respectively; the entry
on the kth row and on the `th column of A is [A]k,`. The M×
M identity matrix is denoted by IM and the M ×N all-zeros
matrix is denoted by 0M×N . For a vector a, the kth entry is [a]k
and we use ‖a‖2 to denote the `2-norm of the vector a. The real
and imaginary parts of a complex vector a are <{a} and ={a},
respectively. The signum function sgn(·) is defined as sgn(a) =
+1 for a ≥ 0 and sgn(a) = −1 for a < 0 and is applied entry-
wise to vectors. The multivariate complex-valued circularly-
symmetric Gaussian probability density function (PDF) with
covariance matrix K is denoted by CN (0,K). We use Ex[·]
to denote expectation with respect to the random vector x.
E. Paper Outline
The rest of this paper is organized as follows. In Section II,
we introduce the system model and formulate the precoding
problem for systems with 1-bit DACs. In Section III, we
propose two new 1-bit precoding algorithms, namely C1PO
and C2PO. In Section IV and Section V, we detail our
VLSI architectures for C1PO and C2PO, respectively. In
Section VI, we show numerical simulations, reference FPGA
implementation results, and a comparison with an MRT-based
baseline precoder. We conclude the paper in Section VII. All
proofs are relegated to Appendices A and B.
O. CASTAÑEDA ET AL. 3
det.
fre
qu
en
cy
-fl
at
w
ire
le
ss
 c
ha
nn
el
. .
 .
. .
 .
RFmap.
RF
RF
RF
RF
RF
1b-DAC
. .
 .
. .
 .
. .
 .
map.
map.
det.
det.1
-b
it 
pr
ec
od
er
1b-DAC
1b-DAC
1b-DAC
1b-DAC
1b-DAC
Fig. 1. Overview of an uncoded massive MU-MIMO downlink system with 1-bit DACs. Left: B antenna massive MU-MIMO BS containing a 1-bit precoder
that mitigates multi-user interference and quantization artifacts in the 1-bit DACs; Right: U single-antenna UEs.
II. SYSTEM MODEL AND 1-BIT PRECODING
We start by introducing the downlink system model and then
provide the necessary details about optimal precoding in 1-bit
massive MU-MIMO systems.
A. Downlink System Model
We focus on the downlink of a single-cell, narrowband
massive MU-MIMO system as illustrated in Figure 1. The
system consists of a B-antenna BS that serves U ≤ B single-
antenna1 UEs simultaneously and in the same frequency band.
We use the standard input-output relation y = Hx + n to
model the narrowband downlink channel [2]. Here, the vector
y = [y1, . . . , yU ]
T contains the received signals at all UEs,
where yu ∈ C is the signal received at the uth UE. The
matrix H ∈ CU×B represents the downlink channel. The
so-called precoded vector is denoted by x ∈ XB , where X
represents the transmit alphabet; this set coincides with the
set C of complex numbers in the case of infinite-precision
DACs. In 1-bit massive MU-MIMO systems, the in-phase
and quadrature components are generated separately using a
pair of 1-bit DACs running at Nyquist rate and hence, the
per-antenna quaternary transmit alphabet is X = {±` ± j`}
for a given (and fixed) ` > 0 that determines the transmit
power. The vector n ∈ CU models i.i.d. circularly-symmetric
complex Gaussian noise with variance N0 per complex entry,
i.e., nu ∼ CN (0, N0), for u = 1, . . . , U . In what follows, we
assume that the realization of the channel matrix H and the
noise variance N0 are perfectly known at the BS.2
B. Precoding Basics
The main purpose of precoding is to transmit the constel-
lation points su ∈ O to each UE u = 1, . . . , U , where O
is the constellation set (e.g., 16-QAM). The BS uses the
available channel state information (CSI) to precode the symbol
vector s = [s1, . . . , sU ]T into the precoded vector x ∈ XB .
1For simplicity, we focus on single-antenna UEs; the model can easily be
expanded to support multi-antenna UEs.
2Knowledge of H is typically acquired via training in the uplink in a time-
division duplexing system [2]. As discussed in [7], channel estimation errors
yield only a small performance loss. Knowledge of the noise variance N0 at
the BS can be obtained by explicit feedback from the UEs to the BS.
Throughout the paper, we assume that the precoded vector x
must satisfy an instantaneous power constraint ‖x‖22 = P ; this
leads to X = {± `± j`} with ` = √P/(2B).
Coherent transmission of data using multiple BS antennas
leads to an array gain, which depends on the realization of the
fading channel and the precoding method. As in [7], [8], we
assume that the uth UE is able to rescale its received signals yu
by a factor3 βu ∈ C in order to compute an estimate sˆu = βuyu
for u = 1, . . . , U of the transmitted symbol su ∈ O.
Since the UEs cannot perform joint processing to recover the
transmitted data, precoding must simultaneously reduce MUI
and increase signal power at all UEs [29]. To accomplish these
goals, there exist multiple formulations of this optimization
problem based on different performance metrics, e.g., sum-rate
throughput or error-rate (see [30] for a survey). As in [7],
[8], we will focus exclusively on precoders that minimize
the mean-squared error (MSE) between the estimated symbol
vector sˆ =
[
sˆ1, . . . , sˆU
]T
= βy and the transmitted symbol
vector s given by
En
[‖s− sˆ‖22] = ‖s− βHx‖22 + |β|2UN0, (1)
where we restrict ourselves to the case in which the precoder
results in the same precoding factor β for all UEs. Hence, in
the remainder of this paper we shall assume that βu = β for
u = 1, . . . , U . With this assumption, the MSE after precoding
will roughly be the same for all UEs, which guarantees a
certain degree of fairness among the UEs; see [7] for more
details. In [8] it is shown that the UEs are able to accurately
estimate the precoding factor β using pilot-based transmission
in block-fading scenarios.
In the infinite-precision case, an MSE-optimal linear precoder
multiplies the symbol vector s with a precoding matrix P ∈
CB×U so that (1) is minimized on average over all possible
transmit vectors s subject to the power constraint. This
problem, which has been studied extensively for the case of
infinite-precision DACs [31], [32], enables the design of low-
complexity linear precoding algorithms [2].
3In contrast to references [7], [8], which assumed real-valued factors βu,
u = 1, . . . , U , we allow these factors to be complex-valued.
4 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
C. MSE-Optimal 1-bit Precoding Problem
In the 1-bit case, linear-quantized precoders perform first
linear precoding and then quantize the result to the finite
transmit set XB as
x =
√
P
2B
(
sgn(<{Ps}) + j sgn(={Ps}))
for a given precoding matrix P. Linear-quantized precoders
can be analyzed theoretically and typically exhibit low com-
plexity [7]. However, as recently shown in [1], [7]–[10],
significant performance improvements can be obtained by using
sophisticated nonlinear precoding methods.
One way to design such nonlinear precoders is to solve the
following MSE-optimal 1-bit precoding problem (OPP), which
simultaneously finds the optimal precoding vector xOPP and
the associated precoding factor βOPP:
(OPP) minimize
x∈XB, β∈C
‖s− βHx‖22 + |β|2UN0.
We emphasize that for a fixed value of β, the problem (OPP)
is a closest vector problem that is known to be NP-hard [33]–
[35]; this implies that there exists no known algorithm to solve
it efficiently for large values of B. In [7], [8], approximate
methods for solving (OPP) using convex relaxation have been
proposed, such as the squared-infinity norm Douglas-Rachford
splitting (SQUID) algorithm. Such relaxation-based methods,
however, still require high computational complexity, which
prevents their deployment in practical systems.
III. 1-BIT PRECODING VIA BICONVEX RELAXATION
Since the problem (OPP) is of combinatorial nature, a brute-
force search for a solution is intractable in massive MU-MIMO
systems with hundreds of BS antennas. We next propose two
nonlinear precoding algorithms that yield approximate but
accurate solutions at low computational complexity.
A. Approximating (OPP)
To solve (OPP) efficiently, we use the BCR framework
put forward in [15], which was initially proposed for solving
large semidefinite programs that appear in computer vision. In
order to use this framework, we first simplify the objective
function of (OPP) by assuming that N0 → 0, i.e., we assume
that the system operates in the high-SNR regime. Note that
we make this assumption solely for the purpose of deriving
computationally efficient algorithms; we show in Section VI-A
that our algorithms also work well in the low-SNR regime. As
in [1, Eq. 3], we take a leap of faith with the approximation
min
x∈XB
min
β∈C
‖s− βHx‖22 ≈ min
x∈XB
min
α∈C
‖αs−Hx‖22 . (2)
This approximation can be justified by noting that if we can
find a precoded vector x ∈ XB for which s = Hx, then both
problems in (2) are indeed equivalent. These approximations
allow us to rewrite (OPP) as follows:
(OPP∗) xˆ = arg min
x∈XB , α∈C
‖αs−Hx‖22 .
We next get rid of the parameter α in (OPP∗). For a fixed x,
the optimal parameter αˆ(x) that minimizes the objective
function of (OPP∗) is given by
αˆ(x) = arg min
α∈C
‖αs−Hx‖22 =
sHHx
‖s‖22
.
By inserting αˆ(x) into the objective function of (OPP∗), we
obtain
‖αˆ(x)s−Hx‖22 = ‖Ax‖22 (3)
with
A = QH and Q = IU − ss
H
‖s‖22
, (4)
where the matrix Q ∈ CU×U is a projection onto the orthogonal
complement of the space spanned by the symbol vector s.
Using (3), the problem (OPP∗) can be simplified to
(OPP∗∗) xˆ = arg min
x∈XB
‖Ax‖22 ,
which remains to be a closest vector problem. Nevertheless, the
specific form of (OPP∗∗) enables us to use BCR to efficiently
compute approximate but accurate solutions.
B. Biconvex Relaxation (BCR)
To solve (OPP∗∗) using BCR, we first introduce a copy z
of the vector x, and replace (OPP∗∗) with the approximation
xˆ = arg min
x∈XB , z∈CB
‖Az‖22 + γ‖z− x‖22,
where γ > 0 is a (fixed) regularization parameter. We next
relax the nonconvex alphabet constraint x ∈ XB to its convex
envelope given by
BB =
{
c ∈ CB
∣∣∣∣ |<{cb}| ≤
√
P
2B
,
|={cb}| ≤
√
P
2B
, b = 1, . . . , B
}
. (5)
This relaxation allows us to convexify the precoding problem
as follows:
xˆ = arg min
x∈BB , z∈CB
‖Az‖22 + γ‖z− x‖22,
which enables the design of algorithms that converge quickly.
Unfortunately, solving this optimization problem yields, in
general, the all-zeros vector, i.e., x = 0B×1. One of the key
ideas of BCR is to force the solution of this new problem
to satisfy the constraints in (5) with equality. This can be
accomplished by including a nonconvex regularization term in
the objective that promotes large values of x. As suggested
in [15], we use a negative `2-norm term to obtain the following
biconvex relaxation optimization problem:
(BCR∗) xˆBCR = arg min
x∈BB , z∈CB
‖Az‖22 + γ‖z− x‖22 − δ‖x‖22,
where δ > 0 is a (fixed) regularization parameter. If δ < γ, then
the formulation (BCR∗) is biconvex (i.e., the minimization with
respect to x is convex when z is fixed, and vice versa). Robust
O. CASTAÑEDA ET AL. 5
parameter choices are γ = ‖AHA‖2,2 and γ/δ = 2; see [15]
for more details. In practice, we use numerical simulations to
tune the parameters γ and δ to further improve the empirical
performance of our algorithms.
C. C1PO: biConvex 1-bit PrecOding
We have noted above that the (BCR∗) problem is biconvex,
meaning that minimization with respect to x alone (with z fixed)
or z alone (with x fixed) is convex. Hence, as done in [15], we
can solve the (BCR∗) problem approximately using alternating
minimization. Since the problem is nonconvex, initialization
critically affects the performance of our algorithm. We initialize
our algorithm with the MRT precoded vector x(1) = HHs,
which yields excellent performance in practice and can be
computed efficiently. Then, we solve for z while holding x
fixed; afterwards, we solve for x while holding z fixed.
Specifically, we repeat the following procedure:
z(t+1) = arg min
z∈CB
‖Az‖22 + γ‖z− x(t)‖22
x(t+1) = arg min
x∈BB
γ‖z(t+1) − x‖22 − δ‖x‖22
for t = 1, 2, . . . , tmax, where tmax is the maximum number of
iterations. Both steps are convex optimization problems that
can be solved efficiently in closed form. Hence, the above
iterative procedure reduces to the following simple algorithm,
which we call C1PO (short for biConvex 1-bit PrecOding).
Algorithm 1 (C1PO). Set A as in (4), initialize
x(1) = HHs, and fix the parameters δ and γ so that
0 < δ < γ. Then, for every iteration t = 1, 2, . . . , tmax,
compute
z(t+1) = (IB + γ
−1AHA)−1x(t) (6)
x(t+1) = proj(z(t+1)). (7)
Here, the expansion-reprojection operator proj(·) is
proj(z) = sgn(<{z}) min
{
γ
γ − δ |<{z}|,
√
P
2B
}
+ j sgn(={z}) min
{
γ
γ − δ |={z}|,
√
P
2B
}
and is applied element-wise to the vector z(t+1). In
the last iteration tmax, the output x(tmax+1) of C1PO is
quantized to the quaternary alphabet X = { ± ` ± j`}
with ` =
√
P/(2B) as follows:
xˆ =
√
P
2B
(
sgn
(
<
{
x(tmax+1)
})
+ j sgn
(
=
{
x(tmax+1)
}))
. (8)
Because C1PO decreases the objective function (BCR∗) on
every variable update, and the objective is bounded from below,
the objective values corresponding to the iterates {x(t), z(t)}
form a convergent sequence. However, by exploiting the
biconvex structure of our problem, we can prove the following
stronger result; the proof is given in Appendix A.
Theorem 1. Any limit point of the sequence {x(t), z(t)}
generated by C1PO is a stationary point of (BCR∗).
The main computations performed by C1PO in Algorithm 1
are (i) the B ×B matrix inversion G = (IB + γ−1AHA)−1,
which can be computed once during a preprocessing
stage, and (ii) the per-iteration matrix-vector multiplication
z(t+1) = Gx(t) in step (6); the complexity of the projection
in step (7) is negligible. Unfortunately, the complexity of
the matrix inversion, evaluated in terms of operations,4 scales
roughly with B3 and the complexity of the per-iteration matrix-
vector product with B2. Both of these tasks are particularly
inefficient for massive MU-MIMO systems with a large number
of BS antennas. Therefore, we next propose an algorithmic
variant that avoids both of these issues and whose complexity
scales more favorably with the number of BS antennas.
D. Fast Algorithm for Very-Large Systems: C2PO
To obtain our alternative algorithm, we start from the BCR
formulation in (BCR∗) but rather than introducing the auxiliary
variable z, we attempt to directly solve the following nonconvex
optimization problem:5
xˆ = arg min
x∈BB
1
2
‖Ax‖22 −
δ
2
‖x‖22. (9)
We use forward-backward splitting (FBS) [36]–[38], a com-
putationally efficient method to solve large convex problems.
Since the problem in (9) is nonconvex, FBS is not guaranteed
to converge to the optimal solution. Nevertheless, as shown in
Section VI, the proposed algorithm performs well in practice.
FBS is an efficient iterative procedure to solve convex
optimization problems of the form
xˆ = arg min
x
f(x) + g(x),
where the function f is smooth and convex, and the function g
is convex but not necessarily smooth or bounded. FBS consists
of the following iteration [36], [37]:
z(t+1) = x(t) − τ (t)∇f(x(t))
x(t+1) = proxg
(
z(t+1); τ (t)
)
for t = 1, 2, . . . , tmax or until convergence. Here, ∇f(x) is the
gradient of the smooth function f , and the so-called proximal
operator for the function g is defined as follows [39]:
proxg(z; τ) = arg min
x
{
τg(x) +
1
2
‖x− z‖22
}
.
The sequence {τ (t) > 0} contains suitably chosen step-size
parameters. For the problem (9), we show below that FBS
4For simplicity, we count the number of complex-valued multiplications to
characterize the operation count.
5To simplify notation, we have divided both terms in the objective function
by a factor of two; this scaling does not affect the result.
6 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
monotonically decreases the objective (9) for any constant step
size that satisfies τ (t) = τ < ‖AHA‖−12,2.
In order to approximately solve (9) using FBS, we set
f(x) =
1
2
‖Ax‖22 and g(x) = χ
(
x ∈ BB)− δ
2
‖x‖22, (10)
where χ is a characteristic function that is zero if the condition
x ∈ BB is met and infinity otherwise. For this choice of f and g,
the gradient is given by ∇f(x) = AHAx and the proximal
operator is given by the expansion-reprojection operation
proxg(z) = sgn(<{z}) min
{
1
1− τδ |<{z}|,
√
P
2B
}
+ j sgn(={z}) min
{
1
1− τδ |={z}|,
√
P
2B
}
, (11)
which is valid for τδ < 1 and applied element-wise to vectors.
By using FBS with the above-mentioned ingredients, we obtain
the following simple algorithm, which we call C2PO.
Algorithm 2 (C2PO). Set A as in (4). Initialize
x(1) = HHs and fix the parameters δ and τ so that τδ < 1.
Then, for every iteration t = 1, 2, . . . , tmax compute:
z(t+1) = x(t) − τAHAx(t) (12)
x(t+1) = proxg(z
(t+1); τ). (13)
Here, the proxg operator is given in (11) and is applied
element-wise to the vector z(t+1). In the last iteration
tmax, the output x(tmax+1) of C2PO is quantized to the
quaternary alphabet X as in (8).
The following result shows that C2PO is well behaved,
provided that the step size is chosen appropriately; the proof
is given in Appendix B.
Theorem 2. Suppose the step size used in C2PO satisfies
τ < ‖AHA‖−12,2, and τδ < 1. Then, C2PO decreases the
objective (9) monotonically, and any limit point of the iterates
{x(t)} is a stationary point.
The most complex operation of C2PO (Algorithm 2) is the
matrix-vector multiplication in step (12). In contrast to C1PO
(Algorithm 1), however, this step requires a minimal amount
of preprocessing and can be computed efficiently, especially
for large BS antenna arrays. To see this, we rewrite AHA,
where A was given in (4), as follows:
AHA = HHH− H
HssHH
‖s‖22
= HHH− vvH = HΥH.
Here, v = HHs/‖s‖2 is a normalized version of the MRT
vector; the augmented matrices H = [H;vH ] and H
Υ
=
[HH ,−v] are of dimension (U + 1) × B and B × (U + 1),
respectively. With these definitions, we can now simplify
step (12) to
z(t+1) = x(t) − τHΥHx(t), (14)
where we first compute w = H(τx(t)), then w′ = H
Υ
w, and
finally z(t+1) = x(t) −w′. As a result, we conclude that each
iteration of C2PO requires only two matrix-vector products
with a cost of roughly 2B(U + 1) operations (in contrast
to B2 operations for C1PO). In addition, the preprocessing
stage of this algorithm only needs to compute the normalized
MRT vector v, which requires roughly BU operations (in
contrast to B3 operations for C1PO). Hence, the complexity of
C2PO can be significantly lower than that of C1PO, especially
since the antenna configurations of typical massive MU-MIMO
systems satisfy U  B. As we will show in Section VI,
the hardware efficiency of C2PO is superior to that of C1PO
for large BS antenna arrays and the error-rate performance is
comparable.
E. Alternative Derivation of C2PO
It is interesting to note that there is a strong connection
between Algorithm 1 and Algorithm 2. In fact, one can obtain
C2PO directly from C1PO using the following well-known
series expansion. Let ‖AHA‖2,2 < γ. Then, we have the
following Neumann series expansion [40]:
(IB + γ
−1AHA)−1 =
∞∑
n=1
(−γ−1AHA)n.
As suggested in [21], we can approximate the inverse by
truncating the series to the two first terms:
(IB + γ
−1AHA)−1 ≈ IB − γ−1AHA.
By using this approximation in step (6) of Algorithm 1, we
immediately obtain Algorithm 2 after setting γ−1 = τ . Note
that the Neumann series expansion is only convergent for
‖AHA‖2,2 < γ, which corresponds to the step size restriction
τ < ‖AHA‖−12,2; this is exactly the same step size requirement
as in Theorem 2.
IV. VLSI DESIGN FOR C1PO
We now present a high-throughput VLSI architecture for
C1PO as in Algorithm 1. We then discuss the key optimizations
performed in our FPGA implementation.
A. Architecture Overview
The proposed VLSI architecture that implements C1PO as
detailed in Algorithm 1 is shown in Figure 2. Our architecture
consists of a linear array of B identical processing elements
(PEs) that share a common control unit. The PEs essentially
compute the complex-valued matrix-vector product in (6),
using a variant of Cannon’s algorithm [41], followed by the
projection operation in (7). Each PE b = 1, 2, . . . , B consists
of three main building blocks: (i) a gb-memory, (ii) a complex-
valued multiply-accumulate (MAC) unit, and (iii) a projection
unit. For the bth PE, the gb-memory stores the bth row of
the matrix G = (IB + γ−1AHA)−1, which we assume was
computed during a separate preprocessing stage. As mentioned
in Section III-B, simulations are used to tune the parameter γ in
order to improve the error-rate performance; the optimal value
of γ depends on the antenna configuration. The complex-valued
O. CASTAÑEDA ET AL. 7
Fig. 2. High-level block diagram of the VLSI architecture for C1PO. We use
a linear array of B processing elements (PEs) that enables us to achieve high
throughput at low hardware complexity.
MAC unit is used by each PE to sequentially compute an entry
of the output vector z(t+1) on line (6), while the entries of the
vector x(t) are exchanged between the PEs in a cyclic fashion;
this is done to avoid an architecture with a centralized x(t)
memory that would suffer from a high fan-out because the
memory’s output has to be distributed to all the PEs. The
projection unit implements the expansion-reprojection operator
proj(·) on line (7) in a hardware-friendly manner. The outputs
of the projection unit are also used to generate the quaternary
outputs of the 1-bit precoder; to this end, each PE simply takes
the sign bits of the complex-valued output vector x(t+1).
B. Architecture Operation
We now detail the (rather technical) operation of the C1PO
architecture illustrated in Figure 2. In the first iteration (i.e., at
t = 1), each PE b is initialized with the bth entry of the vector
x(1). Furthermore, the entries of the gb-memory are stored so
that the first memory address corresponds to [G]b,b, the second
address to [G]b,b+1, and so forth (addresses wrap around).
In the first clock cycle, each PE b computes [G]b,b[x(t)]b
and the result is stored in the accumulator. As shown on the
left side of Figure 2, in the same clock cycle the bth PE passes
the value [x(t)]b to PE (b − 1), while it receives the value
[x(t)]b+1 from PE (b+ 1); PE 1 passes its value to PE B. In
the second clock cycle, since the exchange operation made
the element [x(t)]b+1 available at PE b, each PE computes
[G]b,b+1 · [x(t)]b+1 and uses the accumulator to add it to the
result of the previous cycle. Once again, in the same clock
cycle, the bth PE passes the x(t) entry that is currently being
multiplied on its MAC unit to PE (b − 1); PE 1 passes its
value to PE B. Consequently, in the third clock cycle, the
bth PE will use the values [G]b,b+2 and [x(t)]b+2 to continue
performing MAC operations. By repeating this procedure B
times, each entry of the vector [x(t)] circulates through all
the PEs exactly once, enabling each PE b = 1, 2, . . . , B to
compute [z(t+1)]b. Thus, the matrix-vector multiplication on
line (6) is completed. Since the complex-valued MAC unit
contains three pipeline stages, two clock cycles are required to
flush the pipeline. Hence, the matrix-vector operation requires
a latency of B + 2 clock cycles. After B + 2 clock cycles, the
vector z(t+1) is available at the outputs of the MAC units.
In the subsequent clock cycle, each PE projects their respec-
tive entry of the vector z(t+1). According to our simulation
results, the choice γ/δ = 5, which implies that γ/(γ − δ) =
1.25, works well for all the considered antenna configurations.
Furthermore, to reduce the hardware complexity, we assume
P = 2B so that the clipping threshold of the expansion-
reprojection operator proj(·) is 1. As a result, the proj(·)
operator in (7) is implemented by applying the following
operations independently to the real and imaginary parts of
[z(t+1)]b: We multiply the real (or imaginary) part of [z(t+1)]b
by 1.25; this is accomplished by adding the [z(t+1)]b value with
a 2× right-shifted version of itself. At the same time, the real
(or imaginary) part of [z(t+1)]b is compared to −0.8 and +0.8.
If the real (or imaginary) part of [z(t+1)]b is between these two
numbers, then the projection unit outputs 1.25 · [z(t+1)]b. If it
is smaller than −0.8, then the projection unit generates −1;
if it is larger than +0.8, it generates +1. The result from this
projection is stored as the next iterate [x(t+1)]b in the input
register of the complex-valued MAC unit, which completes
one C1PO iteration. Since the projection requires an additional
clock cycle, one C1PO iteration is completed in exactly B + 3
clock cycles.
C. FPGA Implementation Details
To minimize the FPGA implementation complexity of C1PO,
we exclusively use fixed-point arithmetic; see Section VI-A for
the fixed-point error-rate performance of C1PO. To represent
the entries of the vector x(t), we use 12-bit signed fixed-point
values with 5 fraction bits. The entries of the G matrix are
represented using 10 bits with 9 fraction bits, and we use
FPGA look-up tables (LUTs) as a distributed RAM to store
these values. The complex-valued MAC unit uses 18 bits with
11 fraction bits; the projection unit uses 15 bits with 8 fraction
bits. In our C1PO design, all adders and multipliers do not
saturate, but wrap around; number resizing uses truncation.
All complex-valued multipliers consist of four real-valued
multipliers and two adders; we use the built-in DSP48 units
for these operations.
V. VLSI DESIGN FOR C2PO
We now present a high-throughput VLSI architecture for
C2PO as in Algorithm 2. We then discuss the key optimizations
used in our FPGA implementation.
A. Architecture Overview
The proposed VLSI architecture that implements C2PO
(Algorithm 2) is shown in Figure 3. In what follows, we assume
that B is a multiple of U and B  U . Our architecture consists
of B/U linear arrays; each array consists of U + 1 PEs and a
control unit. The architecture divides the operation in (14) into
8 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
Fig. 3. High-level block diagram of the VLSI architecture for C2PO. We use B/U linear arrays, each consisting of U + 1 processing elements (PEs), which
enable us to achieve high throughput at low hardware complexity.
two separate matrix-vector products: (i) w = H(τx(t)) and (ii)
w′ = H
Υ
w; see also the discussion at the end of Section III-D.
We assume that H was computed in a separate preprocessing
stage. Note that for the first matrix-vector product, the matrix
H has more columns (B) than rows (U + 1); for the second
matrix-vector product, the matrix H
Υ
has more rows than
columns. Therefore, we will refer to the first matrix-vector
product as the wide product, while the second one will be
identified as the tall product. The final subtraction required to
compute z(t+1) = x(t) −w′ in (14) is incorporated into the
tall-product operation; see Section V-B for more details.
To perform both the wide and tall products in a single
computation unit, the matrix H is divided into B/U sub-
matrices H˜w ∈ C(U+1)×U , w = 1, 2, . . . , B/U , so that H =[
H˜1, H˜2, . . . , H˜B/U
]
. In the same way, the matrix H
Υ
is
divided into B/U sub-matrices H˜Υw ∈ CU×(U+1), where w =
1, 2, . . . , B/U , and H
Υ
=
[
H˜Υ1 ; H˜
Υ
2 ; . . . ; H˜
Υ
B/U
]
. Note that
H˜Hw and H˜
Υ
w are the same matrices, except for a sign flip of the
last column. Analogously to the matrices case, the vector x(t) is
divided into B/U sub-vectors x˜(t)w ∈ CU , w = 1, 2, . . . , B/U ,
so that x(t) =
[
x˜
(t)
1 ; x˜
(t)
2 ; . . . ; x˜
(t)
B/U
]
. We next outline the
architectural principle of the wide and tall products.
(i) Wide Product: Each linear array takes one sub-matrix H˜w
and the associated sub-vector x˜(t)w as its inputs and computes
their product in a sequential, column-by-column manner. This
operation is analogous to that of the C1PO architecture (cf. Sec-
tion IV-B) and within each linear array, the entries of the scaled
sub-vector τ x˜(t)w are cyclically exchanged among the PEs. The
resulting vectors ww = H˜w(τ x˜
(t)
w ) are then added to obtain
w = H(τx(t)) =
B/U∑
w=1
H˜w(τ x˜
(t)
w ),
which completes the wide product. Each entry [w]u of the
resulting vector w is then stored in PE u of all linear arrays.
(ii) Tall Product: With the w vector available in all the
linear arrays, each array now computes U entries of z(t+1)
by implementing z(t+1)w = x˜
(t)
w − H˜Υww. Here, however, we
use a sequential procedure in which the accumulated results
are exchanged between PEs of the same array. This procedure
is—once again—a variant of Cannon’s algorithm [41]; see
Section V-B for a detailed explanation. As a result, each linear
array can project its computed z(t+1)w entries to generate the
next iterate x˜(t+1)w , which are then used by the same linear
array to proceed with the next iteration. The sign bits of the
new vector x(t+1) correspond to the outputs of the C2PO
architecture.
As in the C1PO architecture, each PE u = 1, 2, . . . , U + 1 is
formed by three main units. The first unit is an h˜[w]u -memory,
which stores the uth row of the H˜w sub-matrix; H˜Υw can
be derived directly from H˜w. The second unit is a complex-
valued MAC unit, which supports (i) multiplications of a× b
and a× b∗, (ii) accumulation by addition or subtraction, and
(iii) initialization of the accumulator with a non-zero value.
The third unit is the projection unit, which is equivalent to the
one of C1PO, although it is merged with the accumulator of
the MAC unit.
B. Architecture Operation
We now provide the (rather technical) operation details of
the C2PO architecture illustrated in Figure 3. Without loss of
generality, we focus our description on the wth linear array
of PEs, which operates on the H˜w sub-matrix and the x˜
(t)
w
sub-vector. In the first iteration (i.e., at t = 1), the entry
[x˜
(1)
w ]u and its scaled version τ [x˜
(1)
w ]u are stored in PE u =
1, 2, . . . , U in two different registers: The value [x˜(1)w ]u is stored
in the register labeled with “a” in Figure 3, which will later
be used to initialize the accumulator in the complex-valued
MAC unit; the value τ [x˜(1)w ]u is stored at the input register
of the MAC unit labeled with “b.” We restrict the stepsize
τ to be of the form 2−α for some fixed α ∈ N+, which
enables us to acquire τ [x˜(1)w ]u from a simple arithmetic right-
shifted version of [x˜(1)w ]u; we used numerical simulations to
optimize the error-rate performance by selecting an optimal
value for τ . The (U + 1)th PE stores the same value [x˜(1)w ]1 as
that in PE 1. Similar to the C1PO architecture, the entries of
O. CASTAÑEDA ET AL. 9
the h˜[w]u -memory are stored so that the first memory address
contains [H˜w]u,u, the second address [H˜w]u,u+1, and so forth
(addresses wrap around). For the (U+1)th PE, the first address
of the h˜[w]U+1-memory contains [H˜w]U+1,1, the second address
contains [H˜w]U+1,2, etc.
(i) Wide Product: In the first clock cycle, each PE u =
1, 2, . . . , U computes [H˜w]u,u · [τ x˜(t)w ]u and stores the result
in the accumulator. The (U + 1)th PE computes [H˜w]U+1,1 ·
[τ x˜
(t)
w ]1. As shown in the upper left side of Figure 3, in the
same clock cycle, the uth PE passes the value [τ x˜(t)w ]u to
PE (u − 1), while it receives the value [τ x˜(t)w ]u+1 from PE
(u + 1); PE 1 passes its value to PE U , while PE (U + 1)
does not pass anything. In the second clock cycle, since the
cyclic exchange operation made the entry [τ x˜(t)w ]u+1 available
at PE u, each PE computes [H˜w]u,u+1 · [τ x˜(t)w ]u+1 and uses
the accumulator to add it to the result of the previous cycle.
The (U + 1)th PE uses the same value τ x˜(t)w as PE 1; hence,
it can compute [τ x˜(t)w ]2 · [H˜w]U+1,2. Again, in the same clock
cycle, the uth PE passes the τ x˜(t)w entry that is currently being
multiplied on its MAC unit to PE (u − 1); PE 1 passes its
value to PE B, while PE (U + 1) does not pass anything.
Consequently, in the third clock cycle, the uth PE will use
the values [H˜w]u,u+2 and [τ x˜
(t)
w ]u+2 to continue performing
MAC operations. During this third cycle, the (U + 1)th PE
will calculate the product [H˜w]U+1,3 · [τ x˜(t)w ]3. By repeating
this procedure U times, each entry of the sub-vector (τ x˜(t)w )
cycles through all the PEs exactly once, enabling the wth linear
array of PEs to compute H˜w(τ x˜
(t)
w ). Since the complex-valued
MAC unit contains three pipeline stages, two clock cycles are
required to flush the pipeline. Hence, the previous matrix-vector
operation has a latency of U + 2 clock cycles. To complete the
wide product, the vectors H˜w(τ x˜
(t)
w ) must be added. We use a
binary adder tree with log2(B/U) pipeline stages. Hence, the
vector w is computed after U + log2(B/U) + 2 clock cycles.
The uth PE in each linear array stores the entry [w]u in the
MAC unit’s input registered labeled with “b” in Figure 3.
(ii) Tall Product: In the next clock cycle, the computa-
tion of the tall-product starts. During the first clock cycle
of the tall product computation, the PE u = 1, 2, . . . , U has
available [w]u, as well as [H˜w]u,u, the first entry in its mem-
ory. The PE can then compute [H˜w]∗u,u · [w]u = [H˜Hw ]u,u ·
[w]u= [H˜
Υ
w ]u,u · [w]u. Using the accumulator, this product is
then subtracted from [x˜(t)w ]u, which was stored during the
initialization phase in the register labeled with “a” in Figure 3.
During the same clock cycle, the uth PE sends its accumulated
result to the (u− 1)th PE; PE 1 sends its accumulated result
to PE U . Also, in the same clock cycle, the (U + 1)th PE
multiplies the conjugate of the first entry of its memory with
its w entry. In words, the product [H˜w]∗U+1,1 · [w]U+1 =
[H˜Hw ]1,U+1 · [w]U+1 = −[H˜Υw ]1,U+1 · [w]U+1 is computed.
The result is sent to the U th PE. In the following clock cycles,
this result will cycle through the linear array using the same
wires and registers that were previously used to transfer the
τ x˜
(t)
w entries. In the second clock cycle, the (u − 1)th PE
multiplies the value [w]u−1 with [H˜w]∗u−1,u = [H˜
Υ
w ]u,u−1.
The product is then subtracted from the accumulated value
received from the uth PE during the previous cycle, and
the new accumulated value is passed to the (u − 2)th PE.
In the same clock cycle, the (U + 1)th PE multiplies the
value [w]U+1 with [H˜w]∗U+1,2 = −[H˜Υw ]2,U+1 and sends the
result to the U th PE, so it can cycle through the linear array.
Furthermore, PE U passes the −[H˜Υw ]1,U+1·[w]U+1 (previously
received from the (U + 1)th PE) to PE (U − 1). In the third
clock cycle, the (u− 2)th PE calculates [H˜Υw ]u,u−2 · [w]u−2,
subtracts it from the accumulated result received on the second
cycle from the (u − 1)th PE and passes the result to the
(u− 3)th PE. In the same clock cycle, the (U + 1)th PE com-
putes −[H˜Υw ]3,U+1 · [w]U+1 and sends it to PE U . Meanwhile,
−[H˜Υw ]2,U+1 · [w]U+1 is passed from PE U to PE (U −1) and
−[H˜Υw ]1,U+1 ·[w]U+1 is passed from PE (U−1) to PE (U−2).
After repeating this procedure for U clock cycles, each PE
u = 1, 2, . . . , U will contain the accumulated result for the uth
entry of x˜(t)w − H˜Υww, received from the (u+ 1)th PE during
the previous cycle. However, this accumulated result is missing
the product −[H˜Υw ]u,U+1 · [w]U+1, which was computed and
sent by the (U + 1)th PE. Nonetheless, in the (U + 1)th cycle
of the tall product procedure, the uth PE receives the missing
−[H˜Υw ]u,U+1·[w]U+1 value from the (u+1)th PE. The received
product is accumulated with the remaining data by addition
instead of subtraction. Thus, the z(t+1) = x(t) −HΥw entries
are calculated after U + 1 cycles. Since the complex MAC unit
is used again, two additional clock cycles are required to flush
its pipeline. Hence, U + 3 cycles are used for the tall product.
Finally, in the subsequent clock cycle after the tall product
is completed, the projection operator is applied in a similar
fashion as for the C1PO architecture. As in the C1PO case,
our simulation results show that the choice of τλ = 0.2
(which implies that 1/(1 − τλ) = 1.25) works well for all
the considered antenna configurations. Therefore, the only
difference between the projection units of C1PO and C2PO is
that, in the C2PO architecture, the accumulator of the MAC unit
is used to multiply the real and imaginary parts of each z(t+1)
entry with 1.25, by adding each z(t+1) entry with a 2× right-
shifted version of itself. The result from this projection is stored
as the next iterate [x˜(t+1)w ]u in the two initialization locations
previously mentioned, completing one C2PO iteration. Since
the projection operation requires an additional clock cycle, a
full C2PO iteration is completed in exactly 2U+log2(B/U)+6
clock cycles.
C. FPGA Implementation Details
As for the C1PO FPGA implementation, we exclusively
use fixed-point arithmetic for the C2PO FPGA design; see
Section VI-A for the fixed-point error-rate performance of
C2PO. To represent the entries of the vector x(t), we use 12-
bit signed fixed-point values with 5 fraction bits. For the scaled
τx(t) values, we use 12-bit signed fixed-point values with 11
fraction bits. The entries of the H matrix consist of 10 bits with
8 fraction bits, and we use FPGA look-up tables (LUTs) as
a distributed RAM to store these values. The complex-valued
MAC unit uses 18 bits with 15 fraction bits when doing the
wide product and 11 fraction bits when doing the tall product;
the projection unit uses 18 bits with 11 fraction bits. The adder
10 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
tree uses 21 bits with 15 fraction bits. Identical to the C1PO
implementation, all adders and multipliers do not saturate, but
wrap around; number resizing uses truncation. All complex-
valued multipliers are built with four real-valued multipliers
and two adders; we use DSP48 units for these operations.
VI. RESULTS
We now provide error-rate performance results for massive
MU-MIMO systems and show reference FPGA implementation
results for C1PO and C2PO.
A. Simulation Results
Figure 4 shows uncoded bit error-rate (BER) curves versus
the normalized transmit power % = P/N0 for massive MU-
MIMO dowlink systems with U = 16 UEs for various precod-
ing algorithms. In Figure 4(a) we consider the case of B = 32
BS antennas with BPSK, whereas in Figure 4(b) we consider
the case of B = 128 BS antennas with 16-QAM. For both
systems, we use Gray mapping, generate i.i.d. Rayleigh fading
channel matrices, and average the BER over 10 000 Monte–
Carlo trials. We compare ZF followed by quantization (ZF-Q),
MRT followed by quantization (MRT-Q), the nonlinear SQUID
algorithm proposed in [7], as well as C1PO and C2PO, for
systems with 1-bit DACs. As a baseline, we also include ZF and
MRT with infinite-precision DACs (denoted by “Inf. prec. ZF”
and “Inf. prec. MRT”, respectively). SQUID runs tmax = 50
iterations; C1PO and C2PO both run tmax = 24 iterations. For
all algorithms, the curves represent MATLAB floating-point
performance; for C1PO and C2PO, the markers correspond
to fixed-point performance of our hardware designs. Clearly,
the fixed-point implementation loss of our hardware designs is
negligible, i.e., less than 0.15 dB normalized transmit power %
at 1% uncoded BER for both considered scenarios.
For the 16 × 32 system (we use the notation U × B to
refer to a downlink scenario with U users and B BS antennas)
with BPSK, Figure 4(a) shows that all nonlinear precoders
significantly outperform the linear-quantized precoders (ZF-Q
and MRT-Q), which exhibit a high error floor. Furthermore,
we see that C1PO and C2PO achieve similar performance as
that of SQUID. At low values of normalized transmit power %,
SQUID is marginally better, whereas C2PO achieves the best
performance at high values of %, closely followed by C1PO
and SQUID. It can also be seen that, at high values of %,
1-bit nonlinear precoders significantly outperform the error-rate
performance of MRT with infinite-precision DACs.
For the 16×128 system with 16-QAM, Figure 4(b) shows a
similar trend, i.e., non-linear precoders significantly outperform
linear-quantized precoders. SQUID outperforms C1PO and
C2PO (which perform equally well) by about 0.5 dB normalized
transmit power % at 1% BER. However, we note that the
complexity (in terms of operation counts) of SQUID is more
than 2× higher than that of C1PO and C2PO, and also involves
the sorting of B dimensional vectors which is difficult to
implement efficiently in VLSI. We also observe that non-
linear precoders enable reliable transmission of higher-order
modulation schemes (such as 16-QAM), which is not possible
with linear-quantized methods—the error-rate performance of
nonlinear 1-bit precoders for higher-order modulation schemes
is studied in more detail in [8]. We also see that non-linear
precoders do not exhibit an error floor in the considered BER
range, which is in contrast to the linear-quantized ones. We note
that a detailed theoretical analysis of the error-rate performance
of non-linear 1-bit precoders is an open research problem.
Remark 1. Our results are limited to a narrowband downlink
channel, in which we assume that the BS has perfect knowledge
of the channel matrix H and the noise variance N0. We also
assume that all the UEs have approximately the same large-
scale fading gains, and we restricted ourselves to a single
precoding factor β for all UEs. Furthermore, we have ignored
real-world hardware impairments and synchronization aspects.
Hence, the provided simulation results are not necessarily
representative for other, more realistic system scenarios. To
enable interested readers to perform their own simulations with
different system parameters, we made our MATLAB simulation
framework available for download from GitHub: https://github.
com/quantizedmassivemimo/1bit_precoding_VLSI.
B. FPGA Implementation Results
To demonstrate the efficacy of C1PO and C2PO, we imple-
mented several FPGA designs for different antenna configura-
tions, namely for 32, 64, 128, and 256 BS antennas; all designs
support downlink transmission to 16 UEs for modulation
schemes ranging from BPSK to 16-QAM. The FPGA designs
were developed on register transfer level (RTL) using Verilog,
implemented using Xilinx Vivado Design Suite, and optimized
for a Xilinx Virtex-7 XC7VX690T FPGA. Table I shows
reference FPGA implementation results for C1PO and C2PO.
We see that the logic area (in terms of slices, logic LUTs,
flipflops, and DSP48 units) for all designs increases roughly
linearly with the number of BS antennas; this is a result of
using a linear array of PEs. The only exception is the memory
requirements of C1PO (in terms of memory LUTs), which
scales roughly quadratically in the number of BS antennas;
this is a result of having to store the entire B×B matrix G in
contrast to storing only the augmented (U + 1)×B matrix H
for C2PO. We also see that the logic area for C1PO is 20%
to 50% smaller than that of C2PO for all array sizes; the
memory area of C1PO, however, is significantly larger for
128 and 256 BS antennas. This is because the architecture for
C1PO is slightly simpler than that of C2PO, but the memory
requirements of C1PO scale quadratically in B whereas the
memory requirements of C2PO only scale linearly in B.
The maximum clock frequency for C1PO is slightly higher
than that of C2PO, which is due to the slightly simpler archi-
tecture of C1PO. As expected, the maximum clock frequency
slowly decreases with B, since the FPGA routing overhead
increases with B. In fact, after implementing our designs, the
critical paths are typically in interconnect networks. Before
mapping our designs to the FPGA, however, the critical path
for the C1PO designs is in the real-valued multipliers that form
part of the complex multiplier, while for the C2PO designs
it is in the adders that form part of the complex multiplier.
The latency of one C1PO iteration is significantly larger than
that of C2PO for 64, 128, and 256 antennas. This results in
O. CASTAÑEDA ET AL. 11
−10 −5 0 5 10 15 2010
−3
10−2
10−1
100
Normalized transmit power % [dB]
U
nc
od
ed
bi
te
rr
or
-r
at
e
(B
E
R
)
Inf. prec. ZF
Inf. prec. MRT
1-bit ZF-Q
1-bit MRT-Q
1-bit SQUID
1-bit C1PO
1-bit C2PO
(a) B = 32, U = 16, and BPSK.
−10 −5 0 5 10 15 2010
−3
10−2
10−1
100
Normalized transmit power % [dB]
U
nc
od
ed
bi
te
rr
or
-r
at
e
(B
E
R
)
Inf. prec. ZF
Inf. prec. MRT
1-bit ZF-Q
1-bit MRT-Q
1-bit SQUID
1-bit C1PO
1-bit C2PO
(b) B = 256, U = 16, and 16-QAM.
Fig. 4. Uncoded bit error-rate (BER) for various 1-bit precoders as a function of the normalized transmit power % and for different antenna configurations and
modulation schemes. C1PO and C2PO achieve similar performance to SQUID [7] and significantly outperform linear-quantized precoders, such as quantized
zero-forcing (ZF-Q) and MRT (MRT-Q). The performance of ZF and MRT precoding with infinite-precision DACs are included as references.
TABLE I
IMPLEMENTATION RESULTS FOR C1PO AND C2PO FOR MU-MIMO SYSTEMS WITH U = 16 UES ON A XILINX VIRTEX-7 XC7VX690T FPGA
Algorithm C1PO C2PO
BS antennas B 32 64 128 256 32 64 128 256
Slices 2 700 5 187 10 324 21 951 3 375 6 519 12 690 24 748
LUTs 6 671 13 305 30 979 71 817 10 817 21 920 43 710 85 323
– LUTs as logic 6 031 12 025 25 939 51 897 10 069 20 424 40 718 79 339
– LUTs as memory 640 1 280 5 040 19 920 748 1 496 2 992 5 984
Flipflops 6 830 13 624 26 683 52 175 5 677 12 461 26 083 53 409
DSP48 units 128 256 512 1 024 136 272 544 1 088
Max. clock frequency [MHz] 285 264 244 205 222 206 208 193
Min. latencya [clock cycles] 35 67 131 259 39 40 41 42
Max. throughputb [Msymbols/s] 130 63 30 13 91 82 81 74
Power consumptionc [W] 1.13 1.97 3.43 5.74 1.04 1.70 3.17 5.80
Max. throughput/LUTs 19 529 4 733 962 177 8 413 3 756 1 853 862
aThe minimum latency is measured for one algorithm iteration.
bThe throughput corresponds to the total number of symbols precoded per unit of time. In this case, the maximum throughput is equal to (Uf)/d,
where f is the maximum clock frequency and d the minimum latency.
cStatistical power estimation at maximum clock frequency and 1.0 V supply voltage.
significantly higher throughput of C2PO for these BS antenna
array sizes. In summary, C2PO is more efficient in terms of
throughput per area for large BS antenna array sizes (e.g, 128
BS antennas or more), whereas C1PO is more efficient for
small array sizes.
We finally note that the implementation results in Table I
ignore the preprocessing complexity in order to compare the
complexity of the precoding stage alone. For C1PO, preprocess-
ing requires a B×B matrix inversion, which is computationally
demanding, exhibits stringent data dependencies, and requires
high numerical precision, especially for large BS antenna
arrays [21]. In stark contrast, preprocessing for C2PO only
requires the computation of the scaled MRT output, which
requires a multiplication of a B×U matrix by a U -dimensional
vector. As a result, we consider C2PO to be the preferred 1-bit
precoding method for most practical BS antenna array sizes.
C. Comparison with MRT-based Precoding
While the papers [5], [26]–[28] propose hardware designs for
precoding in massive MU-MIMO systems with high-precision
DACs, neither of them provide detailed FPGA implementation
results. Reference [27] describes an FPGA-based testbed that
uses MRT and ZF-based precoding but does not report area
and clock frequency results; [26] and [28] only provide ASIC
12 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
TABLE II
IMPLEMENTATION RESULTS FOR A MRT-Q-BASED PRECODER FOR
MU-MIMO SYSTEMS WITH U = 16 UES ON A XILINX VIRTEX-7
XC7VX690T FPGA
BS antennas B 32 64 128 256
Slices 2 543 5 097 9 444 17 630
LUTs 7 842 15 617 32 476 64 446
– LUTs as logic 7 010 13 953 29 148 57 790
– LUTs as memory 832 1 664 3 328 6 656
Flipflops 5 711 11 419 21 902 42 764
Clock freq. [MHz] 412 410 388 359
Latency [cycles] 18 18 18 18
TPa [Msymbols/s] 366 365 345 319
Powerb [W] 0.79 1.25 1.84 3.16
Throughput/LUTs 46 665 23 356 10 621 4 945
aThe throughput is calculated as (Uf)/d, where f is the clock frequency
and d the latency.
bStatistical power estimation at max. clock freq. and 1.0 V supply voltage.
implementation results, and [5] reports implementation results
on a GPU cluster. Furthermore, all of these implementations
were designed for high-precision DACs. Consequently, to
enable a fair comparison of conventional precoders with C1PO
and C2PO, we developed a baseline design that implements
MRT followed by quantization (MRT-Q).
Our MRT-Q baseline design is essentially a stripped-down
and heavily optimized version of C1PO with only the necessary
circuitry to implement MRT-Q. More specifically, our archi-
tecture corresponds to B/U linear arrays, each one with U
PEs and a control unit. The arrays and PEs are organized as in
Figure 2, with the exception that the projection unit is removed
from the PEs. In addition, no multipliers are required as MRT-Q
computes HHs with s ∈ OU , and hence all multiplications
are with constants (given by the constellation set O) and can
be implemented with adders and shifters.
The FPGA implementation results for the MRT-Q baseline
designs are reported in Table II. Note that these designs do not
require any DSP48 units as the multiplication with constants
are carried out with conventional logic. In comparison to the
1-bit precoder designs reported in Table I, we see that MRT-Q-
based precoding is roughly 5× to 6× more efficient than
C2PO and up to 30× more efficient than C1PO (in terms
of throughput/LUTs). This efficiency advantage comes at a
significant loss in terms of error-rate performance (cf. Figure 4).
We note, however, that for massive MU-MIMO systems with
significantly more BS antennas than UEs (e.g., more than 8×),
MRT-Q is a viable low-complexity alternative—a well-known
fact in the massive MU-MIMO literature [2]–[4].
D. Performance–Complexity Trade-Offs
In Figure 5, we provide the performance–complexity trade-
offs between C1PO (dashed lines with circle markers) and
C2PO (dotted lines with square markers) for various BS antenna
array sizes. This trade-off is characterized in terms of the
−8 −6 −4 −2 0 2 4 6 8 10 12 14 160
2
4
6
8
10
4
1
0
0
410
0
MRT-Q
MRT-Q
Min. normalized transmit power % [dB] that achieves 1% BER
T
hr
ou
gh
pu
t/L
U
T
s
[G
sy
m
bo
ls
/s
/L
U
T
s] C1PO C2PO
B = 32 B = 32
B = 64 B = 64
B = 128 B = 128
B = 256 B = 256
Fig. 5. Performance–complexity trade-offs for C1PO and C2PO. The numbers
next to the curves correspond to the number of iterations tmax. For tmax = 0,
we directly take the outputs from the initialization step x(1) = HHs, which
is an approach equivalent to MRT-Q. The vertical lines show the performance
of ZF precoding with infinite-precision DACs. C1PO outperforms C2PO for
small BS antenna arrays (B = 32 and B = 64); C2PO outperforms C1PO
for large antenna arrays (B = 128 and B = 256). MRT-Q achieves higher
throughput per LUT at the cost of rather poor performance.
minimum normalized transmit power % required to achieve
1% uncoded BER for BPSK (as in Figure 4(a)); the hardware
efficiency is characterized by the throughput per area (in terms
of billion symbols per second per FPGA LUT). As a reference,
we also show the performance for ZF precoding with infinite-
precision DACs (vertical lines). As in Figure 4, we consider a
transmission to U = 16 UEs. Figure 5 shows that, for scenarios
with a high normalized transmit power %, only a few iterations
of our algorithms are required to meet 1% uncoded BER. As
the value of % decreases, more iterations are needed, which
reduces the throughput and, hence, the hardware efficiency
of the circuit. We see that for small antenna arrays (i.e., for
B = 32 and B = 64), C1PO outperforms C2PO, while for
large antenna arrays (i.e., for B = 128 and B = 256), C2PO
significantly outperforms C1PO. We note, however, that the
reported hardware efficiency does not take into account the
fact that the preprocessing complexity of C1PO would be
substantially higher than that of C2PO; see our discussion in
Section VI-B. We also observe that only a small number of
iterations are required (e.g., 2 to 4 iterations) for such large
BS antenna arrays to achieve the error-rate performance limits
of our algorithms.
In Figure 5, we additionally show the trade-off achieved by
the MRT-Q baseline design reported in Section VI-C. Clearly,
MRT-Q achieves higher throughput per LUT than C1PO and
C2PO for large BS arrays (B = 128 and B = 256); this
gain comes, however, at the cost of rather poor error-rate
performance. For small BS antenna arrays (B = 32 and B =
64), MRT-Q is unable to achieve the target BER of 1%. Hence,
MRT-Q is only suitable for massive MU-MIMO systems with
very high BS-to-UE-antenna ratios in which best-in-class error-
rate performance is not the main design objective.
Remark 2. The latency of C1PO and C2PO could be reduced
by modifying the architectures proposed in Section IV and
Section V. While the proposed architectures only pass one
O. CASTAÑEDA ET AL. 13
element of the vector x(t) (for C1PO) and one element of the
sub-vector x˜(t)w (for C2PO) per clock cycle, both architectures
could process two or more elements per clock cycle. Such an
approach would significantly decrease the latency and improve
the throughput at the cost of increased silicon area.
VII. CONCLUSIONS
We have proposed two nonlinear precoding algorithms,
namely C1PO and C2PO, which achieve excellent error-rate
performance in 1-bit massive MU-MIMO systems at low
computational complexity. To substantiate this claim, we have
designed corresponding VLSI architectures—to the best of our
knowledge, the first for 1-bit precoding in the downlink of
massive MU-MIMO systems—and we have presented FPGA
reference implementations for a variety of BS antenna array
configurations. Our results demonstrate that nonlinear precoding
for 1-bit massive MU-MIMO systems is feasible from a
hardware implementation perspective, even for antenna arrays
with hundreds of BS antennas. As a result, our hardware designs
pave the way for enabling BS antenna arrays with 1-bit DACs
to reliably transmit high-rate data to multiple UEs, which has
the potential to keep hardware complexity, system costs, and
circuit power consumption within manageable limits.
There are many avenues for future work. Besides the pro-
posed convergence results, a theoretical error-rate performance
analysis of C1PO and C2PO is a challenging open research
topic. Implementing precoders for other nonlinear algorithms,
such as SQUID [7], which perform better than C1PO and C2PO
at low normalized transmit power %, is left for future work. The
study of 1-bit nonlinear precoders using more realistic system
models and a comprehensive cost, power, and performance
analysis are interesting research directions. Specifically, the
design of 1-bit precoding algorithms and hardware accelerators
for wideband massive MU-MIMO systems that use orthogonal-
frequency division multiplexing (OFDM) is the subject of
ongoing work; preliminary results are reported in [42].
APPENDIX A
PROOF OF THEOREM 1
Let E(z,x) = ‖Az‖22 + γ‖z − x‖22 − δ‖x‖22 denote the
objective (BCR∗) minimized by C1PO. Because BB is bounded,
the sequence of iterates {(z(t),x(t))} remains bounded and thus
contains a convergent sub-sequence. Denote the limit of this
sub-sequence by (z?,x?) and set E? = E(z?,x?). Consider
the point zˆ? = arg minzE(z,x
?) = (IB + γ
−1AHA)−1x?.
If zˆ? 6= z?, then we have the strict inequality
E((zˆ? + z?)/2,x?) <
1
2
E(zˆ?,x?) +
1
2
E(z?,x?) = E?
because E is strongly convex in z. However, this contradicts the
fact that zˆ? = arg minzE(z,x
?), and so it must be the case
that zˆ? = z?. Because δ < γ, E is strongly convex in x, and
a similar argument shows that x? = arg minx∈BB E(z
?,x).
Hence, (z?,x?) minimizes E with respect to z and x separately;
this, combined with the fact that E is differentiable, and B
coordinate-wise separable, guarantees that (z?,x?) satisfies the
first-order conditions for (BCR∗); see Theorem 2 in [43] and
similar arguments in [44].
APPENDIX B
PROOF OF THEOREM 2
Let E(z,x) = ‖Ax‖22 − δ‖x‖22 denote the objective (9)
minimized by C2PO. Let f and g be defined as in (10). Using
the definition of the proximal operator (11) together with (13),
the second update (12) of C2PO can be written as
x(t+1) = arg min
x
g(x) +
1
2τ
‖x− (x(t) − τ∇f(x(t)))‖2
= arg min
x
g(x) + f(x(t)) + 〈x− x(t),∇f(x(t))〉
+
1
2τ
‖x− x(t)‖2.
Observe that, whenever τ < ‖ATA‖−12,2, the inequality
f(x) ≤ f(x(t)) + 〈x− x(t),∇f(x(t))〉+ 1
2τ
‖x− x(t)‖2
holds for all x. Using this observation, we can write
E(x(t+1)) = g(x(t+1)) + f(x(t+1))
≤ g(x(t+1)) + f(x(t)) + 〈x(t+1) − x(t),∇f(x(t))〉
+
1
2τ
‖x(t+1) − x(t)‖2
= min
x
g(x) + f(x(t)) + 〈x− x(t),∇f(x(t))〉
+
1
2τ
‖x− x(t)‖2
≤ g(x(t)) + f(x(t)) = E(x(t)).
This shows that the sequence {E(x(t))} is monotonically
decreasing. Since the sequence is bounded below, there is
some limit L = limt→∞E(x(t)). Let {x(tk)} be a convergent
sub-sequence of iterates (which must exist because the iterates
are bounded) with limit point x?. Let
x¯? = arg min
x
g(x) + f(x?) + 〈x− x?,∇f(x?)〉
+
1
2τ
‖x− x?‖2 (15)
be the result of applying the C2PO iteration starting at x?.
Observing that E(x(tk+1)) ≤ E(x(tk)) ≤ E(x(tk−1)), and
letting k → ∞, we find that E(x¯?) = E(x?) = L, and
so x? is a minimizer of (15). This is only possible if 0 ∈
∂g(x?) +∇f(x?), in which case x? is a stationary point.
ACKNOWLEDGMENTS
The authors would like to thank O. Tirkkonen for insightful
discussions on 1-bit precoding. The authors also thank A. Burg
for discussions on the hardware architecture and R. Manohar
for pointing us to its connection to Cannon’s algorithm. The
work of O. Castañeda and C. Studer was supported in part
by Xilinx, Inc. and by the US National Science Foundation
(NSF) under grants ECCS-1408006, CCF-1535897, CAREER
CCF-1652065, and CNS-1717559. The work of S. Jacobsson
and G. Durisi was supported by the Swedish Foundation for
Strategic Research under grant ID14-0022, and by the Swedish
Governmental Agency for Innovation Systems (VINNOVA)
within the center ChaseOn. The work of T. Goldstein was
supported in part by the US NSF under grant CCF-1535902
14 TO APPEAR IN THE IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
and by the US Office of Naval Research under grant N00014-
17-1-2078.
REFERENCES
[1] O. Castañeda, T. Goldstein, and C. Studer, “POKEMON: a non-linear
beamforming algorithm for 1-bit massive MIMO,” in IEEE Intl. Conf.
on Acoustics, Speech, and Sig. Proc. (ICASSP), New Orleans, LA, Mar.
2017.
[2] F. Rusek, D. Persson, B. Kiong, E. G. Larsson, T. L. Marzetta, O. Edfors,
and F. Tufvesson, “Scaling up MIMO: Oppurtunities and challenges with
very large large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp.
40–60, Jan. 2013.
[3] E. G. Larsson, F. Tufvesson, O. Edfors, and T. L. Marzetta, “Massive
MIMO for next generation wireless systems,” IEEE Commun. Mag.,
vol. 52, no. 2, pp. 186–195, Feb. 2014.
[4] L. Lu, G. Ye Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang,
“An overview of massive MIMO: Benefits and challenges,” IEEE J. Sel.
Topics Signal Process., vol. 8, no. 5, pp. 742–758, Oct. 2014.
[5] K. Li, R. Sharan, Y. Chen, T. Goldstein, J. R. Cavallaro, and C. Studer,
“Decentralized beamforming for massive MU-MIMO on a GPU cluster,”
in 4th IEEE Global Conf. on Sig. and Info. Proc. (GlobalSIP), Washington,
D.C., Dec. 2016.
[6] ——, “Decentralized baseband processing for massive MU-MIMO
systems,” Feb. 2017. [Online]. Available: arXiv:1702.04458
[7] S. Jacobsson, G. Durisi, M. Coldrey, T. Goldstein, and C. Studer,
“Quantized precoding for massive MU-MIMO,” IEEE Trans. Comm.;
arXiv preprint: 1610.07564, Jul. 2016.
[8] ——, “Nonlinear 1-bit precoding for massive MU-MIMO with higher-
order modulation,” in Proc. Asilomar Conf. Signals, Syst., Comput.,
Pacific Grove, CA, Nov. 2016, pp. 763–767.
[9] H. Jedda, J. A. Nossek, and A. Mezghani, “Minimum BER precoding in
1-bit massive MIMO systems,” in IEEE Sensor Array and Multichannel
Sig. Proc. Workshop (SAM), Rio de Janeiro, Brazil, Jul. 2016.
[10] O. Tirkkonen and C. Studer, “Subset-codebook precoding for 1-bit
massive multiuser MIMO,” in Conf. on Info. Sciences and Systems
(CISS), Baltimore, MA, Mar. 2017.
[11] A. Mezghani, R. Ghiat, and J. A. Nossek, “Transmit processing with low
resolution D/A-converters,” in Proc. IEEE Int. Conf. Electron., Circuits,
Syst. (ICECS), Yasmine Hammamet, Tunisia, Dec. 2009, pp. 683–686.
[12] A. K. Saxena, I. Fijalkow, and A. L. Swindlehurst, “On one-bit quantized
ZF precoding for the multiuser massive MIMO downlink,” in IEEE
Sensor Array and Multichannel Sig. Proc. Workshop (SAM), Rio de
Janeiro, Brazil, Jul. 2016.
[13] R. D. J. Guerreiro and P. Montezuma, “Use of 1-bit digital-to-analogue
converters in massive MIMO systems,” IEEE Electron. Lett., vol. 52,
no. 9, pp. 778–779, Apr. 2016.
[14] O. B. Usman, H. Jedda, A. Mezghani, and J. A. Nossek, “MMSE precoder
for massive MIMO using 1-bit quantization,” in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process. (ICASSP), Shanghai, China, Mar. 2016,
pp. 3381–3385.
[15] S. Shah, A. K. Yadav, C. D. Castillo, D. W. Jacobs, C. Studer, and
T. Goldstein, “Biconvex relaxation for semidefinite programming in
computer vision,” in European Conf. on Comp. Vision (ECCV). Springer,
Sep. 2016, pp. 717–735.
[16] C. Risi, D. Persson, and E. G. Larsson, “Massive MIMO with 1-bit
ADC,” Apr. 2014. [Online]. Available: http://arxiv.org/abs/1404.7736
[17] S. Jacobsson, G. Durisi, M. Coldrey, U. Gustavsson, and C. Studer, “One-
bit massive MIMO: Channel estimation and high-order modulations,” in
Proc. IEEE Int. Conf. Commun. Workshop (ICCW), London, U.K., June
2015, pp. 1304–1309.
[18] Y. Li, C. Tao, G. Seco-Granados, A. Mezghani, A. L. Swindlehurst,
and L. Liu, “Channel estimation and performance analysis of one-bit
massive MIMO systems,” IEEE Trans. Signal Process., vol. 65, no. 15,
pp. 4075–4089, May 2016.
[19] C. Mollén, J. Choi, E. G. Larsson, and R. W. Heath Jr., “Uplink
performance of the wideband massive uplink MIMO with one-bit ADCs,”
IEEE Trans. Wireless Commun., vol. 16, no. 1, pp. 87–100, 2017.
[20] C. Studer and G. Durisi, “Quantized massive MU-MIMO-OFDM uplink,”
IEEE Trans. Commun., vol. 64, no. 6, pp. 2387–2399, Jun. 2016.
[21] M. Wu, B. Yin, G. Wang, C. Dick, J. Cavallaro, and C. Studer,
“Large-scale MIMO detection for 3GPP LTE: Algorithm and FPGA
implementation,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, pp.
916–929, Oct. 2014.
[22] Z. Wu, C. Zhang, Y. Xue, S. Xu, and X. You, “Efficient architecture
for soft-output massive MIMO detection with Gauss-Seidel method,” in
IEEE Int. Symp. on Circuits and Systems (ISCAS), Montreal, Canada,
Aug. 2016, pp. 1886–1889.
[23] M. Wu, C. Dick, J. R. Cavallaro, and C. Studer, “High-throughput
data detection for massive MU-MIMO-OFDM using coordinate descent,”
IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 63, no. 12,
pp. 2357–2367, Nov. 2016.
[24] O. Castañeda, T. Goldstein, and C. Studer, “Data detection in large
multi-antenna wireless systems via approximate semidefinite relaxation,”
IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 63, no. 12,
pp. 2334–2346, Nov. 2016.
[25] M. Barrenechea, L. Barbero, M. Mendicute, and J. Thompson, “Design
and hardware implementation of a low-complexity multiuser vector
precoder,” in Conf. on Design and Architectures for Sig. and Image Proc.
(DASIP), Oct. 2010, pp. 160–167.
[26] H. Prabhu, O. Edfors, J. Rodrigues, L. Liu, and F. Rusek, “Hardware
efficient approximative matrix inversion for linear pre-coding in massive
MIMO,” in IEEE Intl. Symp. on Circuits and Systems (ISCAS), June
2014, pp. 1700–1703.
[27] C. Shepard, N. Anand, and L. Zhong, “Practical performance of MU-
MIMO precoding in many-antenna base stations,” in Proc. of the 2013
workshop on Cellular networks: operations, challenges, and future design.
ACM, June 2013, pp. 13–18.
[28] H. Prabhu, O. Edfors, J. Rodrigues, L. Liu, and F. Rusek, “A 60 pJ/b
300 Mb/s 128×8 massive MIMO precoder-detector in 28nm FD-SOI,”
in IEEE Intl. Solid-State Circuits Conf. (ISSCC), San Francisco, United
States of America, Feb. 2017, pp. 60–61.
[29] E. Björnson, M. Bengtsson, and B. Ottersten, “Optimal multiuser transmit
beamforming: A difficult problem with a simple solution structure,” IEEE
Signal Process. Mag., vol. 31, no. 4, pp. 142–148, Jul. 2014.
[30] E. Björnson and E. Jorswieck, “Optimal resource allocation in coordinated
multi-cell systems,” Foundations and Trends in Communications and
Information Theory, vol. 9, no. 2-3, pp. 113–381, 2013.
[31] M. Joham, W. Utschick, and J. A. Nossek, “Linear transmit processing in
MIMO communications systems,” IEEE Trans. Signal Process., vol. 53,
no. 8, pp. 2700–2712, Aug. 2005.
[32] S. Shi, M. Schubert, and H. Boche, “Downlink MMSE transceiver
optimization for multiuser MIMO systems: Duality and sum-MSE
minimization,” IEEE Trans. Signal Process., vol. 55, no. 11, pp. 5436–
5446, Nov. 2007.
[33] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in
lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, Aug.
2002.
[34] U. Fincke and M. Pohst, “Improved methods for calculating vectors of
short length in a lattice, including a complexity analysis,” Math. Comput.,
vol. 44, no. 170, pp. 463–471, Apr. 1985.
[35] S. Verdú, “Computational complexity of multiuser detection,” Algorith-
mica, vol. 4, no. 1, pp. 303–312, 1989.
[36] T. Goldstein, C. Studer, and R. G. Baraniuk, “A field guide to
forward-backward splitting with a FASTA implementation,” Nov. 2014.
[Online]. Available: http://arxiv.org/abs/1411.3406
[37] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183–202, Jan. 2009.
[38] T. Goldstein and S. Setzer, “High-order methods for basis pursuit,” UCLA
CAM Report, pp. 10–41, 2010.
[39] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends®
in Optimization, vol. 1, no. 3, pp. 127–239, Jan. 2014.
[40] G. H. Golub and C. F. van Loan, Matrix Computations, 3rd ed. The
Johns Hopkins Univ. Press, 1996.
[41] L. Cannon, “A cellular computer to implement the Kalman filter
algorithm,” Ph.D. dissertation, Montana State University, United States,
1969.
[42] S. Jacobsson, G. Durisi, M. Coldrey, and C. Studer, “Massive MU-
MIMO-OFDM downlink with one-bit DACs and linear precoding,” in
Proc. IEEE Global Telecommun. Conf. (GLOBECOM), Singapore, Dec.
2017.
[43] P. Tseng, “Convergence of a block coordinate descent method for
nondifferentiable minimization,” J. of Opt. Theory and Applications,
vol. 109, no. 3, pp. 475–494, June 2001.
[44] P. Richtárik and M. Takácˇ, “Iteration complexity of randomized block-
coordinate descent methods for minimizing a composite function,”
Mathematical Programming, vol. 144, no. 1-2, pp. 1–38, 2014.
