Feedforward Architectures for Decentralized Precoding in Massive MU-MIMO
  Systems by Li, Kaipeng et al.
Feedforward Architectures for Decentralized
Precoding in Massive MU-MIMO Systems
Kaipeng Li1, Charles Jeon2, Joseph R. Cavallaro1, and Christoph Studer2
1Department of Electrical and Computer Engineering, Rice University, Houston, TX
2School of Electrical and Computer Engineering, Cornell University, Ithaca, NY
Abstract—Massive multi-user multiple-input multiple-output
(MU-MIMO) enables significant gains in spectral efficiency and
link reliability compared to conventional small-scale MIMO
technology. Furthermore, linear precoders, e.g., using zero forcing
or Wiener filter (WF) precoding, are sufficient to achieve excellent
error-rate performance in the massive MU-MIMO downlink.
However, these methods necessitate centralized processing at the
base-station (BS), which causes (i) excessively high interconnect
and chip input/output data rates, and (ii) high implementation
complexity. We propose two feedforward architectures and corre-
sponding decentralized WF precoders that parallelize precoding
across multiple computing fabrics, effectively mitigating the
issues of centralized approaches. To demonstrate the efficacy of
our decentralized precoders, we provide implementation results
on a multi-GPU system, which show that our solutions achieve
throughputs in the Gbit/s regime while achieving (near-)optimal
error-rate performance in the massive MU-MIMO downlink.
I. INTRODUCTION
Massive multi-user (MU) multiple-input multiple-output
(MIMO) will be among the core technologies of fifth-generation
(5G) cellular wireless systems [1]. The key idea of this
technology is to equip the infrastructure base-stations (BSs)
with hundreds to thousands of antenna elements while serving
tens of user equipments (UEs) at the same time and in the
same frequency band. The fine-grained nature of beamforming
enabled by massive MU-MIMO antenna arrays and coherent
transmission yields significantly improved spectral efficiency,
coverage, and range compared to that of traditional, small-
scale multi-antenna wireless systems [2], [3]. Unfortunately,
the advantages of massive MU-MIMO come at the cost of
significant practical implementation challenges, which must be
solved to realize the gains of this technology in practice.
A. Interconnect Bandwidth and Complexity Bottlenecks
As discussed in [4]–[7], the excessively high amount of data
that must be transferred between the baseband processing unit
and the antenna array is among the most critical challenges. For
example, the raw baseband data rates, from and to the radio-
frequency (RF) chains, of a 128 antenna massive MU-MIMO
system with 10 bit digital-to-analog converters (DACs) for a
This work was supported in part by Xilinx, Inc., the US NSF under
grants CNS-1265332, ECCS-1232274, ECCS-1408370, CNS-1717218, ECCS-
1408006, CCF-1535897, CCF-1652065, CNS-1717559, and with hardware
and software support from the Texas Advanced Computing Center and the
Nvidia Technology Center (the PSG Cluster) with DGX-1 multi-GPU systems.
bandwidth of 40MHz exceed 200Gbit/s, which not only poses
significant challenges for existing interconnect technology, such
as the Common Public Radio Interface (CPRI) [8], but also for
the input/output (I/O) interfaces of existing computing fabrics,
such as application-specific integrated circuits (ASICs), field-
programable gate arrays (FPGAs), or graphics processing units
(GPUs). While maximum ratio transmission (MRT) enables
fully distributed channel estimation (CHEST) and beamforming
in the downlink (BS transmits to the UEs), which alleviates the
bandwidth and I/O bottlenecks, MRT is unable to fully exploit
the spectral-efficiency advantages of massive MU-MIMO [2].
More sophisticated precoding strategies, such as zero-forcing
(ZF) or Wiener filter (WF) precoding [9], enable near-optimal
spectral efficiency. However, such precoding methods require
centralized baseband processing which results in high intercon-
nect bandwidth, I/O data rates, and complexity [2].
B. Decentralized Baseband Processing
To mitigate the bandwidth and complexity bottlenecks
of centralized baseband processing algorithms, a variety of
solutions have been proposed recently. For example, existing
massive MU-MIMO testbeds parallelize the computations
across orthogonal subcarriers [10], [11]. Parallelization across
subcarriers, however, exhibits dependence between the subcarri-
ers and all antenna elements, which prevents its straightforward
use for arrays with thousands of antennas. More recently, the
papers [12], [13] proposed decentralized baseband processing
(DBP), an approach that decentralizes the key computations
required for baseband processing in massive MU-MIMO
systems in order to alleviate the bandwidth and complexity
bottlenecks. The idea of DBP is to partition the antenna array
into clusters, each associated with separate RF circuitry and
processing fabrics. Each cluster then only communicates with
the associated processing fabrics that carry out (de-)modulation,
channel estimation, data detection, and precoding, while only
exchanging a small amount of consensus information among
the clusters. However, as it has been demonstrated with real-
world implementations on GPU clusters in [5], the exchange
of consensus information among the clusters negatively affects
the processing latency and throughput. As an effective remedy,
reference [14] proposed the use of feedforward architectures for
equalization the uplink (UEs transmit to BS). Such architectures
have, up to this point, not been studied for the downlink.
ar
X
iv
:1
80
4.
10
98
7v
1 
 [c
s.I
T]
  2
9 A
pr
 20
18
C. Contributions
We propose two new feedforward architectures and corre-
sponding algorithms for decentralized precoding in the massive
MU-MIMO downlink. Both architectures are compatible with
the ones proposed for the uplink in [14] and prevent the repeated
exchange of consensus information to effectively reduce latency
and throughput. For both architectures, we propose linear
precoding algorithms that build upon the WF precoder, which
minimizes the mean-square error (MSE) at the UE side. We
show that the WF precoder for the partially decentralized (PD)
architecture achieves the same performance as the centralized
WF precoder; the WF precoder for the fully decentralized (FD)
architecture further reduces the interconnect bandwidth at a
small error-rate performance loss. We demonstrate the efficacy
of our feedforward architectures and precoding algorithms using
real-world implementations on a multi graphics processing
unit (GPU) system. Our implementation results reveal that
decentralized precoding with feedforward architectures reaches
throughputs in the Gb/s regime on a multi-GPU system while
achieving (near-)optimal error-rate performance.
D. Notation
Lowercase and uppercase boldface letters denote column
vectors and matrices, respectively. The transpose and Hermitian
transpose of the matrix A are dentoed by AT and AH ,
respectively. The M ×M identity matrix is denoted by IM .
We use tr(A) to denote the trace of the matrix A and Ea[·]
to denote expectation with respect to the random vector a.
II. SYSTEM MODEL AND CENTRALIZED PRECODING
We now introduce the system model and discuss the basics
of centralized precoding for massive MU-MIMO systems.
A. Downlink System Model
We focus on the massive MU-MIMO downlink. The
system consists of a base-station (BS) with B antennas
serving U single-antenna user-equipments (UEs) in the same
time-frequency resource. We consider a block-fading and
narrowband1 scenario modeled as follows:
y[k] = Hx[k] + n[k], k = 1, . . . ,K. (1)
Here, the U -dimensional vector y[k] =
[
y1[k], . . . , yU [k]
]T
contains the signals received at all U UEs with yu[k] ∈ C
corresponding to the signal received at UE u in time slot k.
The matrix H ∈ CU×B represents the downlink MIMO channel
and is assumed to remain constant for K time slots. The vector
n[k] ∈ CU models additive noise and is assumed to be i.i.d.
circularly-symmetric complex Gaussian with variance N0 per
complex entry. We assume the channel matrix H and noise
variance N0 to be known perfectly at the BS. The precoded
vector x[k] ∈ CB at time slot k is given by the function
x[k] = P(s[k],H, N0, ρ2),
1An extension to wideband systems that use orthogonal frequency division
multiplexing (OFDM) is straightforward and considered in Section IV.
which depends on transmit signal vector s[k] ∈ OU , where O is
the constellation set (e.g., 64-QAM), the channel matrix H, the
noise variance N0, and the power constraint ρ2. The precoded
vector is assumed to satisfy the average power constraint
Es
[‖x[k]‖2] ≤ ρ2, k = 1, . . . ,K, (2)
and the vector s[k] =
[
s1[k], . . . , sU [k]
]T
contains the infor-
mation symbols su[k] ∈ O to be transmitted to UE u in time
slot k. In what follows, we omit the time-slot index k.
B. Linear Wiener Filter Precoding
Since the UEs are unable to perform joint signal processing,
the BS has to precode the information symbols with the goal
of mitigating multi-user interference (MUI). The literature
describes numerous optimization criteria for precoding, such as
sum-rate throughput or error probability [15]. In what follows,
we focus on linear precoders of the form x = Ps that minimize
the mean-square error (MSE) between the estimated symbol
vector sˆ and the transmit signal vector s. Since coherent
transmission with a multi-antenna BS leads to an array gain, we
assume that the UEs are able to scales the received signals yu,
u = 1, . . . , U , by a precoding factor βu ∈ C, i.e., the UEs
compute symbol estimates as follows:
sˆu = βuyu.
While each UE u would able to estimate their own precoding
factor βu, we design precoders that minimize the MSE for a
joint2 precoding factor β ∈ C defined as [9]
MSE = Es,n
[‖s− sˆ‖2] = Es,n[‖s− βy‖2]
= Es
[‖s− βHx‖2]+ |β|2UN0.
The resulting MSE-optimal linear precoding matrix P ∈ CB×U
can be designed by solving the following optimization problem
{PWD, βWF} =
{
minimize
P,β
Es
[‖s− βHPs‖2]+ β2UN0
subject to Es
[‖x‖2] ≤ ρ2. (3)
The solution to this optimization problem is known as the
Wiener filter (WF) precoder [9] and is summarized by the
following result; a short proof is given in Appendix A.
Theorem 1. The Wiener filter (WF) precoding matrix PWF is
given by PWF = 1βWFQ
WF, where we define the matrix
QWF =
(
HHH+ κWFIB
)−1
HH . (4)
The associated regularization parameter κWF and precoding
factor βWF are defined as
κWF =
UN0
ρ2
and βWF =
√
tr(QHQ)Es
ρ2
. (5)
A straightforward computation of the precoding factor βWF
in (5) involves the inversion of a B × B matrix followed
by a number of matrix-matrix multiplications. We propose a
2Designing precoders for the case of having individual precoding factors
βu, u = 1, . . . , U , is challenging and left for future work.
+. .
 .
precoding
. .
 .
RF
RF
. .
 .
. .
 .
RF
RF
. .
 .
. .
 .
precoding
. .
 .
CHEST
CHEST
whitening
(a) Partially decentralized (PD) precoding architecture.
. .
 .
precoding
. .
 .
RF
RF
. .
 .
. .
 .
RF
RF
. .
 .
. .
 .
precoding
. .
 .
CHEST
CHEST
(b) Fully decentralized (FD) precoding architecture.
Fig. 1. Partially decentralized (PD) and fully decentralized (FD) precoding architectures for the massive MU-MIMO downlink with C clusters. (a) PD performs
decentralized channel estimation (CHEST) in the uplink and averages the partial Gram matrices Gc while feeding them to the centralized whitening unit; the
⊕ operator denotes matrix additions. In the downlink, precoding is performed in two steps: centralized whitening followed by decentralized precoding in each
cluster. (b) FD performs decentralized CHEST in the uplink. In the downlink, precoding is performed locally at each cluster in a fully decentralized manner.
computationally-efficient alternative that can be implemented
efficiently given the U × U whitening matrix A−1 has been
precomputed; a proof is given in Appendix B.
Lemma 2. The precoding factor βWF of the WF precoder
in (5) can be computed efficiently as follows:
βWF =
√
Es
ρ2
(tr (A−1)− κWF‖A−1‖2F ). (6)
III. DECENTRALIZED PRECODING
We next propose two decentralized precoding schemes that
rely on feedforward architectures and linear WF precoding.
A. System Model for Decentralized Precoding
We now extend the feedforward architecture put forward
in [14] for the uplink by the capability to perform downlink
precoding. To this end, we partition the BS antenna array into
C ≥ 1 clusters, each associated with Bc = wcB ∈ N+ BS
antennas so that wc ∈ (0, 1] and
∑C
c=1 wc = 1. Each cluster
contains local RF circuitry and requires access to only local
channel state information (CSI) acquired in the uplink via
reciprocity. By omitting the time-slot index k, we can rewrite
the downlink system model in (1) as
y =
C∑
c=1
Hcxc + n, (7)
where H =
[
H1, . . . ,HC
]
with Hc = CU×Bc and xT =[
xT1 , . . . ,x
T
C
]
with xc ∈ CBc , we see that each cluster c =
1, . . . , C has to generate a precoding vector xc with information
of the per-cluster channel matrix Hc, the noise variance N0,
the power constraint ρ2, and the transmit symbol vector s, i.e.,
the precoding functions are as follows:
xc = Pc(s,Hc, N0, ρ2), c = 1, . . . , C. (8)
Since each of these functions only depends on local CSI
contained in Hc, the exchange of CSI is reduced significantly—
the vector s is the only signal that must be broadcast to all
clusters. We now present two distinct feedforward architectures
that perform decentralized precoding, differing in the amount
of CSI that must be exchanged during the training phase.
B. Partially-Decentralized WF Precoding
The first feedforward architecture is illustrated in Fig. 1(a)
and called partially decentralized WF (PD-WF) architecture.
The operating principle can be derived directly from (16),
which results in the precoding rule
x =
1
βWF
HHA−1s.
The idea of PD-WF precoding is to first whiten the transmit
vector s at a centralized whitening node as follows:
z =
1
βWF
A−1s.
The whitened transmit vector z is then transmitted to each
cluster, which independently compute xc = HHc z.
Clearly, this PD-WF architecture requires the whitening
matrix A−1 as well as the precoding factor βWF to be calculated
at the centralized whitening node—both of these quantities
require the computation of the Gram matrix G. To compute
this matrix in an decentralized architecture, we follow the
approach for PD equalization put forward in [9], where each
cluster c = 1, . . . , C, first computes the local Gram matrix
Gc = HcH
H
c after estimating the channel in the uplink phase,
followed by computing the (centralized) Gram matrix G =∑C
c=1Gc in a feedforward adder tree; see the blue feedback
path in Fig. 1(a). The centralized whitening node then computes
the whitening matrix A−1 and the precoding factor βWF as
detailed in Section II-B. Since we have that
C∑
c=1
Hcxc =
C∑
c=1
HcH
H
c A
−1s = GA−1s = HPWFs,
the PD-WF architecture implements exactly the centralized
WF precoder from Theorem 1 but in a decentralized fashion.
(a) B = 256, Bc = 128, C = 2 (b) B = 256, Bc = 64, C = 4 (c) B = 256, Bc = 32, C = 8
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
PD-WF: 0.879 ms, 0.979 Gb/s PD-WF: 0.678 ms / 1.270 Gb/s PD-WF: 0.607 ms / 1.418 Gb/s
FD-WF: 0.789 ms, 1.091 Gb/s FD-WF: 0.571 ms / 1.507 Gb/s FD-WF: 0.472 ms / 1.824 Gb/s
(d) B = 64, Bc = 32, C = 2 (e) B = 128, Bc = 32, C = 4 (f) B = 256, Bc = 32, C = 8
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
−10 0 10 20 3010
−4
10−3
10−2
10−1
100
normalized transmit power [dB]
un
co
de
d
bi
te
rr
or
ra
te
(B
E
R
)
MRT
WF
PD-WF
FD-WF
ADMM-WF
PD-WF: 0.532 ms / 1.618 Gbps PD-WF: 0.559 ms / 1.540 Gbps PD-WF: 0.607 ms / 1.418 Gbps
FD-WF: 0.441 ms / 1.952 Gbps FD-WF: 0.451 ms / 1.909 Gbps FD-WF: 0.472 ms / 1.824 Gbps
Fig. 2. Uncoded bit error-rate, latency, and throughput result for decentralized baseband processing with U = 16 users. Top row: fixed number of BS antennas
B = 256, varying cluster size Bc and number of clusters C. Bottom row: fixed cluster size Bc = 32, varying number of BS antennas B and number of
clusters C. PD-WF achieves the same error-rate performance as centralized precoding; FD-WF achieves near-WF performance for cluster sizes Bc ≥ 32; the
ADMM-based WF method outperforms FD-WF but requires iterative exchange of consensus information resulting in higher latency.
C. Fully-Decentralized WF Precoding
The second feedforward architecture, called fully decentral-
ized WF (FD-WF) architecture, is illustrated in Fig. 1(b) and
avoids transmitting partial Gram matrices to the centralized
whitening node. The key idea of this architecture is to first
broadcast the transmit vector s to each cluster, and then
compute the local precoding vector as follows xc = Pcs.
In order to adhere to the (total) power constraint in (2), we
have to define a per-cluster power constraint E
[‖xc‖2] ≤ ρ2c
for which
∑C
c=1 ρ
2
c = ρ
2. In what follows, we allocate the
same amount of power3 to each cluster, i.e., ρ2c = ρ
2/C, which
results in the following precoder carried out at each cluster
xc =
√
ρ2c
tr(QHc Qc)Es
Qcs.
The remaining piece is to identify a suitable choice of the
regularization parameters κc that are used to calculate the
matrices Qc. A straightforward way would be to assume that
each cluster operates independently and to set the regularization
parameter to UN0/ρ2c . In practice, however, it turns out that
this choice of the regularization parameter is overly pessimistic.
3We investigated a number of strategies that allocate different power levels to
each cluster. Such methods did not provide significant performance advantages
in massive MU-MIMO, but may, for example, be critical for cell-free massive
MU-MIMO systems in which the clusters are spread over a large area [16].
Since computing an optimal set of regularization parameters
is difficult in the decentralized scenario, we simply set
κc = τc
UN0
ρ2c
, c = 1, . . . , C, (9)
and tune the parameters τc ∈ [0,∞) for best error-rate
performance via numerical simulations. Specifically, we use
Qc =
{ (
HHc Hc + κcIBc
)−1
HHc if Bc < U
HHc
(
HcH
H
c + κcIU
)−1
if Bc ≥ U,
which further reduces the computational complexity depending
on the number of antennas per cluster.
D. Simulation Results
We now show uncoded bit error-rate (BER) simulation
results for a Rayleigh fading massive MU-MIMO system with
64-QAM. Figs. 2 (a), (b), (c) show the BER for B = 256
BS antennas, with varying cluster sizes Bc = 128, 64, 32,
and number of clusters C = 2, 4, 8. Figs. 2 (d), (e), (f)
show the BER for a fixed cluster size Bc = 32, with
a varying number of BS antennas B = 64, 128, 256, and
number of clusters C = 2, 4, 8. For each antenna configuration,
we compare the performance of the proposed decentralized
solutions PD-WF and FD-WF, as well as existing methods
including centralized WF precoding, fully-distributed MRT, and
the fully-decentralized ADMM-based WF precoder from [5].
Evidently, PD-WF achieves the same BER as the centralized
WF precoder for all antenna configurations. In contrast, FD-
WF suffers a moderate BER loss if Bc is small. To minimize
the performance loss of FD-WF precoding, we have tuned
the parameter τc in (9). Concretely, we found that τc = 0.125
performs well for a broad range of antenna and cluster config-
urations; a corresponding theoretical analysis is left for future
work. In addition, we see that the fully decentralized ADMM-
based WF precoder proposed in [5] is able to outperform
FD-WF precoding but requires multiple iterations of consensus
exchange (we use two ADMM iterations) that dramatically
reduces the throughput due to the typically high interconnect
latency between antenna clusters; see [5] for the details.
IV. MULTI-GPU IMPLEMENTATION
As a proof-of-concept, we now present a multi-GPU im-
plementation of PD-WF and FD-WF precoding, and show
corresponding throughput and latency results.
A. System Architecture
We implemented PD-WF and FD-WF precoding on an
Nvidia DGX-1 multi-GPU system [17], as illustrated in Fig. 3.
The architecture consists of two 2-core Intel Xeon E5-2698
v4 CPUs and eight Tesla V100 Volta GPUs with 300GB/s
bi-directional NvLink GPU-to-GPU communication links. Each
Tesla V100 GPU contains 5120 CUDA cores with 16 GB high
bandwidth memory (HBM). For PD-WF and FD-WF precoding,
we use the message passing interface (MPI) library OpenMPI to
generate a total of C processes in the multi-GPU system, where
each process controls a GPU for accelerating the decentralized
local workload using CUDA [18] with CUDA v9.1. While FD-
WF only requires broadcasting of transmit signals s[k] across
GPUs prior to the precoding computations, PD-WF necessitates
gathering of the local Gram matrices from all GPUs via sum
reduction at the centralized whitening unit (in the master GPU
as shown in Fig. 3), and broadcasting of the whitened vectors
z[k]. These message passing operations are implemented using
the NVIDIA Collective Communications Library (NCCL) [19]
v2.1 that builds on MPI for high efficiency over NvLink.
B. Implementation Details
To increase the throughput on the multi-GPU system, we
need to feed the GPU a sufficient amount of workloads to
fully exploit the available resources. In what follows, we
assume the downlink transmission with orthogonal frequency
division multiplexing (OFDM) with Nsc subcarriers. Each
OFDM subcarrier corresponds to an independent narrowband
block-fading downlink system as in (1), where we assume that
the channel remains constant across K OFDM symbols. The
vector sw[k] indicates the transmit vector s[k] on subcarrier w
in time slot k. In our implementations, we aggregate the
precoding workloads for K OFDM symbols, each including Nsc
subcarriers, and process them together in parallel to improve
the GPU occupancy and throughput. In what follows, we omit
the superscript w as well as the OFDM symbol index k.
V100
GPU
V100
GPU
V100
GPU
V100
GPU
V100
GPU
V100
GPU
V100
GPU
V100
GPU
CPUCPU
PCIe SwitchesPCIe Switches
NVLink PCIe QuickPath Interface
Fig. 3. Overview of the experimental platform with up to eight Tesla Volta
GPUs and high speed NvLink GPU-to-GPU interconnect [17]. The upper-left
GPU is the master GPU that collects results from other GPUs, performs
centralized computations for PD-WF, and broadcasts the transmit vectors s[k]
for FD-WF or the whitened vectors z[k] for PD-WF to other GPUs.
1) PD-WF Implementation: For PD-WF, we invoke C
MPI processes that control C GPUs, and each process
initializes computation of the local Gram matrix Gc us-
ing the local channel Hc on a certain GPU. Within each
GPU, we calculate Nsc such Gc matrices to achieve high
throughput. These matrix multiplications can be efficiently
implemented using the cuBLAS [20] library; specifically, we
use the cublasCgemmBatched function for complex-valued
floating-point arithmetic. Once all local Gc matrices have
been computed, we gather Gc from all C GPUs to a reduced
sum (resulting in the global Gram matrix G) at the master
GPU using the NCCL ncclReduce function. The NCCL
library leverages CUDA-aware MPI [21] for direct GPU-to-
GPU memory copy over high-speed NvLink interconnect.
We compute A = G + κWFIU for Nsc subcarriers
(corresponding to a given OFDM symbol k) at the mas-
ter GPU in parallel. We then invert this matrix using the
cuBLAS cublasCgetrfBatched Cholesky decomposition,
followed by cublasCgetriBatched that implements for-
ward and backward substitution operations to obtain A−1.
Since the local channel matrix Hc is assumed to remain
constant for K OFDM symbols, we store A−1 for a given
OFDM symbol k, and reuse this matrix for all K OFDM
symbols. To compute the whitened vector z = 1βWFA
−1s,
we first multiply the transmit vector s with the matrix
A−1 using the cublasCgemmBatched function for a total
of Nsc × K subcarriers. We then calculate the precoding
factor βWF. As shown in (6), βWF depends on tr(A−1) and
‖A−1‖2F , which involve sum reduction operations across the
diagonal entries or all entries of matrix A−1. To resolve such
data dependencies efficiently, we design a customized kernel
function to calculate βWF, where we take advantage of fast local
registers and shared memories for inter-thread communications.
Specifically, we invoke this kernel with Nsc thread-blocks to
calculate Nsc different βWF values in parallel. In each thread-
block, we generate U × U threads to access each entry of the
U × U matrix A−1, and perform inter-thread shuffle of local
register values within a warp using warp shuffle [22], and inter-
thread communication across different warps within this thread-
block using shared memory, to realize the sum reductions
required to compute tr(A−1) and ‖A−1‖2F . Analogously to
the computations for A−1, we can reuse the parameter βWF
across K OFDM symbols, and compute the whitened vector
z. For PD-WF, whitening happens at the master GPU in a
centralized manner, and therefore we need to broadcast the
whitened vector z to all GPUs using NCCL ncclBcast.
We finally compute the local precoding vector xc = HHc z by
cublasCgemmBatched function for all Nsc×K subcarriers
at each GPU in a decentralized fashion.
2) FD-WF Implementation: For FD-WF, we use cuBLAS
and customized kernels as for PD-WF in order to implement the
local WF precoder corresponding to Bc BS antennas with the
power constraint ρ2c =
ρ2
C . To invoke the FD-WF precoder, we
broadcast the transmit vectors s to the C MPI processes, each
running a local WF precoder on a separate GPU to compute
the local precoding vectors xc in parallel.
C. Implementation Results
Fig. 2 shows the latency and throughput results of PD-WF
and FD-WF measured on the multi-GPU system for various BS
antenna configurations and U = 16 UEs. For all configurations,
we set Nsc = 1200, K = 7 corresponding to a slot-frame of
20 MHz LTE signal with OFDM and 64-QAM transmission.
In the top row of Fig. 2 , we fix the number of BS antennas to
B = 256, and increase the number of clusters C = 2, 4, 8 (and,
hence, the number of used GPUs) while decreasing the cluster
size Bc = 128, 64, 32. By decreasing Bc, the throughput of
PD-WF and FD-WF precoding increases as less local workload
is processed in parallel. FD-WF achieves higher data rate than
PD-WF, since FD-WF only requires to broadcast the transmit
vector s which scales with Nsc×K×U , while PD-WF requires a
higher amount of message passing, which includes (i) gathering
of local Gram matrices Gc (scaling with Nsc×U ×U ) and (ii)
broadcasting of whitened vector z (scaling with Nsc×K ×U ).
In the bottom row of Fig. 2, we fix the number of antennas
per cluster to Bc = 32, and increase B = 64, 128, 256 by
scaling the number of clusters C = 2, 4, 8. We observe that
the throughput only degrades slightly with increasing B and
C for both PD-WF and FD-WF, indicating that the message-
passing latency remains nearly constant; this is a result of the
direct GPU-to-GPU gathering/broadcasting communications
realized by NCCL. This observation also implies that we can
increase the number of BS antennas with only a small loss
in throughput using the proposed decentralized feedforward
architecture. For all configurations show in Fig. 2, our designs
achieve throughputs in the Gb/s regime with latencies below
1ms. We also see that FD-WF outperforms PD-WF in terms
of throughput due to the reduced amount of message passing,
while PD-WF achieves superior error-rate performance.
V. CONCLUSIONS
We have proposed two novel feedforward architectures and
corresponding decentralized precoding algorithms based on
the linear Wiener filter (WF) precoder. We have demonstrated
that the partially-decentralized (PD) WF precoder achieves the
error-rate performance of the centralized WF precoder, while
significantly reducing the interconnect and chip input/output
bandwidths. To further reduce the interconnect bandwidth, we
have proposed a fully-decentralized (FD) WF precoder that
incurs only a small error-rate performance loss compared to the
PD-WF precoder. To showcase the efficiency and scalability
of PD-WF and FD-WF to large antenna arrays, we have
presented implementations on a multi-GPU system. Our results
demonstrate that throughputs in the Gb/s regime at latencies
below 1ms are achievable. These results indicate that the
proposed decentralized precoding are a solution to combat
the interconnect and complexity bottlenecks while being able
to fully exploit the spectral efficiency and link reliability
advantages provided by massive MU-MIMO systems.
There are many avenues for future work. A theoretical
analysis of the optimal regularization parameter τc for FD-
WF precoding in (9) is an open research question. Combining
decentralized feedforward precoding with data detection as
in [14] may further reduce the processing latency and increase
the throughput as a large number of quantities can be re-
used between the uplink and downlink. The development and
analysis of feedforward architectures for cell-free massive MU-
MIMO as put forward in [16] is part of ongoing work.
APPENDIX A
PROOF OF THEOREM 1
The precoder resulting from Section 3 is known as the
Wiener filter (WF) precoder [9] and can be derived as follows.
Let us first form the Lagrangian
L(P, β, λ) = Es
[‖s− βHPs‖2]+ β2UN0
+ λ(Es
[‖x‖2]− ρ2).
We can now formulate the optimality conditions for P and β
by using the Wirtinger derivative as follows. For the precoding
matrix P, we have the following optimality condition:
δ
δPH
L(P, β, λ) = 0 =⇒ β2HHHP+ λP = βHH . (10)
For the precoding factor β, we compute δδβ∗L(P, β, λ) = 0
and obtain the following optimality condition:
β tr(PHHHHP) + β
UN0
Es
= tr(HHPH). (11)
From the power constraint, it follows that
Es
[‖x‖2] = ρ2 =⇒ tr(PHP) = ρ2
Es
. (12)
To derive the optimal value for the Lagrange multiplier λ, we
apply the following steps the optimality condition in (10):
β2HHHP+ λP = βHH
β2HHHPPH + λPPH = βHHPH
β2 tr(HHHPPH) + λ tr(PPH) = β tr(HHPH)
β2 tr(PHHHHP) + λ
ρ2
Es
= β tr(HHPH), (13)
where the last step results from (12). We now multiply both
sides of the optimality condition in (11) with β to obtain
β tr(PHHHHP)Es + βUN0 = tr(H
HPH)Es
β2 tr(PHHHHP) + β2
UN0
Es
= β tr(HHPH). (14)
Subtracting (14) from (13) yields the Lagrange multiplier
λ =
UN0
ρ2
. (15)
From (10) and (15), it follows that the WF precoding matrix
is given by PWF = 1βWFQ with the matrix
Q =
(
HHH+
UN0
ρ2
I
)−1
HH .
The remaining piece is to identify the WF precoding factor
βWF. To this end, we plug PWF into (12), which leads to
1
β2
tr(QHQ) =
ρ2
Es
=⇒ 1
βWF
=
ρ√
tr(QHQ)Es
.
APPENDIX B
PROOF OF LEMMA 2
To reduce the computational complexity of computing βWF
in (5), we first use the matrix inversion lemma [23] to arrive
at an equivalent expression of (4) given by
QWF = HH
(
HHH + κWFIU
)−1
,
which requires the inversion of an U × U matrix. By precom-
puting the U × U Gram matrix G = HHH and inverting the
regularized Gram matrix defined as A = G+κWFIU , we have
QWF = HHA−1 (16)
and consequently, tr(QHQ) = tr
(
A−1GA−1
)
. A direct
evaluation of this expression still requires two matrix-matrix
multiplications of dimension U × U . We can further reduce
complexity by noting that the following equivalence holds
tr
(
A−1GA−1
)
= tr
(
A−1
)− κWF‖A−1‖2F ,
where we used a Searle-type identity [24] and a matrix version
of partial fraction expansion to finally obtain (6).
REFERENCES
[1] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K.
Soong, and J. C. Zhang, “What Will 5G Be?,” IEEE J. Sel. Areas
Commun., vol. 32, no. 6, pp. 1065–1082, June 2014.
[2] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO in the UL/DL
of Cellular Networks: How Many Antennas Do We Need?,” IEEE J. Sel.
Areas Commun., vol. 31, no. 2, pp. 160–171, Feb. 2013.
[3] L. Lu, G. Y. Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang, “An
Overview of Massive MIMO: Benefits and Challenges,” IEEE J. Sel.
Topics in Sig. Proc., vol. 8, no. 5, pp. 742–758, Oct. 2014.
[4] A. Puglielli, N. Narevsky, P. Lu, T. Courtade, G. Wright, B. Nikolic,
and E. Alon, “A scalable massive MIMO array architecture based on
common modules,” in IEEE Intl. Conf. Commun. Workshop (ICCW).
IEEE, June 2015, pp. 1310–1315.
[5] K. Li, R. R. Sharan, Y. Chen, T. Goldstein, J. R. Cavallaro, and C. Studer,
“Decentralized Baseband Processing for Massive MU-MIMO Systems,”
IEEE J. Emerg. and Sel. Topics in Circ. and Sys., vol. 7, no. 4, pp.
491–507, Dec 2017.
[6] S. Jacobsson, G. Durisi, M. Coldrey, T. Goldstein, and C. Studer,
“Quantized Precoding for Massive MU-MIMO,” IEEE Trans. on Comm.,
vol. 65, no. 11, pp. 4670–4684, Nov 2017.
[7] E. G. Larsson E. Bertilsson, O. Gustafsson, “A distributed processing
architecture for modular and scalable massive MIMO base stations,”
arXiv preprint: 1801.07967, Jan. 2018.
[8] http://www.cpri.info, Common public radio interface.
[9] M. Joham, W. Utschick, and J. A. Nossek, “Linear transmit processing
in MIMO communications systems,” IEEE Trans on Sig. Proc., vol. 53,
no. 8, pp. 2700–2712, Aug 2005.
[10] Q. Yang, X. Li, H. Yao, J. Fang, K. Tan, W. Hu, J. Zhang, and Y. Zhang,
“BigStation: Enabling Scalable Real-time Signal Processingin Large MU-
MIMO Systems,” in ACM SIGCOMM, Oct. 2013, pp. 399–410.
[11] J. Vieira, S. Malkowsky, K. Nieman, Z. Miers, N. Kundargi, L. Liu,
I. Wong, V. Öwall, O. Edfors, and F. Tufvesson, “A flexible 100-antenna
testbed for Massive MIMO,” in IEEE Globecom, Dec. 2014, pp. 287–293.
[12] K. Li, Y. Chen, R. Sharan, T. Goldstein, J. R. Cavallaro, and C. Studer,
“Decentralized data detection for massive MU-MIMO on a Xeon Phi
cluster,” in Asilomar Conf. Sig. Sys. Comp., Nov. 2016, pp. 468–472.
[13] K. Li, R. Skaran, Y. Chen, J. R. Cavallaro, T. Goldstein, and C. Studer,
“Decentralized beamforming for massive MU-MIMO on a GPU cluster,”
in IEEE GlobalSIP, Dec. 2016, pp. 590–594.
[14] C. Jeon, K. Li, J. R. Cavallaro, and C. Studer, “On the achievable rates
of decentralized equalization in massive MU-MIMO systems,” in 2017
IEEE Int. Symp. on Info. Theory, June 2017, pp. 1102–1106.
[15] N. Fatema, G. Hua, Y. Xiang, D. Peng, and I. Natgunanathan, “Massive
MIMO Linear Precoding: A Survey,” IEEE Sys. Journal, vol. PP, no.
99, pp. 1–12, 2017.
[16] H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta,
“Cell-free massive MIMO versus small cells,” IEEE Trans. Wireless
Commun., vol. 16, no. 3, pp. 1834–1850, Jan. 2017.
[17] https://www.nvidia.com/en-us/data-center/dgx-1, Nvidia DGX-1 system.
[18] http://docs.nvidia.com/cuda, Nvidia CUDA programming guide.
[19] https://developer.nvidia.com/nccl, NVIDIA Collective Communications
Library (NCCL).
[20] http://docs.nvidia.com/cuda/cublas, Nvidia cuBLAS library.
[21] https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi,
CUDA-aware MPI.
[22] https://devblogs.nvidia.com/faster-parallel-reductions-kepler/, Parallel
reduction using warp shuffle.
[23] N. J. Higham, Accuracy and stability of numerical algorithms, Siam,
2002.
[24] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, Technical
University of Denmark, Nov. 2012.
