A Distributed Processing Architecture for Modular and Scalable Massive
  MIMO Base Stations by Bertilsson, Erik et al.
1A Distributed Processing Architecture for Modular
and Scalable Massive MIMO Base Stations
Erik Bertilsson, Student Member, IEEE, Oscar Gustafsson, Senior Member, IEEE,
and Erik G. Larsson, Fellow, IEEE
Abstract—In this work, a scalable and modular architecture
for massive MIMO base stations with distributed processing
is proposed. New antennas can readily be added by adding
a new node as each node handles all the additional involved
processing. The architecture supports conjugate beamforming,
zero-forcing, and MMSE, where for the two latter cases a central
matrix inversion is required. The impact of the time required for
this matrix inversion is carefully analyzed along with a generic
frame format. As part of the contribution, careful computational,
memory, and communication analyses are presented. It is shown
that all computations can be mapped to a single computational
structure and that a processing node consisting of a single such
processing element can handle a broad range of bandwidths and
number of terminals.
Index Terms—Massive MIMO, Distributed processing, Archi-
tecture, Scalable, Conjugate beamforming, Zero-forcing, MMSE
I. INTRODUCTION
THE ever increasing demands for higher data rates inwireless communication opens up for many opportunities
and challenges in the fifth generation (5G) wireless infrastruc-
ture [1], [2]. One such is the use of many antennas on the
base station side, commonly referred to as Massive MIMO
or Very Large Scale MIMO [3]–[7]. By using many antennas,
compared to the number of terminals, the transmit and receive
power of each antenna and the processing associated per
antenna can be reduced. Furthermore, the total energy per
bit per user can potentially be reduced at the system level
compared to traditional few antenna solutions.
However, while massive MIMO is a promising technology
there are still obstacles to overcome before systems of this type
can be deployed. There exists a number of demonstrators [8]–
[11] that have been used to demonstrate the feasibility of the
techniques. The number of antennas for the demonstrators are
typically between 64 and 128 and supports up to 12 terminals.
In addition, some work has been done to implement parts or all
of the involved processing [12]–[17], either using a centralized
node with all processing [12], [13], [15] or distributing the
processing to several nodes [14], [16], [17]. For the centralized
node architectures, typically the case of 128 antennas and 8
terminals are considered.
Of more interest in our current context are the examples of
distributed processing. In [14], a base station system design
that is constructed by identical modules is proposed. The
The authors are with the Department of Electrical Engineering,
Linköping University, SE-581 83 Linköping, Sweden. Emails: {erik.bertilsson,
oscar.gustafsson, erik.g.larsson}@liu.se
Manuscript received January 25, 2018.
baseband processing is distributed among the modules, which
are connected in an array. The modules contain RF-front ends,
digital baseband processing and digital interconnection link to
all four neighbors. The system design, along with cost and
power consumption issues are analyzed. However, there are no
details on how the baseband processing should be performed
or the impact of timing constraints.
In [16], [17], the authors propose distributed processing
based on COTS processors. However, the timing constraints
of the considered LTE-frame structure is not taken into con-
sideration. Additionally, the systems are not dimensioned to
meet the maximum obtainable throughputs of the considered
specifications.
Compared to a centralized architecture, in a distributed
architecture, the number of antennas at the base station can
more easily be scaled. For the centralized architecture, in-
creasing the number of antennas more or less requires a
complete redesign of the system. In a distributed architecture,
the number of antennas can be increased by adding another
node that contains the antenna and circuitry for the associated
processing. Furthermore, a distributed architecture enables
performing the computations close to the antenna, possibly
integrated on the same chip as the radio. In the case of
component failure, the modularity allows a single node to be
replaced instead of replacing a large centralized unit.
Additionally, for centralized implementations the required
data rate to read all uplink data from the ADCs and to feed the
downlink data to the DACs grows with the number of antennas,
making it very high for systems with many antennas. Finally,
a higher manufacturing yield can be expected since each chip
is smaller for a distributed architecture.
In the current work, a node and system architecture is
proposed that is distributed, modular, and scalable. It supports
conjugate beamforming (CB), zero forcing (ZF) [18], and
minimum mean-square error (MMSE) [19], where in the two
latter cases a matrix inversion is performed in a central unit.
The computational difference between ZF and MMSE is that
in MMSE, a regularization term is added before performing the
matrix inverse. This is also done at the central unit. Therefore,
we only discuss CB and ZF explicitly, as the node processing
for MMSE is the same as for ZF. The main contributions in
this work are:
• Analysis of distributed and modular processing in a
massive MIMO-OFDM system
• Node and system architecture for a distributed, modular,
and scalable MIMO-OFDM system
ar
X
iv
:1
80
1.
07
96
7v
1 
 [e
es
s.S
P]
  2
4 J
an
 20
18
2Fixed
part
Scalable
part
Central control unit
Fig. 1. Proposed system architecture consisting of a central control unit
(CCU) and the scalable antenna nodes.
• Computation, memory, and communication analysis for
the nodes and system
• Analysis of timing constraints and their effects on re-
source requirements
• Deterministic scheduling/control of the nodes and system
• Design space exploration showing that the proposed node
architecture can be used in a rather large set of scenarios
A preliminary version of the current manuscript was pre-
sented in [20]. Compared to the distributed architecture in [14],
we suggest using a tree interconnection of the nodes, although
the proposed approach can also be used in other intercon-
nection topologies. Especially, we perform an analysis of the
computational and timing requirements and propose a detailed
node architecture, along with scheduling of the computations
and inter-node communication. Compared to the distributed
architectures in [16], [17], we propose an optimized node
architecture instead of using generic processors. Additionally,
the timing constraints of the selected frame format is carefully
analyzed.
II. PROPOSED SYSTEM ARCHITECTURE
In this work, the proposed system architecture consists
of one central control unit (CCU) and a scalable part, as
illustrated in Fig. 1. The CCU is responsible for performing
operations such as error correction coding/decoding and opera-
tions associated with the other network layers, such as medium
access control (MAC). The scalable part is responsible for
the channel estimation, linear precoding, and linear decoding
of symbols transmitted to and from the base station. Every
node in the scalable part contains computational blocks for
the associated antenna(s) and inter-node communication links.
One or more nodes can be combined into a chip for different
granularity. The main difference is the latency of the inter-node
communication, which within a chip will be one or a few clock
cycles, while inter-chip may be in the range of one hundred
clock cycles assuming a serial link and a clock frequency
of hundreds of MHz. Here it is assumed that the nodes are
clocked synchronously. In downlink operation, the CCU feeds
modulated symbols for each terminal to the nodes. In uplink
operation, the nodes compute estimated symbols transmitted
from the terminals and sends to the CCU.
In this work, we propose to connect the nodes in a K-ary
tree, however the nodes can be connected in other topologies
as well. It is worth pointing out that, independently of the
interconnection topology, during the accumulation of data
(a) (b)
(c)
Fig. 2. Three different node topologies. The external controller is connected
to the grey nodes. The black node is (one of) the furthest. The number of
routing hops is: (a) Nhops ∝ 32
√
M , (b) Nhops ∝
√
M , (c) Nhops ∝ log2M .
from all nodes, each node will only transmit data to one other
node on the way to the CCU to avoid duplicate transmissions
and accumulations. This means that accumulating data will
always be performed in a tree structure, independently of the
interconnection topology. By modifying the interconnection
topology, the number of hops when accumulating data is
changed. Trees have some inherent advantages and disadvan-
tages compared to array topologies. One of the most profound
advantages of the tree structure is that the number of routing
hops, Nhops, grows logarithmically with the number of nodes
in the tree, as opposed to proportionately to the square root for
arrays. Figure 2 shows two different arrays and a tree topology.
For systems with a large number of antennas this is a major
benefit when using ZF or MMSE processing, as the latency
of propagating data through the tree affects the system design.
For CB processing, low latency is not as important.
Another design trade-off that needs to be considered is fault
tolerance. In an ordinary tree structure, if a node fails during
operation that entire branch will not be able to communicate
with the rest of the tree. In an array topology, this could be
mitigated by routing data past the failing node. This however
increases the node complexity since a routing mechanism must
be implemented.
Additionally, there is the aspect of physical antenna place-
ment and cable routing. In systems where antennas are placed
in an array, the array-based node topologies have the advantage
of simpler cable routing. In systems where the antennas are
scattered in some irregular pattern this advantage is lost.
For the remainder of the article, for ease of exposition,
complete binary trees are considered where each chip contains
one node.
A. System Specification
The considered setup is a TDD based system that utilizes
OFDM. A generalized frame structure can be seen in Fig. 3.
3PUL UL UL UL DLG DL G
Tframe
NUL,1 NUL,2 NDL
Fig. 3. Generalized frame structure.
TABLE I
SYSTEM PARAMETERS.
Name Description
K Number of terminals
NFFT OFDM DFT/IDFT lenght
NSC OFDM subcarriers utilized
NUL,1 Uplink OFDM symbols before pilot
NUL,2 Uplink OFDM symbols after pilot
NDL Downlink OFDM symbols
Nhops Number of hops to the furthest node
fsample Sample rate
TOFDM Duration of one OFDM symbol
Tframe Duration of one frame
Tinv Time to compute matrix inverse
Tlink Latency of sending one value over the link
Wcomp Word length of partial results
Wsymbol Word length of QAM modulated symbols
WADC Word length of ADC
WDAC Word length of DAC
The frame starts with NUL,1 uplink OFDM symbols where the
terminals transmits data to the base station. Then comes the
uplink pilot symbol, where all terminals transmit a unique pilot
sequence that is used to estimate the uplink radio channel.
Another NUL,2 uplink OFDM symbols are sent after the
pilot. Then comes a guard interval to switch from uplink
to downlink operation. The base station then transmits NDL
OFDM symbols to the terminals. The frame duration is
Tframe = (NUL,1 +NUL,2 +NDL + 3)TOFDM, (1)
where TOFDM is the duration of one OFDM symbol.
Generally, it is favorable to place the pilot close to the
middle of the frame to reduce the time between sending the
pilot and the data.
Here, the synchronization between transmitters and re-
ceivers is not considered. The system parameters are shown
in Table I.
III. COMPUTATIONAL TASKS
To utilize the proposed architecture efficiently, the algo-
rithms used must be expressed in a distributed manner. The
processing can be divided into three phases: channel estima-
tion, uplink data decoding and downlink data precoding.
A. Channel Estimation
Here, a channel estimation based on least squares is con-
sidered. Let
xkpilot =
[
01×(k−1) p 01×(K−k)
]
(2)
be the pilot vector transmitted by terminal k. The scalar p
is computed statically at design time. Each node receives the
signal vector
yi,pilot = hiXp +Np ∈ C1×K , (3)
where Xp ∈ CK×K is the pilot matrix and Np ∈ C1×K is a
noise vector. The pilot matrix Xp is given by
Xp =
[(
x1pilot
)ᵀ (
x2pilot
)ᵀ
· · ·
(
xkpilot
)ᵀ]
. (4)
When the pilot signals has been received at node i, it has
all data necessary data to estimate the channels to the K
users, without any inter-node communication. This is done
by multiplying the received pilot signals by the scalar 1p . The
local channel estimate vector is
hˆi =
yi,pilot
p
∈ C1×K . (5)
Assuming that the channels are frequency flat, the entire
channel estimate matrix H ∈ CM×K can be written as
H =

h1,1 h1,2 · · · h1,K
h2,1 h2,2 · · · h2,K
...
...
. . .
...
hM,1 hM,2 · · · hM,K
 , (6)
where hi,j ∈ C is the channel coefficient between antenna i
and terminal j. After the locally performed channel estimation,
node i has computed and stored row i of the channel matrix.
B. Linear Decoding and Precoding Matrices
In the uplink data transmission, the base station separates
the received signal vector y ∈ CM×1 into K streams of
symbols y˜ ∈ CK×1. This is done by multiplication with
a linear detection matrix A ∈ CK×M . For the considered
algorithms, the decoding matrix is
A =
{
HH , for CB(
HHH
)−1
HH , for ZF.
(7)
In the downlink data transmission, the symbol vector q ∈
CK×1 is precoded and sent from M antennas x ∈ CM×1.
This is done by multiplication with a linear precoding matrix
W ∈ CM×K . For the considered algorithms, the precoding
matrix is
W =
{
H∗, for CB
H∗ (HᵀH∗)−1 , for ZF.
(8)
For CB, the linear detection matrix A and the linear
precoding matrix W are obtained directly from the channel
estimation. Each node then has access to one column of the
decoding matrix.
For the ZF algorithm, calculating A and W involves
performing a pseudo inverse of the channel matrix H. The
ZF precoding matrix is
W = H∗ (HᵀH∗)−1 = H∗
((
HHH
)−1)∗
. (9)
Let
D =
(
HHH
)−1
. (10)
4The matrices can for the ZF case then be rewritten as
W = H∗D∗ = (HD)∗ (11)
and
A = DHH . (12)
Given the fact that HHH is Hermitian, we know that its
inverse is also Hermitian. With the Hermitian property (D =
DH ), the decoding matrix can be written as
A = DHHH = (HD)
H
=Wᵀ. (13)
Since the decoding and precoding matrices are each others
transpose, the local decoding column vector and the precoding
row vector will be identical.
To calculate A and W, D must be known. The Gram
matrix of the channel estimates, HHH, can be calculated in
a distributed manner across all nodes. The inversion is then
performed in the CCU. Let B = HHH.
HHH =
M∑
i=1
hHi hi
=

h∗1,1 · · · h∗M,1
h∗1,2 · · · h∗M,2
...
. . .
...
h∗1,K · · · h∗M,K

 h
∗
1,1 h
∗
1,2 · · · h∗1,K
...
...
. . .
...
h∗M,1 h
∗
M,2 · · · h∗M,K

(14)
The matrix hHi hi is the Gram matrix of the local channel
estimate vector in node i, and can be computed locally
without any inter-node communication since the required data
is obtained from the channel estimation. It is a Hermitian
matrix, thus only K(K+1)2 entries must be computed. The
computation performed in node i is
Bi = h
H
i hi +Bleft child +Bright child. (15)
The local contributions are added together as the matrices are
propagated upwards in the tree to form the Gram matrix of
the channel estimates.
This reduces the computational complexity of the CCU
and reduces the amount of data to be sent in the tree.
Instead of propagating a matrix with M × K values, due
to the Hermitian property only K(K+1)2 values needs to be
propagated. However, the computational load in each node is
increased, since Bi must be computed at the node.
When the D matrix has been computed in the CCU, it is
propagated downwards in the tree structure to all nodes. The
nodes can then calculate their local detection and precoding
vectors by multiplying the inverted matrix with their local
channel estimate vector. The computation performed in node
i is
Ai = Dhi ∈ CK×1, (16)
where Ai is the local decoding vector and Wi = A
ᵀ
i is the
local precoding vector.
The process of determining the local precoding/decoding
vector, Ai, is illustrated in Fig. 4. The leaf nodes, 1 and
2, computes their local contribution to the Gram matrix, B1
and B2 respectively, and sends them to the parent node, 3.
CCU
B1 = h
H
1 h1, B2 = h
H
2 h2
B3 = h
H
3 h3 +B1 +B2
D = B−1
A3 = Dh3
A1 = Dh1, A2 = Dh2
3
21
B = B3
B1, B2
D
D
Fig. 4. Data transfers and partitioning of computations when determining
the precoding/decoding vector for M = 3.
Node 3 computes its own local contribution, B3, and sums
it together with the contributions from the child nodes before
sending it upwards to the CCU. The CCU performs the matrix
inversion, and redistributes the results downwards in the tree.
When each node receives the inverted matrix, it computes its
local precoding/decoding vector.
C. Uplink Linear Decoding
The decoding process is performed by multiplying the
received signal vector y ∈ CM×1 with the decoding matrix,
A. During the decoding, each node has access to one column
of the decoding matrix and one sample of the received signal
vector. The symbol vector estimate y˜ is
y˜ = Ay =

a1,1 a1,2 · · · a1,M
a2,1 a2,2 · · · a2,M
...
...
. . .
...
aK,1 aK,2 · · · aK,M


y1
y2
...
yM

=

a1,1y1
a2,1y1
...
aK,1y1
+

a1,2y2
a2,2y2
...
aK,2y2
+ . . .+

a1,MyM
a2,MyM
...
aK,MyM
 .(17)
By multiplying the local sample with the local decoding
column, a local contribution to the received symbol vector
is computed. When the local contributions are calculated in
each node, they are sent upwards in the tree structure. The
contributions are added together as they propagate to the CCU.
Since the local contribution can be calculated using only the
local sample and one column of the decoding matrix, the entire
decoding matrix does not need to be available in all nodes. The
computation performed for each subcarrier in node i is
y˜i = Aiyi + y˜left child + y˜right child, (18)
where Ai is the local decoding vector. This is computed
similarly to computing B in Fig. 4.
D. Downlink Linear Precoding
The precoding process is done by multiplying the symbol
vector, q ∈ CK×1, with the precoding matrix, W ∈ CM×K .
During the precoding, each node has access to the symbol
vector and one row of the precoding matrix.
5x =Wq =

w1,1 w1,2 · · · w1,K
w2,1 w2,2 · · · w2,K
...
...
. . .
...
wM,1 wM,2 · · · wM,K


q1
q2
...
qK

=

∑K
j=1W1,jqj∑K
j=1W2,jqj
...∑K
j=1WM,jqj
 . (19)
The value transmitted at node i is the inner product between
row i of the precoding matrix and the symbol vector q.
Similarly to the decoding case, each node only requires one
row of the precoding matrix to perform the precoding. Thus,
the entire matrix does not need to be distributed to all nodes.
The computation performed for each subcarrier in node i is
xi =
K∑
j=1
Wi,jqj , (20)
where Wi is the local precoding vector. The symbol vector, q,
is distributed to the nodes similarly to D and the computations
of xi is performed similarly to Ai in Fig. 4.
E. OFDM Modulation and Demodulation
In a massive MIMO OFDM system, the OFDM modulation
and demodulation is performed for each antenna. Therefore,
one FFT/IFFT must be performed in the node for each
OFDM symbol (pilot, uplink, and downlink). The length of the
FFT/IFFT is NFFT, while the number of subcarriers utilized is
NSC.
F. Processing Element
As is shown in Section VIII, having one processing element
that performs all computations in the node is enough to support
a large range of different combinations of the number of termi-
nals and channel bandwidth. Therefore, it is beneficial to find
a common structure of the involved computations discussed
earlier. The channel estimation only requires multiplications
with 1/p as shown in Fig. 5(a). For uplink decoding, each node
performs a multiplication and adds data from the other nodes
further down the tree, for a binary tree as shown in Fig. 5(b).
For the downlink precoding, a sum-of-products is locally
computed, which consists of multiplication and accumulation,
as shown in Fig. 5(c).
Finally, the FFT and IFFT consists of butterfly operations
and twiddle factor multiplications. Considering the operations
in Figs. 5(a)–(c), it makes sense to use a radix-2 decimation
in time (DIT) algorithm. This algorithm has the property that
each butterfly operation has a twiddle factor multiplication in
front of one of the inputs [21], as shown in Fig. 5(d). Although
there exist many other radix-2 algorithm, the radix-2 DIT
algorithm is the only one with this property for each and every
butterfly. As a note, it is often believed that DIT corresponds
to bit-reversed input order and normal output order. However,
this is not the case as the butterfly computation order, and,
TABLE II
COMPUTATIONAL TASKS AND THE NUMBER OF OPERATIONS.
Name Description #PE Operations Operation
CE Channel estimation, (5) K Fig. 5(a)
Bi Local contribution for
Gram matrix, (15)
K(K+1)
2
Fig. 5(b)
y˜i Local y˜ contribution
and add contributions
from child nodes, for
all subcarriers, (18)
NSCK Fig. 5(b)
xi Precoded symbol xi
for all subcarriers, (20)
NSCK Fig. 5(c)
Wi/Ai Local precoding and
decoding vector, (16)
K2 Fig. 5(c)
FFT/
IFFT
FFT/IFFT for one
OFDM symbol
NFFT
2
log2 (NFFT) Fig. 5(d)
W A CB W A BW A
Multiply and add Multiply and Multiplication and
W A
Multiply
accumulate butterfly operation
(a) (b) (c) (d)
Fig. 5. Arithmetic operations performed in the nodes.
hence, the data dependency, is independent of the algorithm
selection. A conflict-free memory access scheme with low
hardware overhead can be found in e.g. [22].
These operations can be efficiently mapped to a processing
element as shown in Fig. 6. The number of operations for each
task and the type of operations is summarized in Table II.
In cases where multiple processing elements are used, the
processing element selection may be reconsidered. In this case,
it might be beneficial to map different computational tasks to
different processing elements, enabling specialized structures
for the given task. Similarly, if the computational requirements
per antenna are low, it may be beneficial to interleave the
computations for more than one antenna on a single processing
element.
W A B C
Y1 Y2
Fig. 6. Proposed processing element capable of performing all the operations
in Fig. 5.
6G. Computational Partitioning
So far, all computations that can be performed is a dis-
tributed manner are assumed to be done so. However, this
does not need to be the case. Consider ZF processing, where
the uplink decoding is performed as
y˜ = Ay =
(
HHH
)−1
HHy. (21)
So far the decoding matrix, A =
(
HHH
)−1
HH is computed
once every frame. This requires that the inverted matrix is
redistributed to the nodes before the decoding process can
start. Another possibility is to compute HHy in each node,
just like the conjugate beamforming case, and multiply with
the inverted matrix once the results reaches the CCU. The
distributed parts of the decoding could then be started inde-
pendently of the matrix inversion.
Similarly for the downlink precoding, the ZF processing is
performed according to
x =Wq = H∗
((
HHH
)−1)∗
q, (22)
where the precoding matrix W = H∗
((
HHH
)−1)∗
is
computed once every frame. This requires that the inverted
matrix is available in the node before the precoding can start.
By multiplying the complex conjugate of the inverted matrix
with the symbol vectors,
((
HHH
)−1)∗
q, in the CCU before
they are sent to the nodes, the inverted matrix itself is not
needed at each node for the precoding step.
By partitioning the computations in this way, the inverted
matrix does not need to be redistributed to the nodes. However,
the computational load in the CCU is significantly increased.
The computational load of each node is only slightly reduced,
since precoding and decoding are still performed distributedly.
The only difference is that the A/W computation does not
need to be performed.
IV. COMPLEXITY ANALYSIS
In this section, the computational, memory, and communi-
cation complexity is analyzed.
A. Computational Complexity
The computational complexity of each task is shown in
Table II. With the selected frame format, there are two major
limitations on the computational resources. First, since the
frame is repeated cyclically, all computations for one frame
must be performed in the duration of one frame, Tframe. This
yields the average number of operations per sample received,
Nop,avg. The number of operations that needs to be performed
to obtain the precoding/decoding vector is
Nop,weights =
NFFT
2
log2 (NFFT)+K+
K (K + 1)
2
+K2, (23)
where the first term corresponds to demodulating the OFDM
symbol (FFT), the second term for estimating the K channels
(CE) and the third term for computing the local contribution to
the channel Gram matrix (Bi). The fourth term is from multi-
plying the inverted matrix with the local channel estimates to
create the local decoding and precoding vector. The number
of operations performed for each uplink OFDM symbol is
Nop,UL =
NFFT
2
log2 (NFFT) +KNSC, (24)
where the first term corresponds to demodulating the OFDM
symbol, and the second term for computing the local contribu-
tion to the received symbol vector. The number of operations
required for each downlink OFDM symbol is
Nop,DL = KNSC +
NFFT
2
log2 (NFFT) , (25)
where the first term corresponds to performing the precoding
for each subcarrier utilized, and the second term for per-
forming the OFDM modulation. The number of operations
performed for the uplink and downlink OFDM symbols are
the same:
Nop,OFDM = Nop,UL = Nop,DL. (26)
The total number of operations per sample on average over
an entire frame is then
NOPS,avg =
Nop,weights + (NUL +NDL)Nop,OFDM
Tframefsample
, (27)
where NUL = NUL,1 + NUL,2 is the total number of uplink
OFDM symbols, and NDL is the number of downlink OFDM
symbols. Without considering data dependencies or critical
paths, this is the theoretical lower bound on the number of
operations per sample that the node must be able to perform.
The other limitation is that the downlink symbols must be
processed before their respective deadlines. In practice there
will be NDL critical paths in the schedule for one frame.
Figure 7 shows the computational tasks performed in each
node, the critical paths in the frame, the important times and
the possibility to buffer or process the uplink OFDM symbols.
The critical paths in the computations are from receiving the
pilot symbol, estimating the channels, computing the local
contribution to the Gram matrix, performing the centralized
matrix inversion, computing the local precoding/decoding vec-
tor, and finally performing the precoding for each downlink
OFDM symbol. The number of operations on the critical path
for downlink symbol i is
Nop,CP,i = Nop,weights + iNop,OFDM, i ∈ {1, 2, . . . , NDL}. (28)
The time available to perform the operations on the critical
path for downlink symbol i is
TCP,i = TOFDM (NUL,2 + i) . (29)
Between receiving the pilot symbol and transmitting downlink
symbol i, there are (NUL,2 + i) OFDM symbols, including
the guard interval. However, during the time the local Gram
matrices are propagated to the CCU, inverse computed and
the result redistributed to the nodes, which in total takes
Tinv + 2NhopsTlink, no computations on the critical paths can
be performed. Hence, the worst case average number of
computations per sample on the critical paths is
NOPS,critical = max
i
(
Nop,CP,i
(TCP,i − Tinv − 2NhopsTlink) fsample
)
.
(30)
7PUL UL UL DLG DL GUL1 UL DLULUL UL
NUL,1 NUL,2 NDL
xi xi xi
y˜i y˜i y˜i
CE
Inv + Communication
Wi
y˜i y˜iy˜i y˜i y˜i
Tcp,1
Tcp,2
Tcp,NDL
y˜i y˜i
Buffered
Symbols
OFDM
Symbols
Computations
.
.
.
Tinv
Tslack
NUL,PB
NUL,buffered
Fig. 7. Critical paths and their respective deadlines, for the asymptotic case
(
NOPS = NOPS,asymptotic
)
. Here, CE includes both channel estimation and
computing Bi. Dotted segments of the critical path times indicate that no computations on the critical path can be performed during the period. Gray boxes
illustrate uplink OFDM symbols that must be stored before processing.
This leads to that the computational requirements are deter-
mined by
NOPS = max (NOPS,avg, NOPS,critical) . (31)
This means that the time to perform matrix inversion and inter-
node communication latency may affect the computational
requirements.
If the system specifications are kept, but the number of up-
link and downlink OFDM symbols are increased, the average
number of operations per sample over an entire frame increases
as well. This is due to the two guard intervals becoming
less significant with an increasing number of OFDM symbols.
When the number of uplink and downlink OFDM symbols is
large, the number of operations per sample is
NOPS,asymptotic = lim
NUL,NDL→∞
max (NOPS,avg, NOPS,critical)
=
Nop,OFDM
TOFDMfsample
, (32)
meaning that one OFDM symbol must be processed in the
duration of one OFDM symbol.
As seen from (30) and (31), the matrix inversion time, Tinv,
and the total inter-node communication latency, 2NhopsTlink,
may affect the computational requirements. For fixed inter-
node communication latency1 this behavior is displayed in
Fig. 8. There are two inversion times marked in Fig. 8. The
first time, Tinv,A is the time when the critical path requires
equally many operations per sample as the frame average
(NOPS,critical = NOPS,avg). The second time, Tinv,B, is when the
1Note that the same behavior occurs when varying the inter-node commu-
nication latency, with fixed Tinv, or the sum of both.
Tinv,A Tinv,B
Tinv
O
p
er
at
io
n
s
p
er
sa
m
p
le
Critical path, first DL symbol
Critical path, last DL symbol
Frame average
Asymptotic frame average
Fig. 8. Number of required PE operations per sample depending on the time
to perform the matrix inversion.
number of operations on the critical path grows larger than
the number of operations per sample in the asymptotic case.
Figure 9 shows how the number of operations per sample
for varying Tinv changes when the number of OFDM symbols
in a frame is changed. In Fig. 9(a) the number of OFDM
symbols is small. In this case there is a significant gap between
NOPS,avg and NOPS,asymptotic and between Tinv,A and Tinv,B. When
the number of OFDM symbols increases these gaps decreases,
as shown in Fig. 9(b). Additionally, it can be seen in Fig. 9(b)
that when the number of OFDM symbols is large, the time
Tinv,B acts as a deadline for the matrix inversion. If the inverse
is received later than Tinv,B, the required number of operations
8Tinv,A Tinv,B
Tinv
O
p
er
at
io
n
s
p
er
sa
m
p
le
Critical path, first DL symbol
Critical path, last DL symbol
Frame average
Asymptotic frame average
Tinv
O
p
er
at
io
n
s
p
er
sa
m
p
le
(a) (b)
Fig. 9. Number of operations per sample versus Tinv for (a) small and (b)
large number of OFDM symbols.
per sample grows rapidly.
It can be seen in Fig. 9 that the critical path for the last
downlink symbol is the first to cross the frame average line.
Combining (1), (27), (29), and (30) leads to
Tinv,A = TCP,NDL −
Nop,CP,NDL
Nop,total
Tframe − 2NhopsTlink
= TOFDM (NUL,2 +NDL
− (NUL +NDL + 3) Nop,weights +NDLNop,OFDM
Nop,weights + (NUL +NDL)Nop,OFDM
)
−2NhopsTlink. (33)
The point Tinv,B is given by the equation
NOPS,asymptotic =
Nop,CP,i
(TCP,i − Tinv,B − 2NhopsTlink) fsample , (34)
for downlink symbol i. Using (28), (29), and (32), Tinv,B can
be expressed as
Tinv,B = TOFDM
(
NUL,2 − Nop,weights
Nop,OFDM
)
− 2NhopsTlink. (35)
It can be seen that Tinv,B is identical for all downlink symbols.
The number of operations per sample for each of the critical
paths are the same in the point Tinv,B. If Tinv < Tinv,B, the
critical path to the last downlink symbol will always require
the highest number of operations per sample of all critical
paths. Similarly, if Tinv > Tinv,B the critical path to the first
downlink symbol requires the higher number of operations
per sample.
To keep up with the computational requirements, the num-
ber of operations per sample, NOPS, the number of processing
elements, NPE, and the clock frequency, fclk, must satisfy
NPE
fclk
fsample
≥ NOPS = max (NOPS,avg, NOPS,critical) . (36)
Selecting fclk as an integer multiple of fsample, the number
of operations per sample that can be performed with NPE
processing elements is
NˆOPS = NPE
fclk
fsample
. (37)
In most cases NOPS will not be an integer. However, NˆOPS
will, and, hence, there is a slack time that can be used to
increase the number of terminals, K, the number of antennas,
M , and/or the matrix inversion time, Tinv. If Tinv < Tinv,A,
the slack time can be used to process some uplink symbols,
say NUL,PB, before the downlink symbols, as discussed be-
low. Alternatively, the pilot symbol can be moved closer to
the downlink symbols, i.e., decrease NUL,2, as discussed in
Section II-A.
While this section focuses on ZF, the same analysis can
be made for CB processing. This yields similar results, but
with one significant difference. The precoding and decoding
vector is obtained from the channel estimation, which means
that the computational tasks Bi, the central matrix inversion
and Wi/Ai is not performed. This results in
Nop,CP,i =
NFFT
2
log2 (NFFT) +K + iNop,OFDM, (38)
NOPS,critical = max
i
(
Nop,CP,i
TCP,ifsample
)
, (39)
and
NOPS,avg =
NFFT
2 log2 (NFFT) +K + (NUL +NDL)Nop,OFDM
Tframefsample
(40)
for CB. Hence, the number of operations to perform locally
does not decrease significantly, but the latency issues of
performing centralized computations vanishes.
B. Memory Complexity
Dimensioning the memories in the node will in part depend
on the frame structure that is chosen, and in part on the
scheduling of computations and inter-node communication. In
Fig. 7 the gray boxes illustrate uplink OFDM symbols that
must be stored locally in the node before they are processed.
The number of symbols that must be stored is
NUL,buffered = NUL −NUL,PB (41)
and the number of bits required to store these symbols is
Nbits,buffered = NUL,bufferedNFFTWADC. (42)
For an uplink OFDM symbol, the number of variables
during its lifetime in the node is seen in Fig. 10. Between times
T0 and T1 the OFDM symbol is sampled from the antenna and
stored in memory in the node. During this period, the number
of variables grows until NFFT. The duration between T0 and T1
is slightly shorter than one OFDM symbol, due to the cyclic
prefix not being stored. At time T2 the OFDM demodulation
starts and is finished at time T3. The FFT computation can be
made in-place, meaning that no additional memory is strictly
required. However, towards the end of the FFT computation,
some variables can be discarded due to only NSC subcarriers
9NFFT
NSC
Time
MV
T0 T1 T2 T5
KNSC
T3
FFT Decoding
UL symbol
T4
Fig. 10. Number of existing variables over the lifetime of an uplink OFDM
symbol in the node.
being utilized. When the decoding starts at time T4, there
will be a data expansion by a factor K, since each subcarrier
is multiplied with the decoding vector. When the decoding
is finished, there are KNSC variables. These are the number
of variables that will exist during the lifetime of one uplink
symbol, but not all of them must be stored.
C. Communication Complexity
One of the advantages of distributing the computations
among multiple nodes is that the number of values that needs
to be sent to the centralized structure in the system grows
proportionately to the number of terminals, K, rather than
the number of antennas in the system, M . In massive MIMO
systems where M  K this is clearly advantageous.
The number of bits that needs to be sent upwards in the
tree structure during one frame is
Nbits,up =
(
K(K + 1)
2
+ (NUL,1 +NUL,2)NSC
)
Wcomp,
(43)
which corresponds to the local contributions to the Gram
matrix, Bi, and the symbol vector estimates, y˜i. These values
are all used for computations and thus, the longer word length
Wcomp is required. The required upwards link datarate is
Rup =
Nbits,up
Tframe
. (44)
The downwards propagation differs in that the word length
of the modulated symbols is much shorter. Downwards, only
the raw symbols are propagated to all nodes, using the
shorter word length Wsymbol. However, the inverted matrix still
needs to be represented with Wcomp. The number of bits sent
downwards is
Nbits,down =
K(K + 1)
2
Wcomp +NDLNSCWsymbol. (45)
The required downwards link datarate is then
Rdown =
Nbits,down
Tframe
. (46)
However, this is only the minimum required data rate. If the
data is not sent between the nodes at the same rate as it is
consumed, buffers (which may incur a significant increase in
die area) are needed, as discussed in Section IV-D.
The reduced number of values sent from the antennas to
the central unit is often used as an argument for performing
distributed processing. While this is indeed the case, it must
FFT Decoding
MV
Time
T5T3 T4
NFFT
NSC
KNSC
T6
Send to parent
Fig. 11. Number of stored memory variables during the lifetime of one
uplink OFDM symbol.
also be noted that the word lengths of the data are different.
For a centralized architecture, the word length depends on
the ADC, so the number of bits is proportional to MWADC.
For a distributed architecture, y˜ and B are transmitted, so the
number of bits is proportional to KWcomp. Since these are
sum-of-products, where one product term being the sample
value, one may expect that in general Wcomp > WADC. How-
ever, as M  K, the total number of bits transmitted to the
central unit should still be significantly smaller. Furthermore,
it is important that the intermediate values are properly scaled
as more and more terms are added along the path to the central
unit.
D. Balancing Computations, Communication, and Memories
To obtain an optimized architecture the different types of
resources must be balanced. Here, the processing, communi-
cation and memory capabilities are included.
Considering the inter node communication for one uplink
OFDM symbol, the number of stored variables in each node
can be seen in Fig. 11. From sampling the radio until the FFT
is finished, the number of stored variables are the same as the
number of existing variables in Fig. 10. The output data from
the decoding process has no further data dependencies in the
current node. These variables need to be sent to the parent
node, so it can perform its own decoding process.
In Fig. 11, the time T4 to T5 is again the time taken to
perform the decoding. The time T4 to T6 is the time taken
to send the local contributions to the decoded signal vectors
to the parent node. It can be noted that T6 ≥ T5. When the
decoding starts, the number of variables that needs to be stored
locally increases due to the data expansion of the decoding,
but at the same time decreases due to variables being sent to
the parent node, and thus not needing to be stored.
There are two extreme cases of this behavior. The first is if
T6 tends to infinity. In this case, all variables must be stored
locally, due to none of them being sent over the link. Clearly,
this solution is not feasible. The other extreme is if T6 = T5.
This means that all output variables are sent directly to the
parent node and does not need to be stored locally.
As described earlier, the decoding on the parent node cannot
be performed until the decoding output has been sent over
the link. The implication of this is that the system should
be designed such that the rate of processing and the rate of
sending variables between nodes are the same.
The requirements on the downwards communication how-
ever are not as strict. The data that is propagated from the
10
Receive pilot
InvReceive
NUL,2
symbols
Process
NDL
Send
NDL
symbols
Process pilot
Receive
NUL,1
symbols
Process
NUL,1 + 3
Process
remaining UL symbols
Guard
interval
Guard
interval
Guard
interval
Process
NUL,1 + 3
Process
remaining UL symbols
Receive pilot
Inv
Receive
NUL,2
symbols
Process pilot
Receive
NUL,1
symbols
W/A
Process
NUL,PB
Fig. 12. Schedule of the computational tasks in the asymptotic case. The
white blocks corresponds to one frame.
CCU to the nodes in the tree is not processed on the way
downwards, but rather just forwarded to the next node. This
has the implication that it is not required to send the data in the
same rate as it is processed. It does however prevent the need
for large buffers in each node, making it desirable. Feeding the
nodes with data is a rather straightforward trade-off between
link data rate and buffer size.
V. SCHEDULING
The computational tasks and data dependencies when using
ZF processing can be seen in Fig. 7. This schedule is not
correctly scaled, but rather made to illustrate the data depen-
dencies and the need for a better realization. It is, e.g., clear
that there is time at the end of the frame where no operations
are currently performed. Therefore, it makes sense to move
parts of the computations there to obtain a better utilization of
the processing elements. This will come at a cost of memory
as the data must be stored rather than processed directly.
Here, a node with only one processing element is con-
sidered. The processing element is assumed to support the
required number of operations per sample for the asymptotic
case, i.e., NˆOPS ≥ NOPS,asymptotic. The schedule is created as in
Fig. 12. Initially, the node will wait for the pilot OFDM sym-
bol. The computations for determining the precoding/decoding
vector is then started. This includes an FFT, performing the
channel estimation and computing the Bi matrix. When the
inverted matrix is received, the precoding/decoding vector is
computed. After this stage, the uplink and downlink symbols
can be processed. In order to reduce the number of uplink
symbols that needs to be buffered before processing, NUL,PB
uplink symbols is processed. All downlink symbols is then
processed in order to meet their deadlines. When the downlink
symbols are finished, the node computes NUL,1 + 3 uplink
symbols is processed. Two uplink symbols can be processed
when the last downlink symbol is transmitted and during
the guard interval. Another uplink symbol can be computed
when pilot of the next frame is sampled. The remaining
NUL−NUL,PB−NUL,1−3 uplink symbols are processed when
the node waits for the inverted matrix for the next frame.
As can be seen, the processing is fully deterministic for
the asymptotic case, and, hence, a simple control unit can
be implemented, where the different system parameters can
be configured. For the non-asymptotic case, the same gen-
eral structure is implemented. However, as the processing is
possibly distributed differently within the frame, a slightly
more flexible control unit is required. Alternatively, the control
signals can be stored in a RAM acting as an instruction
memory.
A. System Level Scheduling
For the computational tasks y˜i and Bi in Table II there
are inter-node data dependencies, as described in Section III.
Before the local PE operation can be performed, the corre-
sponding contributions from the child nodes must be sent over
the inter-node link. The latency of sending a value over the link
is Tlink. For each level in the tree, the y˜i and Bi computations
must be skewed by this amount in order for the parent node
to receive the data before processing.
VI. ARCHITECTURE
In this section, an architecture for the node is proposed.
The main components in the architecture are the off-chip I/O,
processing core, memory system and the RF-chain.
As seen later on in Section VIII many system scenarios
can be covered with a single processing element in each
node. Hence, we focus on that here. Further inspection of the
arithmetic operations in Fig. 5 reveals that each input port of
the processing element is connected only to a few specific data.
This means that not all types of data need to be fed into any
port of the PE. For instance, only the twiddle factors, channel
estimates or the precoding/decoding vector is connected to
one input of the multiplier. Taking this into account leads to
the proposed node architecture shown in Fig. 13. The node
architecture uses a processing element as shown in Fig. 6.
The twiddle factor memory can be implemented as a ROM,
since the twiddle factors are static. The channel estimates and
precoding/decoding vector memories are single port memories,
that can either be written or read in one cycle. Although
during precoding and decoding, only the precoding/decoding
vector is required, the channel estimates must be stored until
all precoding/decoding values are computed, and, hence, both
must be stored. For simplicity, we select to have a separate
11
Processing element
Twiddle
factors
Channel
estimates
Sample memoryRadio
Left
in
Right
in
Parent
out
Parent
in
Precoding/
decoding
vector
W
A B C
Y1 Y2
Fig. 13. Proposed node architecture (not all connections shown).
From radio
To PE To PE
From PE From PE
To radio
Radio input
buffer
FFT processing
buffer
Radio output
buffer
WADC WADC
WDAC WDACWcompWcomp
Fig. 14. Structure of the sample memory in Fig. 13.
memory allocation for the channel estimate, instead of using
e.g. the sample memory. The sample memory is more complex
and is divided into three separate memories as shown in
Fig. 14. The first memory is the radio input buffer which stores
raw data from the AD converter. The size of this memory is
Meminput = NUL,bufferedNFFTWADC bits. (47)
The FFT processing buffer is used when performing the
FFT/IFFT computations and its size is
Memprocessing = NFFTWcomp bits. (48)
The last memory is the radio output buffer which holds the
finished downlink OFDM symbols that are to be sent to the
DA converter. The cyclic prefix of the OFDM symbol is also
fetched from the memory. Its size is
Memoutput = NFFTWDAC bits. (49)
In Fig. 14, the memories are shown as two-port memories
being able to read and write simultaneously. In many cases,
it may be beneficial to use two single-port memories of the
half the size instead. For the input and output buffers, it
is straightforward to use memories alternating reading and
writing. For the FFT processing buffer, it is also possible using
e.g. the approach in [22].
All memory sizes and word lengths for the architecture in
Fig. 13 are summarized in Table III. The exact required word
TABLE III
MEMORY SIZES FOR A NODE.
Memory #Words Word length
Input buffer NUL,bufferedNFFT WADC
FFT processing buffer NFFT Wcomp
Output buffer NFFT WDAC
Channel estimates K Wcomp
Precoding/decoding vector K Wcomp
Twiddle factors (ROM) NFFT/2 WTF
TABLE IV
SPECIFICATIONS OF THE LTE-LIKE SYSTEM IN THE EXAMPLE.
Name Value Name Value
K 20 Tframe 0.5 ms
NFFT 2048 Tinv 40 µs
NSC 1200 Wcomp 12 + 12 bits
NUL,1 0 Wsymbol 2 + 2 bits
NUL,2 2 WADC 6 + 6 bits
NDL 2 WDAC 6 + 6 bits
Tlink 0.5µs fsample 30.72 MHz
lengths should be based on system level simulations, which is
left for future work.
VII. EXAMPLE: LTE-LIKE SYSTEM SPECIFICATIONS
Here, the requirements for an LTE-like system using ZF
processing are considered. This is the typical specification
considered in most earlier work. The system specifications can
be seen in Table IV. For this specification, the throughput is
NSCNDL/ULKWsymbol
Tframe
= 384 Mb/s (50)
in each direction. The centralized matrix inversion can be
performed either using an exact [23] or an approximate [12],
[24], [25] algorithm. As shown in [26], the complexity is
similar for the best exact algorithm and a Neumann series
approximation with three terms. In both cases a 20 × 20
matrix inversion can be performed in less than 40 µs using
one processing element running at 200 MHz.
Based on these specifications and (27) and (30), we get
NOPS,avg = 9.95 (51)
and
NOPS,critical = max (9.23, 11.28) = 11.28. (52)
In this case, Tinv,A = 8.27 µs and Tinv,B = 110 µs. Therefore,
based on (31) and (36), selecting one PE running at 12fsample =
368.64 MHz is sufficient, leading to NˆOPS = 12. In this case,
the second downlink OFDM symbol imposes the limit on the
number of computational resources. Hence, NUL,PB = 0 and
NUL,buffered = 2.
A schedule for the computations in the LTE-like system is
derived and can be seen in Fig. 15. Deriving the schedule is
rather straightforward since all tasks are performed sequen-
tially. The slack time between determining the local precod-
ing/decoding vector, Wi/Ai, and the start of the precoding,
xi, can be utilized to modify the specification, as discussed
below.
The size of the radio input buffer is
Meminput = NUL,bufferedNFFTWADC = 48 kb, (53)
12
0 2 4 6 8 10 12 14 16 18
Clock cycle ×104
O
p
er
at
io
n
FFT
CE
Bi
INV
W/A
Xi
IFFT
Xi
IFFT
FFT
Yi
FFT
Yi
Pilot UL1 UL2 Guard DL1 DL2 Guard
Fig. 15. Schedule for the LTE-like system with ZF processing.
the size of the FFT processing buffer is
Memprocessing = NFFTWcomp = 48 kb, (54)
and the size of the radio output buffer is
Memoutput = NFFTWDAC = 24 kb. (55)
Hence, a total of 120 kb of memory is required in each node.
In addition, 960 more bits are needed for the channel estimates
and precoding/decoding vector.
As was shown in Section IV, the computations and com-
munication needs to be performed at the same rate. With the
selected number of operations performed per sample, NOPS,
the data rates are
Rup = NˆOPSfsampleWcomp ≈ 8.847 Gbps (56)
and
Rdown = NˆOPSfsampleWsymbol ≈ 1.475 Gbps. (57)
The available slack time can be used to modify the specifi-
cations of the system. By tweaking the parameters and redoing
the calculations, we can investigate which configurations are
supported with NˆOPS = 12. For example, the matrix inversion
time can be increased up to 54.2 µs, with exactly the same
node architecture. Alternatively, the number of users can be
increased to K = 21, assuming that the matrix inverse time
increases cubically. For K = 30 and the same assumption,
NˆOPS can be selected to 28. In this case, either one PE
running at 860.16 MHz or two PEs running at 430.08 MHz
can be used2. Alternatively, for the example, Nhops can be
increased up to 22, leading to a maximum of M = 222 − 1
antennas, assuming a binary tree. Even though this can be
further increased by increasing NˆOPS, this should not pose a
limitation in most cases.
If we want to process one uplink symbol before the first
downlink symbol, i.e., NUL,PB = 1, we must select NˆOPS = 17.
To move the pilot symbol one symbol closer to the downlink
symbols, i.e., NUL,1 = 1 and NUL,2 = 1, again NˆOPS = 17,
although this equality does not hold in general. Halving the
matrix inversion time leads to NˆOPS = 15 in both cases. This
illustrates that when the critical paths are limiting, increasing
the computational capabilities in the CCU, i.e., decreasing
2Naturally, any combination of NPE and fclk is valid as long as (36) holds.
However, if multiple PEs are used, the memory architecture may need to be
modified.
0 50 100
Bandwidth, MHz
(a) (b)
0
20
40
60
80
100
K
fclk = 1 GHz fclk = 500 MHz fclk = 250 MHz
0 50 100
Bandwidth, MHz
0
20
40
60
80
100
K
Fig. 16. Bound on bandwidth and terminals using a single processing element
at a given fclk for the LTE-like case. Configurations with (a) large and (b)
small number of OFDM symbols.
the matrix inversion time, leads to reduced computational
requirements in the nodes.
Naturally, any valid combination of these modifications can
be realized.
VIII. DESIGN SPACE EXPLORATION
The clock frequency required in the LTE-like example, is
not a problem to achieve in a modern process technology
through, e.g., pipelining, which is straightforward since the
execution is deterministic. Hence, it is possible to change
the bandwidth and/or the number of terminals. Here, we
consider three different clock frequencies up to 1 GHz for
a system otherwise as in the LTE-like case. Figure 16 shows
the bounds on bandwidth and number of terminals for a given
clock frequency. In Fig. 16(a) the asymptotic case is shown,
NOPS = NOPS,asymptotic, meaning that the number of OFDM
symbols in each frame is large. In Fig. 16(b) the frame format
in the LTE-case is used. In both cases the average number
of computations over an entire frame is used. Thus, it is
assumed that the matrix inversion is performed fast enough to
not influence the required number of operations per second,
i.e., Tinv ≤ Tinv,A.
It is noted from Fig. 16 that increasing the channel band-
width by a factor two, roughly requires that the number of
simultaneous terminals are reduced by the same factor.
Here, the length of the FFT, NFFT, and the number of
subcarriers utilized, NSC, is scaled linearly with the bandwidth
of the channel. This is usually not the case, since the FFT
length is favorably selected as a power of two. The plots still
give a good estimate of the available design space.
IX. CONCLUSIONS
In this work, a scalable system architecture using distributed
processing was proposed for the base station in a massive
MIMO system. It was shown that the computations associated
with each antenna can be distributed and in most of the earlier
studied use cases only a simple single processing element
running at a few hundred MHz and a modest amount of
memory are required. It was further shown that it is feasible
to have a simple synchronous control of the nodes and that
the inter-node communication can be handled by one or a few
13
high-speed serial links. All computations required by adding
an antenna are handled by the introduced additional node.
The case of connecting the nodes as a binary tree was
primarily studied, although the architecture is readily extended
to a K-ary tree. It is here worth noting that an array architec-
ture with static scheduling will behave as a binary or ternary
tree, and, hence, the same concept can be used for an array
interconnect with additional simple routing logic surrounding
the processing node. As the processing core is so small, it is
also of interest to possibly have more than one node in a chip,
reducing the amount of inter-chip communication channels.
The exact granularity is left for future work.
The architecture supports conjugate beamforming, zero
forcing, and MMSE processing. In the latter two cases, a
matrix inversion is performed in a central control unit, but all
other computations are distributed. The impact of the matrix
inversion latency and pilot position on the computational
requirements in the node are studied and related.
REFERENCES
[1] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski,
“Five disruptive technology directions for 5G,” IEEE Commun. Mag.,
vol. 52, no. 2, pp. 74–80, Feb. 2014.
[2] P. K. Agyapong, M. Iwamura, D. Staehle, W. Kiess, and A. Benjebbour,
“Design considerations for a 5G network architecture,” IEEE Commun.
Mag., vol. 52, no. 11, pp. 65–75, Nov. 2014.
[3] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO in the UL/DL
of cellular networks: How many antennas do we need?” IEEE J. Sel.
Areas Commun., vol. 31, no. 2, pp. 160–171, Feb. 2013.
[4] L. Lu, G. Y. Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang,
“An overview of massive MIMO: Benefits and challenges,” IEEE J. Sel.
Topics Signal Process., vol. 8, no. 5, pp. 742–758, Oct. 2014.
[5] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, “Massive
MIMO for next generation wireless systems,” IEEE Commun. Mag.,
vol. 52, no. 2, pp. 186–195, Feb. 2014.
[6] E. Björnson, E. G. Larsson, and T. L. Marzetta, “Massive MIMO: ten
myths and one critical question,” IEEE Commun. Mag., vol. 54, no. 2,
pp. 114–123, Feb. 2016.
[7] T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo, Fundamentals
of Massive MIMO. Cambridge University Press, 2016.
[8] C. Shepard, H. Yu, N. Anand, E. Li, T. Marzetta, R. Yang, and L. Zhong,
“Argos: practical many-antenna base stations,” in Proc. Int. Conf. Mobile
Comput. Networking. ACM, 2012, pp. 53–64.
[9] J. Vieira, S. Malkowsky, K. Nieman, Z. Miers, N. Kundargi, L. Liu,
I. Wong, V. Öwall, O. Edfors, and F. Tufvesson, “A flexible 100-antenna
testbed for massive MIMO,” in Proc. IEEE Globecom Workshops, Dec.
2014, pp. 287–293.
[10] X. Yang, W.-J. Lu, N. Wang, K. Nieman, S. Jin, H. Zhu, X. Mu, I. Wong,
Y. Huang, and X. You, “Design and implementation of a TDD-based
128-antenna massive MIMO prototyping system,” 2016.
[11] F.-L. Luo and C. Zhang, Signal Processing for 5G: Algorithms and
Implementations. Wiley-IEEE Press, 2016, ch. Massive MIMO for
5G: Theory, Implementation and Prototyping, pp. 616–.
[12] M. Wu, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“Large-scale MIMO detection for 3GPP LTE: Algorithms and FPGA
implementations,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, pp.
916–929, Oct. 2014.
[13] H. Prabhu, J. Rodrigues, L. Liu, and O. Edfors, “Algorithm and hardware
aspects of pre-coding in massive MIMO systems,” in Proc. Asilomar
Conf. Systems Computers, Nov. 2015, pp. 1144–1148.
[14] A. Puglielli, A. Townley, G. LaCaille, V. Milovanovic´, P. Lu, K. Trot-
skovsky, A. Whitcombe, N. Narevsky, G. Wright, T. Courtade, E. Alon,
B. Nikolic´, and A. M. Niknejad, “Design of energy- and cost-efficient
massive MIMO arrays,” Proc. IEEE, vol. 104, no. 3, pp. 586–606, Mar.
2016.
[15] H. Prabhu, J. N. Rodriguez, L. Liu, and O. Edfors, “A 60pJ/b 300Mb/s
128×8 massive MIMO precoder-decoder in 28nm FD-SOI,” in Proc.
IEEE Solid-State Circuit Conf., 2017.
[16] K. Li, R. Sharan, Y. Chen, J. Cavallaro, and C. Studer, “Decentralized
data detection for massive MU-MIMO on a Xeon Phi cluster,” in Proc.
Asilomar Conf. Systems Computers, 2016.
[17] K. Li, R. Skaran, Y. Chen, J. R. Cavallaro, T. Goldstein, and C. Studer,
“Decentralized beamforming for massive MU-MIMO on a GPU cluster,”
in Proc. IEEE Global Conf. Signal Inform. Process. IEEE, 2016, pp.
590–594.
[18] H. Yang and T. L. Marzetta, “Performance of conjugate and zero-
forcing beamforming in large-scale antenna systems,” IEEE J. Sel. Areas
Commun., vol. 31, no. 2, pp. 172–179, Feb. 2013.
[19] C. B. Peel, B. M. Hochwald, and A. L. Swindlehurst, “A
vector-perturbation technique for near-capacity multiantenna multiuser
communication-part I: channel inversion and regularization,” IEEE
Trans. Commun., vol. 53, no. 1, pp. 195–202, Jan. 2005.
[20] E. Bertilsson, O. Gustafsson, and E. G. Larsson, “A scalable architecture
for massive MIMO base stations using distributed processing,” in Proc.
Asilomar Conf. Systems Computers, 2016.
[21] L. Wanhammar, DSP Integrated Circuits. Academic Press, 1999.
[22] Y. Ma and L. Wanhammar, “A hardware efficient control of memory
addressing for high-performance FFT processors,” IEEE Trans. Signal
Process., vol. 48, no. 3, pp. 917–921, Mar. 2000.
[23] C. Ingemarsson and O. Gustafsson, “Hardware architecture for posi-
tive definite matrix inversion based on LDL decomposition and back-
substitution,” in Proc. Asilomar Conf. Systems Computers, 2016.
[24] X. Liang, C. Zhang, S. Xu, and X. You, “Coefficient adjustment matrix
inversion approach and architecture for massive MIMO systems,” in
Proc. IEEE Int. Conf. ASIC, Nov. 2015.
[25] S. M. Abbas and C.-Y. Tsui, “Low-latency approximate matrix inversion
for high-throughput linear pre-coders in massive MIMO,” in Proc.
IFIP/IEEE Int. Conf. Very Large Scale Integration, Sep. 2016.
[26] O. Gustafsson, E. Bertilsson, J. Klasson, and C. Ingemarsson, “Approx-
imate Neumann series or exact matrix inversion for massive MIMO?”
in Proc. IEEE Symp. Comput. Arithmetic, 2017, invited paper.
