In this work, a scalable and modular architecture for massive MIMO base stations with distributed processing is proposed. New antennas can readily be added by adding a new node as each node handles all the additional involved processing. The architecture supports conjugate beamforming, zero-forcing, and MMSE, where for the two latter cases a central matrix inversion is required. The impact of the time required for this matrix inversion is carefully analyzed along with a generic frame format. As part of the contribution, careful computational, memory, and communication analyses are presented. It is shown that all computations can be mapped to a single computational structure and that a processing node consisting of a single such processing element can handle a broad range of bandwidths and number of terminals.
I. INTRODUCTION
T HE ever increasing demands for higher data rates in wireless communication opens up for many opportunities and challenges in the fifth generation (5G) wireless infrastructure [1] , [2] . One such is the use of many antennas on the base station side, commonly referred to as Massive MIMO or Very Large Scale MIMO [3] - [7] . By using many antennas, compared to the number of terminals, the transmit and receive power of each antenna and the processing associated per antenna can be reduced. Furthermore, the total energy per bit per user can potentially be reduced at the system level compared to traditional few antenna solutions.
However, while massive MIMO is a promising technology there are still obstacles to overcome before systems of this type can be deployed. There exists a number of demonstrators [8] - [11] that have been used to demonstrate the feasibility of the techniques. The number of antennas for the demonstrators are typically between 64 and 128 and supports up to 12 terminals. In addition, some work has been done to implement parts or all of the involved processing [12] - [17] , either using a centralized node with all processing [12] , [13] , [15] or distributing the processing to several nodes [14] , [16] , [17] . For the centralized node architectures, typically the case of 128 antennas and 8 terminals are considered.
Of more interest in our current context are the examples of distributed processing. In [14] , a base station system design that is constructed by identical modules is proposed. The baseband processing is distributed among the modules, which are connected in an array. The modules contain RF-front ends, digital baseband processing and digital interconnection link to all four neighbors. The system design, along with cost and power consumption issues are analyzed. However, there are no details on how the baseband processing should be performed or the impact of timing constraints.
In [16] , [17] , the authors propose distributed processing based on COTS processors. However, the timing constraints of the considered LTE-frame structure is not taken into consideration. Additionally, the systems are not dimensioned to meet the maximum obtainable throughputs of the considered specifications.
Compared to a centralized architecture, in a distributed architecture, the number of antennas at the base station can more easily be scaled. For the centralized architecture, increasing the number of antennas more or less requires a complete redesign of the system. In a distributed architecture, the number of antennas can be increased by adding another node that contains the antenna and circuitry for the associated processing. Furthermore, a distributed architecture enables performing the computations close to the antenna, possibly integrated on the same chip as the radio. In the case of component failure, the modularity allows a single node to be replaced instead of replacing a large centralized unit.
Additionally, for centralized implementations the required data rate to read all uplink data from the ADCs and to feed the downlink data to the DACs grows with the number of antennas, making it very high for systems with many antennas. Finally, a higher manufacturing yield can be expected since each chip is smaller for a distributed architecture.
In the current work, a node and system architecture is proposed that is distributed, modular, and scalable. It supports conjugate beamforming (CB), zero forcing (ZF) [18] , and minimum mean-square error (MMSE) [19] , where in the two latter cases a matrix inversion is performed in a central unit. The computational difference between ZF and MMSE is that in MMSE, a regularization term is added before performing the matrix inverse. This is also done at the central unit. Therefore, we only discuss CB and ZF explicitly, as the node processing for MMSE is the same as for ZF. The main contributions in this work are:
• Analysis of distributed and modular processing in a massive MIMO-OFDM system • Node and system architecture for a distributed, modular, and scalable MIMO-OFDM system • Computation, memory, and communication analysis for the nodes and system • Analysis of timing constraints and their effects on resource requirements • Deterministic scheduling/control of the nodes and system • Design space exploration showing that the proposed node architecture can be used in a rather large set of scenarios A preliminary version of the current manuscript was presented in [20] . Compared to the distributed architecture in [14] , we suggest using a tree interconnection of the nodes, although the proposed approach can also be used in other interconnection topologies. Especially, we perform an analysis of the computational and timing requirements and propose a detailed node architecture, along with scheduling of the computations and inter-node communication. Compared to the distributed architectures in [16] , [17] , we propose an optimized node architecture instead of using generic processors. Additionally, the timing constraints of the selected frame format is carefully analyzed.
II. PROPOSED SYSTEM ARCHITECTURE
In this work, the proposed system architecture consists of one central control unit (CCU) and a scalable part, as illustrated in Fig. 1 . The CCU is responsible for performing operations such as error correction coding/decoding and operations associated with the other network layers, such as medium access control (MAC). The scalable part is responsible for the channel estimation, linear precoding, and linear decoding of symbols transmitted to and from the base station. Every node in the scalable part contains computational blocks for the associated antenna(s) and inter-node communication links. One or more nodes can be combined into a chip for different granularity. The main difference is the latency of the inter-node communication, which within a chip will be one or a few clock cycles, while inter-chip may be in the range of one hundred clock cycles assuming a serial link and a clock frequency of hundreds of MHz. Here it is assumed that the nodes are clocked synchronously. In downlink operation, the CCU feeds modulated symbols for each terminal to the nodes. In uplink operation, the nodes compute estimated symbols transmitted from the terminals and sends to the CCU.
In this work, we propose to connect the nodes in a K-ary tree, however the nodes can be connected in other topologies as well. It is worth pointing out that, independently of the interconnection topology, during the accumulation of data 
from all nodes, each node will only transmit data to one other node on the way to the CCU to avoid duplicate transmissions and accumulations. This means that accumulating data will always be performed in a tree structure, independently of the interconnection topology. By modifying the interconnection topology, the number of hops when accumulating data is changed. Trees have some inherent advantages and disadvantages compared to array topologies. One of the most profound advantages of the tree structure is that the number of routing hops, N hops , grows logarithmically with the number of nodes in the tree, as opposed to proportionately to the square root for arrays. Figure 2 shows two different arrays and a tree topology. For systems with a large number of antennas this is a major benefit when using ZF or MMSE processing, as the latency of propagating data through the tree affects the system design. For CB processing, low latency is not as important.
Another design trade-off that needs to be considered is fault tolerance. In an ordinary tree structure, if a node fails during operation that entire branch will not be able to communicate with the rest of the tree. In an array topology, this could be mitigated by routing data past the failing node. This however increases the node complexity since a routing mechanism must be implemented.
Additionally, there is the aspect of physical antenna placement and cable routing. In systems where antennas are placed in an array, the array-based node topologies have the advantage of simpler cable routing. In systems where the antennas are scattered in some irregular pattern this advantage is lost.
For the remainder of the article, for ease of exposition, complete binary trees are considered where each chip contains one node.
A. System Specification
The considered setup is a TDD based system that utilizes OFDM. A generalized frame structure can be seen in Fig. 3 . Fig. 3 . Generalized frame structure. The frame starts with N UL,1 uplink OFDM symbols where the terminals transmits data to the base station. Then comes the uplink pilot symbol, where all terminals transmit a unique pilot sequence that is used to estimate the uplink radio channel. Another N UL,2 uplink OFDM symbols are sent after the pilot. Then comes a guard interval to switch from uplink to downlink operation. The base station then transmits N DL OFDM symbols to the terminals. The frame duration is
where T OFDM is the duration of one OFDM symbol. Generally, it is favorable to place the pilot close to the middle of the frame to reduce the time between sending the pilot and the data.
Here, the synchronization between transmitters and receivers is not considered. The system parameters are shown in Table I .
III. COMPUTATIONAL TASKS
To utilize the proposed architecture efficiently, the algorithms used must be expressed in a distributed manner. The processing can be divided into three phases: channel estimation, uplink data decoding and downlink data precoding.
A. Channel Estimation
Here, a channel estimation based on least squares is considered. Let
be the pilot vector transmitted by terminal k. The scalar p is computed statically at design time. Each node receives the signal vector
where X p ∈ C K×K is the pilot matrix and N p ∈ C 1×K is a noise vector. The pilot matrix X p is given by
When the pilot signals has been received at node i, it has all data necessary data to estimate the channels to the K users, without any inter-node communication. This is done by multiplying the received pilot signals by the scalar 1 p . The local channel estimate vector iŝ
Assuming that the channels are frequency flat, the entire channel estimate matrix H ∈ C M ×K can be written as
where h i,j ∈ C is the channel coefficient between antenna i and terminal j. After the locally performed channel estimation, node i has computed and stored row i of the channel matrix.
B. Linear Decoding and Precoding Matrices
In the uplink data transmission, the base station separates the received signal vector y ∈ C M ×1 into K streams of symbolsỹ ∈ C K×1 . This is done by multiplication with a linear detection matrix A ∈ C K×M . For the considered algorithms, the decoding matrix is
In the downlink data transmission, the symbol vector q ∈ C K×1 is precoded and sent from M antennas x ∈ C M ×1 . This is done by multiplication with a linear precoding matrix W ∈ C M ×K . For the considered algorithms, the precoding matrix is
For CB, the linear detection matrix A and the linear precoding matrix W are obtained directly from the channel estimation. Each node then has access to one column of the decoding matrix.
For the ZF algorithm, calculating A and W involves performing a pseudo inverse of the channel matrix H. The ZF precoding matrix is
Let 4 The matrices can for the ZF case then be rewritten as
and
Given the fact that H H H is Hermitian, we know that its inverse is also Hermitian. With the Hermitian property (D = D H ), the decoding matrix can be written as
Since the decoding and precoding matrices are each others transpose, the local decoding column vector and the precoding row vector will be identical.
To calculate A and W, D must be known. The Gram matrix of the channel estimates, H H H, can be calculated in a distributed manner across all nodes. The inversion is then performed in the CCU. Let B = H H H.
The matrix h H i h i is the Gram matrix of the local channel estimate vector in node i, and can be computed locally without any inter-node communication since the required data is obtained from the channel estimation. It is a Hermitian matrix, thus only K(K+1) 2 entries must be computed. The computation performed in node i is
The local contributions are added together as the matrices are propagated upwards in the tree to form the Gram matrix of the channel estimates. This reduces the computational complexity of the CCU and reduces the amount of data to be sent in the tree. Instead of propagating a matrix with M × K values, due to the Hermitian property only K(K+1) 2 values needs to be propagated. However, the computational load in each node is increased, since B i must be computed at the node.
When the D matrix has been computed in the CCU, it is propagated downwards in the tree structure to all nodes. The nodes can then calculate their local detection and precoding vectors by multiplying the inverted matrix with their local channel estimate vector. The computation performed in node i is
where A i is the local decoding vector and W i = A i is the local precoding vector. The process of determining the local precoding/decoding vector, A i , is illustrated in Fig. 4 . The leaf nodes, 1 and 2, computes their local contribution to the Gram matrix, B 1 and B 2 respectively, and sends them to the parent node, 3. Node 3 computes its own local contribution, B 3 , and sums it together with the contributions from the child nodes before sending it upwards to the CCU. The CCU performs the matrix inversion, and redistributes the results downwards in the tree. When each node receives the inverted matrix, it computes its local precoding/decoding vector.
C. Uplink Linear Decoding
The decoding process is performed by multiplying the received signal vector y ∈ C M ×1 with the decoding matrix, A. During the decoding, each node has access to one column of the decoding matrix and one sample of the received signal vector. The symbol vector estimateỹ is
By multiplying the local sample with the local decoding column, a local contribution to the received symbol vector is computed. When the local contributions are calculated in each node, they are sent upwards in the tree structure. The contributions are added together as they propagate to the CCU.
Since the local contribution can be calculated using only the local sample and one column of the decoding matrix, the entire decoding matrix does not need to be available in all nodes. The computation performed for each subcarrier in node i is
where A i is the local decoding vector. This is computed similarly to computing B in Fig. 4 .
D. Downlink Linear Precoding
The precoding process is done by multiplying the symbol vector, q ∈ C K×1 , with the precoding matrix, W ∈ C M ×K . During the precoding, each node has access to the symbol vector and one row of the precoding matrix.
The value transmitted at node i is the inner product between row i of the precoding matrix and the symbol vector q.
Similarly to the decoding case, each node only requires one row of the precoding matrix to perform the precoding. Thus, the entire matrix does not need to be distributed to all nodes. The computation performed for each subcarrier in node i is
where W i is the local precoding vector. The symbol vector, q, is distributed to the nodes similarly to D and the computations of x i is performed similarly to A i in Fig. 4 .
E. OFDM Modulation and Demodulation
In a massive MIMO OFDM system, the OFDM modulation and demodulation is performed for each antenna. Therefore, one FFT/IFFT must be performed in the node for each OFDM symbol (pilot, uplink, and downlink). The length of the FFT/IFFT is N FFT , while the number of subcarriers utilized is N SC .
F. Processing Element
As is shown in Section VIII, having one processing element that performs all computations in the node is enough to support a large range of different combinations of the number of terminals and channel bandwidth. Therefore, it is beneficial to find a common structure of the involved computations discussed earlier. The channel estimation only requires multiplications with 1/p as shown in Fig. 5(a) . For uplink decoding, each node performs a multiplication and adds data from the other nodes further down the tree, for a binary tree as shown in Fig. 5(b) . For the downlink precoding, a sum-of-products is locally computed, which consists of multiplication and accumulation, as shown in Fig. 5(c) .
Finally, the FFT and IFFT consists of butterfly operations and twiddle factor multiplications. Considering the operations in Figs. 5(a)-(c), it makes sense to use a radix-2 decimation in time (DIT) algorithm. This algorithm has the property that each butterfly operation has a twiddle factor multiplication in front of one of the inputs [21] , as shown in Fig. 5(d) . Although there exist many other radix-2 algorithm, the radix-2 DIT algorithm is the only one with this property for each and every butterfly. As a note, it is often believed that DIT corresponds to bit-reversed input order and normal output order. However, this is not the case as the butterfly computation order, and, 
Precoded symbol x i for all subcarriers, (20)
W i /A i Local precoding and decoding vector, (16) hence, the data dependency, is independent of the algorithm selection. A conflict-free memory access scheme with low hardware overhead can be found in e.g. [22] . These operations can be efficiently mapped to a processing element as shown in Fig. 6 . The number of operations for each task and the type of operations is summarized in Table II. In cases where multiple processing elements are used, the processing element selection may be reconsidered. In this case, it might be beneficial to map different computational tasks to different processing elements, enabling specialized structures for the given task. Similarly, if the computational requirements per antenna are low, it may be beneficial to interleave the computations for more than one antenna on a single processing element. 
G. Computational Partitioning
So far, all computations that can be performed is a distributed manner are assumed to be done so. However, this does not need to be the case. Consider ZF processing, where the uplink decoding is performed as
So far the decoding matrix,
once every frame. This requires that the inverted matrix is redistributed to the nodes before the decoding process can start. Another possibility is to compute H H y in each node, just like the conjugate beamforming case, and multiply with the inverted matrix once the results reaches the CCU. The distributed parts of the decoding could then be started independently of the matrix inversion. Similarly for the downlink precoding, the ZF processing is performed according to
where the precoding matrix W = H * H H H −1 * is computed once every frame. This requires that the inverted matrix is available in the node before the precoding can start. By multiplying the complex conjugate of the inverted matrix with the symbol vectors, H H H −1 * q, in the CCU before they are sent to the nodes, the inverted matrix itself is not needed at each node for the precoding step.
By partitioning the computations in this way, the inverted matrix does not need to be redistributed to the nodes. However, the computational load in the CCU is significantly increased. The computational load of each node is only slightly reduced, since precoding and decoding are still performed distributedly. The only difference is that the A/W computation does not need to be performed.
IV. COMPLEXITY ANALYSIS
In this section, the computational, memory, and communication complexity is analyzed.
A. Computational Complexity
The computational complexity of each task is shown in Table II . With the selected frame format, there are two major limitations on the computational resources. First, since the frame is repeated cyclically, all computations for one frame must be performed in the duration of one frame, T frame . This yields the average number of operations per sample received, N op,avg . The number of operations that needs to be performed to obtain the precoding/decoding vector is
where the first term corresponds to demodulating the OFDM symbol (FFT), the second term for estimating the K channels (CE) and the third term for computing the local contribution to the channel Gram matrix (B i ). The fourth term is from multiplying the inverted matrix with the local channel estimates to create the local decoding and precoding vector. The number of operations performed for each uplink OFDM symbol is
where the first term corresponds to demodulating the OFDM symbol, and the second term for computing the local contribution to the received symbol vector. The number of operations required for each downlink OFDM symbol is
where the first term corresponds to performing the precoding for each subcarrier utilized, and the second term for performing the OFDM modulation. The number of operations performed for the uplink and downlink OFDM symbols are the same:
The total number of operations per sample on average over an entire frame is then
where N UL = N UL,1 + N UL,2 is the total number of uplink OFDM symbols, and N DL is the number of downlink OFDM symbols. Without considering data dependencies or critical paths, this is the theoretical lower bound on the number of operations per sample that the node must be able to perform. The other limitation is that the downlink symbols must be processed before their respective deadlines. In practice there will be N DL critical paths in the schedule for one frame. Figure 7 shows the computational tasks performed in each node, the critical paths in the frame, the important times and the possibility to buffer or process the uplink OFDM symbols. The critical paths in the computations are from receiving the pilot symbol, estimating the channels, computing the local contribution to the Gram matrix, performing the centralized matrix inversion, computing the local precoding/decoding vector, and finally performing the precoding for each downlink OFDM symbol. The number of operations on the critical path for downlink symbol i is
The time available to perform the operations on the critical path for downlink symbol i is
Between receiving the pilot symbol and transmitting downlink symbol i, there are (N UL,2 + i) OFDM symbols, including the guard interval. However, during the time the local Gram matrices are propagated to the CCU, inverse computed and the result redistributed to the nodes, which in total takes T inv + 2N hops T link , no computations on the critical paths can be performed. Hence, the worst case average number of computations per sample on the critical paths is 
This means that the time to perform matrix inversion and internode communication latency may affect the computational requirements.
If the system specifications are kept, but the number of uplink and downlink OFDM symbols are increased, the average number of operations per sample over an entire frame increases as well. This is due to the two guard intervals becoming less significant with an increasing number of OFDM symbols. When the number of uplink and downlink OFDM symbols is large, the number of operations per sample is
meaning that one OFDM symbol must be processed in the duration of one OFDM symbol. As seen from (30) and (31), the matrix inversion time, T inv , and the total inter-node communication latency, 2N hops T link , may affect the computational requirements. For fixed internode communication latency 1 this behavior is displayed in Fig. 8 . There are two inversion times marked in Fig. 8 . The first time, T inv,A is the time when the critical path requires equally many operations per sample as the frame average (N OPS,critical = N OPS,avg ). The second time, T inv,B , is when the
Operations per sample
Critical path, first DL symbol Critical path, last DL symbol Frame average Asymptotic frame average number of operations on the critical path grows larger than the number of operations per sample in the asymptotic case. Figure 9 shows how the number of operations per sample for varying T inv changes when the number of OFDM symbols in a frame is changed. In Fig. 9 (a) the number of OFDM symbols is small. In this case there is a significant gap between N OPS,avg and N OPS,asymptotic and between T inv,A and T inv,B . When the number of OFDM symbols increases these gaps decreases, as shown in Fig. 9(b) . Additionally, it can be seen in Fig. 9(b) that when the number of OFDM symbols is large, the time T inv,B acts as a deadline for the matrix inversion. If the inverse is received later than T inv,B , the required number of operations per sample grows rapidly. It can be seen in Fig. 9 that the critical path for the last downlink symbol is the first to cross the frame average line. Combining (1), (27), (29), and (30) leads to
The point T inv,B is given by the equation
for downlink symbol i. Using (28), (29), and (32), T inv,B can be expressed as
It can be seen that T inv,B is identical for all downlink symbols. 
Selecting f clk as an integer multiple of f sample , the number of operations per sample that can be performed with N PE processing elements iŝ
In most cases N OPS will not be an integer. However,N OPS will, and, hence, there is a slack time that can be used to increase the number of terminals, K, the number of antennas, M , and/or the matrix inversion time, T inv . If T inv < T inv,A , the slack time can be used to process some uplink symbols, say N UL,PB , before the downlink symbols, as discussed below. Alternatively, the pilot symbol can be moved closer to the downlink symbols, i.e., decrease N UL,2 , as discussed in Section II-A. While this section focuses on ZF, the same analysis can be made for CB processing. This yields similar results, but with one significant difference. The precoding and decoding vector is obtained from the channel estimation, which means that the computational tasks B i , the central matrix inversion and W i /A i is not performed. This results in
T frame f sample (40) for CB. Hence, the number of operations to perform locally does not decrease significantly, but the latency issues of performing centralized computations vanishes.
B. Memory Complexity
Dimensioning the memories in the node will in part depend on the frame structure that is chosen, and in part on the scheduling of computations and inter-node communication. In Fig. 7 the gray boxes illustrate uplink OFDM symbols that must be stored locally in the node before they are processed. The number of symbols that must be stored is
and the number of bits required to store these symbols is
For an uplink OFDM symbol, the number of variables during its lifetime in the node is seen in Fig. 10 . Between times T 0 and T 1 the OFDM symbol is sampled from the antenna and stored in memory in the node. During this period, the number of variables grows until N FFT . The duration between T 0 and T 1 is slightly shorter than one OFDM symbol, due to the cyclic prefix not being stored. At time T 2 the OFDM demodulation starts and is finished at time T 3 . The FFT computation can be made in-place, meaning that no additional memory is strictly required. However, towards the end of the FFT computation, some variables can be discarded due to only N SC subcarriers being utilized. When the decoding starts at time T 4 , there will be a data expansion by a factor K, since each subcarrier is multiplied with the decoding vector. When the decoding is finished, there are KN SC variables. These are the number of variables that will exist during the lifetime of one uplink symbol, but not all of them must be stored.
C. Communication Complexity
One of the advantages of distributing the computations among multiple nodes is that the number of values that needs to be sent to the centralized structure in the system grows proportionately to the number of terminals, K, rather than the number of antennas in the system, M . In massive MIMO systems where M K this is clearly advantageous. The number of bits that needs to be sent upwards in the tree structure during one frame is N bits,up = K(K + 1) 2 + (N UL,1 + N UL,2 ) N SC W comp , (43) which corresponds to the local contributions to the Gram matrix, B i , and the symbol vector estimates,ỹ i . These values are all used for computations and thus, the longer word length W comp is required. The required upwards link datarate is
The downwards propagation differs in that the word length of the modulated symbols is much shorter. Downwards, only the raw symbols are propagated to all nodes, using the shorter word length W symbol . However, the inverted matrix still needs to be represented with W comp . The number of bits sent downwards is
The required downwards link datarate is then
However, this is only the minimum required data rate. If the data is not sent between the nodes at the same rate as it is consumed, buffers (which may incur a significant increase in die area) are needed, as discussed in Section IV-D. The reduced number of values sent from the antennas to the central unit is often used as an argument for performing distributed processing. While this is indeed the case, it must FFT Decoding MV Time
Send to parent Fig. 11 . Number of stored memory variables during the lifetime of one uplink OFDM symbol.
also be noted that the word lengths of the data are different. For a centralized architecture, the word length depends on the ADC, so the number of bits is proportional to M W ADC . For a distributed architecture,ỹ and B are transmitted, so the number of bits is proportional to KW comp . Since these are sum-of-products, where one product term being the sample value, one may expect that in general W comp > W ADC . However, as M K, the total number of bits transmitted to the central unit should still be significantly smaller. Furthermore, it is important that the intermediate values are properly scaled as more and more terms are added along the path to the central unit.
D. Balancing Computations, Communication, and Memories
To obtain an optimized architecture the different types of resources must be balanced. Here, the processing, communication and memory capabilities are included.
Considering the inter node communication for one uplink OFDM symbol, the number of stored variables in each node can be seen in Fig. 11 . From sampling the radio until the FFT is finished, the number of stored variables are the same as the number of existing variables in Fig. 10 . The output data from the decoding process has no further data dependencies in the current node. These variables need to be sent to the parent node, so it can perform its own decoding process.
In Fig. 11 , the time T 4 to T 5 is again the time taken to perform the decoding. The time T 4 to T 6 is the time taken to send the local contributions to the decoded signal vectors to the parent node. It can be noted that T 6 ≥ T 5 . When the decoding starts, the number of variables that needs to be stored locally increases due to the data expansion of the decoding, but at the same time decreases due to variables being sent to the parent node, and thus not needing to be stored.
There are two extreme cases of this behavior. The first is if T 6 tends to infinity. In this case, all variables must be stored locally, due to none of them being sent over the link. Clearly, this solution is not feasible. The other extreme is if T 6 = T 5 . This means that all output variables are sent directly to the parent node and does not need to be stored locally.
As described earlier, the decoding on the parent node cannot be performed until the decoding output has been sent over the link. The implication of this is that the system should be designed such that the rate of processing and the rate of sending variables between nodes are the same.
The requirements on the downwards communication however are not as strict. The data that is propagated from the CCU to the nodes in the tree is not processed on the way downwards, but rather just forwarded to the next node. This has the implication that it is not required to send the data in the same rate as it is processed. It does however prevent the need for large buffers in each node, making it desirable. Feeding the nodes with data is a rather straightforward trade-off between link data rate and buffer size.
V. SCHEDULING
The computational tasks and data dependencies when using ZF processing can be seen in Fig. 7 . This schedule is not correctly scaled, but rather made to illustrate the data dependencies and the need for a better realization. It is, e.g., clear that there is time at the end of the frame where no operations are currently performed. Therefore, it makes sense to move parts of the computations there to obtain a better utilization of the processing elements. This will come at a cost of memory as the data must be stored rather than processed directly.
Here, a node with only one processing element is considered. The processing element is assumed to support the required number of operations per sample for the asymptotic case, i.e.,N OPS ≥ N OPS,asymptotic . The schedule is created as in Fig. 12 . Initially, the node will wait for the pilot OFDM symbol. The computations for determining the precoding/decoding vector is then started. This includes an FFT, performing the channel estimation and computing the B i matrix. When the inverted matrix is received, the precoding/decoding vector is computed. After this stage, the uplink and downlink symbols can be processed. In order to reduce the number of uplink symbols that needs to be buffered before processing, N UL,PB uplink symbols is processed. All downlink symbols is then processed in order to meet their deadlines. When the downlink symbols are finished, the node computes N UL,1 + 3 uplink symbols is processed. Two uplink symbols can be processed when the last downlink symbol is transmitted and during the guard interval. Another uplink symbol can be computed when pilot of the next frame is sampled. The remaining N UL − N UL,PB − N UL,1 − 3 uplink symbols are processed when the node waits for the inverted matrix for the next frame.
As can be seen, the processing is fully deterministic for the asymptotic case, and, hence, a simple control unit can be implemented, where the different system parameters can be configured. For the non-asymptotic case, the same general structure is implemented. However, as the processing is possibly distributed differently within the frame, a slightly more flexible control unit is required. Alternatively, the control signals can be stored in a RAM acting as an instruction memory.
A. System Level Scheduling
For the computational tasksỹ i and B i in Table II there are inter-node data dependencies, as described in Section III. Before the local PE operation can be performed, the corresponding contributions from the child nodes must be sent over the inter-node link. The latency of sending a value over the link is T link . For each level in the tree, theỹ i and B i computations must be skewed by this amount in order for the parent node to receive the data before processing.
VI. ARCHITECTURE
In this section, an architecture for the node is proposed. The main components in the architecture are the off-chip I/O, processing core, memory system and the RF-chain.
As seen later on in Section VIII many system scenarios can be covered with a single processing element in each node. Hence, we focus on that here. Further inspection of the arithmetic operations in Fig. 5 reveals that each input port of the processing element is connected only to a few specific data. This means that not all types of data need to be fed into any port of the PE. For instance, only the twiddle factors, channel estimates or the precoding/decoding vector is connected to one input of the multiplier. Taking this into account leads to the proposed node architecture shown in Fig. 13 . The node architecture uses a processing element as shown in Fig. 6 .
The twiddle factor memory can be implemented as a ROM, since the twiddle factors are static. The channel estimates and precoding/decoding vector memories are single port memories, that can either be written or read in one cycle. Although during precoding and decoding, only the precoding/decoding vector is required, the channel estimates must be stored until all precoding/decoding values are computed, and, hence, both must be stored. For simplicity, we select to have a separate memory allocation for the channel estimate, instead of using e.g. the sample memory. The sample memory is more complex and is divided into three separate memories as shown in Fig. 14. The first memory is the radio input buffer which stores raw data from the AD converter. The size of this memory is
The FFT processing buffer is used when performing the FFT/IFFT computations and its size is
The last memory is the radio output buffer which holds the finished downlink OFDM symbols that are to be sent to the DA converter. The cyclic prefix of the OFDM symbol is also fetched from the memory. Its size is
In Fig. 14, the memories are shown as two-port memories being able to read and write simultaneously. In many cases, it may be beneficial to use two single-port memories of the half the size instead. For the input and output buffers, it is straightforward to use memories alternating reading and writing. For the FFT processing buffer, it is also possible using e.g. the approach in [22] . All memory sizes and word lengths for the architecture in Fig. 13 are summarized in Table III . The exact required word 
VII. EXAMPLE: LTE-LIKE SYSTEM SPECIFICATIONS
Here, the requirements for an LTE-like system using ZF processing are considered. This is the typical specification considered in most earlier work. The system specifications can be seen in Table IV . For this specification, the throughput is
in each direction. The centralized matrix inversion can be performed either using an exact [23] or an approximate [12] , [24] , [25] algorithm. As shown in [26] , the complexity is similar for the best exact algorithm and a Neumann series approximation with three terms. In both cases a 20 × 20 matrix inversion can be performed in less than 40 µs using one processing element running at 200 MHz. Based on these specifications and (27) A schedule for the computations in the LTE-like system is derived and can be seen in Fig. 15 . Deriving the schedule is rather straightforward since all tasks are performed sequentially. The slack time between determining the local precoding/decoding vector, W i /A i , and the start of the precoding, x i , can be utilized to modify the specification, as discussed below.
The size of the radio input buffer is the size of the FFT processing buffer is
and the size of the radio output buffer is
Hence, a total of 120 kb of memory is required in each node.
In addition, 960 more bits are needed for the channel estimates and precoding/decoding vector.
As was shown in Section IV, the computations and communication needs to be performed at the same rate. With the selected number of operations performed per sample, N OPS , the data rates are
The available slack time can be used to modify the specifications of the system. By tweaking the parameters and redoing the calculations, we can investigate which configurations are supported withN OPS = 12. For example, the matrix inversion time can be increased up to 54.2 µs, with exactly the same node architecture. Alternatively, the number of users can be increased to K = 21, assuming that the matrix inverse time increases cubically. For K = 30 and the same assumption, N OPS can be selected to 28. In this case, either one PE running at 860.16 MHz or two PEs running at 430.08 MHz can be used 2 . Alternatively, for the example, N hops can be increased up to 22, leading to a maximum of M = 2 22 − 1 antennas, assuming a binary tree. Even though this can be further increased by increasingN OPS , this should not pose a limitation in most cases.
If we want to process one uplink symbol before the first downlink symbol, i.e., N UL,PB = 1, we must selectN OPS = 17. To move the pilot symbol one symbol closer to the downlink symbols, i.e., N UL,1 = 1 and N UL,2 = 1, againN OPS = 17, although this equality does not hold in general. Halving the matrix inversion time leads toN OPS = 15 in both cases. This illustrates that when the critical paths are limiting, increasing the computational capabilities in the CCU, i.e., decreasing the matrix inversion time, leads to reduced computational requirements in the nodes.
Naturally, any valid combination of these modifications can be realized.
VIII. DESIGN SPACE EXPLORATION
The clock frequency required in the LTE-like example, is not a problem to achieve in a modern process technology through, e.g., pipelining, which is straightforward since the execution is deterministic. Hence, it is possible to change the bandwidth and/or the number of terminals. Here, we consider three different clock frequencies up to 1 GHz for a system otherwise as in the LTE-like case. Figure 16 shows the bounds on bandwidth and number of terminals for a given clock frequency. In Fig. 16 (a) the asymptotic case is shown, N OPS = N OPS,asymptotic , meaning that the number of OFDM symbols in each frame is large. In Fig. 16 (b) the frame format in the LTE-case is used. In both cases the average number of computations over an entire frame is used. Thus, it is assumed that the matrix inversion is performed fast enough to not influence the required number of operations per second, i.e., T inv ≤ T inv,A .
It is noted from Fig. 16 that increasing the channel bandwidth by a factor two, roughly requires that the number of simultaneous terminals are reduced by the same factor.
Here, the length of the FFT, N FFT , and the number of subcarriers utilized, N SC , is scaled linearly with the bandwidth of the channel. This is usually not the case, since the FFT length is favorably selected as a power of two. The plots still give a good estimate of the available design space.
IX. CONCLUSIONS
In this work, a scalable system architecture using distributed processing was proposed for the base station in a massive MIMO system. It was shown that the computations associated with each antenna can be distributed and in most of the earlier studied use cases only a simple single processing element running at a few hundred MHz and a modest amount of memory are required. It was further shown that it is feasible to have a simple synchronous control of the nodes and that the inter-node communication can be handled by one or a few high-speed serial links. All computations required by adding an antenna are handled by the introduced additional node.
The case of connecting the nodes as a binary tree was primarily studied, although the architecture is readily extended to a K-ary tree. It is here worth noting that an array architecture with static scheduling will behave as a binary or ternary tree, and, hence, the same concept can be used for an array interconnect with additional simple routing logic surrounding the processing node. As the processing core is so small, it is also of interest to possibly have more than one node in a chip, reducing the amount of inter-chip communication channels. The exact granularity is left for future work.
The architecture supports conjugate beamforming, zero forcing, and MMSE processing. In the latter two cases, a matrix inversion is performed in a central control unit, but all other computations are distributed. The impact of the matrix inversion latency and pilot position on the computational requirements in the node are studied and related.
