Massive multi-user (MU) multiple-input multipleoutput (MIMO) provides high spectral efficiency by means of spatial multiplexing and fine-grained beamforming. However, conventional base-station (BS) architectures for systems with hundreds of antennas that rely on centralized baseband processing inevitably suffer from (i) excessive interconnect data rates between radio-frequency circuitry and processing fabrics, and (ii) prohibitive complexity at the centralized baseband processor. Recently, decentralized baseband processing (DBP) architectures and algorithms have been proposed, which mitigate the interconnect bandwidth and complexity bottlenecks. This paper systematically explores the design trade-offs between error-rate performance, computational complexity, and data transfer latency of DBP architectures under different system configurations and channel conditions. Considering architecture, algorithm, and numerical precision aspects, we provide practical guidelines to select the DBP architecture and algorithm that are able to realize the full benefits of massive MU-MIMO in the uplink and downlink.
I. INTRODUCTION
Massive multi-user (MU) multiple-input multiple-output (MIMO) will be a key technology component in fifth-generation (5G) and future wireless communication systems [1] . The idea of this technology is to equip the infrastructure base-stations (BSs) with hundreds of antenna elements while serving tens of user equipments (UEs) simultaneously and in the same frequency band. The presence of a large number of antennas at the BSs enables fine-grained beamforming, which provides higher spectral efficiency than traditional, small-scale MIMO systems [2] , [3] . However, naïvely scaling up small-scale MIMO systems to large antenna arrays will inevitably result in a range of practical implementation challenges [4] which must be resolved before deploying massive MU-MIMO in practice.
A. Challenges with Centralized Baseband Processing
The excessively large amount of raw baseband data that must be transferred between the BS antenna array and the baseband processing backhaul is among the most critical challenges that arise with large antenna arrays [4] - [7] . As an example, a 256-BS-antenna massive MU-MIMO system with 12-bit digital-to-analog converters (DACs) supporting a bandwidth of 80 MHz requires raw baseband data rates from and to the The work was supported in part by Xilinx, Inc. and by the US NSF under grants ECCS-1408370, CNS-1717218, CNS-1827940, ECCS-1408006, CCF1535897, CCF-1652065, and CNS-1717559. radio-frequency (RF) chains that approach 1 Tb/s. Such high data rates not only exceed the limits of existing interconnect technology by large margins, e.g., that of the Common Public Radio Interface (CPRI) [8] , but also push chip input/output (I/O) interfaces, power dissipation, and processing capabilities of modern computing fabrics, such as graphics processing units (GPUs), field-programable gate arrays (FPGAs), or applicationspecific integrated circuits (ASICs), to their limits. Figure 1 illustrates the interconnect bandwidth and complexity bottlenecks (highlighted with red color) in the massive MU-MIMO uplink (UEs transmit to BS) with conventional centralized baseband processing architectures. Note that we face the same bottlenecks in the downlink (BS transmits to UEs). Although one could resort to maximum ratio combining (MRC) for fully distributed data detection and maximum ratio transmission (MRT) for fully distributed precoding, MRC and MRT result in low spectral-efficiency compared to that of more complex centralized algorithms, such as minimum meansquare error (MMSE) equalization [2] or linear Wiener filter precoding [9] . Existing massive MU-MIMO testbeds, such as the BigStation [10] and the Lund testbed [11] , incorporate centralized MMSE and ZF algorithms while parallelizing the computation workload across subcarriers. While this approach mitigates the complexity bottleneck, it still suffers from the interconnect and chip I/O bottlenecks as one must transfer raw baseband data from and to all antennas.
B. Decentralized Baseband Processing
To effectively avoid the bottlenecks of centralized baseband processing and to enable scalability to massive MU-MIMO architectures with hundreds or even thousands of antenna elements, recent work [4] , [7] , [12] introduced decentralized baseband processing (DBP). DBP partitions the antenna array into smaller clusters, each associated with separate RF circuitry and baseband processing fabrics. Each antenna cluster only connects with the associated computing fabrics which perform local baseband processing tasks, such as channel estimation, data detection in the uplink, and precoding in the downlink. While consensus-sharing based methods have been proposed for these tasks [4] , [13] , [14] , the iterative exchange of information among clusters suffers from data transfer latency which negatively affects the design efficiency. Recently, the references [15] - [17] and [9] proposed feedforward DBP architectures for uplink detection and downlink precoding, respectively. Such feedforward architectures avoid the repeated exchange of information among clusters, which mitigates data transfer latency issues [9] . Furthermore, the theoretical analysis in [16] revealed that feedforward equalization architectures with linear algorithms are able to achieve the same or similar spectral efficiency as their centralized counterparts.
C. Contributions
While the literature describes a number of feedforward DBP architectures and algorithms [7] , [9] , [16] , a systematic tradeoff analysis under different system configurations and channel conditions is missing. Such a trade-off analysis, however, is critical to making design decisions for practical massive MU-MIMO systems that rely on DBP. This paper focuses on such trade-offs at different levels for feedforward DBP architectures and algorithms which avoid iterative consensus exchange. In Section II, we analyze the performance and data transfer bandwidth trade-offs of two feedforward architectures dependent on system configuration and channel conditions, and we show that the channel coherence time is a critical design factor for architecture-level trade-offs. In Section III, we investigate the performance and complexity trade-offs of different feedforward equalization and precoding algorithms, and we show that channel coherence time and channel reciprocity are important design factors for algorithm-level trade-offs. In Section IV, we study the performance and efficiency trade-offs when reducing the arithmetic precision of data transfers and numerical computations. In Section V, we conclude the paper and summarize practical design guidelines.
II. ARCHITECTURE TRADE-OFFS
We now study the performance and data transfer bandwidth trade-offs in the uplink and downlink for DBP feedforward architectures dependent on the channel's coherence time.
A. System Models and Architectures
We now detail the uplink and downlink channel models and provide details on DBP feedforward architectures. 
Here, H ul c ∈ C Bc×U is the local channel matrix, a sub-matrix of H ul , and n ul c ∈ C Bc represents the local noise vector at cluster c. We focus on two distinct feedforward DBP architectures put forward in [15] : the partially-decentralized (PD) architecture and fully-decentralized (FD) architecture shown in Fig. 2 at the centralized processing unit to produce the equalization outputx ul . The FD architecture performs local equalization and requires data fusion of the local equalization outputsx ul c and noise varianceσ ul c to produce the equalization outputx ul . See [15] , [16] for the details.
2) Downlink System: In the downlink, the BS computes the precoding vector x dl = P dl s dl , where x dl ∈ C B , P dl ∈ C B×U is the precoding matrix and s dl ∈ O U is the transmit data vector. At the UEs, the receive vector y dl ∈ C U is given by y dl = H dl x dl + n dl , where H dl ∈ C U ×B and n dl ∈ C U are the downlink channel matrix and noise vector, respectively.
Analogous to the uplink, reference [9] proposed feedforward DPB architectures for the downlink, where each of the C antenna clusters uses the local downlink channel matrix H dl c = H ul order to form the local beamforming vector x dl c . See [9] for the details. The input-output relation at each cluster is
B. PD Architecture vs. FD Architecture
We now compare the PD architecture and FD architecture and explore the trade-offs for architecture selection. We first focus on a case study of the linear MMSE data detector in the uplink with the PD and FD architectures, to showcase how architecture selection for a certain algorithm affects the error rate performance, computational complexity, and data transfer size under different antenna configurations and channel conditions, such as channel coherence time. We later extend our analysis to precoding in the downlink and discuss more algorithm variations and their trade-offs in Section III.
Linear uplink data detection can be formulated as the following optimization problem (we omit the superscript ul ):
Here, ρ = N0 Es is used for linear MMSE equalization where E s is the per-UE transmit power. The corresponding closed form solution for (centralized) MMSE detection is given bŷ at the centralized processor. Finally, we calculate the matrix inversion to perform equalization at the centralized processor, which yields the global equalization outputx and noise variance σ 2 . For FD-MMSE detection, we first compute G c and y MRC c locally, the same as for PD-MMSE. We then calculate a matrix inversion and perform equalization locally to obtain the local estimatex c and noise variance σ 2 c . Finally, we fuse the local estimateŝ x c together with the local noise variances σ 2 c at the centralized processor, which yields the global equalization output [15] , [16] for the details.
From the above explanations, we see that the timing complexity, which we measure as the number of real-valued multi-plications, is approximately the same 1 for both architectures.
The key factor that can lead to different efficiency of PD and FD architectures lies at the data transfer size at the data fusion stage. For the PD architecture, data fusion requires the transfer of both local G c , a U × U complex-value Hermitian matrix consisting of U 2 unique real values and local y MRC c , a U -dimensional complex-valued vector that contains 2U real values, which leads to a total of C × (U 2 + 2U ) real values for each data symbol. If we consider a typical scenario for which the estimated channel in the uplink is static across N coh contiguous symbols, then the local Gram matrix G c , which depends only on the local channel matrix H c , can be transferred only once for every N coh symbols, while the transfer of y MRC c is required for every symbol. Therefore, the average data transfer size m PD for each symbol at the fusion stage of PD architecture is as follows:
In contrast, for the FD architecture, data fusion requires the transfer of the local x c , a U -entry complex-valued vector, and σ c , a U -entry real-valued vector, which leads to a total of 3U real values that must be transferred for every symbol regardless of the channel coherence time. Therefore, the average data transfer size m FD for each symbol in FD architecture is
When extended to a multi-subcarrier transmission, e.g., using orthogonal frequency-division multiplexing (OFDM), the quantities m PD and m FD represent the average data transfer size per symbol on each subcarrier. From (4) and (5), we see that the channel coherence time determines whether the PD or FD architecture requires more or less data to be transferred.
C. Trade-off Analysis
While the data transfer size and efficiency of the PD and FD architectures depend on system parameters and channel conditions, the PD architecture always outperforms the FD architecture in terms of error-rate performance, as it is able to achieve the same performance as centralized data detection [15] , [16] . Figure 3 (a) compares the uncoded bit error-rate (BER) of PD-MMSE and FD-MMSE data detection for a singlecarrier system with a cluster size of B c = 32, C = 4 clusters, and U = 16 UEs with 16-QAM. We simulate the BER performance with a simple i.i.d. Rayleigh fading channel, and a more realistic urban micro-campus non-line-of-sight (NLOS) Quadriga channel [18] , where we place the UEs randomly in a 120 • sector at a distance of 50 to 100 from the BS, which is using a uniform linear array.
We see that PD-MMSE clearly outperforms FD-MMSE under both channel environments in terms of uncoded BER. To showcase the trade-off of PD and FD on BER vs. data transfer size, in Fig. 3(b) , we also compare the average per-symbol data transfer size m PD and m FD for PD and FD architectures, respectively, under the same antenna configuration but for different channel coherence times characterized by N coh . We find that at small coherence times N coh , if N coh < U , then m FD < m PD , which implies that one should select the FD rather than the PD architecture if the data fusion efficiency has higher priority than BER performance. If N coh > U , then m PD < m FD , then the PD architecture is always preferred for both better data fusion efficiency and BER performance.
Similarly to the uplink, we can use the PD and FD architectures to realize decentralized precoding in the downlink, e.g., using linear Wiener filter (WF) precoding [9] . For PD-WF precoding, the total data transfer size, which consists of both fusion of local Gram matrices G c and broadcasting of centralized whitened vector (scaled with U ), is always larger than that of FD-WF precoding, which only requires the broadcasting of the transmit vector s (scaled with U ) and a scalar power allocation value. We can calculate similar average per-symbol data transfer sizes for downlink precoding by m PD = C(U 2 +2NcohU ) Ncoh and m FD = C(1+2NcohU )
Ncoh
. Clearly, the FD architecture is more efficient on data transfer since m FD < m PD holds in typical scenarios. Only if N coh U 2 , i.e., for nearly static channels, we have m FD ≈ m PD . However, for the BER performance, PD-WF is never worse than that of FD-WF. Therefore, selecting the PD or FD architecture for the downlink is solely determined by the system designer's preference, i.e., whether BER or data transfer is the critical factor.
III. ALGORITHM TRADE-OFFS
We now investigate the selection of data detection or precoding algorithms, given that either the PD or FD architecture has been already chosen. We first discuss the trade-offs for various decentralized algorithms, where we focus on explicit and implicit matrix-inversion-based uplink equalization methods with the PD architecture. A similar trade-off analysis applies for the FD architecture and downlink systems. We also explore situations in which we are able to store and reuse computation results from the uplink in order to facilitate downlink precoding, which enables additional complexity reduction in a time-division duplex (TDD) system.
A. Explicit vs. Implicit Algorithms
As discussed in Section II, a closed form solution for (centralized) MMSE data detection is given byx = (G + ρI) −1 y MRC and requires (i) Gram matrix computation and (ii) matrix inversion of the regularized Gram matrix, both of which are computationally intensive but only depend on the channel matrix H. For the PD or FD architectures, the timing complexity of Gram matrix computation at each cluster is reduced because of smaller matrix size (B c × U ), while the dimension of the complete Gram matrix is still the same (i.e., U × U ). This implies that the matrix inversion timing complexity is the same as that of the centralized MMSE equalizer. To reduce complexity, we can (i) take advantage of channel coherence, which allows us to reuse intermediate computation results across N coh symbols, and (ii) avoid an explicit computation of the matrix inversion.
Consider PD-MMSE as an example. At each cluster, if we compute G c only once and reuse it across N coh receive symbols, then the average timing complexity (number of real-valued multiplications) per receive symbol for computing G c + ρI can be as low as 2B c U 2 /N coh . However, the computation of y MRC c = H H c y c relies not only on channel H c but also on the receive symbol y c , which leads to 4B c U timing complexity per receive symbol. Then for explicit Cholesky-based matrix inversion of a U × U matrix, which only depends on H, we can still compute the matrix inversion only once and reuse it across N coh receive symbols, and therefore the average complexity per receive symbol is ( 10 3 U 3 − 4 3 U )/N coh [19] . Finally, to compute the equalization outputx, the matrix-vector multiplication requires an additional 4U 2 operations per receive symbol. Therefore, the timing complexity n ex of all above steps to obtainx averaged on each symbol for the explicit matrix inversion based PD-MMSE is
The literature describes a number of implicit methods that can be used for directly computingx while avoiding an explicit matrix inversion, such as the decentralized conjugate gradient method [13] or decentralized coordinate descent method [20] . Such iterative methods typically obtain an approximate result 2 , which entails a small BER loss. Furthermore, such implicit methods are unable to exploit the benefits of channel coherence since all iterative updates need to be computed for every symbol. We therefore propose to integrate implicit Choleskybased MMSE detection [19] with the PD architecture, which is not only able to compute the exact linear MMSE equalizer but also realize low complexity compared to other iterative methods, especially for large N coh when the intermediate Cholesky decomposition results can be reused. The implicit PD-MMSE algorithm computes the local Gram matrix G c and MRC vector y MRC c and fuses them to obtain the global Gram matrix G and MRC vector y MRC similarly to explicit PD-MMSE. After fusion, the implicit PD-MMSE method factorizes the regularized Gram matrix A = G + ρI by Cholesky decomposition A = LL H , where L represents a lower triangular matrix. One can then solve Lz = y MRC and finally L Hx = z, by forward and backward substitution, respectively, in order to obtainx. While the forward and backward substitutions have to be carried out for every symbol with a total of 4U 2 operations (including real-valued multiplications and divisions), the Cholesky decomposition, which dominates the complexity at 2 3 U 3 − 2 3 U , can be computed only once every N coh symbols. Therefore, the resulting timing complexity n im per symbol for the implicit Cholesky based PD-MMSE approach is In OFDM systems, the quantities n ex and n im represent the detection complexity on each subcarrier.
By comparing (6) and (7), we see that the U 3 term in n ex and n im dominates the complexity and n im < n ex since n im has smaller constant associated with this term; this indicates that the implicit method can reduce complexity at no loss in terms of BER compared to the explicit method. Figure 4 shows an example for a system with B c = 32, U = 16, and C = 4 where we compare the timing complexity of explicit PD-MMSE and implicit PD-MMSE depending on the coherence time. We observe that the implicit method always achieves lower complexity, whereas the complexity is similar to that of the explicit method for large values of N coh .
We conclude by noting that we can similarly leverage both channel coherence and implicit inversion to reduce the complexity for FD-based uplink data detection. However, for the downlink, implicit inversion is not particularly helpful for the PD-WF or FD-WF precoding algorithms, which require the computation of an optimal scaling factor that depends on the explicit matrix inversion result of a regularized Gram matrix [9] .
As an alternative, one can resort to ZF-based algorithms (e.g., PD-ZF and FD-ZF), which enable implicit methods for both uplink detection and downlink precoding. Here, for ZF precoding, we can simply scale the normalized ZF precoding vectorx x 2 without the need of explicit matrix inversion to satisfy the same transmit power constraint as WF precoding.
B. Reusing Uplink Results for Downlink
Due to channel reciprocity in TDD systems, we have H dl = (H ul ) T , and therefore
where (·) * indicates entrywise complex conjugate operation and G ul is an Hermitian matrix. This indicates that the Gram matrix computed in the uplink, can also be reused in the downlink by a simple transpose. We can also take advantage of channel reciprocity under decentralized architectures. For example, in FD-MMSE, the local Gram matrix G ul c is computed. If we perform FD-WF precoding in the downlink, and each cluster estimates the local precoding vectorx c = 1 βc (H dl c ) H (G dl c + κ c I) −1 s, where 1 βc is the scaling factor to satisfy the transmit power constraint and κ c regularizes G dl c as detailed in [9] , then we can store the uplink matrix G ul c and reuse it as G dl c = (G ul c ) T for FD-WF precoding, even across N coh symbols. This approach yields an additional 2B c U 2 /N coh complexity reduction per symbol for a system that integrates FD-MMSE detection and FD-WF precoding by re-using intermediate results compared to the total complexity of computing them individually. A similar approach can be used for PD-MMSE detection and PD-WF precoding: in PD-MMSE detection, we aggregate local G ul c and compute the global G ul at the centralized node, and transpose it for PD-WF precoding.
If we consider ZF detection and precoding, where the Gram regularization coefficients ρ = 0 and κ = 0, respectively, we can even store the matrix inversion result (G ul ) −1 computed in ZF detection and reuse it for ZF precoding which requires
When (G ul ) −1 is computed explicitly, we can also use it to compute the scaling factor β for ZF precoding in a similar way like WF precoding [9] . When it is computed implicitly, we should store and reuse the Cholesky decomposition result G ul = LL H rather than the matrix inversion result for implicit ZF precoding which relies on G dl = (LL H ) T = L * (L * ) H , and finally scale the normalized ZF precoding vectorx x 2 to reach the power constraint. Similarly, under decentralized scenarios (PD or FD architecture), for example, with an integrated pipeline of PD-ZF detection and PD-ZF precoding, we can reuse such explicit or implicit inversion results to realize further complexity reduction compared to the integrated pipeline of PD-MMSE detection and PD-WF precoding, at the cost of BER performance degradation.
As an example, Fig. 5 (a) and 5(b) compare the uncoded BER performance of PD-MMSE vs. PD-ZF for uplink detection, and PD-WF vs. PD-ZF for downlink precoding, respectively, in a single-carrier system. We see that PD-ZF methods entail BER performance loss compared to PD-MMSE and PD-WF, expecially under realistic Quadriga freespace channel [18] . Fig. 5 (c) compares the timing complexity of different pipelines of uplink detection and downlink precoding at different N coh . Using the total complexity of individually computed explicit PD-MMSE detection and PD-WF precoding as the baseline, we show that the integration of PD-MMSE and PD-WF by reusing G ul effectively reduces complexity, and the integration of PD-ZF detection and PD-ZF precoding achieves further complexity reduction as expected, especially when incorporated with implicit methods. When N coh increases, the difference among those complexity curves decreases, indicating that the channel coherence plays a more important role on complexity reduction than channel reciprocity at a large N coh , while at a small N coh , exploiting channel reciprocity is more critical.
IV. DATA PRECISION TRADE-OFFS
Data precision is another factor in the design space. Given a decentralized architecture and algorithm, reducing the data precision can improve efficiency on modern computing fabrics due to fewer compiled machine instructions and memory transactions, and smaller inter-cluster data transfer sizes. For example, when using 8-bit floating point (fp8), we can pack four fp8 values into a fp32 value and execute a vectorized computation instruction (such as vectorized addition, multiplication, etc) in a single-instruction-multiple-data (SIMD) manner to process four fp8 values in parallel within the single instruction, contributing to 4× smaller number of instructions and memory transactions on the processor, and also reducing the inter-cluster bandwidth requirement by 4× compared to corresponding fp32 design. However, low precision sacrifices numerical accuracy and thus reduces the BER performance of corresponding detection or precoding algorithms. Fig. 6 compares the uncoded BER performance at fp32 vs. fp8 precision for PD-MMSE and FD-MMSE detectors under a realistic NLOS Quadriga channel [18] with system configurations of C = 4, U = 16, B c = 32 and 16-QAM modulation. Here, an fp8 value contains 1 sign bit, 2 mantissa bits and 5 exponent bits while an fp32 value contains 1 sign bit, 23 mantissa bits and 8 exponent bits. We see in Fig. 6 that fp8 precision only entails a small BER performance loss compared to fp32 precision. In practice, the selection of data precision depends on the trade-off of BER performance vs. efficiency given certain system and environment configurations.
V. CONCLUSIONS
We have discussed the design trade-offs across architecture, algorithm, and data precision levels for decentralized baseband processing (DBP) in massive MU-MIMO systems, and proposed a practical design flow that jointly considers critical metrics for DBP including computational complexity, data transfer sizes, and error-rate performance. As summarized in Fig.7 , given certain system configurations and channel conditions, one should first select the PD or FD architecture by trading off BER vs. data transfer size, and then decide on the detection and precoding algorithms with the selected architecture according to the BER vs. complexity trade-off. Finally, one can the lower numerical precision if higher efficiency is more important than BER performance. To realize minimal computational complexity and data transfer size at no or little loss of BER, we have provided insights on taking advantage of both channel reciprocity and channel coherence properties by reusing intermediate results. In the future, we expect to build reconfigurable massive MU-MIMO software-defined radios based on programmable computing fabrics, such as GPUs or FPGAs, in order to dynamically adapt to time-varying system parameters and realize effective trade-offs. 
, ,

Architecture Selection Algorithm Selection Precision Selection
