Abstract-Millimeter wave (mmW) communications is viewed as the key enabler of 5G cellular networks due to vast spectrum availability that could boost peak rate and capacity. Due to increased propagation loss in mmW band, transceivers with massive antenna array are required to meet link budget, but their power consumption and cost become limiting factors for commercial systems. Radio designs based on hybrid digital and analog array architectures and the usage of radio frequency (RF) signal processing via phase shifters have emerged as potential solutions to improve radio energy efficiency and deliver performances close to conventional digital antenna arrays. In this paper, we provide an overview of the state-of-the-art mmW massive antenna array designs and comparison among three array architectures, namely digital array, partially-connected hybrid array (sub-array), and fully-connected hybrid array. The comparison of performance, power, and area for these three architectures is performed for three representative 5G downlink use cases, which cover a range of pre-beamforming signal-to-noise-ratios (SNR) and multiplexing regimes. This is the first study to comprehensively model and quantitatively analyze all design aspects and criteria including: 1) optimal linear precoder, 2) impact of quantization error in digital-to-analog converter (DAC) and phase shifters, 3) RF signal distribution network, 4) power and area estimation based on state-of-theart mmW circuits including baseband digital precoding, digital signal distribution network, high-speed DACs, oscillators and mixers, phase shifters, RF signal distribution network, and power amplifiers. Our simulation results show that the fully-digital array is the most power and area efficient compared against optimal design for each architecture. Our analysis shows digital array benefits greatly from multi-user multiplexing. The analysis also reveals that sub-array is limited by reduced beamforming gain due to array partitioning, and system bottleneck of the fullyconnected hybrid architecture is the excessively complicated and power hungry RF signal distribution network.
The abundant spectrum facilitates key performance indicators (KPI) of 5G, including 10Gbps peak rate, 1000 times higher traffic throughput than the current cellular system [4] . As shown in theory and measurements, mmW signals suffer higher free-space transmission loss [5] , and is vulnerable to blockage [6] . As a consequence, radios require beamforming (BF) with large antenna arrays at both base station (BS) and user equipment (UE) to combat severe propagation loss [7] . This makes reliable communication range short and as a consequence, mmW BSs will be deployed in an ultra-dense manner with inter-site distance in the order of hundreds of meters [8] , [9] . Due to these facts, performance, energy, and cost efficiency in the future mmW base station (BS) radios become more important than ever before.
Implementation and deployment of transceiver arrays in sub-6GHz have shown great success. In the 4G Long Term Evolution Advanced (LTE-A) system, BS supports up to 8 antennas [10] and arrays with even larger size are being actively prototyped [11] and will be soon available in the LTE-A PRO (the pre-5G standard). Those systems exclusively have digital array architecture based on a dedicated radiofrequency transceiver chain, with data converter and up/downconversion, per each antenna, and rely on digital baseband for array processing. Many implementation challenges arise in scaling up array size [12] by an order of magnitude or more required for mmW bands. System designers are also concerned about the high cost and power consumption in digital array architecture with massive number of RF-chains and ultra-wide processing bandwidth [13] .
Recently, an emerging concept of hybrid array has been proposed. A hybrid array uses two stage array processing. The analog beamforming implemented with variable phase shifters (PS) provides beamforming gain and the digital beamforming in the baseband provides flexibility for multiplexing multiple user streams [14] , [15] . As a result, hybrid arrays support an RF transceiver count which is smaller than the array size. Such an architecture intends to reduce the power and cost penalty due to numerous tranceivers. Based on the connectivity between RF-chain and antenna, there are two major variations, fully-connected hybrid array and partially connected hybrid array. Although both architectures were used for radar application [16] and were introduced for telecommunication application as early as a decade ago [17] , they have recently gained much attention for mmW radios. Signal processing techniques, including channel estimation and beamforming, This paper is a preprint (IEEE "accepted" status). IEEE copyright notice ( c 2018 IEEE.) using hybrid architecture have been comprehensively studied [18] . Proposals for using hybrid architectures in mmW 5G have been considered in standardization organizations [19] .
A handful comparative analyses exists for different mmW array architectures, with an emphasis on the signal process algorithms [19] [20] [21] [22] . Authors in [23] discussed circuits design challenges in implementing energy-efficient digital arrays. The relationship between spectral efficiency (SE) and energy efficiency in partially-connected hybrid architecture is studied in [15] , [24] , [25] . Works [20] , [26] provided comparison among array architectures and concluded that hybrid architecture can achieve higher energy efficiency than fully digital ones in the regime of point-to-point communication. Future 5G system, however, will certainly use multiuser multiplexing to provide higher network throughput. Moreover, existing works did not study trade-offs among array size, transmit power, and specifications of key circuit blocks in the three architectures. However, system designers need to understand these trade-offs and hardware implications to develop energy and cost efficient mmW systems [27] .
This work aims to fulfill this gap. We intend to compare different array architectures in a comprehensive manner by considering trade-offs among capacity, energy and area efficiency. Specifically, we compare array architectures based on the criterion of achieving same capacity. All design trade-offs are carefully considered in reaching most efficient design in all architectures which meets the requirement of typical 5G use cases. Power consumption, including analog processing energy and digital computation energy, and IC area are then compared based on state-of-the-art circuits. We provide several design insights on scaling laws and the bottlenecks in each architecture which allow us to predict a trend for future wireless demands and technology scaling.
The paper is organized as follows. In Section II, we briefly introduce emerging mmW array architectures and typical 5G use cases. In Section III, we discuss design trade-offs in all array architectures and the designs used for comparison. In Section IV, we study implementation issues in antenna arrays and their impact on different architectures. In Section V, we present the state-of-the-art specifications of mmW beamforming circuits blocks and system level power consumption and IC area of the three architectures. This leads us to the general conclusions in Section VI.
II. COMPARATIVE FRAMEWORK
In this work, we focus on the comparison of transmitter antenna array architectures in a 5G mmW BS. We first introduce three commonly considered array architectures and summarize recent silicon implementations. Then, we describe the metrics used for comparison of the three architectures.
A. Array architectures
There are three transmitter array architectures that are considered for adoption in 5G mmW system. Figure 1 depicts block diagrams of digital array and two variations of hybrid array, partially-connected hybrid array (we denote it as subarray in this work), and fully-connected hybrid array. Key design parameters for each architecture are: • Transmit power in all array elements: P (out)
• Number of antennas: N • Number of RF-chains: M • Number of simultaneous streams: U (U ≤ M ).
• Number of bits in digital-to-analog converter (DAC): B • Number of bits in phase shifter: Q. This only applies to hybrid arrays.
In the rest of the paper, we use DA, SA and FH when referring to digital array architecture, sub-array and fully-connected hybrid array architecture, respectively. Mathematical symbols with subscript indicate parameters associated with the specific architecture, e.g., N DA represents number of antennas in digital array. The main differences among three array architecture are:
• Digital Array: As shown in Figure 1 (a), N DA antennas in DA are connected to M DA RF-chains, i.e., N DA = M DA . The beamformer precoding occurs in the baseband (BB) digital signal processor (DSP).
• Sub-Array: SA consists of multiple phased arrays. As shown in Figure 1 (b), N SA antennas are partitioned into M SA group, each of which has one dedicated RFchain, K SA phase shifters (PS), variable gain ampli- [39] , and high definition active antenna system [40] . Similar to SA, the FH architecture uses phase shifters for analog beamforming and DSP for digital beamforming. However, FH has different connecting structures between RF-chains and phase shifters. As shown in Figure 1( Recent integrated circuits (IC) implementations of all three architectures are summarized in Table I . Apart from array in 28GHz band, Table. I includes implementation in 60GHz band for mmW indoor access, mmW backhaul and radar, because they share the same array architectures. Directly comparing array architectures from the table is difficult, because they use different silicon technology, and not all circuits components, e.g., local oscillator (LO) and associated up/down-conversion circuits, low noise amplifier (LNA), and PA, are integrated. It is worth noting that SA and FH architectures in Table. I implement phase shifters in the RF domain. A comprehensive survey of phase shifter implementations is covered in [41] , including phase shifters in analog baseband, LO, and RF domain. Moreover, system level prototyping of 28GHz arrays together with field test can be found in [19] , [42] .
There are other architectures that have been recently proposed, e.g., switch based antenna array [43] and lens antenna array [44] . Due to the lack of implementation details available in the literature, we do not include quantitative analysis of them in this work.
B. Comparison metrics under 5G use cases
5G is characterized by a wide variety of use cases having different environments, communication distances, and performance requirements. Performance, in turn depends on connectivity density (defined as number of simultaneous connections for one wireless service operator in an given area), peak rate, and network traffic throughout. It is our vision that the mmW BS should be capable of using the same radio front-end arrays to handle various use cases and meet their demands.
We choose three representative use cases [45] : Dense Urban Mobile Broadband (MBB), 50+Mbps Everywhere, and SelfBackhauling. They cover different MIMO processing schemes of transmitter array.
• Dense Urban MBB: In dense urban area, large number of UEs require high-speed connections for applications like streaming, high-definition videos, and downloading files. According to 5G KPI requirement [45] , the connection density is expected to be 150,000 connections per square kilometer, while the traffic throughput is up to 3.75Tbps/km 2 in such scenario. A typical 5G mmW BS deployment setting has inter-site distance (ISD) of 200m and each BS has 3 radio sectors [46] . With 850MHz spectrum at 28GHz band, the required SE in this use case is up to 58.8bps/Hz. Such a scenario often involves line-of-sight (LOS) environment and relatively good SNR is expected for each UEs so that SE greatly benefit from high multiplexing. We anticipate that at least 8 simultaneous streams are required 1 .
• 50+Mbps Everywhere: mmW electromagnetic waves are extremely vulnerable to blockage. Despite this, BS in the 5G mmW network need to sustain baseline performance (up to 100Mbps data rate [45] ), even for those UEs under unfavorable propagation conditions. The 5G KPI requirement [45] also indicated that the connection density is up to 2,500 connections per square kilometer. With the same BS deployment assumption as discussed in the previous use case, the required SE is 4.7bps/Hz. Due to a non-LOS (NLOS) environment, severe propagation loss exists and more than 20dB beamforming gain is required to close the link budget. Due to the requirement of high beamforming gain, we anticipated up to 8 simultaneous streams are adopted in this use case.
• Self-Backhauling: To facilitate ultra-dense mmW BS deployment, BSs are required to connect to core network through a backhaul link. Since the large array allows interference isolation in the spatial domain, it is expected that 5G BS is capable of using the same spectrum for both access and backhauling, which is refereed as self-backhauling. Self-backhauling using radio for 5G access significantly reduces cost of setting up high-speed fiber. We consider a scenario where mmW BS transmits uplink data of its local network to a macro-BS receiver which connects to core network. With assumption of one macro-BS deployed in every square kilometer, the selfbackhauling link has up to 707m communication distance [50] . In this use case, LOS environment is assumed and 10Gbps rate is targeted by single data stream. For fair comparison of power consumption and area among array architectures, each array architecture has to deliver the same target SE. In Table II , the system parameters and link budgets are summarized, with a set of possible data streams number U and the corresponding signal to interference plus noise ratio (SINR) that reach SE objectives are also listed. In the Section III, we study on the impact of design parameters on SE performance of different architectures and mainly focus on number of streams U , array size N and required transmit power P (out) . The power consumption and hardware resources comparison are then presented based on state-of-the-art device specifications.
III. TRANSMITTER ARRAY DESIGN PARAMETERS
In this section, we discuss the impact of array design parameters on the SE performance of multi-user multi-input multi-output (MU-MIMO) mmW system. We provide the design specification of components in array architectures to meet the SE requirement for each use case.
A. System Model of mmW MU-MIMO
We consider a mmW system where a BS of interest transmits data to multiple UEs in mmW access or a hub in mmW self-backhauling. Both transmitter and receiver are equipped with antenna array. Linear precoding techniques over flat fading channel are considered. In case of frequency selective channel, the precoding can be extended using orthogonalfrequency-division-multiplexing (OFDM) by considering per sub-carrier precoding. In the baseband equivalent model, the received symbol at the u th UE is denoted as beamforming at the u th receiver. B and R denote the precoding scheme in the baseband and RF domain on the transmitter side, respectively. The transmit noise due to DAC quantization error is denoted as z t and thermal noise at the receiver is z r . Operation a H is the Hermitian transpose of a. In DA architecture, the precoding occurs entirely in digital baseband and therefore there is no analog processing, i.e., R DA = I. The digital precoder B DA has dimension N DA × U .
In SA architecture, the digital precoder B SA has dimension M SA × U due to M SA RF chains. The RF precoder R SA has dimension N SA ×M SA . Due to the fact that every K SA of phase shifters connect to one RF-chain, R SA is a block diagonal matrix
where column vector r SA,m with length K SA represents K SA phase shifters that connect to the m th RF-chain. Each element of r SA,m has unit magnitude 2 . We define the set 
each element in R FH has unit magnitude. We make the following assumptions. Firstly, the channel information H u is known to both transmitter and receivers. A practical way of channel estimation can be found in [18] . Secondly, each UE receiver is equipped with a phased array with only one RF-chain. As a consequence, BS assigns one data stream to each UE receiver. Thirdly, all receivers have the same pre-beamforming SNR and BS assigns equal power among data streams. Fourthly, the combining vector of each receiver w u is chosen as the primary left eigenvector of channel matrix H u after magnitude normalization in each element.
The SINR at the u th receiver array is denoted as
where the signal power gain g u is given by g u = arg min g E y u − gs u 2 . All signal, noise, and interference powers are relative powers, referenced to 46dBm transmit power based on Table II . As a consequence, receiver thermal noise power E w
n,rx is treated as constant in each use case. The multiuser interference is σ
In the remaining of the sections, we discuss how to design array parameters for each architecture to reach targeted SINR for three use cases.
B. Array size and transmit power gain
In principle, increased transmit power P (out) and array size N both improve signal power gain g u in (4). Effectively, they provide higher equivalent isometric radiation power (EIRP) and help achieve target SINR from Table II. In DA and FH, output power of each PA P (out) /N is split into U parts due to multiplexing and even power allocation. Thus each stream in each PA has output power P (out) /(N U ). The coherent summation of N -elements via beamforming provides N 2 times increased power. In SA, however, PAs are partitioned into groups to amplify different streams. For each stream, each PA element outputs P (out) SA /N SA , while the beamforming gain is N 2 SA /U 2 . As a consequence, maximum output signal power after beamforming in each architectures is
It is clear that SA is in an disadvantage in terms of signal power gain. SA requires to use more array elements, output power, or both for the comparable output power to DA and FH architectures. 
C. Precoder design
Given maximum signal output power G, the the precoder determines the actual signal power g u and multiuser interference σ 2 int in (4) . In this subsection, we discuss precoding techniques for three architectures.
In DA architecture, maximum ratio transmission (MRT) and zero-forcing (ZF) are two commonly used linear precoding approaches. The former maximizes the signal strength at destination and approaches maximum gain discussed in Section II-A, while the latter eliminates multiuser interference. It is commonly believed that because mmW signals suffer from severe propagation loss, the interference is generally less troublesome than sub-6GHz systems. However, the interference from transmitted sidelobes, if not properly handled, can still affect the achievable rate at receivers. In this work, we propose to use regularized zero-forcing beamforming [53] , where the introduced regularization coefficient α DA facilitates controlling both signal strength and interference at the receiver.
In the above equation, G DA is the post-combining multiuser channel with the u th row as
The regularization coefficient α DA controls the behavior of the precoder, i.e., MRT when it approaches positive infinity and ZF when it approaches zero. One can expect SINR maximization when α DA is selected to be the largest with constraint that σ 2 int σ 2 n,rx . Power scaling parameter κ DA is used to guarantee total transmit power constraint B SA 2 = P (out)
DA . Precoding approaches with SA and FH architectures are currently actively investigated by researchers and are mostly for systems where analog beamformer has phase-only tuning capability. The optimal hybrid precoding is a mixed integer programming problem and its optimal solution must be solved via potentially exhaustive search. Many sub-optimal methods have been proposed for near optimal performance, e.g., works in [54] for FH architecture. In [54] , the analog precoder is selected to point beams towards directions of intended receivers. The digital precoder is then used to handle associated interference among beams synthesized by phase shifters. In the following paragraphs regarding precoding algorithm for SA and FH, we adopt assumption of phase-only analog precoder.
In SA architecture, we propose to use the following approach as a modification of FH beamforming in [54] and the scheme is illustrated in Figure 2 (a). We first merge adjacent M SA /U phase shifter groups in SA into one virtual group. It leads to N SA /U array elements within each virtual group in an ideal scenario 3 . The input signal of RF-chains within a virtual group are exactly the same. Let us denote set V u as one that contains index of physical array groups within the u th virtual group. The analog beamformer is chosen to synthesize beams towards primary propagation direction to U receivers
In the above equation, ∠({a} Sm ) selects elements from vector a according to indices from set S m and finds phases of selected elements. Let us denote the effective channel as G SA which contains the effect of receiver combiner and RF precoder in multiuser channel. The m th row is defined as
Note the effective channel G SA is the channel between digitally precoded stream and UEs. As a consequence, the digital precoding problem in SA can be solved in the regularized-ZF framework
The power scaling coefficient κ SA is used to meet total output power constraint, i.e., R SA B SA 2 = P (out)
SA . Similar to precoding in the digital array, the regularization coefficient α SA is chosen to maximize SINR.
The precoding scheme in FH architecture is illustrated in Figure 2 (b). Only U out of M FH RF-chains are turned-on to provides U streams. Without loss of generality, the first U RF-chains are active and the analog precoder is
The digital precoder in FH is a regularized zero-forcing over G FH , the effective channel that contains the receiver combining and RF precoding in the multiuser channel
The u th row is defined as
Similar to precoding in the SA architecture, κ FH is the power scaling coefficient for
and α FH is the regularization coefficient.
D. DAC precision
The transmit noise in (4) comes from the quantization error due to DACs with finite precision. A practical system design uses sufficient quantization precision such that the transmission noise level stays well below the receiver thermal noise. Different architectures require different values of effective number of bits (ENOB) for such goal. The required ENOB in three architectures arẽ
for transmit noise to be D dB lower than AWGN. In the above equation, PAPR represents the peak to average power ratio of the input signal of each DAC. Note that these expressions are accurate when DAC quantization errors are uncorrelated, which may not be valid with small number of bits, e.g., B = 1 bits. Derivations of (11) are provided in the Appendix A. Equation (11) together with (5) indicates following facts. Firstly, with fixed signal power gain G DA , DACs precision in DA architecture can be reduced by increasing array size and decreasing transmit power. For SA and FH, however, the transmit noise remain constant regardless of the source of signal power gain. Secondly, with the same signal power gain and transmit power, DA architecture has lower requirement in DAC quantization as compared to SA and FH.
E. Phase shifter precision
In both SA and FH architectures, finite resolution of phase shifters leads to a changed power level of sidelobes and shifted location of nulls, as compared to system using ideal devices. More importantly, the locations of main lobe varies and associated signal gain drops. One might expect highly precise phase shifters are required to accurately control beams. In this subsection, we discuss the impact of finite resolution of phase shifters on SA and FH architectures.
The former issue regarding the distorted sidelobes is less troublesome in both SA or FH transmitter array architecture. Sidelobes lead to multi-user interference as seen from the off-diagonal elements in the effective channel G SA and G FH . When system is aware of potential interference, digital precoding stage can be used to effectively suppress them. A practical way to acquire the information of effective channel is via a training procedure where BS and UE use quantized analog beamformer to exchange pilot symbols and estimate effective channel G SA and G FH . This training procedure is similar to the multi-beam scheme proposed for the next generation of mmW indoor system [49] . Meanwhile, the gain reduction due to finite phase shifter resolution is not severe either. In fact, the gain degradation is lower bounded by 0.68dB, 0.16dB and 0.04dB with Q = 3, 4, 5 bits quantization of phase shifters and does not scale with the array size or multiplexing level. An analysis that supports these numbers is provided in the Appendix B. Equivalently, the gain degradation is bounded by 0.16dB so long as angle error of phase shifters are no larger than 11.25 degree. Such specifications are not difficult to meet in state-of-the-art devices as it will be discussed in Section V-C.
F. Simulation results
In this subsection, simulation results are presented to show the required design parameters to reach SE target in three array architectures.
In the simulation, 3D mmW MIMO channel between BS and U UEs are generated according to mmW sparse scattering model [54] . The channel between BS and each UE consists of 20 multi-path rays in 3 multipath cluster and LOS cluster, if exists, is 10dB stronger than the rest. Angle of arrival (AOA) and angle of departure (AOD) of clusters are uniform random variables within azimuth range [−60
• , 60
• ] and elevation range [−30
• , 30 • ]. Azimuth and elevation AOA and AOD of rays within a cluster have random deviations from the cluster specific AOA and AOD, and they follow zero mean Laplacian distribution with 10
• standard deviation. In dense urban MBB, a scheduler is assumed such that the LOS paths of all target receivers are unique [55] . The mean SE is evaluated by taking average of SINR in (4) over U UEs and use Shannon capacity formula, i.e., SE = U u=1 log 2 (1 + SINR u ). The data streams used in the simulation are Gaussian distributed and their magnitudes are truncated such that PAPR is 10dB.
With ideal hardware, the required transmit power P (out) to reach SE target with various antenna size N and number of data streams U in three architectures are shown in Figure 3 .
We first focus on how transmit power changes with parameter N and U . Increasing array size N is effective in reducing transmit power in all scenarios since it helps improve both signal gain and interference control from narrow beams. When interference from multi-beam is negligible, the transmit power saving from increasing U depends on difference between the SINR target reduction in Table II and signal gain dropping in (5). For example, when U increases from 2 to 4 and 4 to 8 in MBB, the SINR requirement reduces by 5.2dB and 4dB. Meanwhile, the signal gain changes by 3dB, 6dB and 3dB in DA, SA, and FH, respectively. Therefore DA and FH save around 2.2dB and 1dB P (out) and SA is forced to use around 0.8dB and 2dB higher P (out) . It is also true in high-N regime of DA and FH in the Dense Urban MBB. When U increases from 8 to 16 and 16 to 32, the SINR requirement reduces by 11.4dB and 6.6dB. Therefore the power saving at N = 1024 is around 8.4dB and 3.3dB for both DA and FH. Power saving is more difficult to predict when system needs to trade power gain for interference control. Therefore the transmit power saving from increasing U with smaller antenna N and large multiplexing U is less accurately using the above analysis.
Then we focus on the comparison between array architectures. There is one universal conclusion that holds true for DA and FH in all scenarios. DA and FH have the same maximum signal gain when P (out) and N are the same according to (5) . In simulation, FH actually requires near 1dB higher P (out) than DA in all scenarios. This gap is due to the loss from the two-stage precoding of FH. Further exploiting hardware capability, e.g., using phase-and-magnitude analog precoders, and designing better hybrid precoding algorithm in FH would reduce this gap.
Next, we compare array architectures in each use case. In self-backhauling where data stream number U is constraint by point-to-point environment, SA has the same performance as FH as both architectures become the same in model (1). They both require 1dB higher transmit power than DA. Secondly, the difference of required transmit power between architecture can by analyzed by (5) in 50+Mbps Everywhere. Equation (5) reveals that SA has U times lower power gain than other architectures and it is shown in the figure that that SA requires U times higher P (out) than FH for the same performance. Equation (5) predicts the gap between curves well in the since there is negligible interference with small number of beams. Thirdly, in MBB use case the required transmit power gap between SA and FH in Dense Urban MBB meet (5) when N is large, i.e., SA requires to use 9, 12, 15dB higher P predicts when N is small. This deviation is due to power gain and interference control trade-off. Dense Urban MBB features a large number of simultaneous data streams and the mutual interference among streams becomes system bottleneck when beam-width is not small enough. With U = 8, the transmit power gap between SA and FH increases from 9dB to 13dB when N reduces from 1024 to 64. The additional 4dB gap is the cost of controlling interference in SA, because the SA uses nearly U times wider beam to carry each data stream as compared to FH. Further, the BB precoding of SA is forced to sacrifice more gain for interference control. With U = 32, the gap reduces from 9dB to 6dB when N reduces from 1024 to 64. One may expect each data stream in SA is carried by wide beams with N/U = 2 antennas and conclude the opposite results. However, with U = 32 data streams, each RF-chain is connected with at most N/U = 2 antennas and such architecture is effectively a digital array. In fact, the BB precoding stage in SA facilitates each stream to be transmitted by nearly all antenna elements and improves the signal gain. In fact, the intuition of hybrid precoding approach [54] may not be true and a better hybrid precoding scheme tailored for this regime would provide more additional power saving for SA.
With finite precision in the baseband precoding, DAC and phase shifters, the SE performance is shown in Figure 4 and Figure 5 . For clarity, all array architectures use 256 antenna elements and the transmit power P (out) in each architecture is chosen such that it delivers the same SE performance as in quantization free cases. Figure 4 shows the required quantization bits in baseband precoding and DAC and it matches with the analysis. According to (11) , the required ENOB for transmit noise to be D = 15dB lower than AWGN in the Dense Urban MBB with U = 8 streams are 5.1, 8.0, and 7.7 in DA, SA, and FH architectures, respectively. The SE improvement in Figure. 4 is saturated once DAC quantization bits are beyond these values. Equation (11) also precisely matches with Self-backhauling use case where DA, SA, and FH requires 5.8, 10.0, and 10.0 ENOB, respectively. It is worth noting that the additive quantization error model becomes inaccuracy when the analytical ENOB from (11) is significantly small. For example, equation (11) system requires 1 to 4bits for the most scenarios in 50+Mbps Everywhere, while the required ENOB from simulation is close to 5bits. A rule-of-thumb is to use at least 5 bits. Note that this inaccuracy regime of (11) does not affect power consumption estimation of the system, because the direct current (DC) power of DAC does not effectively reduce by using less than 5 bits due to the fixed hardware overhead and it is discussed in details in Section. V-A. Moreover, the precision requirement in baseband precoding and DAC of DA is in general lower than hybrid architectures throughout all scenarios and it suggests a system level power consumption saving. Last, Figure 5 shows that with the hybrid precoding approach in Section III-C, the SE performance is negligibly affected by phase shifter quantization and it matches with our analysis in Section III-E.
In summary, for the same target SE performance, DA requires a reduced transmit power or number of array elements as compared to SA and FH. Besides, the DAC quality of DA is relaxed as compared to the hybrid architecture. A fair comparison among architectures cannot overlook these factors by restricting architectures to use the same transmit power, number of array elements, or specification of hardware components. The design parameter trade-off analyzed in this section leads to a more practical comparison in Section VI.
IV. HARDWARE DESIGN CHALLENGES OF TRANSMITTER ARRAY
In this section, we discuss practical hardware design of mmW arrays with different architectures. We first introduce the distributed array processor module. Then, the necessary circuits blocks for baseband signal and RF signals distribution are discussed.
A. Distributed array module
The conventional MIMO system integrates array processing module in an IC and delivers RF signal to antennas. Such centralized design may not be practical in mmW system with massive number of antennas. With a compact and centralized IC, mmW signals routed to hundreds of array elements suffer severe insertion loss 4 . Besides, the heat dissipation becomes a concerns for a centralized solution. Moreover, array size scalability becomes challenging since adding more elements requires completely new processing module.
A practical solution is to implemented processing hardware for antenna arrays in a distributed manner [23] . In DA and SA, each IC in a processing module integrates the processing circuits for K DA and K SA antennas and is located close to these antennas. Although a centralized digital processor is still necessary for some baseband functionality, e.g., symbol mapping and channel coding, the digital baseband precoding can be implemented in each distributed module. With such design, the system needs to deliver U digital signal streams rather than M digitally precoded signal streams to the processing modules [23] . It offers a significant saving of baseband signal distribution throughput given M U in DA and SA. The DAC, upconverter and RF signal processing are also included in the processing module. The digital signals from central processor are routed and recovered through Serializer/Deserializer (SerDes) sub-system in each of the processing modules. Note that the exact value of elements integrated in an IC affects system area and energy. But the discussion of that is beyond the scope of this work. The patch antenna is directly attached on the printed circuits board (PCB).
The illustration of distributed DA hardware implementation is shown in Figure 6 . In the remaining of the paper, power consumption and cost estimation of DA system is based on design where each module contains K DA = 8 antenna elements and associated processing circuits. Each DA module contains SerDes, voltage controlled oscillator (VCO) within a phase-lock-loop (PLL), and RF-chains and T/Rx multiplexers.
The power amplifiers for 5G mmW applications are expected to be built in non-silicon material, as shown in Section V-D and they are placed next of DA processing IC.
The illustration of SA implementation is illustrated in Figure 6 . In SA, each module has processing circuits for K SA antenna elements. Each of them contains SerDes, VCO and phase shifter networks.
There is no priori work on FH implementation with larger than 8 antennas. The RF signal routing is a challenging task in FH architecture, because the input signal for each antenna element is a combination of signals from all RF-chains. The most viable approach we could anticipate is illustrated in Figure 6 . Opposite of DA and SA architectures, routing loss cannot be reduced by distributing RF-chains into a closer position, since their outputs are required to be delivered to entire PCB board. In the proposed design, each array module integrates a combining network and delivers the combined signal to nearby antenna elements. It also contains RF amplifiers to compensate for insertion loss during the RF signal routing and combining.
In all array architectures, routing digital baseband signal and RF signals plays a critical roles. We discuss associated challenge and solutions in the next subsections.
B. BB signal distribution
The digitally precoded sample streams require to be routed into each processing module by serial-link tranceivers in all array architectures. The state-of-the-art SerDes supports data rates over 50Gb/s using PAM-4 signaling in wireline chip-tochip communication. The specific design of SerDes system is beyond the scope of this work. In Section V, we use the specifications of ultra-high-speed tranceivers.
C. RF signal distribution
Multiple circuit components introduce non-negligible insertion losses that need to be carefully handled by system designers.
• PCB and Inter-Connectors Loss: RF signal suffers from interconnect loss between the silicon chip RF ports and the antenna elements. The low-loss PCB board, such as RO 3000 series and 4000 series, 28GHz signal have 1.25dB/inch insertion. Besides, each IC chip require to be placed on organic or ceramic substrate (interposer) to distribute the chip ports to a ball-grid array and it has an additional 1-2dB distribution loss. This implementation loss needs to be pre-compensated before the RF signal is fed into antenna.
• Intra-Chip Transmission Lines Losses: RF signal loss in silicon is significant at mmW band. According to [36] , there is up to 0.6dB/mm transmission line loss at 28GHz. The length of transmission line is proportional to the IC size but exact value is determined by actual IC design. According to a 60GHz array design [56] , phase shifter and Wilkinson RF splitter take most of the IC area. The intra-chip routing loss can be roughly estimated by taking into account the required area of those components. With the practical components size in Section V, the loss in an SA module with K SA = 32 phase shifters is less than 1dB but up to 3-4dB for FH since each RF-chain distributes signals into hundreds of phase shifters that require dozens of millimeters square area.
• Power Splitters and Combiners Loss: In the analog beamforming stage of SA and FH architectures, output signals of RF-chains need to be fed into phase shifter network for phase rotation. The Wilkinson power splitters are commonly used for such purpose [28] , [29] , [56] . Moreover, the fully-connected hybrid architecture uses same Wilkinson structure to combine multiple RF signals before power amplification. An ideal power splitter/combiner introduces 3dB insertion loss in each of the one-to-two splitter (1:2) or two-to-one combiner (2:1) unit. Practical design often has an additional 1dB implementation loss. It results in a 4 log 2 (K SA ) dB power drop in the SA architecture. For FH architecture, the splitters and combiners introduce total 4 log 2 (N FH M FH ) dB loss.
All the above RF insertion losses lead to an reduced EIRP at the antenna and therefore need to be properly compensated. The detailed distribution budget in all architectures is discussed in Section. V-D.
V. HARDWARE POWER AND COST MODELING
In this section, we first provide the power and cost model of necessary circuits blocks based on a survey of the state-of-theart circuits design and measurement. The power consumption contains DSP module for precoding, SerDes, mixed signal components, and RF components. Note that other hardware blocks such as power supply, active cooling may consume considerable power [57] . We omit them in this work since these are constant hardware overhead. Then, examples are provided for signal distribution budgets calculation in order to determine necessary RF amplifiers to compensate insertion loss. Finally, we summarize the total power and cost calculating formula for all architectures operating with different design parameters.
A. Digital signal processing power
Due to large bandwidth, the array processing in the digital baseband needs to support such high throughput. The DSP for array processing mainly consumes power for digital precoding and digital signal routing. Note that tasks such as channel coding, higher layer processing in the communication standard stack are not included since they have equal power consumption for all architectures. Channel estimation and precoder computation are also omitted since they occur at time scale that is several orders of magnitude longer than symbol duration.
The DSP power estimation contains linear precoding and 4096 point inverse discrete Fourier transform 5 (FFT). The precoding requires multiplication of M × U complex matrix with U ×1 complex vector. It has 6U M fixed points operations. Note that the number of operation does not change with different design choices of N FFT , because the number of precoder slices in sub-carriers and symbol duration change. The latter consists of log 2 (N FFT ) = 12 complex multiplication per sample per RF-chain, and it results in 6 × 12M × BW operations per second. We use FOM DSP = 13GOPS/mW in 40nm CMOS as state-of-the-art fixed point digital computation efficiency [59] . As a consequence, the power consumption in the digital precoding is
where BW is the signal bandwidth. The power consumption P Precoding has unit Watt.
The power of SerDes system is modeled in the following equation
In the above, ENOB is the required precision in the digital precoding and DAC of mmW transmitter and its value is determined according to the analysis in (11) and Figure 4 . P SerDes scales with the number of independent data stream U due to the distributed digital precoding. The figure-of-merit of SerDes is adopted as FOM SerDes = 10mW/(Gb/s) [60] in this work. Note here we use BW OS as the oversampled data rate after considering a factor of 2 oversampling ratio, i.e., BW OS = 1.7GS/s.
B. Power model of mixed signal components
In section III-D, we analyze the impact of DAC quantization in different array architecture. The DAC power consumption is mainly determined by the sampling frequency and effective number of bits. The total power consumption in each DAC is computed using the following equation
ENOB × BW OS + P buffer (14) where P DAC has unit. BW OS and are similarly define in (13) . The state-of-the-art specification of DAC is FOM DAC = 0.08PJ/conversion [61] . A constant hardware overhead for signal amplification is modeled as P buffer = 10mW for −14dBm output signal power. Therefore further reducing precision has limited power saving benefits when P buffer dominates.
C. Power model of RF signal components
In this section, we estimate the required power consumption in the RFIC, including the power for signal amplification and analog array processing for hybrid architecture. The components are phase shifter, local-oscillator using phase-lock-loop (PLL), mixer, RF amplifier for gain compensation, and the power amplifier for transmission.
• Local oscillator (LO) and mixer: The phase noise of an oscillator is inversely proportional to the power dissipated [23] . The state-of-the-art VCO design [62] [63] [64] [65] facilitates phase noise lower than -110dBc/Hz at 1MHz by using less than 30mW DC power consumption, and system performance is not affected by such noise specification [66] . Considering the required buffer at the output, the power consumption of VCO block can be P VCO = 60mW for each element. Mixer can be made by active or passive devices. Practically, passive mixers are easier to implement and have better linearity and noise. Mixers require enough LO signal power to be driven. In this work, we select the input LO power to be at least -5dBm and the power consumption of mixer is P Mixer = 10mW. The total power consumption of LO is P LO = 70mW • Phase shifter: RF phase shifting can be implemented in various ways, see [41] for a comprehensive survey. The state-of-the-art work uses reflective-type phase shifter (RTPS) and switch-type phase shifter (STPS) as main approaches of passive PS [29] , [30] , [67] [68] [69] . Such approaches use delay line with controllable length to generate desired phase shifting. Although nearly zero DC power consumption is required, passive PS often has high insertion loss and large IC area due to the delay line. The active approach uses vector modulator (VM), which consists of variable gain amplifier in both In-Phase and Quadratic RF path to generate a complex gain as magnitude adjustment and phase shifting coefficient. VM requires active devices and has higher power consumption than STPS or RTPS. Meanwhile, VM requires less IC area [28] , [56] , [70] . In this work, we use VM for building block of hybrid architecture and the power model is P PS = 10mW with 2dB gain.
D. RF signal amplification power
The RF signals amplification has two categories: gain compensation amplifier and power amplifier.
• RF amplifier: Gain compensation amplifiers are used to compensate insertion loss in the analog beamforming for hybrid architectures. As discussed in Section IV, hybrid architectures require to distribute up-converted RF signal into phase shifter networks. During this procedure, insertion loss is introduced in power splitter, transmission line and power combiner. These losses need to be properly compensated in order to deliver sufficient radiated signal power at the antenna. From the cost perspective, it is better to provide the gain before power splitting occurs since it requires fewer number of amplifiers. However, it raises the linearity concern of CMOS amplifier. As it is shown in the next subsection, a large hybrid array has more than 20dB insertion loss in the distribution route and in order to pre-compensate such loss immediately after up-conversion leads to a severe nonlinear distortion in RF signals. A practical design typically places amplifiers in a hierarchical manner along RF signal distribution route [56] . Besides, the gain compensation amplifiers need to be carefully designed and their power consumption cannot be overlooked. The power model adopted in this work considers gain compensation amplifier design from [36] , where each amplifier has up to 15dB gain with P Amp = 40mW power consumption. Note that active combining [28] is an alternative approach that combines RF signal in current mode using low-noise amplifiers. Although insertion loss can be avoided, there is power consumption in each combiner. We do not discuss this approach in details.
• Power amplifier (PA): Power amplifiers consume large amount of power in current base-stations operating in sub-6GHz band. In the mmW BS system design there are two conflicting scaling direction. On one hand, the transmit power of each PA is relaxed due to the use of massive antenna array for similar total power. On the other hand, the power amplifier efficiency is lower than those designed for sub-6GHz band. In Figure 7 , specifications of the state-of-the-art mmW power amplifier at 28GHz are shown. Specifically, the poweradded-efficiency (PAE) at saturated output power and associated saturated output power are presented. Different semiconductor technologies, e.g., CMOS, BiCMOS, Gallium Arsenide (GaAS), and Gallium Nitride (GaN) are included. The state-of-the-art CMOS or SiGe BiCMOS PAs are not suitable due to the low saturated output power. Assuming 10dB PAPR margin, even with an extremely large array of 1024 elements, the 46dBm total transmitter power leads to 16 dBm output power for each element. Thus the PA is likely to require a saturation point of 26dBm and this is a challenging target for PAs suitable for deployment in arrays. GaAs PAs are generally cheaper than GaN PAs and are expected for 5G array applications without operating in strongly nonlinear region. In the proposed PA power consumption model, a PA efficiency is η PA = 0.185 is adopted. Specifically, the calculation of PA efficiency is based on 0.3 peak PAE, 10dB power back-off, and a Doherty PA architecture 6 . Accordingly, the power consumption in each PA element is
where the number of array elements N and output power P (out) are from Figure 3 in each architecture.
E. Summary of specifications of circuits blocks for transmitter array architectures
In Figure 8 , we present the signal distribution budget example of three array architectures with 64 elements. Specifically, we focus on the insertion loss in PCB, silicon, and RF devices as modeled as in Section IV. There is more than 10dB loss for every two stages of Wilkinson splitters/combiners plus associated transmission line. As a consequence, RF amplifiers 6 In Doherty PA, the PAE remain constant when the instantaneous output magnitude a is no more than 3dB weaker than the peak magnitude amax, i.e., PAE(a) = PAEmax, a ≥ amax/2. Otherwise, the PAE drops as a linear function of instantaneous output magnitude, i.e., PAE(a) = 2a amax PAEmax, a < amax/2. Thus, the average efficiency is η PA = a f A (a)PAE(a)da, where f A (a) is the probability distribution of signal magnitude. When PAEmax = 0.3 and the signal magnitude is Rayleigh distributed with average power 10dB below the peak, PA efficiency is η PA = 0.185. 
We do not focus on varying the number of elements in module. K DA = 8 and K SA = 16 are treated as constants. b. It refers to a 1:2 or 2:1 Wilkinson splitting or combining unit. c. We use 0.89mm 2 [60] and 0.32mm 2 [83] for SerDes receiver and transmitter respectively. They are fabricated in 28nm and 16nm CMOS. d. Specification is from [61] and the DAC has 8 bits precision and uses 28nm fabrication. e. Specification of 28GHz LO and mixer are from [64] and 65nm CMOS fabrication is used. f. Specification is estimated from figure in [56] . 0.18µm BiCMOS is used for fabrication. g. Specification is estimated from figure in [56] and scaled by wave-length due to its direct impact in Wilkison divider. h. Assuming SerDes module is used for each module define in Section. IV-A.
with FOM DSP,area = 500GOPS/mm 2 , 10 times scaling from [84] due to potential advanced CMOS process.
can be placed to compensate such loss in SA as shown in Figure 8 (b). For FH, multi-stage compensation is required to avoid saturation as shown in Figure 8 (c). Such design is commonly adopted in implementation of phased array [56] . Moreover, a combining network in FH also needs similar design. For a splitting or combining network with N wilk ports, we use an approximation number of ∞ n=1 N wilk /4 n ≈ N wilk /3 amplifiers for simplicity. Therefore, FH requires a total U N FH /3 amplifiers in both splitting and combining network. Moreover, for all architectures, we assume a 5dBm signal strength is required at the input of PA [79] , [80] . The output of each mixer is -6dBm. For a Wilkinson splitter or combiner with N wilk ports, a total N wilk − 1 splitting (1:2) or combining (2:1) units are required. As a consequence, the required number of Wilkinson units are (K SA − 1)N SA /K SA and U (N FH −1)+N FH (U −1) in the SA and FH architectures, respectively. A summary of specifications of circuits blocks, total number of blocks in each architectures, and required number of blocks per antenna element are summarized in Table III .
VI. COMPARISON RESULTS
In this section, we present the power and hardware cost comparison among three architectures. Then, we discuss the scalability of these architectures for future trends. Specifically, we focus on the impact of increased throughput requirement and improved energy efficiency in digital computation due to silicon scaling.
A. Power consumption of mmW array architectures
The required power consumption in three use cases is presented in Figure 9 to 11. All designs meet the SE requirement and the quantizations in DSP, SerDes, DAC, and PS are optimized. We observe that the system power consumption is a concave function of array size except few exceptions that will be discussed in later paragraphs. The concavity comes from the trade-off between PA power and processing power in other circuits blocks for different antenna array sizes. In the figures, the range of antenna element number N for all scenarios is chosen to be close to green point, one that minimizes system power consumption.
Taking a closer look at Dense Urban MBB use case in Figure 9 , we have the following conclusions. Firstly, DA and FH have similar green point of array size when the same number of streams U is used, while green point of SA is much larger. This is due to the inefficiency of array gain (5) when SA splits antenna with sub-groups. The exception occurs in SA with U = 32 streams. When SA uses small antenna number and high multiplexing level, it effectively becomes a digital array. In fact, the green point for SA with U = 32 streams occurs in N = 32. It requires RF-chain to be connected with one antenna, and it makes SA a fully digital array. In the rest of comparison discussion, we focus on regime where each RF-chain is connected to K SA = 8 antennas and do not further consider regime for N < 256 with U = 32 streams. Secondly, increasing U reduces system power consumption in DA and SA. With the fixed N , increasing U reduces required transmit power and thus saves DC power of PA. Besides, increasing U does not require additional hardware resources except baseband precoding and SerDes throughput. With the benefits of quantization requirement reduction from Figure 4 and high DSP efficiency, the negative impact of additional hardware resources is marginal. Thirdly, the transmit power and power consumption of PA reduces when FH uses higher U , but the system does not necessarily benefits. Part of the reason is that power in other circuits blocks linearly scales with stream number and they become system bottleneck in high-U regime. Another important fact is that a power efficient design tends to reduce N to save processing power when increased U . It implies FH needs to deal with higher interference from the increased beam-width. In fact, FH with N = 16 cannot meet SE requirement when using U = 32 beams. At last, comparing with the best designs of all architectures, we conclude that DA is the most power efficient architecture. The best design of SA Table II . Table II. becomes DA and the best design of FH still requires 240% more power than DA.
The system power consumption in 50+Mbps Everywhere is shown in Figure 10 . We have the following findings. Firstly, the benefits of using higher multiplexing are not as prominent as in MBB case. According to Section 3 and corresponding analysis, it is mainly caused by smaller target SINR relaxation by reducing U. In fact, SA requires to use higher transmit power and thus DC power of PA. Secondly, large array size N is required to for power efficient system. Overall, system requires more hardware and power consumption than in Dense Urban MBB and it implies the intrinsic disadvantage of mmW to provide ubiquitous connection even in small cell size. At last, DA remains the most efficient architecture and best design of hybrid architecture require nearly 50% more power. This is a surprising result. One may expects that hybrid architectures outperforms DA when system is optimized for beamforming rather than multiplexing in this NLOS environment. With U = 2, we do observe comparable power consumption. However, DA further reduces its power by levering on increasing U with negligible additional processing power consumption. Hybrid architectures either requires higher transmit power, e.g., SA, or excessive processing power, e.g., FH, to increase U .
The only use case in our survey that hybrid architectures outperform DA is Self-backhauling where multiplexing level is limited due to point-to-point communication environment of LOS channel. In Figure 10 , the DA requires 18% more power as compared to hybrid architectures. This small power margin is due to the fact that the DA requires nearly 4 bits smaller quantization than hybrid architectures according to Figure 4 and it prevents excessive power consumption in BB precoding, SerDes and DAC. Overall in this use case, the SA and FH have similar power consumption. In fact, SA and FH have the same the number of phase shifters when using same number of antenna elements. The difference between them lies in the power consumption of signal routing. The SA has more RFchains than FH and therefore SA requires more power in high precision DAC and VCOs. The FH has only one RF-chain but it requires more power for RF signal distribution than SA.
In Figure 9 to 11, DAC and BB precoding power has small Table II. proportion in the DA system, even when high multiplexing or large array size is used. Part of the reason is the ENOB requirement relaxation according to Section III. A more important factor is the DSP energy efficiency. Our study is based the assumption that baseband processing is implemented on application-specific integrated circuits (ASIC). In deploying mmW DA, programmable DSP or Field-Programmable Gate Array (FPGA) based BB processor provides flexibility of reconfiguring BB precoding scheme, with the cost of orderof-magnitude more power consumption [84] . In Figure 12 , the system power of all architectures are compared when different DSP efficiencies are used. Throughout all cases, all design parameters are optimized such that lowest power consumed in reaches SE target, and the required array size N and multiplexing level U is labeled in the figure. We have the following findings. Firstly, DA is most sensitive to the decreased DSP efficiency. An efficient design would use smaller array size when BB precoding becomes bottleneck since it effectively reduces DSP burden. SA is less sensitive due to a much smaller number of RF-chains except in Dense Urban MBB where SA effective behaves as a digital array. FH is least sensitive to DSP efficiency. Secondly, with 3.2mW/GOPS, a FOM that can be reached by reconfigurable digital processor using 90 to 130nm process [84] , DA remains the best architecture in Dense Urban MBB. In the rest use cases, DA becomes less competitive in power consumption.
B. IC Areas and cost of mmW array architectures
In Figure 13 , the required IC area is presented as a function of array size. Note that increasing the multiplexing capability forces DA to have more powerful and larger DSP, and it also forces FH to have more RF-chain and complicated distribution network. Since maximum multiplexing of U = 16 does not significantly affect the optimal design for power consumption, we use U = 16 for DA and FH while U = 32 for SA. As shown in the figure, the largest contributor in DA is the DSP, which is expected to be further reduced so long as Moore Law reduces silicon area. SA remains competitive in IC area with DA. However, the cost of PA, which is likely to be fabricated with other material, is likely to require additional cost for SA due to the requirement of larger antenna number to be power efficient. FH requires the largest IC area due to the full connection nature between RF-chains and large number of antenna elements.
VII. DISCUSSIONS ON OPEN RESEARCH CHALLENGES
Admittedly, the power and IC area analysis for three array architectures provided are preliminary estimates based on the surveyed literature. In particular, the effect of the extra digital processing on power consumption and area depends on actual design and are hard to analyze at this point. Besides, some open research questions remain and were not covered in this paper. First one is the issues of synchronization among large number of array elements. In the centralized LO distribution architecture, each element re-generates clock from the same references but global LO distribution may not be area and energy efficient [85] . Under distributed LO scheme, independent LOs help reduce impact of phase noise [86] but system needs to be calibrated periodically to avoid loss of coherency across elements. Second issue is related to compensation of PA nonlinearity. Digital predistortion (DPD) is important in massive transmitter array design. Conventionally, DPD is designed DA and DSP is implemented for each pair of transmitter chain and PA. Due to increased processing and power of DPD, the overall gains in power efficiency for large number antenna arrays need to be analyzed and optimized. DPD for SA [87] , [88] and FH [89] are actively investigated by researchers. Thirdly, other design variations, including phase-and-magnitude analog precoder and active RF splitter and combiner [90] can help reduce the complexity and power consumption of the hybrid arrays. Last, our survey reveals the benefits of using high multiplexing level for power saving in the hardware. However, high multiplexing brings additional burden in higher layers of system, e.g., network layer faces more challenges to schedule users with non-overlapping propagation paths, and their impact needs to be incorporated in more comprehensive study.
In this work, we reveal that the conventional belief that hybrid array architecture is more cost and energy efficient than digital architecture is not necessarily true when comprehensive hardware block is modeled and system adopts optimized design parameters. Similar findings were reported for the receiver array during the period when this work is written [91] , [92] . It is worth noting that these works, including ours, focus on the additive uniformly distributed quantization error model and linear MIMO processing model. However, such quantization error model becomes less precise when data samples and quantization error are correlated, which occurs when data converters have significantly small number of bits. Besides, linear MIMO processing is not optimal. In fact, in the receiver array a variety of nonlinear combining and decoding algorithms are proposed, e.g., successive interference cancellation based combining [25] , approximate message passing [93] . Besides, the precision requirement of DAC and analog-to-digital (ADC) devices are strongly dependent on processing algorithms, e.g., algorithm tailored for 1-bit ADC [94] . It remains open research question how to use advanced signal processing to further reduce power consumption and cost of mmW array. The Matlab code for simulation and data for system level power comparison is released in [95] for readers that are interested in results with different design choices and hardware specifications.
VIII. CONCLUSION
Building energy and cost efficient massive array is one of the major challenge in implementing and deploying mmW networks in the 5G era. In this work, we study and compare three array architecture candidates, digital architecture and two variation of analog-digital hybrid architectures and discuss various hardware design trade-offs. Specifically, the required power, IC area of circuits blocks are modeled as functions of key design parameters in each architecture. Based on the stateof-the-art circuits design and measurement results, we evaluate the power and IC area of circuits blocks. We compare three array architectures when the associated design parameters are optimized to meets the spectral efficiency targets in three representative 5G-NR use cases with the most efficient manner. The results show that digital architecture is the most efficient in power and area. The key intuition is that digital array can effectively save system power and area by levering on high multi-user multiplexing , which effectively reduces requirement of array size, transmit power, and hardware specifications in the RF-chains. The hybrid architectures require additional power to support more simultaneous spatial beams, either via additional transmit power to compensate for the loss array gain, or severely increased processing power. Besides, we reveal that the bottleneck of hybrid architectures are the RF signal distribution networks in their RF beamforming stages.
IX. ACKNOWLEDGMENT
This work is partially supported by National Science Foundation through grant 1718742. Authors would like to acknowledge Dr. Jefferey Lee and Dr. Zhao Yan for their helpful comments and discussions.
APPENDIX

A. Required DAC quantization bits
In this subsection, we provide analysis of transmit noise σ 
Note that the above power is normalized with the input signal power of each DAC. In DA architecture, the input signal power of DAC is amplified to P (out) DA /N DA . As a consequence, the transmitter noise power at output of each PA is P In SA architecture, due to the identical input signal of DACs in a virtual group, quantization noise remains the same as well. The quantization noises are coherent at the outputs of N SA /U PAs within a virtual group and each has power P (out) SA DAC (B SA )/N SA . As a consequence, the transmit noise is σ 2 n,tx = P (out) SA N SA DAC (B DA )/U 2 . In FH architecture, the quantization noise from each DAC is amplified to P (out) FH /(N U ) in each PA. As a result, the total transmitter noise power is σ Consider a linear phased array system with N antenna elements that steers a beam towards direction γ in a 2D plane. Beamforming vector is given by [e jφ1 , · · · , e jφ N ], where φ n = (n−1)π sin(γ). In the next, we derive beamforming gain at the main lobe for system with ideal and non-ideal phase shifters.
Let us denote the signal at the n th elements as w n with |w n | = 1/ √ N , ∀n when all phase shifters are ideal. Clearly, the phase shifter needs to be set such that signals are constructively added in the intended direction, i.e., w n e φn = 1/ √ N , and the beamforming gain is
w n e jφn 2 = N When all phase shifters are non-ideal, the signal at the n th element is denoted as w n = w n exp(jψ n ), where ψ n is the phase error due to quantization and random implementation impairment. With Q bits quantization, the phase error ψ n is bounded as |ψ n | ≤ where = π/2 Q . The corresponding beamforming gain is cos (ψ n ) 2 ≥N cos 2 ( ) 7 Correlation among quantization errors of DACs become non-negligible when quantization level is significantly small, e.g., one bit. Dithering is a technique to de-correlate them but is beyond the scope of this work.
where the second inequality is valid so long as Q ≥ 1, i.e., |ψ n | ≤ π/2, ∀n.
Therefore the gain reduction is bounded by 10 log 10 G G ≤ −20 log 10 cos π 2 Q [dB]
The above derivation shows that the gain drop in the main lobe is less than 0.68dB, 0.16dB and 0.04dB with Q = 3 to 5 bits quantization. Besides, these values are independent from the antenna size N . Equivalently, when phase shifter implementation error is less than = 22.5
• , 11.25
• , and 5.625
• , gain drop is also bounded by 0.68dB, 0.16dB and 0.04dB, respectively.
