This article presents the first application of a digitalintensive intrinsically linear digitally controlled class-E technique in a Doherty configuration. By careful nonlinear segmentation and multiphase RF-clocking along with overdrive-voltage control and automatic duty-cycle correction, it is shown that even the nonlinearities related to Doherty operation can be fully handled by the underlying design such that digital predistorion (DPD) can be, in principle, omitted. The nonlinearity behavior of the whole digital Doherty power amplifier (PA) is analyzed, and closed-form equations are given to predict the AM-AM and AM-phase modulation (PM) curves. In addition, time/phase mismatch between the peak and main branches and the AM and PM signals is accurately compensated. In order to achieve the maximum intrinsic linearity, two separate chips with the same architecture, but different design parameters, are fabricated as the main and peak amplifiers in 40-nm bulk CMOS. To achieve a large RF bandwidth and high passive combiner efficiency, a differential low-loss, wideband Marchand balunbased Doherty power combiner, implemented using reentrant coupled lines with independent second-harmonic control is proposed, and together with the matching network is fabricated on a two-layer PCB. The measured peak/6-dB power backoff P OUT , drain efficiency/power-added efficiency at 2.4 GHz are 17.5 dBm/12.2 dBm, 57%/52% and 36%/25% with VDD main/peak = 0.6 V/0.7 V. Measured results without using DPD show −41-dBc adjacent channel power ratio (ACPR) and −36-dB error vector magnitude (EVM) for a 16-MHz OFDM signal at 2.5 GHz. By using DPD, the measured ACPR and EVM of a 16-MHz /32-MHz OFDM signals are −52 dBc/−48 dBc and −50 dB/−48 dB, respectively.
I. INTRODUCTION
T HE biggest challenge in designing a transmitter (TX) for wideband mobile application is to achieve high energy Manuscript efficiency combined with high spectral purity. Highly efficient but not necessarily linear power amplifiers (PAs) often use switch-mode operation such as class-E, F, or D ( -1 ) [1]- [7] . In order to benefit from the advances in digital CMOS process technology, it is highly desirable to push the digital/analog boundary in mixed-mode RF circuits toward the antenna interface as much as possible. A switch-mode digital-PA (DPA), implemented as an array of small sub-PA cells, is therefore a logical candidate for such a transmitter, as it can be directly driven by digital (i.e., square-wave) signals [8] - [18] . In a digital-intensive polar TX, as shown in Fig. 1 , the input complex I/Q data are converted to amplitude AM[n] = ((I [n] 2 + Q[n] 2 )) 1/2 and phase φ[n] = Arctan(Q[n]/I [n]). The conversion from the Cartesian domain to polar is a highly nonlinear operation, which can limit the maximum signal bandwidth in practical implementations. However, since there is only one RF path per PA in a polar TX, the efficiency is normally higher than its Cartesian counterpart. On the other hand, high data-rate signals normally have a high peak-to-averagepower ratio (PAPR), which compels a PA to operate in deep power backoff (PBO), thus reducing its power efficiency if no efficiency enhancement technique is applied. Among different efficiency enhancement techniques such as envelope-tracking [19] , [20] and Doherty [7] , [21] - [31] , Doherty PA [21] is still one of the most widely used efficiency enhancement techniques, because of its relatively simple and low-cost implementation. Using an off-chip matching network reduces the passives losses, thus increasing the efficiency, especially at PBO compared to implementations with on-chip matching network [27] - [29] . 0018 -9480 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Conventional TX design approaches are often based on using a nonlinear PA to achieve high efficiency and then linearize it by applying digital predistortion (DPD) techniques [7] , [21] - [24] , [26] , [28] , [29] , [31] . Furthermore, as will be discussed in Section IV, even with an ideal DPD, due to the highly nonlinear operation mode of class-E DPAs, it is not possible to achieve maximum spectral purity and minimum error vector magnitude (EVM) for a given number of bits with a conventional uniform DPA structure [32] . The bandwidth of the digital AM signal is mostly limited by the sampling rate and not by analog blocks as in an analog-intensive polar TX [33] , [34] ; thus, it can handle a higher signal bandwidth [9] - [12] , [16] - [18] , [29] , [30] .
In addition to high video bandwidth, high RF bandwidth is also of great importance. There are three main challenges in increasing the RF bandwidth of a class-E Doherty PA, namely, class-E PA limited bandwidth, impedance converter limited bandwidth, and the balun limited bandwidth, which can be mitigated using three different techniques, namely, reactance compensation [35] , a shunt open-circuit λ/8 section [26] , [31] , [36] parallel to the load, and compensated Marchand balun with reentrant coupled lines [18] , [37] - [39] , respectively.
In this article, for the first time, a linear digital-intensive polar class-E Doherty PA is demonstrated, in which the linearity is significantly enhanced using circuit-level linearization techniques with automatic duty-cycle correction (DCC). Wideband efficiency enhancement is achieved by using reactancecompensated parallel-circuit class-E along with wideband impedance inverter and a novel wideband Marchand balunbased Doherty power combiner, implemented using reentrant coupled lines with independent second-harmonic control. Nonlinear sizing, multiphase RF-clocking, and overdrive-voltage control have been recently successfully used to linearize single PAs with both on-chip [16] , [17] and off-chip [18] matching networks in circuit-level without using DPD.
In the following, a wideband class-E Doherty PA and a digital Doherty PA are discussed in Sections II and III, respectively. System-level design considerations are discussed in Section IV, and the circuit-level linearization techniques are described in Section V. The final design and implementation are explained in Section VI, followed by the measurement results and conclusion in Sections VII and VIII, respectively.
II. WIDEBAND CLASS-E DOHERTY PA
In a symmetric Doherty PA, as shown in Fig. 2(a) , there are the main (or carrier) and peak (or auxiliary) PAs, where the peak PA is only active beyond the 6-dB PBO point resulting in an additional peak in the efficiency, as shown in Fig. 2(b) . The output powers are combined using an impedance inverter. To maintain linearity, efficiency is typically compromised at the high-efficiency power backoff point to ease DPD [7] , [21] - [24] , [26] - [28] , [31] . To achieve higher efficiency, switch-mode PAs can also be considered to be used as branch amplifiers. Among the switch-mode PAs, the class-E has one of the simplest load networks and can theoretically provide up to 100% drain efficiency (DE) while absorbing the drain capacitance in its load network [1] , [2] , [5] - [7] . In this section, different techniques to mitigate the bandwidth liming factors, as highlighted in Fig. 2 (a), are described.
A. Reactance Compensated Parallel-Circuit Class-E PA
The general topology of a push-pull class-E PA with finite dc feed inductance is shown in Fig. 3 (a). The normalized resonance frequency of the parallel resonator is defined as q D = 1/(ω 0 (L D C D ) 1/2 ). It has been shown that for q D = 1.412, the output power for a given V D D and R L is maximum, and the series reactance X can be zero [35] , [40] , [41] . Such a structure, known as parallel-circuit class-E, has higher maximum operating frequency and higher load resistance [41] . To have a wideband RF operation, the load angle seen by the intrinsic drain should remain constant over the required bandwidth. This can be done through reactance compensation [35] , [40] . By properly choosing the parameters of the series resonator, a constant load angle, as shown in Fig. 3(b) , over a wide frequency band can be achieved, resulting in the optimum Q series = 1.026 [40] . However, in an ideal class-E PA, high Q series is required to block all the harmonics in the series resonator; otherwise, the efficiency drops. By using a push-pull configuration with differential matching network, the orthogonality between odd and even terminations can be used to ensure a very high even mode (second harmonic) impedance, and as such relax the Q series requirement of the series resonator, achieving wideband operation without compromising the class-E efficiency.
B. Compensated Impedance Inverter
Doherty implementations normally use a quarter-wave transmission line (QWTL) or its lumped equivalent as the impedance inverter [ Fig. 4 (a)] [7] , [21] - [24] , [27] - [30] . As can be seen from Fig. 4 (b)-(c), the magnitude and phase of the impedance Z m seen by the main PA are highly sensitive to frequency. By adding an open-circuit compensation half-wave TL (HWTL) in parallel to the load [26] , [31] , [36] , as shown in Fig. 4 (a), the input impedance is given by
which shows smaller variations of the magnitude and phase of Z m over a larger bandwidth. This structure can be employed in Doherty configuration as depicted in Fig. 4(d) to expand the efficiency bandwidth.
C. Compensated Marchand Balun With Reentrant Coupled
Lines and Second-Harmonic Control 1) Compensated Marchand Balun: The planar Marchand balun, shown in Fig. 5(a) , is one of the best TL-based topologies for offering wideband amplitude and phase balance while having a relatively simple implementation [39] , [42] . A conventional Marchand balun is constructed from two λ/4 coupled lines with short and open terminations at their specific ports, ideally providing a balanced loading condition from a single-ended load. However, in practice, due to the unequal even-mode and odd-mode phase velocities, the conversion from single-ended to balanced operation is not perfect and results some imbalance. To correct for this imbalance, a compensation technique [42] can be adopted, where an extra compensation line section is added between the two λ/4 coupled line sections [ Fig. 5 (b)] whose parameters can be calculated as follows:
where Z o and Z e are the characteristic impedance and θ o and θ e are the electrical length of odd-mode and even-mode, respectively. Z cm and θ cm are the characteristic impedance and the electrical length of the compensation section, respectively.
2) Reentrant Coupled Lines: The bandwidth of a Marchand balun at low (odd-mode) impedance levels depends on the Z 0e /Z 0o ratio. This can be very challenging in practice with single-layer transmission lines, as a very small horizontal gap between the coupled lines is required. However, reentrant coupled lines, as shown in Fig. 5 (c), can achieve a very tight coupling without strict requirements in fabrication [39] . In the odd-mode, Z 0o = Z 0,1 , where Z 0,1 is the impedance between transmission lines and floating layer. In the even-mode, Z 0e = Z 0,1 +2Z 0,2 where Z 0,2 is the impedance between the floating layer and the bottom plate. In this case, the coupling factor K = (Z 0e − Z 0o )/(Z 0e + Z 0o ) = 1/(1 + Z 0,1 /Z 0,2 ) mostly depends on the Z 0,1 /Z 0,2 ratio rather than the horizontal spacing between the coupled lines, thus relaxing the dimensional requirements in fabrication. In general, a low Z 0,1 /Z 0,2 ratio is preferred. Therefore, having an upper layer with a larger dielectric constant but a smaller thickness compared to the lower layer ( r1 > r2 and H 1 < H 2 ), a strong coupling coefficient can be expected, resulting in a low-loss and wideband balun. Furthermore, since the effective dielectric constants of even and odd modes are different, the wavelength λ = λ air /( reff ) 1/2 of these modes, namely, λ even and λ odd , are also different.
3) Second-Harmonic Control: In a differential PA, the even harmonics are seen open-circuit at the input of the balun. However, the use of a QWTL impedance transformer at the input can provide very low impedance levels for the even harmonics at the PA reference plane, which conflicts the loading conditions of class-E PA operation. To address this issue, the center of the floating metal layer in the re-entrant λ/4 sections is connected to the ground by a via at λ even /8 distance from the PA, as depicted in Fig. 6 (a). Therefore, thanks to the tight coupling between the top and floating metal layers, the TL is ac-ground in even-mode, thus seen as opencircuit by the PA at second harmonic, as shown by electromagnetic (EM) simulation in Fig. 6 (b). In the odd-mode, the center of the floating metal is virtually ground thus barely affecting the odd-mode impedance levels.
III. DIGITALLY CONTROLLED CLASS-E DOHERTY PA
In an RFDAC-based class-E digital Doherty PA, the output amplitude is directly modulated by changing the effective width or R ON of the final PA stage, as shown in Fig. 7 (a) for a 10-bit Doherty DPA with two 9-bit DPAs. The input amplitude-control word (ACW) varies between 0 and ACW Max = 1022, and the ACW of the main (ACW M ) and peak DPA (ACW P ) both have a range of 0-ACW MP,Max = 511. For ACW ≤ ACW MP,Max , we have ACW M = ACW and ACW P = 0. For ACW > ACW MP,Max , as shown in Fig. 7(b) , the main DPA is fully on (ACW M = ACW MP,Max ) and the peak DPA starts turning on (ACW P = ACW − ACW MP,Max ).
The total drain capacitance (C D , including the transistors and interconnect parasitics) is tuned for class-E operation at ACW M (ACW P ) = ACW MP,Max . Therefore, when the main (peak) DPA is fully on, it operates in class-E. As the number of switching transistors in the main (peak) DPA decreases (at PBO), the fundamental impedances become complex with positive reactances, and the second-harmonic impedances become mostly negative reactances (capacitive) thus operating similar to a class-J PA [43] - [45] . For small ACW M (ACW P ) (<30), the voltage swing on drain of the main (peak) DPA is small, therefore operating similar to a current source with an almost linear behavior [17] . The C D change is rather small since all the devices in the output stage are in parallel all the time. Therefore, only their gate potential is changing, which affects C D to a little extend (varying 110 fF in total from ACW = 1 to ACW = 511 for each DPA, equivalent to less than 3% change). Consequently, the variations of C D for ACW M (ACW P ) > 30 does not change the intended class-E operation significantly. However, the efficiency drops due to the increased R ON .
Using an analysis approach similar to [17] , the singleended Norton equivalent linear time-invariant (LTI) model of the DPAs in odd-mode is shown in Fig. 8 (a)-(d). Since the HWTL section of the compensated impedance inverter does not alter the impedances seen by the main and peak PAs at center frequency f C (except for a phase-offset); therefore, the conventional QWTL is used for theoretical simplicity. The switching transistors are replaced by a series of paralleled current sources representing the harmonics of the drain current.
For theoretical simplicity, the amplitude of the fundamental is assumed to be proportional to the total effective (switched-on) width. The output resistance is modeled in parallel to the current sources and inversely proportional to the total effective width. Ideally, the series resonator only allows the fundamental component I H 1 to pass through. Therefore, by neglecting the higher harmonics and using the superposition theory, the output signal equals V OUT = V OUT,M + V OUT,P where V OUT,M and V OUT,P are the contributions of the main and peak DPAs to the output signal, given in (3) and (4) at the bottom of this page. R L is the load resistance seen from the matching network, R D,0 is the output resistance of a unit transistor with W 0 width, and K M and K P are the ratio of total width of the activated sub-PA cells of the main and peak DPAs to the unit transistor, respectively. L D is dc-feed inductance (implemented by wirebonds in this work), and ω 0 is the radian center frequency. Since the variation of the transistors total output capacitance is small (<3%), we consider C D as constant. Therefore, the ACW-AM and ACW-phase modulation (PM) functions can be easily calcu-
In contrast to the output phase, the normalized output amplitude is not a strong function of q D , therefore by assuming q D = 1 for theoretical simplicity, the output amplitude can be calculated as follows:
By assuming K M , K P R D0 /R L , the normalized ACW-AM can be approximated by
where K NL = R L /R D0 is defined as the nonlinearity factor. In a linearly sized array, for ACW ≤ ACW MP,Max , K M = ACW, and K P = 0; otherwise, K M = ACW MP,Max and K P = ACW−ACW MP,Max . The calculated ACW-AM/PM curves and the full-circuit (differential class-E DPA with real transistor model and TL-based Merchand balun) simulation results are plotted in Fig. 9 , showing a reasonable (ACW-PM) to good agreement (ACW-AM) between the proposed model and the real circuit simulation results. As can be seen, although a switch-mode (class-E) DPA is a nonlinear time-variant circuit, the proposed LTI model provides good insight in predicting the nonlinearity behavior of the Doherty DPA for the fundamental band.
IV. SYSTEM-LEVEL DESIGN CONSIDERATION
As explained in Section III, a Doherty DPA with conventional uniform array and single-phase RF-clocking is highly nonlinear as characterized by its static ACW-AM and ACW-PM curves. Such nonlinearities are typically corrected using DPD, which can lead to nonuniform quantization effect [17] , [32] . In addition, in a polar PA, since the AM and PM signal paths are different from each other, they will have different time delays, thus requiring delay adjustments before reaching the final stage of the DPA. Furthermore, in a Doherty configuration, the paths of main and peak PAs are also different from each other, thus requiring timing alignment between these two branches. In the following, these systemlevel design considerations are explained in more detail.
A. Nonuniform Quantization
Although DPD is very common for linearizing a nonlinear PA, the cascaded combination of DPD and an N-bit DPA with a highly nonlinear ACW-AM curve constructs a nonuniform quantizer. Such a nonuniform quantizer cannot achieve the dynamic range (DR) and linearity levels as expected from an ideal N-bit quantizer (i.e., the DPA). In Fig. 10 , the ACW-AM curve of a 10-bit Doherty DPA with and without an ideal DPD, the inverse of ACW-AM curve, the probability distribution function (PDF) of atypical quadratic-amplitude modulation (QAM) signal, and the zoomed-in view around the transition point where the peak DPA starts operating are plotted. As can be seen, the quantization levels at small ACWs for both the main and peak DPAs are much higher than at larger ACWs. This is due to the fact that the slope of the nonlinearized ACW-AM curve significantly decreases as the ACW increases. The PDF of a QAM signal has its peak around the transition point, where the slope suddenly increases as the peak DPA turns on. Therefore, the rms power of the quantization noise varies dynamically with variation of the signal's amplitude, leading to degradation of output spectral purity. In Fig. 10(b) , the effect of this phenomena on the output spectrum is shown and compared with an ideal 10-bit quantizer, and a nonlinear 13-bit DPA after ideal DPD. Therefore, compensating for such nonidealities, requires about 2-3 extra bits in the DPA and the whole preceding digital processing blocks, increasing the complexity, area, and power consumption.
B. AM-PM Timing Mismatch
The AM and PM signals in a polar TX are separated from each other. After the CORDIC block at the input, the baseband digital AM signal can be directly applied to the DPA array, while the digital baseband phase data are first upconverted to the RF carrier signal by a phase modulator, thus becoming a passband signal, and then applied to the DPA cells. Consequently, these two signals pass through totally different channels with different timing delays. Because of the bandwidth expansion of the AM and PM signals, any timing mismatch will significantly degrade the adjacent channel power ratio (ACPR) and EVM. Increasing the input signal bandwidth makes it even more challenging to achieve a good linearity since it directly increases the impact of time alignment errors, as shown in Fig. 11(a) and (b). For example, for a signal bandwidth of 32 MHz, the timing mismatch should be less than 100 ps to have enough margin for ACPR<−50 dBc and EVM<−45dB after linearization. Therefore, as shown in Fig. 11(d) for a single DPA, tunable delay cells should be used in the AM/PM signal paths to correct for the timing mismatch between them.
C. Main-Peak Timing Mismatch
In a Doherty PA, the output signals of the main and peak DPA pass through transmission lines with different lengths, thus seeing different delays, which can degrade the ACPR and EVM significantly. The simulated effect of such a timing mismatch on ACPR and EVM is shown in Fig. 11(a) and (c). In a typical analog Doherty PA shown in Fig. 2(a) , the output of the main PA passes through a QWTL, while at the input, the input of the peak PA passes through a QWTL. Therefore, the overall input-output signals of the main and peak DPA are automatically self-aligned and there is ideally no timing mismatch between them (note that in practical implementations, the different impedance terminations of the lines can strongly degrade this property). However, in a digital Doherty PA as shown in Fig. 11(d) , while the phase of the carrier signals is corrected by applying a 90 • phase-offset, the output signals are not automatically self-aligned. Therefore, this delay difference should be compensated accurately, which can be done in the digital domain. Furthermore, it is interesting to see in Fig. 11 (a) that for an OFDM signal, EVM and ACPR are more sensitive to main-peak timing mismatch than AM-PM timing mismatch.
V. CIRCUIT-LEVEL LINEARIZATION
As explained in IV-A, a digital Doherty PA is, in fact, so nonlinear that even with an ideal DPD, the nonlinearity lowers the effective number of bits, thus reducing the DR of the output signal [32] . In this work, the DPAs are made intrinsically linear by using three different circuit-level techniques: nonlinear sizing and overdrive-voltage control for ACW-AM correction, and multiphase RF-clocking for ACW-PM correction [16] - [18] . Therefore, not only the burden on DPD for strict cellular wireless standards is reduced but also the ACW-AM and ACW-PM distortions are corrected well enough to pass the WiFi mask even without using DPD. In the following, these techniques are described in detail.
A. ACW-AM Correction
In a conventional DPA, as shown in Fig. 7(a) , the sub-PA cells in the array are sized linearly, meaning that as the input ACW increases, the effective width of the total active cells (W Eff ) increases linearly. Linear sizing can result in substantial ACW-AM distortion as shown in Fig. 7(b) . Assuming a width of W 0 for a unit cell, the effective width of the array is ACW · W 0 . In this work, as shown in Fig. 12(a) , in order to linearize the ACW-AM conversion, the sub-PA cells in both of the main and peak are sized nonlinearly, meaning that as the ACW increases, the effective size of the total active cells increases nonlinearly. Assuming an N-bit fully thermometercoded array comprising 2 N − 1 cells, the transistors corresponding to small ACWs are sized smaller than W 0 , and the transistors corresponding to large ACWs are sized larger than W 0 . This yields a linear ACW-AM conversion, as shown in Fig. 12(b) . By calculating the inverse function of (6) for main and peak DPA, and then scaling its maximum to the same total width of ACW MP,Max · W 0 , the widths of the main DPA transistors corresponding to each ACW M are initially calculated by
The widths of the peak DPA transistors corresponding to each ACW P are given in (8) , shown at the bottom of this page, where F WP is given in (9) , shown at the bottom of this page.
Due to the impact of other nonidealities, it is more practical to extract W Eff by actually simulating the ACW-AM curve of a linearly sized DPA, then calculating the inverse curve and scaling its maximum to ACW MP,Max .W 0 . Using (7) , the width of each transistor corresponding to each ACW is calculated by W Eff,NL [ACW] − W Eff,NL [ACW − 1]. Obviously, for a nonlinearly sized N-bit DPA, this results in 2 N − 1 different transistors sizes, requiring fully thermometer coding, which results in high power consumption in the drivers stages. Therefore, in order to benefit from the well-known binaryunary segmentation [46] to reduce the array complexity and power consumption, segmented nonlinear sizing is used in this work. In a segmented nonlinearly sized DPA, as shown in Fig. 13(a) , the array is divided into N segments with the same ACW range but different total sizes. Although the effective size of the active cells inside each segment increases linearly, the overall effective size of the total active cells increases nonlinearly such that the resulting W Eff [ACW] curve is a piecewise linear version of the original fully nonlinearly sized W Eff [ACW] curve. Since the cells inside each segment are sized linearly, it is possible to apply binary-unary segmentation to reduce the power consumption by the drivers. By increasing the number of the segments, the overall linearity improves. It has been shown in [17] that eight segments are enough to lower the ACPR and EVM to an acceptable level with enough margin for other sources of nonlinearities. The simulated ACW-AM curve of segmented nonlinearly sized Doherty DPA with eight segments in each main/peak DPAs is plotted in Fig. 13(b) , showing significant improvement in ACW-AM linearity over a Doherty DPA using uniformly sized arrays.
As can be seen from (7) and (8) for a nonlinearly sized array, the optimum sizes of the sub-PA cells depend on the nonlinearity factor K NL = R L /R D0 . However, after fabrication or during the operation of the chip, the load or frequency may change, which will change R L . In addition, the process/ voltage/temperature variations will change R D0 , thus changing K NL from its desired design value. Consequently, the resulting ACW-AM curve will somewhat deviate from its optimum linearity in a practical implementation. To correct for this, we can tune K NL by tuning the ON-resistance of the transistors. [47] , we can tune K NL by controlling the overdrive-voltage V OD . To facilitate this, the VDD of the buffers that drive the output transistors is tuned. Therefore, the peak voltage of the RF clock changes, changing the overdrive-voltage V OD = V G S − V T H , consequently, the ACW-AM curve can be linearized back to its desired level. The details of the circuit-level implementation will be described in Section VI.
B. ACW-PM Correction
In a conventional DPA with single-phase RF-clocking, all of the sub-PA cells are driven by the same (modulated) RF clock. In energy-efficient class-E like DPAs that reliy on reactive loading, the variation of the ON-resistance of the transistor with ACW variations yields significant ACW-PM distortion as shown in Fig. 9(b) . To correct for this, a concurrent multiphase RF-clocking technique is used to reduce the ACW-PM conversion. For a single DPA line-up, the resulting AM-PM curve using five multiphase RF clocks is shown in Fig. 14. In this technique, different phases of RF clocks are applied to various segments of the DPA array. The output currents of these segments are summed; thus, the overall output phase is averaged, resulting in a considerable reduction of ACW-PM distortion.
The delayed RF clocks are generated by a bank of delay lines. Since their delay can change due to PVT variations, or the ACW-PM curve itself can also change due to variations in the load or frequency, the delay lines are designed to be partly digitally programmable in order to compensate for the PVT/load/frequency variations. Once the ACW-PM is flattened, the normalized ACW-AM curve will be still almost identical to that of a single-phase nonlinearly sized DPA. Therefore, no dynamic modification is needed for each ACW, and once the delay-offsets are programmed, they are fixed during the normal operation. The required phase-offsets are roughly proportional to the phase error of each segment with respect to the output phase at maximum ACW. In practice, during the design process or chip operation, the delay-offsets can be found using an iterative algorithm as proposed in [17] . 
VI. IMPLEMENTATION AND FABRICATION

A. CMOS Chips
Since the load seen by the mean and peak DPA is not the same (except at peak power), their ACW-AM and ACW-PM curves are also different. Therefore, two chips with the same structure but different nonlinearly sized segments and delayoffsets have been designed. The overall block diagram of the chips as well as the conceptual layout of the nonlinearly sized array are shown in Fig. 15 . Since the sub-PA cells of the eighth segment are very large, they are implemented in two parallel rows, each with the half size of segment 8, as shown in Fig. 15(b) . The arrays of both the main and peak DPAs are 9 bit, each with a total width of 2.555 mm distributed over eight segments with different sizings as shown in Fig. 15(c) . Each segment consists of 16 thermometer-coded MSB cells and three LSB cells, which are 1/16 and 1/64 the total size of each segment, respectively. In order to control the overdrive voltage, a programmable on-chip low-dropout (LDO) voltage regulator has been designed, as depicted in Fig. 16(a) [17] . The input reference voltage of the LDO is controlled by a 6-bit R-2R digital-to-analog converter (DAC), while the output voltage supplies the positive dc voltage of the buffers, which drive the output transistors. In each chip, there is only one LDO for the whole array. The LDO is capable of driving 50 mA with a resolution of 10-12 mV. The input RF-clock and BB-clock are amplified by on-chip differential amplifiers and then converted to single-ended clocks. Although the input RF-clock amplifier and the digital buffers are designed to have 50% duty-cycle, in practice, due to the PVT variations, the duty-cycle might change, degrading the output power/efficiency or linearity. Therefore, an on-chip 6-bit programmable automatic DCC circuit, shown in Fig. 16(b) , has been designed to compensate for such practical nonidealities. The DCC monitors the dc voltage of the RF-clock and compares it with a reference voltage supplied by a 6-bit R-2R DAC, then adjusts the dc voltage of the RF-clock path. Because of the voltage clipping caused by the digital buffers, changing the offset voltage of the RF-clock modifies the duty-cycle within a control range of 33%-66%. The output of the DCC is applied to the multiphase RF-clocking generator, which consists of five fine resolution single-ended delay lines. The output of the first to fifth delay-offsets is applied to the segments 1-2, 3-4, 5-6, and 7-8, respectively. The required resolution of delay-offsets is less than 6 ps, which are realized by 4-bit programmable Vernier (relative) delay lines to cover the PVT/frequency/load variations [17] . The outputs of the delay lines are converted to differential signals before being applied to the DPA array. Furthermore, clock gating is applied in the paths of the RF clocks to reduce the drivers power consumption in PBO. In order to correct for the timing mismatch between the AM and PM paths, a digital ten-tap finite-impulse response (FIR) filter is implemented on-chip as a fractional delay cell [48] in the path of the ACW data, as depicted in Fig. 15(a) . The coefficients of the filter are given by h[n] = si n[π(n − )]/ [π(n − )], in which n is the tap index and is the desired delay as a fractional of sampling time T S = 1/F S , which is the group delay of the FIR filter. For example, for a delay of 200 ps with 500 MHz, a sampling rate, the impulse response (coefficients), and frequency response of the FIR filter are plotted in Fig. 17(a) and (b). The chips are fabricated in 40-nm bulk CMOS. The core area of each DPA including the multiphase RF-clocking and LDO blocks is 0.8 mm × 0.3 mm. The die micrograph of the two chips (main and peak DPA) is shown in Fig. 18 . The LDO settings, delay-offsets, and coefficients of the FIR filter are programmed via a SPI interface. The input ACW data are also loaded via the SPI interface to an on-chip 4-K sample SRAM memory. During the normal operation, the stored ACW data words are read out in a loop to be fed to the DPA array using the BB-clock.
B. Balun and Matching Network
In this work, the compensated impedance inverter is combined with Marchand balun with reentrant coupled lines to form the wideband load network of the proposed Doherty DPA, as depicted in Fig. 19(a) . The reentrant coupled lines are adopted to achieve tight coupling without violating the stringent fabrication design rules. The design parameter for the reentrant-type coupled lines is r1 = 10.2, H 1 = 0.13 mm and r2 = 3.0, H 2 = 0.75 mm. The width of the top layer metal lines is W 1 = 1.5 mm with S = 0.2 mm spacing, and the width of the middle metal layer is W 2 = 3.2 mm, resulting in Z 0e = 71 and Z 0o = 7.5 impedances for the main DPA. The even-and odd-mode wavelengths at f 0 = 2.5 GHz are around λ even = 59 mm and λ odd = 36 mm. The λ odd /4 and λ odd /2 reentrant coupled (differential) TL sections are placed in front of the main and peak DPA, respectively (as described in Section II-B), to make a wideband compensated impedance inverter, and also to connect them to the Marchand balun.
The Marchand balun is optimized to compensate for the nonperfect ground (via inductance) and port transitions. A λ/4 transmission line is placed after the Marchand balun to match to 50 . Moreover, since it is not practically easy to make blind vias (e.g., from the middle layer to the bottom), two islands with a through via from the top metal layer to the bottom ground plate, as shown in Fig. 19(c) , are placed at an optimized distance in front of the main and peak DPA to provide a second-harmonic open impedance. Due to nonideal effects, in practical situations, this distance is slightly different from λ even /8. Furthermore, besides the use of the compensated λ odd /4 λ odd /2 Doherty power combiner, the succeeding cascaded impedance stepped TL sections further increase the bandwidth.
C. Overall Implementation
The main and peak chips are mounted on a FR-4 PCB, while the Marchand balun has been implemented separately on a two-layer Rogers material, as shown in Fig. 20(a) . The top layer of the matching network is Rogers-3003, and the bottom layer is Rogers-3010. Both of the PCBs are mounted on a FR-4 substrate as the base. The area of the matching network PCB is 41.4 mm × 32 mm. The inductances of the shunt and series resonators are implemented by three and four parallel wirebonds, respectively, as shown in Fig. 20(b) . Chip capacitors are used to complete the implementation of the series resonator and to realize the decoupling capacitors of the dc feed. The assembly structure of the PCBs with DPA chips, wirebonds, and the chip capacitors for RF and dc feed connection and decoupling is shown in Fig. 20(b) . Transformer-based RF baluns are used on the FR-4 PCB to convert the single-ended clocks to differential ones before feeding them to the DPA chips.
VII. MEASUREMENT RESULTS
A. Static Measurements
Two different measurement setups are used for the static and dynamic measurements as shown in Fig. 21(a) and (b), respectively. In the static measurements, a signal generators with a power divider arec used to provide the BB-clock to both DPAs, and another signal generator with a hybrid power divider is to provide two RF clocks with 90 • phase difference for the main and peak DPAs. The BB-clock frequency is 500 MHz, which is limited by the readout speed of the SRAM used to store the data, and the RF clock varies between 2 and 3 GHz.
1) Power/Efficiency Measurement: The output power (P OUT ), DE, and power-added efficiency (PAE) are measured with different VDDs ranging from 0.5 to 0.7 V, for continuous-wave (CW) output signals over the frequency range of 2-3 GHz, both at full power (ACW M = ACW P = 511) and backoff power (ACW M = 511, ACW P = 0). The ACW data are generated in MATLAB and then loaded into the on-chip SRAMs. The output power is measured using a power meter. The PAE includes the power consumption of all the main building block on the chips, such as power the sub-PA drivers, digital decoder/encoders, multiphase RF clocking circuit, DCC, and LDO. The measured peak and backoff output power over the 2-3-GHz range are shown in Fig. 22(a) , which range from 16 to 19 dBm, and 10.6 to 13.6 dBm, respectively.
The measured peak and backoff DE over the 2-3-GHz range are shown in Fig. 22(b) . As can be seen, DE is more than 50% within the 2.35-2.8-GHz frequency range. The efficiency at backoff power is within 10% of its maximum value over a 750-MHz span, equivalent to 30% relative bandwidth. At F C = 2.4 GHz with V DD Main =0.6 V and V DD Peak =0.7 V, the peak/backoff DE and PAE are 57%/52% and 36%/25%, respectively. The DE and PAE are plotted versus output power in Fig. 22(b) and (c) showing a well-shaped Doherty efficiency curve.
2) Linearity Measurement: Using a similar measurement setup, the static linearity is measured using a spectrum analyzer at the output to downconvert and digitize the output signal to digital baseband. Since the input signal to the DPAs is digital, it is trivial to generate a perfect quantized triangle (or ramp) signal for measuring the ACW-M and ACW-PM conversion curves. For this purpose, a 4096-sample triangle signal is generated in MATLAB, from which the main ACW M and peak ACW P signals are created, then loaded into the on-chip SRAMs. These SRAMs are read out in a loop with a 500-MHz clock frequency, creating a 122.07-kHz triangle waveform as the input signal for the DPA branches. The digital downconverted output data are processed in MATLAB to extract the ACW-AM and ACW-PM curves. The delay mismatch between the main and peak branches is also measured. The integer part is compensated in MATLAB, while the fractional part is programmed into the chips. The resulting static linearity curves are measured at F C = 2.5 GHz. These results are plotted in Fig. 23 . As can be seen, compared to a Doherty DPA with conventional segmentation, the proposed Doherty DPA shows a significant improvement in the linearity without using any kind of DPD. 
B. Modulated Signal Measurements
The proposed Doherty DPA is also measured with modulated signals with the measurement setup shown in Fig. 21(b) . The input I/Q signal is converted to digital AM and PM signals in MATLAB with F S = 500 MHz. A 12-GSa/s arbitrary waveform generator (AWG) is used for generating phase modulated RF signals. For this purpose, the phase data are upconverted in MATLAB to a 2.5-GHz sine wave and then loaded into the AWG. The AM data are converted to ACW M and ACW P and loaded into the on-chip SRAM memories running at 500 MHz. The BB-clocks are generated using a signal generator with a power divider. Similar to the static measurements, the delay mismatch between the main and peak branches as well as the AM and PM signals are compensated both in MATLAB for integer part, and in the on-chip FIR filters for the fractional part. The output spectrum of a 16-MHz OFDM with PAPR=8.1 dB is measured with and without using DPD as shown in Fig. 24(a) and (b). The measured ACPR and EVM without DPD are −41 dBc and −36 dB, respectively. By using a simple DPD based on iterative learning control (ILC) with LUT [32] , the measured ACPR and EVM are −52 dBc and −50 dB. The measured ACW-AM and ACW-PM curves of the 16-MHz OFDM signal, with and without ILC DPD, are shown in Fig. 24(c) and (d) . The output spectrum of a 32-MHz OFDM signal is also measured with the ILC DPD, as shown in Fig. 25 . The measured ACPR and EVM with ILC DPD are −48 dBc and −48 dB, respectively. Table I summarizes and gives the comparison of the performance of this work with the state-of-the-art in digital Doherty PAs.
VIII. CONCLUSION
A highly linear wideband Class-E CMOS digital Doherty PA is presented. Closed-form equations are extracted to predict the AM-AM and AM-PM curves. By using a wideband Marchand balun with reentrant coupled lines for the output matching network, more than 50% DE at ∼6-dB PBO within the 2.35-2.8-GHz frequency range is achieved. The DE at 6-dB PBO is within 10% of its maximum value over a 7500-MHz span, equivalent to 30% relative bandwidth. The measured peak/6-dB PBO P OUT and DE and PAE at 2.4 GHz are 17.5 dBm/12.2 dBm, 57%/52% and 36%/25% with VDD main/peak=0.6 V/0.7 V. The linearity has been significantly improved by nonlinearly sizing of the DPA arrays along with overdrive-voltage control and concurrent multiphase RF clocking as well as accurately compensating the time/phase mismatch between the peak and main branches and the AM and PM signals. In order to achieve the maximum intrinsic linearity, two different chips with the same architecture, but with different design parameters, are fabricated as the main and peak amplifiers. Measured results show −41-dBc ACPR and −36-dB EVM for a 16-MHz OFDM signal at 2.5 GHz without using DPD. By using DPD, the measured ACPR and EVM of the 16-MHz OFDM signal are −52 dBc and −50 dB, respectively. For a 32-MHz OFDM signal, the measured ACPR and EVM are −48 dBc and −48 dB, respectively.
The proposed concept in this article is scalable to higher power levels. The future versions will include on-chip phase modulators and complete (both integer and fractional) delay calibration blocks, eliminating the need for any off-chip PM or signal processing.
