I. INTRODUCTION
Recently, orthogonal frequency division multiplexing (OFDM) has been considered as a promising candidate for high data rate transmission in a mobile environment. One of the serious drawback is the high peak-to-average power ratio (PAPR) of OFDM signals, resulting in poor efficiency of linear power amplifier (PA). Linear amplification by nonlinear components (LINC) [1] has been proposed to maintain the PA efficiency along with good linearity, but the overall efficiency is still degraded by the isolated combiner as the outphasing angle difference between two branches increases. Nonisolated combiners have been proposed to improve the combining efficiency, but the linearity is sacrificed [2] . Accordingly, uneven multi-level LINC (UMLINC) [3] have been suggested to improve the combining efficiency by reducing the possibility of large outphasing angles. To apply these techniques, the PA implementations with two gain modes have been proposed [4] , [5] . However, a low power SCS design to separate signal into two constant-envelope signals and give corresponding gain mode controls is still required for these high-efficiency systems.
To achieve an accurate separation with low power overhead, the SCS design is a design challenge. To achieve high separation accuracy, digital implementation [6] is assumed to be the best choice. Unfortunately, digital SCS requires four DACs, and the power of both DACs and DSP suffers from high operation speed due to the nonlinear separation process. Several analog designs are proposed to reduce the power cost [7] , [8] , but it is difficult to realize nonlinear mathematical functions. Besides, those published chips still cost tens-milli-order power consumption, a significant overhead for the transmitter. A submW SCS is proposed in our previous work [9] . However, the state-of-the-art solutions only consider conventional LINC and cant be applied to UMLINC systems. Accordingly, this work presents an all-digital SCS for UM-LINC with minimal power overhead. First, this work discusses the limitation of PA gain levels with branch mismatch issues and finds the optimal level for maximal efficiency. For the SCS implementation, a multi-level phase calculator (MLPC) is included to calculate specified phases with corresponding PA gain mode controls. To reduce the power of high-speed calculations, the DSP functions can be operated at 0.5V to save power dissipation. Instead of four DACs, a power-of-two based digitally-control phase shifter (DCPS) pair with a continuous PVT monitor is proposed to generate the phase-modulated signals accurately. A source gating scheme is applied in DCPS to reduce the power overhead without sacrificing its linearity. Besides, this SCS also provides mismatch compensation capability to enhance the system performance. This paper is organized as follows. In section II, the system behavior is introduced. Then the chip implementation details are described in section III. Section IV shows the system evoluation results, followed by the conclusion in section V.
II. SYSTEM OVERVIEW

A. UMLINC Principle
Denote the baseband signal as ( ) = ( ) + ( ), then the transmitted signal ( ) can be generally expressed as
where ( ) and ( ) are the envelope and phase given by
).
(1)
As shown in Fig. 1 (a) , the concept of UMLINC is to separate the original signal ( ) into two phase-only-modulated signals 1 ( ) and 2 ( ) with minimal outphasing angles to achieve maximal efficiency [3] . This work assumed that the high efficiency PAs can provide two gain modes -high gain and low gain = . Denote 1 ( ) and 2 ( ) as
978-1-4577-0704-9/10/$26.00 ©2011 IEEE where the outphasing angles 1 ( ) and 2 ( ) can be derived:
(3) 1 ( ) and 2 ( ) representing PA gains are or decided by the envelope regions as shown in Fig. 1 (b) . Since 1 ( ) and 2 ( ) only contain phase information, they can be generated by two phase modulators (PMs) directly with the calculated phases ( ) + 1 ( ) and ( ) -2 ( ).
B. Branch Mismatch Issue and Optimal Level Decision
The branch mismatch degrades the linearity significantly, and this issue is more complex in UMLINC since the mismatches are different during gain mode switch. Since and reflect the gain of PAs, the gain mismatches between two branches or two gain modes will affect the separation correctness. Use the upper branch with high gain mode as the basis, the gain mismatches of the other cases are denoted as 1 , 2 , and 2 , then the exact gains are
, and 2 = 2 . Therefore, 1 ( ) and 2 ( ) in (3) should be specified to these calibrated gain values, such that the gain mismatch can be considered during the signal separation [9] . Similarly, the phase mismatch can be compensated by adding corresponding phase offsets in different gain modes. Another consideration is the region cover range, the boundaries and should also be adjusted to ensure the correct cover range. Therefore, this work proposed a modified region boundaries aŝ
To achieve maximum efficiency, the gain ratio decision is important, but the region cover range should also be considered to avoid signal distortion. A limitation is the minimal magnitude of Region 2 -( − )/2 should smaller or equal to the maximal magnitude of Region 1 -, then we can conclude the gain ratio should be larger than 1/3. Using 64-QAM 64-point OFDM signal as the signal source, Fig. 2 (a) shows the average combining efficiency with different gain ratio, and the optimal gain ratio is 1/3. To avoid the cover range violation caused by the gain mismatch 1 and 2 , the gain ratio is specified to 0.42 in this work (2.01 dB gain mismatch tolerance), corresponding to maximum 44.82% average efficiency. Fig. 2 (b) shows the average combining efficiency comparison between LINC and UMLINC, the average combining efficiency can be modified from 13.08% to 44.82%, resulting in 3.44x improvement. Figure 3 presents the chip block diagram and signal paths. The source signal from the baseband modem with 5 MHz bandwidth is 8x interpolated by an interpolation filter (linear phase 81-tap Kaiser FIR filter) first to extend the operation bandwidth for nonlinear calculation. Then the MLPC operating at 40 MHz calculates the 8-bit phase codeword and gain mode controls for PMs and PAs respectively. The mapper transfers the phase codewords to control codewords for the DCPS pair, which is behaved as PMs in this work, then the desired phase-modulated signals can be generated at IF frequency 80 MHz. The gain and phase mismatch values of different PA gain modes can be programmed in the registers by the serial peripheral interface (SPI), and the proposed SCS can balance the mismatch during the transmission.
III. CHIP IMPLEMENTATION
To achieve low power and accurate SCS, this chip is partitioned into two independent power domains with different supply voltages. The default domain uses 1.0 V to interface with IO pads, and the DCPS pair also uses 1.0 V to achieve accurate results. The high-speed DSP, including the interpolation filter, MLPC, and the mapper are operated with 0.5 V supply voltage to reduce the power cost. To apply the voltage scaling with the standard-cell-based design procedure, the cell behavior and timing information under 0.5 V supply voltage are simulated and re-calibrated, then the cell library after picking out the cells which can work normally is reconstructed. With the reconstructed cell library, the DCP functions can be implemented by exploiting standardcell-based design procedure. Figure 4 is the detail block diagram of MLPC. The major work of MLPC is to calculate ( ), 1 ( ), 2 ( ), and the gain-mode controls for PAs according to the input signal magnitude dynamically. Before the calculation of outphasing angles 1 ( ) and 2 ( ), the level values and region boundaries 
A. MLPC Design
Boundary
Cal.
Phi Cal. considering gain mismatch are calculated in the boundary calculation first. Then the level decision decides the gain-mode of PA depending on the region of input signal magnitude. Besides, it also outputs the corresponding 1 ( ) and 2 ( ) for the phi values calculation, which is the most complex operation in MLPC. Due to the increased propagation delay by voltage scaling, the critical delay becomes a design challenge, especially the long-bit division. To reduce the critical delay, three pipelines are inserted into the division to meet the timing requirements. The arc-cosine function is implemented by a look-up-table (LUT) to reduce the complexity. Similarly, the theta calculation in (1) is also implemented by the same way. Figure 5 shows the proposed DCPS architecture consisting of two-stage delay-lines with 10-bit coarse-tune codeword C and 4-bit fine-tune codeword F to achieve adequate delay range and accuracy. The delay-line is based on power-oftwo architecture to avoid complex codeword encoder, and a source gating is proposed to reduce the power waste. An "AND" cell is added before the delay cells in each coarsetune sub-stage to avoid extra operation when the sub-stage delay is unused. The "AND" gate on the non-delay path is also necessary to keep power-of-two delay property and eliminate the loading difference between two paths. The fine-tune stage can achieve ps-level resolution by tuning the loading from gate capacitance. Besides,, the DCPS pair can also provide a specified phase offset to eliminate the phase imbalance according to the gain modes from MLPC.
+ -
1 θ φ + 2 θ φ − 1 φ θ 2 φ PA Gain Controls i S q S A LT V HT V 1 L κ 2 H κ 2 L κˆL T V MT V 1 H V 1 L V 2 H V 2 L V 1 V 2 V Pipeline Stages (Reg.)
B. Low Power DCPS Pair with Continuous PVT Monitor
The mapper transfers the 8-bit phase codeword P to 14-bit {C,F} by following equations:
where
} is the code set corresponding to one IF clock period, and is the resolution ratio of two stages.
} and are varying with PVT, so a continuous monitor is required to ensure the accuracy. Fig. 6 (a) shows the proposed PVT monitor block diagram. Two delay-lines are duplicated for continuous monitoring without affecting the normal SCS operations, and a phase detector (PD) is added to construct a delay-lock-loop (DLL). By giving specified codewords {C 3 ,F 3 } and {C 4 ,F 4 } as shown in Fig. 6 (b) , the required parameters can be detected continuously by repeating the tracking states. Each repeated tracking costs 5 to 25 cycles to converge depending on PVT varying amount. Observing the coarse-tune control codewords of DCPS4 are always small, so the power overhead is low since most coarse-tune delay substages are not used and can be gated.
IV. EXPERIMENTAL RESULTS
The proposed SCS chip is fabricated in 90 nm 1P9M standard CMOS process. Table I shows the DCPS measurement results under different supply voltages, and the output delay can cover our target period 12.5 ns. The minimum resolution 13.86 ps can be achieved typically, which also implies 0.4 ∘ phase compensation can be achieved. Although the DCPS performance is affected by operation environments, the continuous PVT monitor can detect the correct C (M) , F (M) , and accurately for the codeword mapper with low power overhead 140 W. Therefore, Fig. 7 (a) shows the measured DCPS output delay with different phase codewords, providing 8-bit resolution with root-mean-square (RMS) error 11.57 ps (0.33 ∘ ). Fig. 7 (b) presents the power consumption of a DCPS operating at 80 MHz. Smaller codeword costs less power due to the source gating scheme on the unused sub-stages. Since the phase codeword is near norm distribution during normal operation, the average power of a DCPS is 74 W typically.
To verify the UMLINC system behavior with the proposed SCS, a verification platform is constructed as shown in Fig. 8 . The pattern generator (Agilent 16902A) provides the 64-QAM 64-point OFDM baseband signal with 5 MHz bandwidth and 8-bit quantization for the SCS chip, and the branch mismatch values can be programed into the chip registers by SPI interface from a micro control unit (MSP430). The measured SCS output waveforms including the phase-modulated signals and gain controls are sampled and stored in the oscilloscope (LeCroy 4000A), which provides maximum 20 GHz sampling resolution. Then these data are feedback to a front-end behavior model (including the filters, dual-gain PA with branch mismatch) to evaluate the system performance. Fig. IV shows the output spectrum with 1 dB gain and 10 ∘ phase mismatches. By using the proposed SCS with the mismatch compensation, the out-of-band radiation can be reduced, and the system error vector magnitude (EVM) of -31.06 dB is achieved.
The power of DSP blocks with 40 MHz operation speed is reduced to 359 W by scaling the supply voltage to 0.5 V, resulting in 74.17% power reduction. The chip summary and comparisons are shown in Table II . Only this work presents the SCS implementation for UMLINC systems to further enhance the amplification efficiency. This work also extends the mismatch compensation in [9] to UMLINC for linearity improvement. Although this work integrates more functions and achieves better performance, the power is still smaller than [9] due to modified circuit designs. By using source gating and voltage scaling for the DCPSs and DSP blocks respectively, 81.32% power can be reduced as shown in Fig. 11 . The overall SCS chip consumes only 647 W, a significant improvement comparing to state-of-the-art SCS designs. V. CONCLUSION An all-digital SCS for UMLINC system is presented in this work. The optimal gain level and the separation process considering the branch mismatch are proposed. The required separation DSP functions are implemented, and the power can be reduced by voltage scaling. Besides, the phase-modulated signals can be generated by a low-power DCPS pair without DACs. Therefore, an accurate SCS with low power overhead is achieved for high-efficiency UMLINC systems.
