High-Throughput Signal Component Separator for Asymmetric Multi-Level Outphasing Power by Yan Li et al.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013 369
High-Throughput Signal Component Separator
for Asymmetric Multi-Level Outphasing
Power Ampliﬁers
Yan Li, Zhipeng Li, Oguzhan Uyar, Yehuda Avniel, Alexandre Megretski, and Vladimir Stojanović
Abstract—This paper presents an energy-efﬁcient high-
throughput and high-precision signal component separator (SCS)
chip design for the asymmetric-multilevel-outphasing (AMO)
power ampliﬁer. It uses a ﬁxed-point piece-wisel i n e a rf u n c t i o n a l
approximation developed to improve the hardware efﬁciency of
the outphasing signal processing functions. The chip is fabricated
in 45 nm SOI CMOS process and the SCS consumes an active
area of 1.5 mm . The new algorithm enables the SCS to run at a
throughput of 3.4 GSamples/s producing the phases with 12-bit
accuracy. Compared to traditional low-throughput AMO SCS
implementations, at 0.8 GSamples/s this design improves the area
efﬁciency by 25 and the energy-efﬁciency by 2 .T h i sf a s t e s t
high-precision SCS to date enables a new class of high-throughput
mm-wave and base station transmitters that can operate at high
area, energy and spectral efﬁciency.
Index Terms—Application speciﬁc integrated circuits (ASIC),
asymmetric multi-level outphasing (AMO) power ampliﬁer, base-
band, energy efﬁciency, linear ampliﬁcation by nonlinear compo-
nent (LINC), Signal component separator (SCS), throughput.
I. INTRODUCTION
H
IGH-THROUGHPUT wireless communication systems
working at the millimeter-wave (mm-wave) frequency
range from 60 GHz to 90 GHz [1]–[7] have recently become
the focus of research and development activity. The availability
of large chunks of bandwidth and maturity of CMOS process
technologyprovidetheopportunitytoaddressseverallargemar-
kets with bandwidth-demanding communication applications.
Meanwhile, these mm-wave applications place great challenges
on the transceiver design, due to factors such as power-ampli-
ﬁer (PA) efﬁciency and linearity, high wireless channel loss and
multipath, increasing parasitics for passive components, limited
ampliﬁer gain etc. Even in cellular base stations, the drive to-
ward ﬂexible, multi-standard radio chips, increases the need for
high-precision, high-throughput and energy-efﬁcient backend
processing. The desire to best leverage the available spectrum
for these high-throughput applications, creates the demand for
high-efﬁciency and high-linearity PAs. While these conﬂicting
ManuscriptreceivedSeptember12,2012;revisedOctober08,2012;accepted
October 17, 2012. Date of current version January 24, 2013. This paper was
approved by Associate Editor Ichiro Fujimori.
The authors are with the Department of Electrical Engineering and Computer
Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.
Color versions of one or more of the ﬁgures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identiﬁer 10.1109/JSSC.2012.2229071
PA design requirements have been satisﬁed in the past at low
system throughputs by designing smart digital back-ends, the
multi-GSamples/s throughput required in new applications puts
asigniﬁcantchallengeondigitalbasebandsystemdesigntoper-
form the necessary modulation and predistortion operations at
negligible power overhead.
This desire for high-throughput energy-efﬁcient digital base-
band becomes especially prominent for the outphasing PAs
designedtoimprovetheefﬁciencywhilesatisfyingthehigh-lin-
earity requirements for higher-order signal constellations. At
low throughputs (10–100 MSamples/s), the outphasing PAs
would rely on complex digital signal processing to generate the
outphasingvectorsandmakeitpossibletousesimple,high-efﬁ-
ciency switching PAs on each path. Examples of the outphasing
PAs include the linear-ampliﬁcation-by-nonlinear-component
(LINC) PA proposed by Cox in [8], and its more recent mod-
iﬁcation: the asymmetric-multilevel-outphasing (AMO) PA
[9]–[11]. At high (multi-GSamples/s) throughputs, however,
a radical redesign of the signal component separator (SCS)
digital signal processing implementations is needed to prevent
degradation in net powere f ﬁciency due to signiﬁcant increase
of digital baseband power consumption.
The conventional LINC SCS has been traditionally imple-
mentedbothinanaloganddigitaldesigns[12]–[14].Theanalog
versions of SCS are obviously not suitable for high-speed and
high-precisionapplications,sowe onlyconsiderthedigitalSCS
implementations. The SCS decomposes the original sample
signal to two signa l sa sr e q u i r e db yt h eL I N C / A M O ,a n dt h e
decomposition involves the computations of several nonlinear
functions. For digitally implemented SCS, a look-up-table
(LUT) is the most common way to realize the nonlinear func-
tions. Considering that the past signal separators mainly work
below 100 MSamples/s with low to medium precision, LUT
indeed is the simplest and most energy-efﬁcient approach.
Even for the recent AMO architecture, LUT is still a preferable
choice for operations under 100 MSamples/s [15]. However,
the traditional LUT-based function map quickly becomes in-
feasible when the throughput and precision requirements go up
to multi-GSamples/s and more than 10-bit range. The LUT size
becomes prohibitively large for on-chip implementations and
gives the penalty in both area and speed. Besides, the number
of LUTs used in the AMO SCS is signiﬁcantly larger than in
the LINC SCS, so the LUT solutions that can barely work for
LINC render AMO implementations infeasible. On the other
hand, at these high throughputs a direct nonlinear function
synthesis through iterative algorithms like CORDIC [16] or
0018-9200/$31.00 © 2012 IEEE370 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
TABLE I
LINC AND AMO SCS EQUATIONS
nonlinear ﬁlters [17] proves to be more area compact but with
prohibitive power footprint for the overall power efﬁciency of
the PA.
In this paper, we present the function synthesis algorithms
and a corresponding chip implementation, designed using an
alternative approach to compute the nonlinear functions, which
is both more area and energy-efﬁcient than state-of-the-art
methods like LUTs, CORDIC or nonlinear ﬁlters. The chip
results demonstrate an AMO SCS working at 3.4 GSamples/s
with 12-bit accuracy and over 2 energy savings and 25 area
savings compared to traditional AMO SCS implementation.
The new approach is based on the piece-wise linear (PWL)
approximation of a nonlinear function. The approximation
consists of the computations of LUT, add, and multiply. In
order to minimize the computational cost while maintaining
high accuracy and throughput, we propose a novel algorithm
to ﬁnd the ﬁxed-point representation of the approximation.
T h ei d e ao ft h eﬁxed-point version of the approximation is to
use as few operations as possible and minimize the number of
input bits to all the operations so as to achieve high throughput.
With these considerations, we are able to achieve a ﬁxed-point
representation of typical LINC or AMO nonlinear functions,
which consists of one small LUT, one adder and one multiplier.
The hardware architecture derived from this special algo-
rithm achieves a nice balance among area, energy-efﬁciency,
throughput and computation accuracy, which will be presented
in details in the rest of the paper.
The paper is organized as follows. In Section II, we present
the basic principles of LINC and AMO PA architectures and
their corresponding SCSs. In Section III, we introduce the pro-
posed approximation algorithm and an example to illustrate its
derivations and advantages. In Section IV, we present the chip
design of the digital baseband system and the microarchitecture
of each block, followed by the chip measurement results. We
conclude the paper in Section V.
II. SYSTEM OVERVIEW
Both LINC and AMO PAs are outphasing PA architectures
and their digital basebands perform similar computations. The
LINC PA architecture is proposed by Cox in [8] with the moti-
vation to relieve the ever existing trade-off between the power
efﬁciency and linearity performances of the PA. By decom-
posing the transmitted signal to two constant-amplitude signals,
high-efﬁciency PAs can be used to amplify the two decomposed
signals without sacriﬁcing the linearity. The AMO PA archi-
tecture, proposed in [9]–[11] improves the average power efﬁ-
ciency further by allowing the two PAs switch among a discrete
set of power supplies rather than ﬁxing on a single supply level.
Fig. 1. (a) LINC, AMO SCS. (b) AMO PA system overview.
Fig.1(a)showstheworkingschemesofLINCSCSandAMO
SCS for an arbitrary IQ sample (I,Q). The SCS decomposes the
( I , Q )t ot w os i g n a l sw i t hp h a s e so f , and amplitudes of
, where for LINC . The outphasing angles
and for both architectures are derived from the equations
summarized in Table I. In AMO equations, denote the
power supplies of the two PAs respectively. are restricted
to the set of ,w h e r e
are the four levels of supply voltages. Equations in (amo4)
of Table I are in the signal decomposition process simply due
to the architecture requirement from the digital-to-RF-phase-
converter (DRFPC) [18], which converts the digital outputs to
RF modulated signals and takes a function of the phase
as the input. Generally, computations in (amo4) depend on the
typeofthemodulatorandmaybedifferentthanwhatwepresent
here.
The typical low-throughput LINC SCS and recent AMO im-
plementations [12]–[15], [19] usually involve the use of co-
ordinate rotational digital computer (CORDIC) [16] and LUT
map for the nonlinear functions in Table I [14], [20]. The ma-
turity of the CORDIC algorithm and simplicity of the LUTLI et al.: HIGH-THROUGHPUT SIGNAL COMPONENT SEPARATOR FOR ASYMMETRIC MULTI-LEVEL OUTPHASING POWER AMPLIFIERS 371
approach make themselves suitable for the LINC SCS appli-
cations whose throughput is below 100 MSamples/s and with
low to medium resolution ( 8 bits for example). However, the
approaches become less attractive or even prohibitive for our
target mm-wave wideband applications where the throughput is
in the multi-GSamples/s range with high phase resolution ( 10
bits for example). In the next section, we show our proposed so-
lution: using ﬁxed-point PWL approximations on the nonlinear
functions which provides a balance among accuracy, power and
area.
III. PROPOSED PIECE-WISE LINEAR APPROXIMATION
A. Algorithm
The motivation for a new approach to the nonlinear func-
tion computation is simple: avoid and replace complex compu-
tations with simple and energy-efﬁcient computations. For ex-
ample, table look-up with LUTs of reasonable sizes, adders and
multipliers are the favorable computations to perform. We also
realize that all functions involved in the SCS computations are
smoothinalmostthewholeinputrange.Hence,theyaresuitable
to be approximated by functions with simple structured basis
functions, such as polynomials, splines and etc. These consid-
erations lead us to the PWL function approximation of the non-
linear functions.
Fig. 2(a) shows the general application of the PWL approxi-
mation to any smooth nonlinear function. The input is divided
into several intervals, where a linear function ,
isconstructedineachintervaltoapproximatethe
actual function value in that range. With this approximation, the
computationofthe nonlinearfunctiononlyconsistsofthe linear
function computation in each interval (add and multiply), plus a
relatively small LUT for the linear function parameters in
each interval. In terms of accuracy, for any function which has a
continuous second-order derivative, the approximation error is
bounded by the interval length, the second-order derivative and
does not depend on higher-order derivatives, as shown in [21],
(1)
Here, are the boundaries of the interval and is
the second-order derivative in . We observe that the approx-
imation error can be made arbitrarily small as we increase the
number of approximation intervals. These initial examinations
on the computational complexity and approximation accuracy
of the piece-wise linear approximation make it an appealing al-
ternative technique for the LINC and AMO SCS designs.
In order to beneﬁt from the nice properties of the PWL ap-
proximation,weneedtotailorittobehardware-implementation
friendly.Mostimportantly,allthearithmeticcomputationshave
to be converted to their ﬁxed-point counterparts, and the ques-
tion is whether the resulting ﬁxed-point computations are able
to operate atmulti-GSamples/s throughputs with highaccuracy.
The most seemingly obvious solution is a direct quantization
of the parameters in the ﬂoating-point representation of the ap-
proximationformula.However,thismaynotbeanoptimalsolu-
tion if throughput is the major concern and bottleneck, because
the operands ofthe add and multiply are quantizedto have
Fig. 2. (a) The general concept of PWL approximation. (b) Proposed ﬁxed-
point PWL approximation.
the same long bits as the output, and these long-bit arithmetics
are likely to be in the critical timing path. Further optimization
of the long multiplication would only add complexity to the de-
sign. In what follows, we present a modiﬁed formulation of the
ﬁxed-point PWL approximation and show its capability of run-
ning at a much higher throughput than the direct quantization
version of the approximation.
The setup of our problem is to compute a nonlinear function
of -bit output with -bit input ,u s i n gt h eP W La p -
proximation. An m-bit input can be decomposed to and
as ,w h e r e .
Naturally, divides the input range to intervals and it is
the indexing number of those intervals. Fig. 2(b) shows an en-
largement of the interval of the approximation, where
takes its value, and takes values, ranging from 0 to
. Under this setup, we have our proposed ﬁxed-point




, , , , , ,
a n dt h e ya r ea l lﬁxed-point numbers.
The underlying idea of this formulation is to compute the
-bit output part by part. In the linear function of each interval,372 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
we use the term to represent the most signiﬁcant bits of
the function value, and the term to achieve
the lower-signiﬁcant bits of accuracy. Then is simply
the concatenation of the two parts. The procedures to ﬁnd the
ﬁxed-point representations of the three parameters in
(2) are described in the following steps.
Step 1: Obtain the Floating-Point Version of the PWL Ap-
proximation: Theoptimalrealcoefﬁcientsofthelinearfunction
in each interval in terms of norm canbe found by least-square
optimization (3), where the design variables are and .
The superscripts denote that they are ﬂoating-point real num-
bers; and are deﬁned as in (2).
(3)
Theapproximationerrorboundin(1)showsthattheerrorispro-
portional to , which in the ﬁxed-point input case,
equals .L e t , then it is possible to realize
the required output m-bit accuracy with only intervals.
Since the number of intervals determines the number of address
bits of the LUT that stores the parameters of the linear func-
tion in each interval, this LUT ( entries) is considerably
smaller than a direct map from input to output ( entries). The
following steps determine the ﬁxed-point parameter values, i.e.,
the content of the LUT.
Step 2: Obtain the Fixed-Point Value : can be achieved
simply by quantizing the to -bit. As we mentioned before,
the -bit output is constructed part by part with as the con-
stant term in the interval, representing the major part of the
function value in that interval. As long as the functional value
increment in each interval is less than ,t h a ti s ,t h ef u n c -
tional derivative , it is enough to use the -MSB
of to represent the -MSB of the output.
Step 3: Obtain the Fixed-Point Value : Since Step 2 yields
a with a maximum quantization error of , to compen-
sate for the accuracy loss of , an extra parameter is
introduced such that .I t sﬁxed-point counterpart
is derived as in (4)
(4)
The number of bits of is determined such that has the
accuracy of bits. From our experience with the functions
involved in the SCS design, usually has the number of bits
around or a few more (i.e. 2–4) bits than , depending on the
derivative of the function in each interval.
Step 4: Obtain the Fixed-Point Value : The slope of the
function in the interval can also be obtained by simply
quantizing its ﬂoating-point counterpart from the optimization
procedure in Step 1. As shown in (2), the term
contributes to the second part of the output- the LSBs.Since
has an accuracy of at least bits, has to have at least
bits to make the LSBs of the output.
The above procedure not only provides a way to obtain the
three ﬁxed-point parameters of the linear function in each in-
terval, but also provides beneﬁt in the high-throughput hard-
ware micro-architecture design. Fig. 3(a) shows the micro-ar-
chitecture ofthe approximationand(b)shows more clearlyhow
Fig.3. (a)Micro-architectureofthePWLapproximation.(b)Illustration ofthe
computations in the PWL approximation.
the computations are carried out. There are essentially 3 arith-
metic operations involved: LUT, one adder, and one multiplier.
The LUT takes the MSBs of the input as the address and
outputs the parameters in the corresponding interval.
Thenthelinearfunctioncomputationsfollowaccordingly.From
Fig. 3(a), we notice that for all arithmetic computations, the
operands have only , or bits, but not bits
as input. As we discussed in Step 1, it is a good choice to set
, hence with operands of bits (roughly) in all
computations, we are able to achieve the -bit output.
This implies two important improvements in hardware
efﬁciency: storage and throughput. For a direct LUT imple-
mented function, if both the input and output have bits, the
storage required is . With the proposed scheme, the
storage is , which is approximately
assuming (when
mi se v e n )a n d small ( 4). A comparison on the storage
usage between the direct LUT map and the ﬁxed-point PWL
approximation approach is illustrated in Table II, for practical
range of from 10 to 16. The last column of the table shows
the ratio of LUT size from approximation versus the one from
direct LUT map, which reﬂects the storage savings of 10–100
for the range of values of interest. The net area advantage
of our approach versus the direct LUT will depend on the
actual technology and throughput speciﬁcations, since these
would dictate the type of the storage elements being used. For
example, in high-throughput applications, register-based LUTs
are needed while in lower throughput conditions, SRAM-based
LUTs can be used. Under both types of LUT implementa-
tions, the additional area consumption brought by one adder
and one multiplier is almost negligible compared to the LUT
area. For example, in 45 nm SOI technology, the direct LUT
implementation of a 16-bit in/out arccos function consumes an
area of 19 in register-based implementation and 0.7
SRAM implementation. With the PWL approximation, area
consumption reduces to 46200 with register implemen-
tation and 9784 with SRAM. The adder and multiplier
consume roughly 1280 in total, which is only a smallLI et al.: HIGH-THROUGHPUT SIGNAL COMPONENT SEPARATOR FOR ASYMMETRIC MULTI-LEVEL OUTPHASING POWER AMPLIFIERS 373
TABLE II
STORAGE COMPARISON EXAMPLES BETWEEN A DIRECT LUT MAP APPROACH
AND FIXED-POINT PIECE-WISE LINEAR APPROXIMATION APPROACH
portion compared to the overall area consumption. Obviously,
the PWL approximation has a large advantage in storage size
and the advantage becomes more prominent as the input and
output size increases. As for the throughput, because of the
short operands and LUT address, the whole chain of operations:
LUT, add and multiply can be easily pipelined into a few stages
depending on the process and throughput requirement. For ex-
ample, with a 45 nm SOI process, we use two pipeline stages:
table lookup, adder in the ﬁrst pipeline stage and multiply in the
second pipeline stage, and this structure can sustain roughly a
2-GSamples/s throughput to compute a 15-bit input and output
nonlinear function.
As a side note, analternative way to write our formulation (2)
is
(5)
To compare the two formulations, we consider the following
two aspects: storage size and arithmetic computation com-
plexity. In terms of storage size, formulation (2) requires
while
(5) requires .
Formulation (2) does require a little bit more storage of
bits, however, it brings the advantage of shorter operands of
the add operation. In terms of arithmetic operation complexity,
formulation (2) requires an adder with and -bit
operands, an multiplier with and -bit operands,
while (5) requires an m-bit full adder and -bit multiplier. As
gets large, the long adder in (5) may need further pipelining
andcomplicatesthedesignathighthroughput.Furthermore,the
optimization lets represent the ﬁrst bits while it chooses
and in (2) so that exactly represent the rest of
the bits, to avoid any overﬂow and an additional adder. Our
design is more throughput rather than area-limited, therefore
with the above considerations, we choose to use formulation (2)
to achieve a higher throughput with more compact arithmetic
hardware.
B. Piece-Wise-Linear Design Example
In this section, we show an example of computing a
normalized 16-bit input, 16-bit output arccosine function
using the proposed PWL approximation
approach. This function is one of the functions in the actual
AMO SCS design.
First, we obtain a ﬂoating-point representation of the







, , ,a n d acts
as the address for the LUT. The optimal ﬂoating-point parame-
ters yield a maximum absolute error for the input
range .Forinput ,thePWLapprox-
imation does not behave as well because of the large derivative
valuewhentheinputapproaches1.However,thiscaseonlyhap-
pens when the input sample vector nearly aligns with the two
decomposed vectors, namely is approaching and ,
. One solution is to redeﬁne the threshold values such
that those samples use a set of higher level of power supplies so
as to avoid the situations of , .
Then, we quantize the terms and to 8 bits, and use (4)
to obtain the offset . It turns out that the offset parameter uses
11 bits. And the resulting accuracy after all the quantization is
in terms of maximum absolute error.
Table III shows the place and route results of the hardware
implementation with the proposed approximation approach, as
wellasotherapproachesascomparisons.Therearetwoversions
of the approximation approach shown there with different ways
of handling the LUT: one version has the LUT programmable
and the other version has it hardwired. The approaches shown
there as comparisons include CORDIC and a 6th order poly-
nomial approximation. CORDIC [22] is a general iterative ap-
proach to implement the trigonometric functions. However, due
to its general purpose, it is much less energy-efﬁcient and with
lower throughput compared to our PWL approximation. The
polynomial approximation, as another alternative to approxi-
mate the nonlinear functions, requires much more multipliers
than the PWL approximation, hence is also less energy-efﬁ-
cient. As a summary, the proposed PWL approximation pro-
vides 6–20 improvement in energy-efﬁciency with signiﬁcant
area savings over the competing approaches.
IV. CHIP IMPLEMENTATION
A. Overall Chip Design
The baseband design uses the 64-QAM modulation scheme
and has the target symbol throughput of 1–2 GSym/s. The
system has an oversampling rate of 4 or 2, resulting in a system
sample throughput of 4 GSam/s. The baseband needs to provide
at 60 dB adjacent channel power ratio (ACPR). In order to
meet this speciﬁcation while overcoming the nonlinearity in
the phase modulator DAC [18], the baseband is designed to
achieve 65 dB ACPR with 12-bit phase quantization.
The baseband system has a block diagram as shown in Fig. 4.
It includes two parts of the design: supporting blocks and
AMO SCS. The supporting blocks upsample and pulse-shape374 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
TABLE III
COMPARISON BETWEEN PWL, CORDIC IMPLEMENTATIONS OF THE 16-bit INPUT,O UTPUT FUNCTION
Fig. 4. The block diagram of the chip.
the input symbol sequence from the 64-QAM constellation to
appropriate sample sequences, which are then fed to the AMO
S C Sb l o c k s .S h o w ni nF i g .4 ,t h e3 - b i tIa n dQs y m b o l sﬁrst
pass through a LUT-based nonlinear predistorter with a size
of and produce I/Q symbols with 12-bit accuracy
in each dimension. The system is not designed to have a
powerful nonlinear predistorter, so this simple predistortion
table is added only for preliminary symbol-space predistortion.
The table size is chosen such that the predistorter has some
memory while ﬁtting in the die area. Then the 12-bit I and Q
symbols pass through a pulse shaping ﬁlter which oversamples
the symbols and produces 12-bit I and Q samples with shaped
spectrum. Interleaving is explored here to achieve even higher
throughput. The shaping ﬁlter produces one sample at the
positive edge of the clock and another at the negative edge.
Therefore, two copies of the AMO SCS blocks follow the even
and odd outputs of the ﬁlter.
The AMO SCS part, the zoomed-in part in the bottom of
Fig. 4, consists of four main sub-blocks: the Cartesian-to-polar
block, Amplitude-selection block, Outphasing-angle-computa-
tionblock,andtheanglefunction block.TheCartesian-to-
polar block computes the amplitude square and the angle of
the I/Q samples in polar coordinates, corresponding to equation
(amo1) in Table I.
The Amplitude-selection block then takes the value of ampli-
tude square and selects the pair of power supplies for the PAs
in the two paths. Recall that the initial motivation to modify
the LINC architecture to the AMO architecture is to introduce
more supply levels to minimize the combiner loss especially
when the outphasing angle is large. Therefore, the choice of
thepowersuppliesdirectlyaffectstheaveragepowerefﬁciency.
According to the Wilkinson combiner’s efﬁciency [9] at sample
amplitude and two PA’s supply voltages
(7)
we design the criterion shown in Table IV to select the pair of
power supplies, where
(8)
and are the four available power supply
levels. The criterion is designed to maximize the combiner’s ef-
ﬁciency (7) by using the smallest pair of power supplies while
still the power levels are large enough to form the transmitted
sample. Obviously, there are more than the 7 levels used hereLI et al.: HIGH-THROUGHPUT SIGNAL COMPONENT SEPARATOR FOR ASYMMETRIC MULTI-LEVEL OUTPHASING POWER AMPLIFIERS 375
Fig. 5. The hardware block diagram of the SCS system.
TABLE IV
CRITERION FOR POWER SUPPLY PAIR SELECTION.
that can be designed from 4 supply levels. An important factor
that motivates the choice of the 7 levels is the consideration
of minimizing the number of switching events with each of
the power supply. Power supply switching is accompanied by
ringing and slewing, which introduce nonlinear and memory
effects into the system and cause the spectrum outgrowth and
degradation in the linearity performance of the overall trans-
mitter. The rules in (8) make only one adjacent power supply
change when the sample amplitude jumps from one region to an
adjacent region. This is what happens most of the time because
the pulse-shaping ﬁlter smooths the I/Q symbol transitions and
limits the jumps between I/Q samples.
The Outphasing-angle-computation block computes the two
angles between the decomposed and transmitted vectors, corre-
sponding to equations (amo2) and (amo3) in Table I. The steps
of the computations are divided into four sub-blocks in Fig. 4.
Sub-blocksIandIIcomputetheargumentofthearccosinefunc-
tion , including square-root, inverse of
square-root and summation operations. The terms and
insub-blockIIaretwoprogrammableconstants
and selected after the determination of two supply levels. Then
sub-block III computes the arccosine function and IV computes
the ﬁnal outphasing angles.
Thelastblockof computationpreparestheinputsignals





B. SCS Blocks Design
In this section, we show details of the micro-architecture of
eachblockintheSCSsystem.Fig.5showstheoverallpipelined
TABLE V
SUMMARY OF ARITHMETIC OPERATIONS IN EACH
FUNCTIONAL BLOCK OF THE AMO SCS
hardware block diagram. It is roughly a direct translation from
the conceptual block diagram in Fig. 4. The I/Q samples gener-
ated by the shaping ﬁlter ﬁrst pass through the getTheta block
and produce the and , . The following getAlpha block
then takes and , selects the two power supplies and com-
putestheangles and .ThisroughlycorrespondstotheAm-
plitude-selection and Outphasing-angle-computation blocks in
Fig. 4. The angles and , together with , are inputs to the
getPhi block, which computes the function on
the outphasing angles . This represents the block
in Fig. 4. The ﬁnal outputs of the SCS system are , ,
, ,a n d , . Here, and are quadrant
indicators of and , respectively; , are computed
with converted to the ﬁrst quadrant; and are the
digital codes that control the PA power supply switches. Next,
we see how each sub-block accomplishes its tasks.
1) getTheta Block: Fig. 6(a) shows the micro-architecture of
the getTheta block, which has two main operations as division
andarctan.WiththePWLapproximationalgorithmdiscussedin
Section III-A, both functions can be realized with the micro-ar-
chitecture in Fig. 3. Before applying the approximation, it is
important to carefully examine the input and output range of
the function, because of the nature of the ﬁxed-point computa-
tion. In order to have a good accuracy with the approximation,
it is desirable to have an input range where the function behaves
smoothly and has a nicely bounded derivative. Consider as an
example the division function. The division function has
two input variables, while the presented algorithm assumes a
single variable function. So the computation of is divided
into ,followedby .Theinversionfunction has
a discontinuity at and itsderivative becomes large
as approaching zero. In order to use the PWL approximation
with good accuracy, several preprocessing steps are necessary376 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
Fig. 6. (a) The hardware block diagram of the getTheta block. (a) The hardware block diagram of the getPhi block.
to massage the input before doing the approximation of the in-
version function . We implement the following treatments
on the input, corresponding to the divPrep block in Fig. 6(a):
•S t e p ( 1 ) : are ﬁrst transformed to the ﬁrst quadrant
as where and .U s eaﬂag of
two bits to indicate whether the current sample is
actually negative or not.
•S t e p ( 2 ) : S w a p and if , so the resulting
satisﬁes . The boundary values
of 0 and 1 are computed as special cases separately. Again,
use a ﬂag to indicate whether the swap is performed on the
current sample.
• Step (3): Shift the input such that .T h e
shift operation is always valid because the shaping ﬁlter
coefﬁcients are programmable and can be designed such
that . This step just means shifting the bits in
tothe leftuntilthe MSB is1. Recordtheshiftednumber
of bits for each sample .
Although it is obvious that after the transformations,
is different from the desired output , these preprocessing
steps can be compensated. Speciﬁcally, the swap in Step (2)
and the absolute operation in Step (1) are taken care of after the
computation of ; and the shift operation in Step (3) are taken
care of after the computation of .
• Step (1): Shift back accordingly after the computation of
. This is an operation included in the block of
divPost, together with the multiplication .
• Step (2): After the computation of , for values whose ﬂag
indicating a swap operation has happened, ,
otherwise . This is included in the atanPost block in
Fig. 6(a).
• Step (3): After Step (2), we need to check further if quad-
rant change has happened to the current sample, and adjust
the accordingly. This is also a part of atanPost block.
With properly designed preprocessing, the input of inversion
function takes the range of (1, 2), and the input of function
takes the range of (0, 1). In these ranges, the func-
tions have nicely bounded derivatives, enabling them to be suit-
ablefortheﬁxed-pointPWLapproximation.Thetwofunction’s
approximation computations are represented by the blocks di-
vApprox and atanApprox in Fig. 6(a), whose micro-architecture
follows the one in Fig. 3(a). The overall getTheta block is able
to achieve a throughput of 2 GSamples/s in the place and route
timing analysis. The look-up tables that store the , ,a n d
for the two functions have sizes as summarized in the ﬁrst two
lines in Table VI. The table also gives a size comparison to the
LUTs which are used directly to map the nonlinear functions.
There, we can see orders of magnitude of LUT size saved by
using our ﬁxed-point PWL approximation approach. The ac-
curacy column also shows that an output accuracy of 14 bit is
achieved.
2) getAlpha Block: Fig. 7 demonstrates the detailed micro-
architecture of the getAlpha block of Fig. 5, also corresponding
to the conceptual sub-blocks I, II and III of the Outphasing-
angle-computation part in Fig. 4. The and computations
include two parts: obtain the argument to the function
and calculate the function itself. In order to obtain the
argument ,w er e a r r a n g et h et e r m sa s
(9)LI et al.: HIGH-THROUGHPUT SIGNAL COMPONENT SEPARATOR FOR ASYMMETRIC MULTI-LEVEL OUTPHASING POWER AMPLIFIERS 377
Fig. 7. The hardware block diagram of the getAlpha block.
TABLE VI
SUMMARY OF ACCURACY AND LUT SIZE OF THE
PWL APPROXIMATED FUNCTION BLOCKS
where constants and are programmable values and are se-
lectedaccordingtotheselectionofpowersupplies.Theproblem
with using the original formula is the
long-bit division, whose inputs are on the same order of .O n
the other hand, (9) involves no computations with inputs on the
order of .
The computations to obtain the terms , in (9) include
approximations of the functions and , whose inputs
are the sum of and . Similarly as we discussed for the
division computation, certain input preprocessing is necessary
to avoid the large derivatives near discontinuity point at 0. The
SqrtPrep block of Fig. 7 serves this purpose by scaling the input
to the range of [1/4, 1), namely shifting two bits at a time either
to the left or right until the input ﬁts to the range. Then the ap-
proximations to the two functions are performed and followed
by the postprocessing parts that compensate for the shifting op-
erations done to the inputs. With two more multipliers and one
adder, the computations of (9) are now accomplished. Then the
function takes the input arguments and obtain angles
,whichis alreadyshowninthe previousexample.Forthe
three functions, The LUT sizes and accuracy for the three func-
tions are summarized in Table VI.
3) getPhi Block: Shown in Fig. 6(b) and as the ﬁnal block
in Fig. 5, getPhi takes the outputs , and from the pre-
vious getAlpha and getTheta blocks and produces the ﬁnal out-
phasing angles and .T h egetPhi block ﬁrst computes
the outphasing angles , in the sub-block ftanPrep,t h e n
block computes the ﬁnal outputs. Nominally,
the digital baseband SCS’s tasks end after the ftanPrep,d e l i v -
ering the outphasing angles themselves. However, there may be
additional signal processing task at the interface between the
digital baseband and the DRFPC phase modulator. In our case,
the phase modulator we intend to use requires such a function
on the outphasing angle as input.
After obtaining the outphasing angles as and
,weconvertthemtotheﬁrstquadrantsanduse2-bit
ﬂags and to indicate the quadrants. This conver-
sion is necessary both for the sake of the phase modulator input
requirement, as well as acting as a preprocessing step for the
followingfunctionalapproximation.Bylimitingtheinputtothe
ﬁrst quadrant, the function has nicely bounded
derivative as in the range of .O t h e r -
wise, the function has a discontinuity at . So it is suitable
to apply the PWL approximation on this function as well. The
hardware cost in terms of the LUT size is again summarized in
Table VI.
C. Experimental Results
With all nonlinear functions properly approximated and pa-
rameters quantized, the tested SCS output produces the signal
spectrum as shown in Fig. 8(a). Compared with the spectrum
at the shaping ﬁlter’s output, the SCS block reduces the ACPR
by 2 dB, from 67 dB to 65 dB, due to the approximation and
quantization errors. Fig. 8(b) shows the 64 QAM constellation
diagram between SCS output and ideal input, illustrating that
the SCS introduces EVM of 0.08%.
The digital AMO SCS system is fabricated in a 45 nm SOI
process, with 448578 gates occupying the area of 1.56 .
Thechiprunsupto1.7GHz(3.4Gsample/s)at1.1Vsupply.As
shown in the shmoo plot of Fig. 9, lowering the power supply
voltage decreases the dynamic power of the SCS digital system
until it hits the minimum-energy point at lower throughput,
where leakage energy takes over. The minimum-energy point
of 58 pJ per sample or 19 pJ per bit in 64-QAM transmission
(assuming 2 oversampling) is measured at 800 MSamples/s
throughput. For typical PA efﬁciency of 40% and throughput
of 800 MSamples/s, at peak output power level of 1.8 W, the
total peak PAE is affected by less than 1% (46 mW/(46 mW
1.8 W/0.4)) by this 64-QAM capable AMO SCS backend.378 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
Fig. 8. Spectrum and EVM of the SCS. (a) Spectrum comparison of the SCS
output and shaping ﬁlter output. (b) EVM comparison of the SCS output and
ideal input.
Fig. 9. Throughput and energy with supply scaling for AMO SCS.
The chip photograph is shown in Fig. 10, with annotated
blocks and sizes. The power breakdown of the AMO SCS is
illustrated in Fig. 11(a). Based on the reported post-place and
route power estimation values, the estimated contribution to
the total AMO SCS power at 2 GHz operation is shown. The
Fig. 10. Chip photograph.
Fig. 11. (a) Power breakdown of the AMO SCS design. (b) Area breakdown
of the AMO SCS design.
large proportion of the clocking power is in part due to the la-
tency-matching register stages on amplitude paths required to
compensate for the depth of the phase computations, and the
leakage power of the getPhi block is due to its programmable
LUT of the function. The area breakdown of the AMO
SCS is illustrated in Fig. 11(b), which shows the areas of major
functional blocks of the three main functions of the SCS. The
computation of the function of takes over two thirds of the
area due to its programmable LUTs. A comparison of our work
with other digital/analog implementations of LINC/AMO SCS
is summarized in the ﬁrst 5 columns of the Table VII. Our workLI et al.: HIGH-THROUGHPUT SIGNAL COMPONENT SEPARATOR FOR ASYMMETRIC MULTI-LEVEL OUTPHASING POWER AMPLIFIERS 379
TABLE VII
COMPARISON WITH OTHER WORKS
demonstrates a design with the highest throughput and phase
accuracy to date. To show a more fair comparison with other
digital AMO SCS work, we scaled the design in [15] to provide
the same phase accuracy, technology node and throughput. The
scaledperformancesaresummarizedinthelast3columnsofthe
Table VII, and our design shows more than 2 improvement
in energy-efﬁciency and 25 improvement in area. As a gen-
eral guideline, for applications with low/medium accuracy (e.g.
less than 8-bit phase resolution) requirement and low/medium
throughput (e.g. up to hundreds of MSamples/s), LUT is still
a good design choice because of its low energy-efﬁciency, rea-
sonable size and low design complexity. On the other hand, our
proposed approach is more suitable for applications with high
accuracy (e.g. greater than 10-bit phase resolution) and high
throughput (e.g. around GSamples/s) requirements.
V. CONCLUSION
In this paper, we present a chip design of a high-throughput
(3.4 GSamples/s) SCS for the AMO PA architecture. In order
to achieve energy- and area-efﬁcient high-throughput opera-
tion, we developed a new ﬁxed-point piece-wise linear approx-
imation algorithm for the computations of the nonlinear func-
tions in SCS design. This new algorithm and the corresponding
implementation achieve over 2 improvement in energy efﬁ-
ciency and 25 improvement in area efﬁciency over the tra-
ditional AMO SCS implementations. The algorithm has nice
properties of few and simple arithmetic operations, short arith-
meticoperandsandsmall-sizedlook-uptables,andcanbeeasily
pipelined to run at multi-GSamples/s throughputs. Designed in
45 nm SOI technology, this SCS implementation is the fastest
SCS implementation demonstrated to date. Though we demon-
strate the application of the approximation algorithm with the
AMO SCS, the approximations are directly applicable to LINC
SCS, and enable a new class of wideband wireless mm-wave
communication system designs with high energy and spectral
efﬁciency.
ACKNOWLEDGMENT
The authors would like to thank Mark M. Tobenkin, and Joel
Dawson, David Ricketts, Wei Tai, Zhen Li, and Gilad Yahalom
from the MIT-CMU ElastX team, for their useful discussions
and kind help.
REFERENCES
[1] J. Laskar, S. Pinel, S. Sarkar, B. Perumana, and P. Sen, “The next
wireless wave is a millimeter wave,” Microwave J., vol. 50, no. 8, pp.
22–36, Aug. 2007.
[ 2 ] S .P i n e l ,P .S e n ,S .S a r k a r ,B .P e r u m a n a ,D .D a w n ,D .Y e h ,F .B a r a l e ,
M. Leung, E. Juntunen, P. Vadivelu, K. Chuang, P. Melet, G. Iyer, and
J. Laskar, “60 GHz single-chip CMOS digital radios and phased array
solutions for gaming and connectivity,” IEEE J. Sel. Areas Commun.,
vol. 27, no. 8, pp. 1347–1357, Oct. 2009.
[3] C. Marcu, D. Chowdhury, C. Thakkar, L.-K. Kong, M. Tabesh, J.-D.
Park, Y. Wang, B. Afshar, A. Gupta, A. Arbabian, S. Gambini, R. Za-
mani, A. Niknejad, and E. Alon, “A 90 nm CMOS low-power 60 GHz
transceiverwithintegratedbasebandcircuitry,”inIEEEInt.Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2009, pp. 314–315.
[4] M. Tabesh, J. Chen, C. Marcu, L. Kong, S. Kang, E. Alon, and A.
Niknejad, “A 65 nm CMOS 4-element sub-34 mW/element 60 GHz
phased-array transceiver,” in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2011, pp. 166–168.
[5] S. Nicolson, K. Yau, S. Pruvost, V. Danelon, P. Chevalier, P. Garcia,
A. Chantre, B. Sautreuil, and S. Voinigescu, “A low-voltage SiGe
BiCMOS 77-GHz automotive radar chipset,” IEEE Trans. Microw.
Theory Tech., vol. 56, no. 5, pp. 1092–1104, May 2008.
[ 6 ] R .B e nY i s h a y ,R .C a r m o n ,O .K a t z ,a n dD .E l a d ,“ Ah i g hg a i nw i d e -
band 77 GHz SiGe power ampliﬁer,” in Proc. IEEE Radio Frequency
Integrated Circuits Symp. (RFIC), May 2010, pp. 529–532.
[7] A. Arbabian, B. Afshar, J.-C. Chien, S. Kang, S. Callender, E. Adabi,




[8] D. Cox, “Linear ampliﬁcation with nonlinear components,” IEEE
Trans. Commun., vol. COM-22, no. 12, pp. 1942–1945, Dec. 1974.
[9] S. Chung, P. Godoy, T. Barton, E. Huang, D. Perreault, and J. Dawson,
“Asymmetric multilevel outphasing architecture for multi-standard
transmitters,” in Proc. IEEE Radio Frequency Integrated Circuits
Symp., Jun. 2009, pp. 237–240.
[10] P. Godoy, S. Chung, T. Barton, D. Perreault, and J. Dawson, “A
2.5-GHz asymmetric multilevel outphasing power ampliﬁer in 65-nm
CMOS,” in Proc. IEEE Topical Conf. Power Ampliﬁers for Wireless
and Radio Applications (PAWR), Jan. 2011, pp. 57–60.
[11] J. Hur, O. Lee, K. Kim, K. Lim, and J. Laskar, “Highly efﬁcient un-
even multi-level LINC transmitter,” Electron. Lett., vol. 45, no. 16, pp.
837–838, 30 2009.
[12] B. Shi and L. Sundström, “A 200-MHz IF BiCMOS signal component
separator for linear LINC transmitters,” IEEE J. Solid-State Circuits,
vol. 35, no. 7, pp. 987–993, Jul. 2000.
[13] L.Panseri,L.Romano,S.Levantino,C.Samori,andA.Lacaita,“Low-
power signal component separator for a 64-QAM 802.11 LINC trans-
mitter,” IEEE J. Solid-State Circuits, vol. 43, no. 5, pp. 1274–1286,
May 2008.380 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 2, FEBRUARY 2013
[14] W. Gerhard and R. Knoechel, “LINC digital component separator
for single and multicarrier W-CDMA signals,” IEEE Trans. Microw.
Theory Tech., vol. 53, no. 1, pp. 274–282, Jan. 2005.
[ 1 5 ] T . - W .C h e n ,P . - Y .T s a i ,D .D eM o i t i e ,J . - Y .Y u ,a n dC . - Y .L e e ,“ Al o w
power all-digital signal component separator for uneven multi-level
LINC systems,” in Proc. ESSCIRC, Sep. 2011, pp. 403–406.
[16] J. E. Volder, “The Cordic trigonometric computing technique,” IRE
Trans. Electron. Comput., vol. EC-8, no. 3, pp. 330–334, Sep. 1959.
[17] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Sys-
tems. Malabar, FL: Krieger Publishing Co., 2006, vol. 1.
[18] P. Eloranta, P. Seppinen, S. Kallioinen, T. Saarela, and A. Parssinen,
“A multimode transmitter in 0.13 m CMOS using direct-digital
RF modulator,” IEEE J. Solid-State Circuits, vol. 42, no. 12, pp.
2774–2784, Dec. 2007.
[19] T.-W.Chen,P.-Y.Tsai,J.-Y.Yu,andC.-Y.Lee,“Asub-mWall-digital
signal component separator with branch mismatch compensation for
OFDM LINC transmitters,” IEEE J. Solid-State Circuits, vol. 46, no.
11, pp. 2514–2523, Nov. 2011.
[20] C. Conradi, J. McRory, and R. Johnston, “Low-memory digital signal
component separator for LINC transmitters,” Electron. Lett., vol. 37,
no. 7, pp. 460–461, Mar. 2001.
[21] K. Eriksson and D. Esten, Applied Mathematics: Body and Soul:
Volume 2: Integrals and Geometry in . New York: Springer,
2 0 1 0 ,v o l .2 .
[22] P. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50
years of cordic: Algorithms, architectures, and applications,” IEEE
Trans. Circuits Syst. I: Reg. Papers, vol. 56, no. 9, pp. 1893–1907,
Sep. 2009.
[23] B. Shi and L. Sundstrom, “An IF CMOS signal component separator
chip for LINC transmitters,” in Proc. IEEE Custom Integrated Circuits
Conf. (CICC), 2001, pp. 49–52.
Yan Li received the B.E. degree in electrical engi-
neering from University of Science and Technology
ofChinain2004,andtheM.A.Sc.degreeinelectrical
engineering from McMaster University, Canada in
2006. She is currently purs u i n gt h eP h . D .d e g r e ei n
electrical engineering and computer science from
the Massachusetts Institute of Technology (MIT),
Cambridge, MA.
Her current research interests include the design
of high-speed energy-efﬁcient digital systems, non-
linear systems modeling and compensation, and op-
timization for analog integrated circuits.
Zhipeng Li received the S.B. in physics, S.B.
in electrical engineering, M.Eng., and Electrical
Engineer degrees from the Massachusetts Institute
of Technology (MIT), Cambridge, MA, in 2009,
2009, 2010, and 2011, respectively, where he is
currently pursuing the Ph.D. degree. His research
interests include design and implementation of
energy-efﬁcient digital circuits.
Oguzhan Uyar was born in Yozgat, Turkey, in 1986.
He received the M.S. degree in electrical engineering
from Massachusetts Institute of Technology (MIT),
C a m b r i d g e ,M A ,i n2 0 1 1a n dt h eB . S .d e g r e ei ne l e c -
trical engineering from Bogazici University in 2009.
His main areas of interest are high-speed mixed
signal circuits, gigabit serial links, equalization and
wireless transceivers.
Yehuda Avniel received the Ph.D. degree in
electrical engineering and computer science from
Massachusetts Institute of Technology (MIT),
Cambridge, MA, in 1985 in the area of control. He
is a Research Afﬁliate at MIT and a contributor
in numerous DARPA initiatives in the applica-
tions of robust and hierarchical optimization in
nanotechnology, integrated circuit design, and
optimization-based non-linear model reduction.
Alexandre Megretski is currently a Professor
of Electrical Engineering and Computer Science
with the Laboratory for Information and Decision
Systems, Massachusetts Institute of Technology
(MIT), Cambridge, MA. He was a Researcher with
both the Royal Institute of Technology, Stockholm,
Sweden, and the University of Newcastle, NSW,
Australia, and a Faculty Member with Iowa State
University, Ames. His current research interests
include nonlinear dynamical systems (identiﬁcation,
analysis, and design), validation of hybrid control
algorithms, optimization, applications to analog circuits, control of animated
objects, and relay systems.
Vladimir Stojanović received the Ph.D. degree
in electrical engineering from Stanford University,
Stanford, CA, in 2005, and the Dipl. Ing. degree
from the University of Belgrade, Serbia, in 1998.
He is the Emanuel E. Landsman Associate
Professor of Electrical Engineering and Computer
Science at Massachusetts Institute of Technology
(MIT), Cambridge, MA. He was with Rambus,
Inc., Los Altos, CA, from 2001 through 2004. His
research interests include design, modeling and op-
timization of integrated systems, from CMOS-based
VLSI blocks and interfaces to system design with emerging devices like NEM
relays and silicon-photonics. He is also interested in design and implementation
of energy-efﬁcient electrical and optical networks, and digital communication
techniques in high-speed interfaces and high-speed mixed-signal IC design.
Prof. Stojanović received the 2006 IBM Faculty Partnership Award, and the
2009 NSF CAREER Award as well as the 2008 ICCAD William J. McCalla,
2008 IEEE TRANSACTIONS ON ADVANCED PACKAGING, and 2010 ISSCC Jack
Raper best paper awards. He is an IEEE Solid-State Circuits Society Distin-
guished Lecturer for the 2012–2013 term.