Abstract-
I. INTRODUCTION
U LTRA-LOW-POWER (ULP) wireless transceivers (TRXs) for the Internet of Things (IoT) are a subject of intensive research in both industry and academia [1] - [10] . Bluetooth low energy (BLE) [11] is currently the most popular standard for short-range IoT communications. BLE is an extension of the conventional Bluetooth (BT) that specifies an increased channel spacing of 2 MHz and a relaxed interference tolerance to allow for low-power (LP) implementations. This paper focuses on implementing such a TRX in the most advanced low-cost bulk CMOS technology node: LP polysilicon gate version of 28-nm CMOS with nine metal layers. The key objective is to maximally reduce the system cost by fully integrating all the RF TRX building blocks, including the traditionally troublesome antenna-interfacing circuitry, such as the power amplifier (PA) matching network and transmit/ receive (T/R) switch, while maximally reducing the power consumption.
To address the above objectives of full system integration at ULP consumption, including amenability with digital processors [6] in face of strong push toward a sub-threshold operation [12] , the proposed TRX exploits all-digital and digitally intensive architectures for the frequency synthesizer, transmitter (TX), and receiver (RX) [13] , [14] . A timeto-digital converter (TDC) in an all-digital PLL (ADPLL) employs a string of inverters to convert a time difference between reference and variable (RF) clocks into a digital phase error. Power consumption and resolution of the TDC improve with technology scaling. Furthermore, at the same area, device matching improves thus reducing TDC nonlinearity and the level of fractional spurs. On the RX side, most of signal processing and filtering is done using discrete-time (DT) passive switched capacitor circuits. Waveforms required for driving the switches are also generated using digital logic. To provide signal gain, DT techniques use inverter-based g m cells that are always compatible with digital technology. As the technology scales, MOS switches become faster with lower parasitic capacitance [15] , [16] . Consequently, digital waveform generators also become faster and more power efficient. On the other hand, metal capacitor density improves by migrating to more advanced technology, resulting in a reduced area. The above reasoning justifies the use of relatively advanced technology node of the 28-nm bulk CMOS in this paper, especially given an upcoming introduction of embedded nonvolatile memory.
Consequently, we have revisited the basic operation of major TRX building blocks-local oscillator (LO), TX, and RX-from the standpoint of power consumption for the relatively relaxed BLE performance (and IoT in general), and attempted to rearchitect the RF circuitry, given the objectives of advanced CMOS technology and full monolithic integration [9] . The LO and TX parts are largely based on a standalone TX published recently in [17] and [18] , so only a quick LO/TX summary and new features are covered here. Fig. 1 shows a block diagram of the BLE TRX. It successfully integrates all the required RF and IF building blocks and further adds a T/R antenna switch with an adjustable digital PA (DPA) and a low-noise transconductance amplifier (LNTA) matching networks such that the RF input/output (RFIO) pin can be directly connected to the antenna.
II. BRIEF OVERVIEW OF TRANSCEIVER ARCHITECTURE
The LO shown in Fig. 2 is an ADPLL based on a switched current-source digitally controlled oscillator (DCO) [19] and a phase-predictive TDC [18] . The sensitivity of the RF oscillator to the common-mode noise (e.g., flicker noise of the oscillator transistors and supply modulation) is the main origin of the oscillator's flicker noise upconversion and frequency pushing. It is relatively well known that oscillators with lower 1/ f 3 phase noise (PN) corner are also less sensitive to the supply/load modulation and thus frequency pushing and pulling [20] . While the reasons for a low flicker noise upconversion of the switching current source oscillator have been already discussed in depth in [18] and [21] , this paper deals with its low frequency supply pushing. Contrary to the traditional cross-coupled oscillators, supply perturbations here cannot directly modulate gate-source voltage V gs of M 1−4 devices (see [18, Fig. 7] ). Note that V B biasing does not consume any dc current; therefore, realizing an on-chip V B voltage reference with a good PSRR would be quite straightforward. Consequently, supply perturbations cannot modulate the oscillator's dc current and nonlinear gate-source capacitance (C gs ) of M 1−4 devices. Furthermore, the upper pair transistors just work in the cutoff, subthreshold, and saturation regions. The variation of gate-drain capacitance is almost negligible in these regions. As a result, the frequency pushing here is at least an order of magnitude better compared with the traditional structures.
The lower frequency pushing allows the ADPLL loop to be frozen after acquiring lock since the packet duration of ∼500 μs (and up to several milliseconds) will not incur any significant frequency drift.
The TX uses a dedicated DCO switchable capacitor bank (Mod M/L in Fig. 2 ) to perform direct GFSK modulation, while providing an option of also feeding the GFSK data to the second compensating feed (Data FCW) if the ADPLL is not frozen (i.e., two-point modulation). The DPA is realized as a switched-mode class-E/F 2 topology with a transformer-based matching network, which was found to maximally enhance its efficiency at low supply voltage [22] . The TX architecture is also described in detail in [18] . This paper introduces a number of architecture-and circuit-level innovations to maximally reduce power and cost (i.e., maximum integration and lowest die area) of the BLE TRX. The transformers for the DCO and DPA are redesigned to incorporate the associated active devices underneath them with no significant loss of performance or power efficiency. This leads to a >30% reduction in the TX area.
The key innovation in the RX architecture was derived from realizing that the best devices and basic building blocks in low-voltage deep-nanoscale CMOS are logic gates, transistor switches, inverter-like g m transconductors, and metal-oxidemetal (MOM) capacitors [15] . Hence, the most logical topology would be a charge-domain switched capacitor network operating in DT. However, to maximally reduce power consumption, MOS devices would need to be remarkably small, which would invariably increase their flicker noise corner. To mitigate that, we propose to increase the RX intermediate frequency (IF) to just beyond the flicker corner frequency and filter the IF signal using complex-domain cascaded bandpass filters (BPFs). The first of filters is clocked at multiples of the LO frequency performing sufficient filtering and harmonic rejection. System-level aspects of the BLE DT-RX are beyond the scope of this paper and are discussed in [38] .
The remainder of this paper is organized as follows. Section III gives an overview of the ULP aspects of ADPLL and TX and discusses the vertical RF integration of passive and active components. Section IV details the DT-RX implementation. Section V reveals the RFIO matching and switching. In Section VI, the the experimental results are discussed.
III. ALL-DIGITAL PHASE-LOCKED LOOP AND TRANSMITTER ARCHITECTURE
The LO signal for the BLE TX and RX is derived from a DCO running at 4.1-5.1 GHz (see Fig. 2 ). This promotes smaller on-chip transformers with maximized quality factor (Q). Two separate ÷2 dividers generate the required quadrature signals (CKV 0−3 in Fig. 2 ) for the RX mixer and predictive TDC [23] that supports ADPLL locking for any crystal oscillator's frequency reference (FREF) between 1 and 40 MHz. In the TX mode, the DPA feeds the RFIO pin with P out ≤ 3 dBm, which may be controlled in 16 steps (−5 · · · 3 dBm range with finer resolution at higher power levels [18] ) to save power and to limit interference to other users. In this TRX IC chip, the Gaussian frequency shift keying (GFSK) modulating data are stored in SRAM.
For IoT applications, flexible power management is essential to improving the battery life. Several techniques are exploited here to reduce the TX power consumption. First, supply voltage reduction leads to significant power savings for RF critical circuits (i.e., oscillator and power amplifier) [18] and makes it more compatible with energy harvesters, e.g., photovoltaic cells. Nominal supply voltage for core devices in a 28-nm LP CMOS technology is 1.05 V but is projected to scale towards 0.5 V in several generations of FinFET technology. Second, a dynamically programmable reference clock is employed to scale down the ADPLL update rate from 40 MHz to 5 MHz. The ADPLL digital logic power consumption will benefit proportionally to the frequency down-scaling ratio, while the in-band PN (i.e., between 1 and 100 kHz) will deteriorate by the same amount. Last, after ADPLL is settled, the DCO can be put in an open-loop modulation for great savings in power consumption. However, if the dynamically reduced FREF rate or the open-loop DCO approach is to be used, the most important factor to be considered is system tolerance to the frequency drift, which must be well below the BLE limit of 400 Hz/μs [11] , [24] . Besides, as the CMOS technology is scaled down, the relative area occupied by on-chip transformers/inductors will be increasing because the transformers cannot scale down as does the minimum feature size of technology. In this paper, both DCO and DPA with active devices underneath the transformers are reported. This achieves a 30% area reduction compared with a previous design [18] using lateral separation of passives and actives.
Supported by the low flicker noise of the oscillator, the ADPLL first acquires the lock in closed loop and then freezes the DCO tuning words while mostly shutting itself down [i.e., only the red bold circuits in Figs. 2 and 3 are operational], letting the DCO to free-run and to generate the LO signal for TX (FM-modulated) and RX (CW) packets in open loop. Alternatively, the blue bold circuits in Fig. 2 allow the effective reference clock to be dynamically scaled down in the closedloop operation to save digital power. For more details, see [18] and Section VI. 
A. Vertical Layout Integration
The area of the proposed TX is dominated by two large passive components: a tuning LC tank transformer in the switched current-source DCO and a matching-network transformer in the class-E/F 2 DPA. The DCO transformer needs to be optimized for the largest Q-factor possible in a given technology, since both its spot PN L( f ) and figure of merit (FoM) are proportional to 1/Q 2 . Likewise, the Q-factor of the DPA transformer needs to be maximized to achieve the reasonably highest efficiency, η p(max) , as derived in [25] 
where Q p and Q s are, respectively, the Q-factors of the DPA transformer's primary and secondary windings and k m is the magnetic coupling factor between the windings. Inspection of inductors available in this process design kit reveals that there exists a certain minimum area of an inductor below which its Q-factor starts dropping. Further, as shown in [26] , at the same inductance value, multiturn windings occupy smaller area but have much lower Q-factor than with a single turn. We surmise that observation would extrapolate to transformers. This reasoning concludes that further area reduction in the transformer-/inductor-dominated TX would only be possible by making the active devices (i.e., the remainder of the constituting components) somehow "disappear" beneath these passives [27] . Naturally, this needs to be done without degrading the precious Q-factor and other performance parameters. Fig. 4 illustrates this idea of vertical integration. In a conventional TX layout implementation, such as that in [18] , the circuitry associated with each of the two transformers is placed laterally nearby [see Fig. 4(a) ]. In the proposed implementation, shown in Fig. 4 (b), these associated components are placed underneath their respective transformers. Specifically, the DCO transformer shares its die area vertically with its cross-coupled transistors, switchable tuning capacitor banks, output buffer, and divider, whereas the DPA transformer shares its die area vertically with its clock network, switching transistors, and matching switchable-cap banks. This way, the 30% area reduction of the TX has been achieved with no detrimental effects to the performance, as will be explained in the following. Fig. 4 (c) reveals the vertical arrangement of the TX within the multilayer metal stackup. In this 28-nm digital CMOS technology, the upper Cu metal layers M8, M9, and AP are, respectively, 9.5×, 40×, and 31× thicker than the thin AT lower Cu metal layers M1-M7 of uniform thickness (i.e., 1×). To obtain high-quality factors (e.g., Q>11) at the resonance frequency, the two transformers (T − DCO and T − DPA) occupy the dedicated three upper thick-metal layers (M8-AP). The main coils are routed at M9, with M8 and AP serving only for cross routing. The core devices occupy metal layers M5 and lower, allowing placement underneath the transformers. A signal routing channel for non-RF signals is implemented at M6, which is largely kept empty but available to promote future integration. The patterned ground shield (PGS) implemented in M7 is placed between the transformers and the ADPLL/TX circuitry to improve isolation with "noisy" active circuitry at M1-M6. The optimal vertical integration arrangement was determined through extensive electromagnetic (ADS TM /PeakView LEM TM ) simulations. Keeping M8 largely empty turned out to favorably reduce capacitance between M7 (PGS) and M9 (transformer).
To gain deeper understanding, Fig. 5 (a) illustrates the simplified equivalent circuit of the transformer model without the PGS and is compared with that of the transformer model in (2) , based on a method in [27] . The secondary winding inductance L s can be analyzed similarly as
in which the equivalent parallel resistance R X is
where L p is the transformer's primary winding inductance, r p is its series metal resistance, C OX is the thick-oxide capacitance, C sub is the substrate capacitance, R sub is the substrate resistance, and C O X,m79 is the oxide capacitance between PGS (M7) and M9. In Fig. 5 , an impedance seen by the primary winding inductance L p and presented by the combination of C OX , C sub , and R sub is replaced by a parallel equivalent resistance R X in (3), which is frequency dependent. However, when the PGS is added, this equivalent resistance becomes very large as R X in (4). This is shown in Fig. 5(b) as the PGS at M7 is shorted to ground, and C OX , C sub , and R sub can be ignored, thus increasing the winding's Q-factor. However, M7 brings the bottom plate of the shunt parasitic capacitance closer to the inductor, which now appears as C OX,m79 and which somewhat decreases the self-resonance frequency of the winding.
A comparison of the transformer's parameters obtained through EM simulations between the cases of "without PGS" and "with PGS" is shown in Table I . The results confirm that the transformer's substrate loss "with PGS" is lower than that of "without PGS" but at a cost of a bit lower self-resonant frequency.
Simulated Q of the transformers, shown in Fig. 6 , further illustrates better Q-factors with the PGS than without the PGS in the operating frequencies of interest. This confirms that the insertion of PGS and use of proper empty layers (M8) will allow the same or better Q-factors compared with [18] , while preventing any significant coupling from the TX subblocks underneath, as verified via performance measurements in Section VI. The peak Q p and Q s of resonator with the components underneath are ∼17 and 13, which are ∼3% higher than ∼16.5 and 12.5 of the transformer with the components outside, respectively.
IV. DISCRETE-TIME RECEIVER
Recent ULP RXs for BLE achieve significant power reduction [3] , [28] and higher level of integration [5] , [8] , [10] primarily using sliding IF and low-IF continuous-time architectures. To simultaneously reduce the RX power consumption beyond state of the art while maintaining adequate performance margin with the stated goal of direct antenna connection (see Fig. 1 : shared single TX/RX RFIO pin and no external RF components, such as bulky and costly antenna band-selection filter), we propose a DT high-IF or superheterodyne RX architecture with complex-signaling BPFs and a progressively reduced sampling rate. The approach exploits the chargesharing (CS) BPF recently introduced for a high-performance cellular 4G RX [15] , [16] and now adapted for the first time for ULP applications.
To be able to eliminate the external RF components, out-of-band (OOB) blockers are attenuated here using a combination of a charge-sampling mixer and a full-rate (∼4 × 2.45 Gsamples/s) CS-BPF, which are protected from aliasing only by selectivity of a narrow-band LNTA. An additional OOB blocking protection margin is offered by the fully integrated TX/RX matching network. The full-rate DT circuitry is followed by two cascaded CS-BPFs that operate at 16× decimated sampling clocks for power reduction, thus preparing the received signal for moderate dynamic range and aliasing requirements of the following ADC.
A. Full-Rate Sampling Strip
The front-end section (strip) of DT-RX is presented in Fig. 7 . It consists of the narrow-band LNTA, a single-ended-todifferential quadrature sampling mixer, and a DT 4/4 CS-BPF. The 4/4 CS-BPF is implemented with four rotating capacitors (C R ) sampled at four phases with 25% duty cycle (D).
There are two key principles here. First, with no interstage decimation, both the mixer and DT filter operate at the same effective sampling frequency, 4× higher than LO ( f s = 4 f LO ), giving an OSR of more than 2, which avoids aliasing up to the third clock harmonic. Second, the low quality factor (Q) of the complex 4/4 CS-BPF is mapped to a high Q of RF input filtering by means of frequency translation by the current-mode sampling mixer [29] , [30] . This beneficial mixer transparency effect was first exploited for a CS-BPF architecture in a highly linear software-defined radio for cellular 4G applications [16] . For the ULP application designed here, the high-performance but power-hungry architecture composed of a differential LNTA, 8-phase sampling mixer, and 8/16 CS-BPF with harmonic rejection introduced in [16] is converted into an ULP structure. A single-ended narrow-band LNTA, a 25% duty-cycle quadrature passive mixer, and a 4/4 CS-BPF are presented in Fig. 7 .
The LNTA is composed of two stages: a single-input/singleoutput common-source cascode low-noise amplifier (LNA) and a common-source transconductance (g m ) amplifier. Both stages operate in moderate inversion (g m /I d = 18 and 12 V −1 ) as opposed to a strong inversion operation in prior reports, in order to reduce power consumption with I d = 400 and 100 μA, respectively, biased with current mirrors (not shown in Fig. 7 ). Capacitors C g and C d are 4-b programmable to tune the LNTA input matching network and its tank load over process, voltage, and temperature (PVT), as well as over package parasitics.
Besides converting from single-ended into differential, and hence implicitly acting as a balun, the D = 25% quadrature passive mixer operation results in a low quadrature imbalance, low noise, and high linearity [31] . The sampling mixer 
where C R is the rotating capacitor, C H is the history capacitor, and α = C H /(C H + C R ). Q-factor of the 4/4 CS-BPF is directly related to the output resistance of the previous cell (R out ). When R out ≥ 3R eq , where R eq is the input resistance of the filter, the best Q of 0.5 is obtained. R eq and center frequency ( f c ) of the filter are given by the following and offer a tradeoff between capacitor size (area), sampling frequency (power) and noise:
The gain of the first stage is given simply by G m × R eq , product of effective transconductance of LNTA and its resistive load, R eq . In this ULP application, the strategy is to reduce C R as much as possible in order to increase R eq and, consequently, the gain of the first stage, reducing the overall noise figure (NF) of the full-rate strip. The LNTA output impedance is increased by designing the second-stage transistor with L = 100 nm, which results in R out = 3.5 k . Here, C R was selected as 100 fF but a smaller C R value would improve noise while reducing filter quality. To reduce area, C H capacitors are implemented differentially as its equivalent series C H /2, as noted in Fig. 7 . To account for PVT variations and offer flexibility in f c tuning, C H and C R are implemented as 5-bit binary-weighted capacitor banks. The combined banks enable programming of f c from 1 to 14 MHz and R eq from 667 to 4.67 k . Fig. 7 (b) plots a TF of the infinite impulse response (IIR) filter of 4/4 CS-BPF (5). As expected of any DT filter, TF reveals repetition peaks (replicas) at multiples of f s (≈9.8 GHz). Repetition peaks are folded to dc, but not before being attenuated by a windowed integration sampling (WIS) effect of the current-mode sampling, which creates an in-built sinc filter response, also shown on the plot. The combination of these two effects creates a filtering shape nearly independent of the repetition [32] , [33] . Due to the mixing operation, odd harmonics of LO (±3 f LO , ±5 f LO , . . .) are also translated to dc after being attenuated by the 4/4 CS BPF with more than 60 dB of protective filtering. It should be noted that BLE requires protection from −35 dBm OOB blockers, which means 50 dB of attenuation to keep the RX SNR at ≥15 dB.
B. IF Filtering at Decimated Clock
The back-end strip of analog DT-RX conditions the IF signal for the 20-Msamples/s 9-bit SAR ADC. The pass-band IF signal is amplified while in-band interferers are sufficiently attenuated such that they do not saturate the ADC nor alias due to its sampling. This strip is composed of programmable inverter-like gain stages and two 4/8 CS-BPFs [16] . Schematics of both 4/4 and 4/8 CS-BPFs are shown in 
where α = C H /(C H + C R ). C H and C R are also implemented using 5-b programmable capacitor banks to allow for IF and input impedance adjustments. Fig. 8(c) compares TFs of both filters according to (5) and (7), showing 5 dB of extra image rejection improvement in the 4/8 CS-BPF case.
Since the 4/8 CS-BPF is composed of 4× more switches than its counterpart, it could suffer from higher power consumption. This effect is alleviated by clocking the filter at a reduced ÷16 rate. This decimation process is trivially implemented by clock reduction, allowing multiple input samples to be integrated onto the rotating capacitors creating a finite impulse response filter described by (8) [33] , [34] 
Gain stages in the RX path are implemented using highly linear inverter-like transconductors presented in Fig. 9(a) . Programmable gains of 1.7, 7.1, 10.5, and 12.5 dB are supported considering a load of 3.27 k presented by the input resistance of the 4/8 CS BPF. In order to increase output impedance and produce current output, g m stage devices are sized with L = 200 nm. Fig. 9 (b) presents gain and NF simulations for the programmable gains. g m -cell inverters are biased in moderate inversion (g m /I d = 18 V −1 ) with a current of 12 μA to reduce power consumption. Biasing of pMOS devices is implemented using a common-mode feedback (CMFB) loop with 86°of phase margin. Input blocking capacitors at input and output of g m -cells form high-pass filters with bias resistors and CMFB resistors, respectively. The corner frequency of the resulting filter was designed at 1 MHz and can be observed in simulation results [ Fig. 9(b) ].
C. Clock Generation Circuitry
The D = 25% and 12.5% duty cycle clock phases needed for the DT-RX operation are generated in several steps. At first, a differential clock of 2 f LO ≈ 4.9 GHz from the DCO is divided by 2 to generate D = 50% quadrature clocks [ Fig. 10(a) ] using high-speed differential D-latches based on tri-state inverters designed with low-V t devices [ Fig. 10(b) ]. In the second step, the D = 50% quadrature clocks are processed to generate D = 25% clocks, which are separately buffered for the mixer and 4/4 CS-BPF, using customized CMOS topologies with low-V t devices, as presented in Fig. 10(c) . Finally, the D = 12.5% clocks are generated using standard CMOS cells available in the technology and independently buffered for the two 4/8 CS-BPFs [ Fig. 10(d) ]. In order to minimize power consumption due to parasitic routing, the D = 25% generation blocks are located very close to the mixer and 4/4 BPF filter.
D. Strategies for Low-Power Discrete-Time Receiver Operation
The ULP operation achieved by this DT-RX is a consequence of a higher impedance of the DT filters and an aggressive clock decimation. In [15] and [16] , the cellular DTRXs feature a higher IF due to the required higher bandwidth and lower NF. Here, they work at IF = 5 MHz, which is just beyond the flicker noise corner of active devices, and through (6), the capacitance C R can be reduced with a consequent increase in input impedance of the filters (R eq ). Since the RX gains are of the form G m × R eq , this increase enables high gain with lower g m and hence with a lower current at the transconductors. The increase of the input impedances also allows for the use of smaller switches (with higher resistances) both in the mixer and in the filters, with a consequent reduction in the power consumption of the clock generation. Additionally, the less challenging NF requirement of the BLE standard allows for the adoption of lower gain and power in the LNA, which ends up with a consequent increase of the LNTA output impedance. Consequently, it also beneficially allows for the increase in input impedance of the first filter, which must be around three times smaller than the LNTA's output impedance.
Relaxed linearity requirements of the BLE, especially when compared with [15] and [16] , enables the LNA and g m cells to work in moderate inversion, thus reducing power consumption of the g m cell and LNA when compared with the alternative implementation of strong-inversion.
The next ULP technique is decimation, which is carried out as early as possible in the RX path. The first BPF stage runs at the full rate since it uses the same 4× f LO sampling clock rate as that of the mixer. This is done to increase blocker protection and to reduce noise folding. The second BPF stage implements a decimation by 16 to drastically reduce power consumption of the clock generator. It is protected against aliasing by the first filter, which offers 55 dB (plus 13 dB due to the LNTA attenuation) of protection at a 4× f LO /16 ≈ 612.5-MHz offset, which is enough to avoid any impact on the RX sensitivity from folding of a blocker at that frequency [ Fig. 7(b) ]. From the standard requirements, the protection should be higher than 58 dB = −30 dBm (OOB blocker) +67 dBm (required sensitivity) +21 dB (co-channel interference defined by the standard) [11] . A decimation higher than 16 in the second stage would leave little margin for the filter implementation. Between the second and third filter stages, there is no further decimation in order to avoid any additional clock generation circuitry since the power consumption of these blocks is already very low (around 160 μA in simulations for both 4/8 CS-BPFs' clock generation, including buffers).
V. RADIO-FREQUENCY INPUT-OUTPUT
SWITCHING AND MATCHING Fig. 12(a) illustrates the proposed implementation of the onchip matching networks with a soft T/R switch (i.e., without any explicit switches carrying RF signals) between the TX and RX paths. In the RX mode [see Fig. 12(b) ], the PA transistors are OFF and, consequently, the TX is simplified to the PA transformer-based TX matching network (TXMN) acting as a second-order resonator. In this mode, the ultimate goal is to alleviate the side effects of TXMN on the RX NF and input return loss. To analyze the system in the RX mode, the Thevenin equivalent circuits of the TXMN and LNA are employed as shown in Fig. 12(c) . The RX noise factor (F) can be calculated by
where V 2 n,R X and Z RX are, respectively, the equivalent input noise and input impedance of LNTA at the operating frequency ω 0 and may be estimated by the following equations [35] :
and
Furthermore, Z PA,RX and V 2 n,T X are, respectively, the output impedance and equivalent output noise of the PA's matching network and may be calculated by (12) and (13), as shown at the bottom of this page, where k m is the magnetic coupling factor of the transformer and r p and r s model the equivalent series resistance of the primary L p and secondary L s inductances [36] . To reduce the side effect of TXMN on the RX's noise factor, the last term in (9) should be minimized. By employing (12) and (13) 
As shown in Fig. 11(a) , there exists a global optimum frequency ω opt that minimizes the contribution of TXMN to the system NF. It can be shown that (14) reaches its minimum at
To achieve the minimum NF penalty, one should tune C 1 switchable capacitor to roughly adjust ω opt to near ω 0 . The optimum V 2 n,T X /|Z PA,RX | 2 is then obtained by inserting (15) 
As a result, the noise factor penalty reduces with increasing Q s and k m , which fortunately coincides with efforts to optimize the efficiency of the PA's matching network [18] . However, a step-down transformer must be employed for the PA's matching network to scale up the load resistance seen by PA's transistor in order to achieve the highest possible efficiency at a relatively low output power of 3 dBm. It is against the noise factor optimization, as evident from (16) and clearly demonstrates a tradeoff between TX efficiency and RX noise factor. The total noise factor may be estimated by inserting (10), (11) , and (16) into (9)
By considering L s = 880 pH, Q s = 11, and k m = 0.75, the noise factor penalty in (17) can be as low as 0.22.
On the other hand, the input impedance of the RX must be matched to the antenna impedance. The input matching of LNTA is quite sensitive to the imaginary part of the impedance seen from the output pad toward TX. Hence, it is desired that the main resonant frequency of PA's matching network is roughly adjusted at the operating frequency, ω 0 . It also facilitates designing the PA and LNTA more independently. The fundamental resonant frequency of the transformer-based resonator may be estimated by [36] 
Note that L p C 1 should be chosen to satisfy (15) in order to achieve the lowest noise factor. Consequently, one should tune the switchable capacitors C 2 to adjust the PA's matching network resonant frequency at ω 0 . By inserting (15) into (18), we obtain
Consequently, to simultaneously achieve the lowest NF and input insertion loss, one should adjust C 1 and C 2 switchable capacitors to satisfy (15) and (19) , respectively. Under this condition, the TX's output impedance becomes purely resistive [see Fig. 11 (b)] and may be estimated by
As a result, the input matching can be realized by adjusting the transconductance gain of LNTA via
Now, moving attention to the TX mode, the LNTA's transistor is OFF, and consequently, the RX path can be simplified to a series RLC network (RXMN) as shown in Fig. 12(d) .
In this mode, the ultimate goal is to alleviate the side effects of RXMN on the efficiency of the PA. To analyze this efficiency drop, it is more convenient to replace the RLC series network with its equivalent parallel capacitance (C RX ) and resistance (R RX ), as illustrated in Fig. 12(d) . It can be shown that
where ω RX and Q RX are, respectively, the RXMN's resonant frequency 1/(L 1 + L G )C gs 1/2 and its quality factor. Due to R RX power dissipation, the PA's efficiency scales down with
As a result, the side effect of RXMN on the TX efficiency can be minimized by having a larger R RX . As can be gathered from (22) , this can be achieved by pushing the resonant frequency of LNTA's matching network to a much lower or higher frequency than ω 0 via the C gs switchable capacitor bank. By simultaneously considering η RX optimization over PVT variations and the quality factor degradation of switchable capacitors in an on-state, C gs,max /C gs,min is chosen ∼4, resulting in 1.5 GHz ≤ ω RX /2π ≤ 3 GHz. At the lower boundary of ω RX , the PA sees the RX path as a small negative capacitor in parallel with R RX ≈ 1-k modeling LNTA matching network losses. This negative capacitance is absorbed by the PA's matching network while R RX creates a large resistance path for the TX signal (compared with the 50-load), which leads to a negligible penalty (<5%) in the efficiency of the TX.
In the TX mode, the highest efficiency is achieved by adjusting the switchable C gs to its maximum value and then tuning the C 1 and C 2 capacitors to satisfy the required matching network of class-E/F 2 operation. As explained in depth in [18] , class-E/F 2 tuning exhibits the lowest systematic drain current and thus P out at the same V DD and load resistance among various flavors of differential switched-mode PAs. Consequently, this PA needs a smaller impedance transformation ratio for P out ≤ 3 dBm, which results in a lower insertion loss for its matching network and thus higher system efficiency. Fig. 13(a) shows the die photo of the proposed TRX implemented in TSMC 1P9M 28-nm digital CMOS. The total core area, including empty space between the sub-blocks, is merely 1.9 mm 2 . Fig. 13(b) shows the corresponding layout with area breakdown of the constituting blocks, which totals 0.97 mm 2 . To save mask costs, only core devices are used with an exception of 1.05 V, 28-nm low-V T (250 mV) transistors.
VI. EXPERIMENTAL RESULTS
Fig. 14(a) plots a representative PN at fractional-N BLE channels. When used as an LO at FREF down-divided to 5 MHz, the closed-loop ADPLL consumes 1.4 mW with an integrated PN of 1.06 o . It exhibits in-band PN of −92 dBc/Hz, which corresponds to an average TDC resolution of ∼12 ps. Fig. 14(b) shows the ADPLL in-band PN at a 10-kHz offset, its PN performance at a 1-MHz offset, and oscillator's FoM 1 across ADPLL tuning range (TR) of 2.05-2.55 GHz. The oscillator's average FoM is 188 dB and varies ∼2 dB across the TR. Moreover, the average in-band PN is −92 dBc/Hz with a 1-dB variation across the TR. The reason the input reference clock is divided down from 40 to 5 MHz, despite the 10 log 10 (40/5) = 9-dB in-band PN degradation, is to save the digital power of the ADPLL by 85%, as indicated in Fig. 14(c) .
To achieve simultaneous fast locking and power savings, the loop bandwidth is dynamically controlled via a gearshift technique [37] . During frequency acquisition, the loop operates in type-I, with a wide bandwidth of 2 MHz. It is then switched to type-II fourth-order IIR filter with a 500-kHz loop bandwidth when it enters the tracking mode. Finally, the loop bandwidth is reduced to 200 kHz to optimize the ADPLL integrated jitter. The measured lock-in time is <15 μs for f REF of 5 MHz, as shown in Fig. 16(a) . After the settling, the rest of ADPLL can be frozen (shut down) to improve power efficiency of the BLE TX and/or RX. Fig. 15 verifies the 1.0-Mb/s GFSK modulation of the TX. Fig. 15(a) and (b) shows the measured eye pattern and modulation spectrum, without any intentional disturbances, e.g., on the supply line, at the midrange BLE channel of 2.456 GHz by a Rohde & Schwarz (R&S) vector signal analyzer. The measured GFSK modulation deviation for a 11110000 data pattern [i.e., without any intersymbol interference (ISI), which also corresponds to f1 in the BT standard] over the entire BLE range is shown in Fig. 15(c) . This measurement was retaken at 2.456 GHz for four IC samples and shown in Fig. 15(d) . The average f1 frequency deviation is 250 kHz (versus the specification of 225-275 kHz) and the worst-case RMS FSK error is less than 3%. Similarly, the average measured modulation deviation f2 corresponding to the alternating "10" data pattern, which creates the most ISI is 220 kHz [ Fig. 16(a) ]. This is close to a theoretical value with an ideal modulation. It leaves a 35-kHz margin above the 185-kHz specification, which needs to be 3σ , i.e., standard deviations (as dictated by the 99.9% probability requirement of the BLE specification) of the rms frequency noise. 2 The maximum frequency drift between the 0/1 symbol at the start of the BLE packet and the 0/1 symbols at anytime within the packet payload shall be less than ±50 kHz. Fig. 16(a) shows that it is properly satisfied here with ample margin even while in the open-loop operation. Thanks to the DCO's low flicker PN and frequency pushing, its frequency drift is extremely small. The f1 frequency deviation here is ±247 kHz and the worst case frequency drift is less than 8.5 Hz/μs within a single packet of 376 μs. We believe this technique can also handle multiple concatenated packets in the 2 σ is calculated ∼3 kHz from (25) and (26), which leaves enough margin. just-released BLE version 5, which extends its packet length to as long as 17 ms. Under this condition, the oscillator's residual FM noise due to its lowest frequency components ( 1 MHz) together with its frequency drift due to voltage and temperature variations should be safely less than ±50 kHz to satisfy requirements for the open-loop operation of a BLE TX. The oscillator's residual FM can be calculated by
where L( f ) is the oscillator's PN at the offset frequency of f from the carrier. The lower integration bound f a is inversely proportional to the BLE packet length (50 Hz for BLE 5, worst case). f b is the bandwidth of the postdemodulation low-pass filter typically set a bit higher than the FSK symbol rate (i.e., 1 MHz). The PN of the proposed oscillator was reported in [18] . Its 1/f 3 PN corner is ∼100 kHz and its PN is −116 dBc/Hz at the 1-MHz offset from the carrier. Consequently, the oscillator's PN can be expressed by
As a result, the residual FM, f F M , will be ∼3 kHz. The PN of the proposed oscillator, therefore, appears to be good enough even for the 17-ms length of future BLE. On the other hand, as reported in [18] , the frequency drift of the proposed oscillator is ∼10.5 kHz during 3.4 ms of open-loop operation. By extrapolating the measured result, the frequency drift would be ∼50 kHz for the 17-ms BLE packet, which meets the specification, but with no margin left. Hence, the voltage and temperature related drift components must be reduced in future implementations in order to operate in the open-loop mode for such a long packet. The average second and third harmonic levels are −51 and −48 dBm, respectively. The harmonic levels remain well below the −41 dBm regulatory limit and are plotted in Fig. 16(b) .
Functionality of the RX was verified on several channels covering the complete BLE band. Fig. 17(a) shows that the RX average performance figures are 46 dB of gain and >42 dB of image rejection (versus the BLE requirement of 31 dB). The finite image rejection is due to an uncompensated mismatch between I and Q clock phases. With no such imperfection, the rejection should be theoretically infinite for a superheterodyne RX [16] . Fig. 17(b) shows the average 6.5 dB of NF and −19 dBm of IIP3 from the first to the last BLE channel. The measured RX NF and filter characteristics are plotted The OOB IIP2 of 50 dBm was measured for several channels in the BLE band in a two-tone test with frequencies 2.5 and 2.505 GHz. Typical IIP2 measurement curve and variations over BLE channels are presented in Fig. 18(a) and (b) . Fig. 19(a) shows the BLE RX packet error rate (PER) versus the input signal power. It was measured using an R&S CBTgo BT tester with help from an R&S FSW signal analyzer and a signal generator. The sensitivity is −95 dBm at 30.8% PER. Fig. 19(b) reveals merely a 1-dB sensitivity variation (under PER ≤30.8%) across BLE channels. For the OOB blocking measurement shown in Fig. 19(c) , the desired BLE signal is fixed at channel 12 with an input power of −67 dBm. Both the desired signal and OOB CW blocker are injected into the RX. The OOB blocker power is recorded when the PER reaches 30.8%. The results corroborate with the proposed full-rate DT-RX strategy and show that the RX is able to tolerate the OOB BLE blocker mask shown in Fig. 19(c) , thus eliminating the need for an expensive surface acoustic wave (SAW) or ceramic filter. Fig. 20 shows the TRX's RF port matching in both modes of operation. In the RX mode, the return loss S 11 is below −15 dB across the ISM band of 2400-2483.5 MHz, while in the TX mode it is between −19 and −13 dB.
The power consumption is summarized in Fig. 21 . The supply voltage for the DCO and DPA is 0.5 V, while it is 1.05 V for the rest of the circuitry. The continuous current consumption of each RF/analog building block is individually measured from its supply pin. The TX consumes 3.7 mW at 0-dBm RF output. During actual TX and RX packets, most of the ADPLL is shut down immediately after settling. The DCO tuning word is then maintained on its update port, while the second port is used to perform an open-loop modulation (see Fig. 2 ). This reduces the LO power from 1.4 to 0.6 mW. The RX consumes 2.75 mW at maximum gain when sensitivity of −95 dBm is measured through PER curve of PER = 30.8% [8] . Table II summarizes the proposed TRX and compares it with recent state-of-the-art BLE designs. It is the first implemented in the 28-nm CMOS node. It reaches a similar RX performance (NF, linearity, and sensitivity) and a better TX performance (max P out , PLL PN) but at a much lower power consumption, even better than [3] and [6] , which use off-chip matching network and T/R switch. Compared with the other two designs with fully integrated on-chip T/R switch [5] , [8] , the power efficiency is over 2× better for both the TX and RX.
VII. CONCLUSION
A single-chip ULP TRX for IoT applications, fully compliant with the BLE standard, is demonstrated in a digital 28-nm CMOS technology. The main objectives of this paper are: 1) full monolithic integration and 2) maximum power efficiency. Toward the first goal, active devices associated with a DCO and a PA are placed underneath their passive RF components to promote vertical integration of passive/active components as opposed to their almost exclusive lateral monolithic integration done conventionally. The TX and RX share a single pin for a direct connection to an antenna. Toward the second goal, we have implemented several power-saving techniques, taking advantage of the relaxed specifications defined in the standard. The TX directly modulates the DCO in an open-loop manner. The RX is a DT superheterodyne architecture performing amplification and filtering using CS complex-signaling BPFs. In 2007, he joined the Design Technology Division, Taiwan Semiconductor Manufacturing Company (TSMC), Hsinchu, Taiwan, where he is currently involved in RF, analog and mixed-signal designs. He has authored more than five technical papers, and holds 20 granted patents. His current research interests include ultra-low-power transceiver and digital/analog phase-locked loops, delay-locked loops, and high-speed datacommunication circuits design using advanced CMOS technology, as well as CMOS analog circuits.
