Abstract-
I. INTRODUCTION
T HE digital processors are exploring on-chip integration of integrated voltage regulators (IVRs) with on-chip/onpackage capacitors and/or inductors to reduce power supply noise and facilitate fast power-state transitions [1] - [10] . The design of inductive IVRs with small inductance (L) and capacitance (C) requires the use of a higher switching frequency to control the output ripple and enhance the loop bandwidth. The digital pulsewidth modulation (DPWM) control for IVRs facilitates on-chip integration with digital cores [6] , [11] . Digital controllers can exploit faster digital process by clocking the compensator at a higher operating frequency (F SAMP ) to achieve a higher bandwidth. However, the switching loss limits the maximum switching frequency (F SW ) and the achievable loop unity gain frequency (UGF) for a singlesampled (F SW = F S AM P ) IVR. Krishnamurthy et al. [6] have shown that the bandwidth of a for a single-sampled IVR can be increased by using phase-shifted clocks, but that it requires a fast analog-to-digital converter (ADC) with a conversion time much lower than the sampling clock period. An alternative approach to enhance bandwidth for a given switching frequency is multi-sampling, which is widely used for lowfrequency (∼1 MHz) voltage regulator modules (VRMs) with off-chip passives [12] , [13] . However, multi-sampling for a high frequency (>100 MHz) inductive IVR imposes strict timing constraints on the digital compensator. Therefore, enabling multi-sampling in high-frequency IVRs is an important problem to address. The integrated inductor technologies such as magnetic interposer [14] , magnetic thin films [15] , and on-chip spiral inductors [2] can have higher variation than discrete (packaged) inductors. The variations in inductance and capacitance degrade the steady-state power quality and responses to transient events such as a load step or a power-state change [16] . The challenge is exacerbated in IVRs where the switching frequency is limited by the power efficiency, as the filter poles (∼ 10 to 30 MHz) can lie close to UGF, making loop characteristics more sensitive to variation in the passives. Auto-tuning of the control loop is used in low-frequency (∼1 MHz) offchip VRMs to tolerate variations [17] , [18] . However, the prior approaches use complex algorithms to characterize and tune the frequency response of VRMs, which require significant hardware resources and are inefficient for realization in highfrequency (>100 MHz) IVRs. Hence, there is a need for autotuning algorithms that can be easily integrated with digitally controlled high-frequency IVRs to tolerate variations. This paper presents a high frequency inductive IVR with a digital PWM control with enhanced transient performance (bandwidth) while tolerating variations in the passives. An all-digital design approach is adopted to realize different blocks of the proposed architecture (Fig. 1) . The all-digital design facilitates the integration of the IVR with the digital cores (referred to as the digital load in Fig. 1 ) in advanced process nodes. The presented IVR has the following key attributes.
1) A reduced precision of coefficients is introduced to allow the digital compensator to use multi-sampling for bandwidth improvement without any timing failure. 2) A fast and compact (low area) auto-tuning architecture is presented to enhance tolerance to parametric variations in the filter passives without complex computation. 3) An all-digital discontinuous conduction mode (DCM) controller is proposed to sense a small negative inductor current and improve low-load efficiency of the IVR. 4) A resistive transient assist (RTA) scheme is proposed to improve the IVR's response to large load and powerstate transients. The architecture of the proposed design is shown in Fig. 1 . The all-digital controller with the on-chip auto-tuning engine is designed and synthesized in 130-nm CMOS. The power stage of the prototype IVR uses bondwire inductance and on-die MIM (metal-insulator-metal) capacitance with 125-MHz switching frequency, and the control loop uses a 250-MHz sampling frequency. The measurement results demonstrate improved transient response with multi-sampling and auto-tuning. The measurement results also demonstrate that RTA can cause up to 2.5 × reduction in the settling time considering a load and a reference transient. The measured peak efficiency of the IVR is 71%.
The rest of the paper is organized as follows: Section II presents the controller design; Section III presents the autotuning algorithm; Section IV presents the circuit details; Section V presents the measurement results; and Section VI summarizes the paper.
II. BANDWIDTH IMPROVEMENT A fixed-point arithmetic with computation in reduced precision is used to enable multi-sampling even at a relatively high (80 FO4 delay in 130-nm CMOS) sampling frequency. Inductive IVRs use low output capacitance, which causes the capacitor Equivalent series resistance (ESR) zero to reside at a higher frequency. Hence, to compensate such a power stage, a type III compensator is used as follows:
Given a constraint on the transient response of an IVR, we select the coefficients for the digital controller considering Our hypothesis is that a multi-cycle operation with reduced precision of the coefficients (b 0 , b 1 , b 2 ) will help the IVR to meet a tighter performance constraint indicating a higher bandwidth of the loop. First, the coefficients are optimized using a simplex optimization algorithm. Next, the digital controller is designed using reduced precision to meet 250-MHz timing, and the quantized coefficients are used to estimate the transient performance.
Our optimization process considers a 0.8 to 1 V output voltage step response. The target rise time (80% change in output voltage) is 20 ns, the target settling time (time required to reduce output swing below 18% of V DD ) is 50 ns, and the target overshoot and undershoot are 5% and 10% of the output voltage, respectively. The single sampled system failed to achieve the target (best coefficients found were b 0 :1. Fig. 2(a) ]. Fig. 2(b) shows the architecture of the multi-sampled compensator. We quantized the coefficients b 1 and b 2 as multiples of 1/16 and 1/32 (each represented using a 6 bit signed number), whereas b 0 is approximated as 1. The saturated compensated error (D P ) represents the pulsewidth of the PMOS, whereas N WIDTH , generated from the DCM engine, represents the pulsewidth of the NMOS. A 300 ps skew is used in the clock of the final stage to meet timing.
III. ALL-DIGITAL AUTO-TUNING
The time-domain (stable steady-state and fast transient response) behavior of an IVR's output determines a system's (digital core + IVR) performance [16] . We develop a Behavior of the individual components of SFOM for different response types.
stability-figure-of-merit (SFOM) metric to quantify the timedomain behavior of the IVR's output, computed by performing a set of simple arithmetic operations on the output error, and the locations of the compensator's poles and zeros (dictated by the compensator's coefficients) are used as the tuning knobs [ Fig. 3(a) ]. The algorithm finds the minimum SFOM over a range of coefficients applied to the system while observing the output error (digitized difference between the output and the reference). The simplicity of the resulting tuning algorithm allows a light and fast tuning engine operating at F SW , and removes requirement for storing any error samples [ Fig. 3(b) ]. The use of saturated adders approximates SFOM for unstable/slow responses, and computes SFOM accurately for near-optimal responses.
The SFOM metric uses the following quantities: 1) absolute error (AE), calculated as the accumulation of absolute values of the error signal; 2) signed error (SE), calculated as the accumulation of SE values; and 3) convergence time (CT). The CT quantifies the recovery time of the output from a droop caused by a load step. The CT calculation starts when a load transient is induced at the midpoint of the evaluation phase. The recovery is defined when the output is observed to remain within a user provided threshold (CT TH ) for ten consecutive cycles. Fig. 4 shows the output response of a baseline IVR for three different coefficients applied, the corresponding digitized error and the different components of the SFOM. We observe that the AE helps to reject responses that are unstable for a steady load current. For the given examples, the AE is significantly higher for response 3 than response 1 and 2. The SE helps to reject responses that show damped oscillations while recovering from a transient. If the output oscillation does not dampen due to either a low-phase margin (ϕ M ) or a steady-state limit cycle oscillations (LCO) or complete instability, the SE remains low as +ve and −ve error cancel each other. For the given example, SE is higher for response 2 than response 1 and response 3. The CT selects responses with the shortest recovery time which, for the given example, is response 1. For the example response set, the tuning engine will select response 1.
The SFOM metric for each coefficients, selected from a set of coefficients, is evaluated for a fixed number of sampling clocks; followed by opening the feedback loop, and driving by a fixed duty cycle to ensure same initial condition for the next coefficients. The steady-state responses are observed using two dc loads. A synthetic on-chip load generator is used to generate a fast transition between the two DC current levels.
The transition from open loop to closed loop is used to emulate the reference transition.
IV. SYSTEM IMPLEMENTATION The detailed architecture of the proposed IVR is shown in Fig. 5 . We used two bondwires in the package as the powerstage inductor and an on-chip MIM capacitance as the output filter. Based on the package datasheet, the total bond-wire inductance is estimated to be 11.6 nH [19] . A delay-line-based ADC is used for digitizing the output. The digital compensator, the auto-tuning engine and a serial interface for programming are generated with digital synthesis tools. The compensator output is fed to a delay locked loop (DLL)-based DPWM engine. The regulator can operate in an open-loop condition where the DPWM is driven by a fixed input word, generating gate signals with a constant duty cycle. An all-digital DCM engine and an RTA circuit are added to the IVR. A voltage-controlled oscillator generates the multi-sampling clock that is distributed to the ADC and the controller. The DPWM clock (F SW ) is derived from the compensator clock (F SMAP ) to ensure that the duty cycle commands from the controller (D N and D P ) change synchronously with F SW .
A. Delay-Line-Based ADC
The small-signal gain through an ADC in a digital controller can be expressed as (2) where T S is the ADC sampling clock, v LSB is the voltage change corresponding to 1 LSB change in the ADC output, and t D , ADC is the ADC conversion delay. The ADC requires a high sampling speed to reduce the feedback delay (t D,ADC ) and a higher resolution for more accurate binning of the output voltage. Multi-sampling imposes more stringent constraint on the sampling speed, which can conflict with the resolution. Fig. 6 shows the delay-line-based ADC architecture [20] used in the proposed IVR. The entire design, except for the sense branch, is synthesizable. The current mirrored from the sense branch, consisting of a source degenerated PMOS (M1), controls the delay of the current starved inverter in each stage. A conversion cycle begins by initiating a pulse with high duty cycle (IN 0 ) at the input of the chain and ends when IN 0 goes to logic zero. During the conversion time, depending on the delay of the cells, the input pulse crosses a partial number of delay cells, before IN 0 goes low. Each delay cell also contains an RS latch that samples and stores the intermediate node (X) once it goes low. The latches are reset using the signal RL in the middle of the conversion cycle. The latch outputs are fed to a thermometer to binary encoder to obtain a binary format output. Having the intermediate latches ensure that the computation delay through the T2B encoder does not affect the operation of the main delay chain. The T2B output is sensed by the compensator clock (COMP CLK ) right before RL goes low. Moreover, the delay between the positive edges of IN 0 and COMP CLK is less than the delay through the T2B encoder, ensuring no hold violations.
B. Limit Cycle Oscillation
An LCO is a well-known problem in digitally controlled buck regulators and is caused by the finite granularity of the DPWM engine. An LCO is traditionally avoided by satisfying three conditions [21] : 1) a lower voltage resolution of the ADC than that of the DPWM engine; 2) setting the integral gain of the compensator less than unity; and 3) ensuring loop stability with the highest small-signal gain across the ADC. In a singlesampled system, an ADC sampling clock always samples the output ripple at the same location and therefore the ripple magnitude doesn't play any role in inducing a LCO. In a multisampled controller, the output voltage is sampled, typically using a sample-and-hold circuit (S and H ), multiple times in one switching cycle. If the peak and trough of the output ripple is sampled and peak-to-peak ripple magnitude maps to different ADC bins [ Fig. 7(a) inset] , an LCO can occur, even if the other conditions are satisfied. This adds a fourth criteria of avoiding an LCO for multi-sampled controllers. The proposed ADC design does not use a dedicated S and H circuit, instead uses latches within the delay cells working as an intermediate storage element (Fig. 6 ) and addresses the fourth criteria of avoiding an LCO. Consider that the IVR's output voltage (i.e., ADC input) changes at the middle of a conversion cycle. Consequently, the delay of each of the delay elements will change [ Fig. 7(a) ], but the states of the delay elements that have already flipped are stored in the corresponding latches. This effect causes the effective sampling frequency of the output to be 1/delay of the unit cells (i.e., higher effective bandwidth of the ADC gain). The total number of delay cells that changed its states i.e., the ADC output depends on the average value of the sensed voltage during the ADC conversion cycle instead of the sampled value of the sensed voltage at the beginning of the ADC conversion cycle, if a dedicated S and H was used. This achieves the same effect as an antialiasing low-pass filter or a repetitive ripple estimation [12] without increasing feedback loop delay. Fig. 7 (a) also shows the delay of the consecutive cells during a voltage droop at the IVR output for the proposed ADC design for four consecutive ADC conversion cycles. We observe that as the output droops during cycles 1-3, the delay between consecutive cells increases during the conversion cycle. For cycle 4, the sensed voltage stays relatively constant, and therefore the delay does not change significantly between consecutive delay cells. The simulations also show that introducing a S and H at the input can causes LCO at the output, but the proposed design (no S and H ) does not show any LCO [Fig 7(b) ].
The first and the third criteria for avoiding an LCO are satisfied during the design phase. An LCO can still occur from the selection of the compensator coefficients or due to passive variations which increases the peak-to-peak output ripple and the sampled outputs in one switching period map to different ADC bins. The proposed auto-tuning engine can correct for the second criteria, however, avoiding LCO under increased ripple requires adjusting the scaling factor before the ADC.
V. ALL-DIGITAL DCM ENGINE AND POWER STAGE
An all-digital DCM engine is used to improve the light-load power efficiency of the IVR. The existing digital DCM controllers for high-frequency IVRs sense the V SW node after the NFET turns off and reduce the width of the NFET pulse till the sensed value becomes logic '0' [22] . However, smaller negative I L results in a long rise time of the V SW node and may not be detected by digital sensing [ Fig. 8(a) ]. To address this challenge, we use the falling edge of the NFET to create multiple (delayed) sampling clocks which are combined using an OR gate [ Fig. 8(b) ]. Multiple sampling of the V SW node facilitates detection under parametric variation even at small negative I L . Another layer of flip-flops, clocked by the inverted PFET gate signal, samples the outputs of the first layer of flip-flops clocked by the sampling clocks. This is necessary to prevent a false detection of the V SW node after the PFET has turned on for the next cycle. Sensing that using only the last sample might not be accurate if the time interval when both the FETs are turned off is not enough to complete all four samplings. The DCM engine senses the OR'ed result and outputs a digital word (N WIDTH ), which keeps on decreasing till the sensed value is logic '0' . Fig. 8(c) ] have been optimized to maximize power efficiency at 70-mA load current as well as the ability to pass a thin pulse (<1ns) for skewed duty cycles. To improve the light-load efficiency, the power stage is split into two segments. In DCM mode, if D N falls below a threshold value ( A WIDTH ) for ten consecutive cycles, one of the power FET segments turns off, and reduces the driving loss [ Fig. 8(b) ]. Fig. 9(a) shows the inductor current and the delayed samples of the V SW node for CCM to DCM transition through simulation. As the NFET pulsewidth reduces, the latter samples can detect negative I L (although the initial samples fail) and help to reduce N WDITH until detection fails for all samples. Fig. 9(b) shows that N WIDTH values for different sampling options continuously oscillate. When the sensing logic detects a logic '0', N WIDTH starts increasing till a logic '1' is detected again. This causes the N WIDTH to continuously toggle within a set of values. As more samples are used in the sensing process, the average N WIDTH value reduces and eventually saturates to a minimum value improving power efficiency [ Fig. 9(c)] .
A. Resistive Transient Assist
We present a resistive transient assist circuit (RTA) to reduce response time of load or reference transients [ Fig. 10(a) ]. The proposed scheme uses transistors M T1 and M T2 to assist the output directly from the input voltage, bypassing the inductor and the control loop. The transistors are driven by digital comparators that compare the digitized error, already available from the compensator block, with two threshold values. If the errors are positive i.e., the output is less than the reference, and more than the threshold value of the corresponding comparator, the switch turns on and assists the output by supplying current directly from the input.
The current assist from the input to the output scales with the droop magnitude. We achieve this by keeping multiple (two) assist devices (equal width) where both the devices turn on only when the droop exceeds the larger threshold. No hysteresis was added to the comparator to avoid overshoot of the output voltage beyond the maximum tolerable value. The threshold values are chosen to ensure that the maximum positive error during an LCO does not trigger the RTA. Fig. 10(b) shows that enabling the RTA (W MT1 = W MT2 = 100 μm) helps reducing voltage droop due to a load step (20 to 100 mA) by 50 mV. Increasing widths of the assist devices reduces the voltage droop, but the settling time increases. A stronger assist reduces the digitized error at the output, which reduces the control effort of the compensator resulting in lower output slew when an RTA turns off.
VI. MEASUREMENT RESULTS
The integrated buck regulator was fabricated in 130-nm process and packaged using a wirebond ceramic leadless chip career (CLCC) (Fig. 11, Table I ). The power stage operates at 125 MHz and can convert 1.2 V to 0.45-1.05 V output. The minimum output is limited by the lower range of the ADC input. Scaling factors are appropriately adjusted to ensure that the scaled outputs are within the ADC range. The output characteristics and the control loop of the IVR are characterized by operating the power stage in the openloop condition, with varying DPWM input in steps of 1 with zero load current (Fig. 12) . The DPWM provides a linear input code (6-bit word) to output voltage profile. The average peakto-peak ripple is measured to be 84 mV. The corresponding ADC output shows three key observations. First, the ADC goes through all possible states for increasing DPWM code and multiple DPWM codes map to the same ADC bins for most of the ADC states. The result indicates that the ADC resolution is less than DPWM resolution which satisfies the first criteria for reducing LCO [21] . Second, the ADC shows non-linearity near a low output voltage. This is due to the non-linearity of the delay of the ADC delay elements with respect to the control voltage. The loop stability is not affected when the ADC gain is high (around 800-mV output), which satisfies the third criteria for reducing LCO [21] . Finally, the peak-to-peak ripple for a fixed DPWM code spans across three different ADC bins, however, the ADC output remains fixed across multiple measurements, due to use of the delay-linebased ADC and absence of a dedicated S and H as explained in Section IV-B.
A. Auto-Tuning of Coefficients
The power-stage inductance is formed by two bondwires in the 52 pin CLCC package, shorted externally through a PCB trace. Unlike the bond-wire formation in [10] , the package and an external low-resistance PCB connection is included in the inductance path, similar to [8] . The average bondwire length is 5.5 mm, providing an average of 5.8 nH per bondwire, estimated based on the package datasheet [19] . The inductance offered by the package is expected to be minimal due to absence of leads. The bondwires near the corner of the die are chosen to extract the maximum inductance and to keep maximum distance between the inductance forming bondwires to avoid negative mutual inductance effect [8] . One of the adjacent bondwires for each inductance forming bondwires are kept floating and also to reduce mutual inductance (Fig. 11) . However, the other adjacent bondwires, which themselves are subjected to variations, play a role in mutual inductance. The length of the bondwires can also vary during the bonding process. Therefore, as identified in [10] , there can be appreciable variability in the effective inductance value. Huang et al. [10] address the impact on variability on power efficiency, however, the variability on the transient performance of the IVR is not addressed. In this paper, we use auto-tuning to address the effect of inductance variation on the transient performance. Fig. 13 shows the behavior of the IVR output when the autotuning process is triggered using an external enable signal. The designed controller first disengages the feedback loop, followed by the tuning phase where the optimum coefficients are found and the loop is subsequently closed with the new coefficients. The measured waveforms show limit cycling at the output before tuning which reduces after the coefficients are modified through the tuning process. A total of 25 different (b 1 , b 2 ) pairs are covered during the tuning. Fig. 14(a) shows the zoomed output voltage (six coefficient pairs and responses) during the tuning. The responses measured from the test-chip are more damped than that for the modeled system in Simulink [ Fig. 14(b) ] due to additional resistance in the power stage. Fig. 14(c) shows the measured step response of the converter for a set of designed coefficients and the coefficients found after auto-tuning.
We emulate a variation in the output inductance by adding an external PCB inductance in series with the bondwires. An extra 6-nH inductance emulates 50% variation in expected L. During the auto-tuning process, the system becomes unstable for initial few coefficient pairs, caused due to the shift of the LC poles at a lower frequency. The coefficients found for the previous configuration [ Fig. 14(c) ] were used to test the reference step response and load step response of the new configuration (config 2) with extra L. The updated coefficients for config 2 yield a better reference step response and a comparable load step response than the coefficients for config 1 (Fig. 15) . This clearly shows that the tuning process needs to be performed separately for every system.
B. Transient Performance
The internal synthetic load generator is used to create fast load transients (75 mA/100 ps). The reference transient is generated by changing the digital reference word. The measurement results show that after enabling the RTA, the settling time for a 5 to 65-mA load transient improves by 2. Fig. 2(b) ]. When RTA is enabled [ Fig. 16(d) ], the droop reduces, however, the settling time increases. This is caused by a slower increase of D P and N WDITH , as assist current is directly added at the V OUT node, reducing the control effort. This causes the regulator to take 7-8 cycles to come out of the DCM mode. Fig. 17(a) shows that the open-loop load-line resistance increases from a simulated value of 1.3623 to a measured value of 2.235 . The change can be attributed to the resistance of the power distribution network, shifts in the bond-wire resistances, and the resistances of the package leads and the PCB connection. Fig. 17(b) shows the division of conduction loss, driving loss and controller loss, and corresponding design efficiency for 0.8-V output across different load currents. The difference in load-line resistances between design and measurement is used to estimate the excess conduction loss and predict the efficiency [ Fig. 17(b) ] as used in [1] . The measured efficiency [ Fig. 18(a) ] matches closely with the estimated value [ Fig. 17(b) ] and shows a peak of 71% at 0.8-V output @50 mA. The DCM mode with adaptive power FET improves light-load efficiency by up to 12% (600 mV V OUT at 10 mA). Adaptive driver width does not kick in for 0.8-V output as the threshold for driver width (AW TH ) is set as 32 (6'b100000). For 0.6-V output, adaptive driver width kicks in and increases the improvement in light-load efficiency. auto-tuning at high operating frequency, which has not been reported in the prior works. The measured efficiency of the proposed design is lower than the previous works. The modelto-hardware correlation analysis presented in Fig. 17 shows that the additional resistance in the power stage is the key reason for the efficiency change from simulation to measurement.
C. Power Efficiency
VII. CONCLUSION An inductive IVR with a high bandwidth multi-sampled compensator and a fast and lightweight auto-tuning engine is presented in 130-nm CMOS. The proposed all-digital architecture ensures seamless integration in a digital process node. Multi-sampling with reduced coefficient precision, a delayline-based ADC and an RTA improves the supply quality of the IVR. A lightweight computationally simple auto-tuning engine ensures stability at dc load and improved response for large transients against parametric variation. A peak efficiency of 71% at 50-mA load current is achieved for 1.2 to 0.8 V conversion. More than 2 × improvement in load and reference step response settling time is achieved by enabling an RTA. A voltage ramp rate of 2.9 V/μs is measured for a reference step, one of the highest for fully integrated inductive regulators.
