I. INTRODUCTION
A DOMINANT limitation on computational performance in modern microprocessors and systems-on-chip is power consumption. Battery life, energy costs, and maximum operating temperature all impose a power envelope on digital ICs that commonly necessitates throttling computational performance. Consequently, performance-per-watt has become an increasingly important metric. Dynamic voltage and frequency scaling (DVFS) is a technique that has enabled improved performance-per-watt by reducing supply voltages during periods of low computational demand [1] , but implementations stand to improve dramatically by reducing the time N. Sturcken, S. Warren, and K. L. Shepard are with the Department of Electrical Engineering, Columbia University, New York, NY 10027 USA (e-mail: nsturcken@ee.columbia.edu, sbw986@ee.columbia.edu, shepard@ee.columbia.edu).
M. Petracca, P. Mantovani, and L. P. Carloni are with the Department of Computer Science, Columbia University, New York, NY 10027 USA (e-mail: petracca@cs.columbia.edu, paolo@cs.columbia.edu, luca@cs.columbia.edu).
A. V. Peterchev is with the Departments of Psychiatry and Behavioral Sciences, Biomedical Engineering, and Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail: angel.peterchev@duke.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2012.2196316
scales over which the supply voltage is positioned, allowing real-time optimization of power consumption in the presence of workload variability. For the case of chip multiprocessors and heterogeneous systems-on-chip (SoCs), it is natural to divide computational logic into individual voltage-frequency domains, allowing per-core or per-functional-block DVFS [2] , [3] . Generally, a DVFS implementation with faster voltage transition times and smaller voltage-frequency domains delivers a more energy-efficient implementation. However, current methods for power supply regulation with board-level voltage regulator modules (VRMs) require tens of microseconds to transition voltages and are too bulky to deliver many independent power supplies in a cost effective manner [4] . External VRMs present two other efficiency challenges. First, R losses in the power distribution network (PDN) are significant when highly scaled voltages are delivered from the board. In a typical PDN (Fig. 1) [5], a resistance from the VRM to the CPU's package of 0.7 m dissipates 7 W of power for 100 W load at 1 V. Second, VRMs require power supply margins that degrade energy efficiency. The high-frequency impedance of the PDN limits the VRM's ability to suppress voltage overshoot in the event of load current transients; consequently, modern VRM specifications stipulate that the supply voltage follow a load-line commonly given as , where is the processor supply voltage, is the desired at zero load, is the desired load-line resistance, and is the load current. Implementation of load-line control reduces the VRM size and cost required to maintain the output voltage within the allowed tolerance during load transients. However, when the system is not operating at maximum power consumption, the load-line is a source of inefficiency as will be greater than the minimum supply voltage, , where is the maximum load current. The wasted power will be . For a typical value for of 1 m [5] , a CPU with of 100 A operating at 50 A and 1 V will waste 2.5 W in the load-line implementation. If the PDN impedance were smaller, the value of and hence the load-line inefficiency could be reduced.
Recent work has explored switch-mode integrated voltage regulators (IVRs) as a means to address these shortcomings in VRMs. In this case, energy is stored on or close to the integrated circuit in capacitors (switched-capacitor converters) or inductors (buck converters). Integrated switched-capacitor converters, taking advantage of high-density integrated capacitors, Fig. 1 . Power Distribution Network for a modern high-performance microprocessor, from VRM to CPU package [5] .
have shown high efficiency at reasonable current densities but have done so only at fixed conversion ratio and without addressing transient requirements [6] - [8] . Meanwhile, integrated buck converters have shown high current densities and efficiencies with a continuous range of conversion ratios but face challenges concerning the integration of high-quality inductors [9] - [16] .
Until recently, integrated inductors that offered both low losses and high inductance density were unavailable. Planar spiral or other inductor topologies that can be constructed using the interconnects of a typical CMOS process are too resistive to provide efficient on-chip power conversion at reasonable current densities [16] . The efficient use of surface mount technology (SMT) air-core inductors, which can provide a current density up to A/mm [17] , has been successfully demonstrated [9] - [13] . However, the size and discrete nature of these devices hinders the scalability of any IVR incorporating discrete SMT inductors. Fortunately, advances have recently been made in the development of integrated magnetic-core power inductors that are highly scalable and capable of delivering current densities as high as 8 A/mm [18] - [21] . These inductors have been included in IVR prototypes by on-chip integration [14] and chip stacking [15] , demonstrating the eventual feasibility of highly scalable and efficient switched-inductor IVRs.
Another challenge in the development of switched-inductor IVRs is the integration of decoupling capacitance. While VRMs are able to augment voltage regulation at high frequencies by leveraging large amounts of inexpensive board-level decoupling capacitance, the integrated capacitance required in IVRs comes at much greater expense. In switched-inductor IVRs the dominant constraint on decoupling capacitance is set by the need to suppress voltage overshoot during fast load current transients. Extending the IVR controller bandwidth has the effect of reducing these decoupling capacitance requirements.
Some early switched-inductor IVRs address transient response by employing a multi-phase hysteretic controller to provide nearly instantaneous response to transients, effectively reducing the required output decoupling capacitance [9] , [10] . Unfortunately, the closed loop behavior of the multi-phase hysteretic controller is difficult to predict and the loose synchronization of phases produces an under-damped large-signal response. Also, hysteretic controllers do not operate at fixed switching frequency, and can therefore pose challenges when attempting to control EMI. Subsequent work has used more conventional, pulse-width modulation (PWM) controllers and has relied on abundant package-level decoupling capacitance to compensate for increased controller delay [11] , [12] . However, the dependence on package-level capacitance increases component and packaging cost and degrades scalability.
In contrast, the interleaved four-phase buck converter presented here, fabricated and tested in 45-nm SOI, employs an unlatched PWM modulator and nonlinear feedback to concurrently provide PWM-like synchronization of multiple phases, linear small-signal dynamics (ensuring stability and load-line regulation), and nearly instantaneous response to large-signal input-voltage and load-current transients without the need for large output decoupling. SMT inductors are employed for this initial implementation but the approaches used here can extend to integrated magnetic-core power inductors. The converter powers a realistic on-chip load composed of four parallel 64-node networks-on-chip (NoCs) along with a programmable current source capable of generating large load current steps for characterization of the controller. In Section II, we discuss the impact of controller bandwidth on the required output capacitance in IVRs, motivating our controller design. Section III describes the design and operation of the proposed control scheme, providing analysis for predicting the controller response. Section IV details the construction and operation of the integrated NoC, and Section V presents experimental results from the IC prototype.
II. CONSTRAINTS ON OUTPUT CAPACITANCE

A. Output Voltage Ripple
Candidate capacitor technologies for an IVR include low-inductance discrete ceramic capacitors, on-chip MOS capacitors, and on-chip deep-trench (DT) capacitors, each offering reduced effective series resistance (ESR) and effective series inductance (ESL) relative to the capacitors typically used with VRMs. The high-frequency impedance of low-inductance discrete (LID) capacitors such as land-grid-array or interdigitated capacitors is dominated by ESL with self-resonant frequencies (SRF) around 30 MHz, where [22] . In contrast, the distributed nature of on-chip MOS and DT capacitance results in negligible ESL with a high-frequency impedance dominated by ESR with time constants, , around 1 ps for MOS capacitors and 500 ps for DT capacitors, depending on resistance of the on-chip PDN.
With wide impedance variability of candidate IVR capacitor technologies, it is important to use a general model in determining the output voltage ripple and other design parameters that are dependent on the high-frequency output impedance. The total peak-to-peak inductor current ripple is (1) where is the buck converter input supply voltage, is the switching period, is the number of phases in a multi-phase converter,
, and is the filter inductance of each phase [23] . The expression for output voltage ripple, including the effects of ESL, is given by: (2) using a simple lumped RLC model for the output capacitor.
B. Load-Line Implementation
The low ESR of ceramic capacitors typically requires the output voltage to follow a dynamic load-line [23] (3)
where the output impedance is defined as (4) This remains the case for IVRs that use on-chip MOS or DT capacitance. Typically, the dynamic load-line is implemented by having the controller regulate the output impedance of the converter to until the unity-gain frequency, , at which point must dominate the output impedance, constraining the output capacitance to (5) It is desirable to achieve the highest possible in order to reduce the requirement on . However, a well-accepted guideline for maximum loop-gain bandwidth that avoids instability in closed-loop operation is (6) where is the switching frequency and is a constant commonly chosen as [24] . Switching losses can become appreciable at high frequencies, effectively constraining ; nevertheless, it has been shown that IVRs can operate efficiently with around 100 MHz [9] - [15] . Combining (5) and (6) produces the constraint on for load-line regulation with on-chip MOS or DT capacitance (7) True load-line regulation is not easily achieved with lowinductance discrete capacitors when exceeds the capacitor SRF, which is generally the case for IVRs. ESL will dominate at frequencies above resulting in an appreciable first droop at the onset of a large event [11] . However, dynamic load-line regulation is possible if series resistance is added to the low-inductance discrete capacitors. The discrete capacitance, , in this case, is accompanied by an added series resistance,
, and an additional on-chip capacitance, . These values can be chosen according to (8) (9) This option will not yield a reduction in the total capacitance; however, it may facilitate a balance between on-chip and offchip decoupling capacitance that is cost-effective.
C. Load Current Transient Response
While the load-line constraint on output capacitance results in the desired small-signal output impedance, the duty cycle, and hence the controller response, may saturate in the event of a large load-current step,
. In this case, the saturated response of the controller is unable to prevent the output voltage from overshooting the load-line; therefore, the output capacitor must provide additional support. The minimum capacitance (critical capacitance) that limits voltage overshoot to during worst-case load-current transients is (10) where is the load step time constant and is the delay time for the controller to saturate the duty cycle [23] . This expression is applicable to IVRs using on-chip decoupling capacitance, where typically . For the case of IVRs using low-inductance discrete capacitors with values selected in accordance with (8) and (9), the critical capacitance can still be determined from (10) if is used for .
D. Minimum Output Capacitance
The constraints on minimum capacitance as a function of for an IVR that meets the specifications from Table I with conventional voltage mode feedback are shown in Fig. 2 . The dominant constraint on minimum output capacitance using either LID or on-chip DT or MOS capacitance for this IVR is load-current transient response. For the case of an IVR with conventional linear feedback, the value of can be approximated as (11) Fig. 2 . The minimum capacitance meeting constraints for an IVR with operating parameters defined in Table I . Voltage ripple , load-line regulation and the saturating transient response are plotted versus converter switching frequency, , for low-inductance discrete capacitance (LID) and on-chip MOS or DT capacitance.
TABLE I PROPOSED IVR SPECIFICATIONS
For the example used in Fig. 2 with on-chip DT capacitance, dominates the numerator with a value of 154 ns, relative to 6.5 ns and 19 ps for the terms and respectively. This result indicates that controller delay is the primary bottleneck in reduction of for IVRs with conventional feedback controllers. Therefore, control techniques that extend controller bandwidth while maintaining stable operation enable reduction in . Load-current feedforward has been demonstrated as an effective means to extend bandwidth for VRMs [23] . However, in the integrated context, load-current estimation is especially challenging due to the distributed nature of decoupling capacitors, high variability of on-chip resistors and capacitors, and parasitic poles introduced by analog amplifiers at high-frequencies. As a result, we employ a nonlinear, unlatched PWM controller that offers extended controller bandwidth during large load-current transients while maintaining stability.
III. DESIGN OF A NONLINEAR, UNLATCHED PWM CONTROLLER
A. Overview
The proposed control scheme is shown in Fig. 3 . A four-phase interleaved buck converter is composed of four identical hardware phases (HPs) along with clock generation circuitry that provides the switching frequency and phase for each of the HPs, . Within each HP, is superimposed onto a DC reference voltage, , by means of to create a triangle wave reference input to the controller, , that is centered at the desired DC output voltage (12) as shown in (12) and Fig. 4 . The feedback voltage, , is a superposition of the bridge switching node voltage, , at low frequencies and the output voltage, , at high frequencies. The comparison of and at the delay-optimized continuous comparator determines the steady state duty-cycle, , according to (13)
The DC output resistance, , of the IVR can be tuned by and
. As the load current increases, the feedback loop will cause the duty cycle to increase, compensating for the increase in voltage drop across the bridge switches. The duty cycle is buffered and drives , which subsequently causes to slightly increase, offsetting the increased voltage drop across the inductor resistance at higher current. This tuning of the DC output resistance follows the equation (14) where and are, respectively, the effective series resistance of NMOS and PMOS bridge switches for an HP and is the effective series resistance of a single inductor, such that 
B. Large-Signal Behavior
The time constant, , is designed to be slightly longer than such that in steady state, will slew behind as shown in Fig. 4 . In the event of a load current step, the resulting across couples through , and causes to cross . At this point, the comparators will switch state and the bridge will apply the appropriate voltage at . Each of the HPs responds asynchronously, such that the ensemble exerts the maximum within a fraction of the switching period. When an HP becomes unsynchronized, the difference between and is larger and the HP's sensitivity to is reduced, driving the HP back to proper synchronization. In this manner, the controller simultaneously provides near immediate asynchronous response to load transients and strong synchronization between HPs in steady state. 
C. Small Signal Dynamics
The small-signal dynamics can be determined using a combination of conventional linear circuit analysis and circuit averaging, if we assume that the frequency content of a small-signal perturbation,
, is sufficiently below for averaging to be valid. The small-signal, steady state gain, , of the comparator stage is similar to a conventional PWM modulator with the exception that both and have large signal components at in steady state (see Fig. 4) , and, hence, the effective PWM ramp signal is as shown in Fig. 5 inset.
is inversely proportional to the slope of where it intersects . Fig. 5 shows the feedback gain, the small signal change in the duty cycle, , as a function of . The discontinuity in the feedback gain occurs at , which is approximated as (18) where accounts for circuit delay through the continuous comparator, ZVS logic and bridge switches. When the gain through the comparator is linear and approximated as (19) for larger deviations, , the gain through the comparator is non-linear and increasing, which provides improved transient response. The instantaneous gain for is (20) The remainder of the loop transfer function can be determined with linear circuit analysis; the small signal model, transfer functions and output impedance are shown in Figs. 6 and 7 . Comparing the open-loop and closed-loop output impedances, we see that the controller regulates the output to a dynamic load-line. Assuming the output capacitor is implemented with on-chip MOS capacitance, the ESR zero occurs above 100 GHz, beyond the range of Fig. 7 .
D. Test Chip
The proposed control scheme achieves high feedback bandwidth using a combination of unlatched PWM modulation, nonlinear feedback gain, and high linear feedback bandwidth relative to the effective switching frequency . Controllers with such features can be sensitive to noise and/or prone to chaotic behavior, which can cause unpredictable switching, potentially degrading efficiency and output voltage regulation [23] . Extensive modeling and simulation of the proposed controller was conducted with Matlab and Spectre to verify stability and the absence of bifurcations and strange attractors in the converter operation [25] . Unfortunately, other factors such as inductor or device mismatch may upset the balance between HPs and cause multiple switching, thus, a four-phase buck converter was designed and fabricated on a test chip in a 45 nm SOI process to experimentally verify proper converter operation. The converter provides a regulated supply voltage to a digital load in the form of four 64-tile networks-on-chip (NoC) and a programmable current source capable of generating load-current steps of 1 A with slew rates of A/100 ps. An image of the chip is shown in Fig. 8 with dimensions of 3 mm by 6 mm. The converter occupies 0.75 mm including all input and output decoupling capacitance (0.32 mm excluding these capacitors). It operates with a switching frequency MHz and mV. The down-converter supports a continuous range of conversion ratios from a 1.5 V supply with a load current as high as 1.25 A. The bridge switches are thick-oxide floating body FETs where the widths have been optimized for 80 MHz switching and 300 mA per phase. A discretely programmable dead-time can be added to the NMOS turn-on transition, allowing zero voltage switching (ZVS) when transitions from high to low. The continuous comparators have an adjustable hysteresis ranging from 5 mV to 30 mV to prevent chatter. An independent 1 V supply powers the control circuitry and is isolated from the bridge power supply to prevent switching noise form disturbing the controller.
Four 26 nH, SMT-0402 air-core inductors are integrated on top of the chip by bondwire connections as shown in Fig. 9 . The inductance value is chosen to limit current ripple such that the converter efficiently operates in continuous conduction mode at of 80 MHz and of 500 mA. The total controller delay during a worst-case load transient is ps according to simulation, 325 ps for to cross , 160 ps for the comparators to switch, and 200 ps for the digital delay through ZVS logic and bridge buffers. With this short delay time, required for the specifications in Table I is only 20 nF according to (10) . An IVR with the same power train and using a conventional feedback controller with latched PWM modulator would require nF. The total on the test chip is nF, Fig. 9 . Illustration of SMT inductor integration by bondwire connections.
including explicit MOS capacitors and non-switching gate capacitance from the digital load.
IV. NETWORK-ON-CHIP AS A REGULATED LOAD
Four independent 64-tile NoCs serve as a realistic digital load for the IVR; the NoC provides a highly scalable platform for exploring granular power distributions given the ease with which traffic patterns can be used to modulate load currents and transients. NoCs are becoming the basic interconnect infrastructure for complex SoCs. Since communication plays a key role in SoCs and given the very strict energy and performance requirements imposed on NoCs, recent designs have reserved a separate voltage-clock domain for the NoC alone [2] .
In future SoCs, NoCs will be required to support an increasing number of traffic classes and communication protocols. Adding virtual channels (VCs) to a NoC helps to avoid deadlock and optimize the bandwidth of the physical channels in exchange for a more complex design of the routers. Another, possibly alternative, approach is to build multiple parallel physical networks (multiplanes, MPs) with smaller channels and simpler router organizations. Yoon et al. compared the two approaches from a power-performance point of view and concluded that while VCs guarantee higher performance then MPs, MPs are more flexible and better suit applications that have a limited power budget [26] . We organized the NoC in this chip as MPs because they are easier to implement and they better represent an architecture designed with power being the primary concern. Further details of the NoC are provided in the Appendix.
V. EXPERIMENTAL RESULTS
The measured response of the test chip to a load current step from 0.6 A to 1.2 A in ps is shown in Fig. 10 . The simulated behavior is determined from a time-domain Matlab model that is able to capture the nonlinear behavior of the control loop. The output voltage, , follows the load-line with of 125 m , so that if the converter were scaled to deliver 100 A, would scale to 1.25 m . overshoots the load-line by only mV and closely matches simulated results with the exception of some ringing that occurs after the step. This ringing is attributed to oscillation between and the bondwire inductance on the ground return of the load. The estimated resonant frequency of this series LC, 75 MHz, is the same as the frequency of ringing in Fig. 10 . Fig. 11 shows the input step-up response, where we see a settling time for of ns. In order to verify the controller switching stability and noise immunity in closed-loop operation, efficiency was measured while the converter operated in open-and closed-loop with the same operating conditions. The open-loop configuration bypasses the comparator to directly drive the bridge with a fixed duty cycle, producing a of 1 V with of 1 A at of 80 MHz. The converter was subsequently configured to deliver the same output voltage and current at 80 MHz in a closed-loop configuration. In both open-and closed-loop configuration, the efficiency was 78% and the spectral content of the output voltage peaked at MHz, which is the expected effective switching frequency . The converter efficiency (Fig. 12) is hindered by the relatively high of 120 m , which is dominated by bond wire resistance. The efficiency for mV is further adversely impacted by an ESD diode at the node that turns on with decreasing
. Converter efficiency could be improved by removing the ESD diode and using an alternative packaging strategy that reduces as demonstrated in [15] . The data in Figs. 10-12 was taken from a single unit. However the efficiency of four units were measured as a check, exhibiting variation that was below the noise of the measurement, each achieving an efficiency of 83% at a current density of 1 A/mm (2.35 A/mm if decoupling capacitor area is not considered) and a 0.66 conversion ratio. The proposed control scheme allows for reduction in the output capacitance relative to an IVR with conventional control scheme. This corresponds to improvement in total current density for the IVR implementation described here, assuming is implemented with on-chip MOS capacitance. Fig. 13 shows a breakdown of the test chip's power consumption with scaled NoC supply voltage and frequency (bandwidth). Fig. 13 also illustrates the dramatic decrease in the power consumption of the system when the power supply of the NoC is scaled; even when considering the inefficiency of the IVR, this serves as evidence for the potential power savings achieved with DVFS.
VI. CONCLUSION
We demonstrate a four-phase integrated buck converter with a novel control scheme that uses an unlatched PWM modulator and nonlinear feedback. The proposed controller provides predictable small-signal dynamics along with fast response to input and load-current steps, which facilitates a 2.2 improvement in current density. Combined with recent developments in inductive energy storage [18] - [21] , such a converter could enable implementation of integrated power conversion on a large scale.
APPENDIX
The NoC has four independent planes, each organized as an eight-by-eight 2D-mesh NoCs (Fig. 14) . Each plane supports a different data parallelism: 128, 64, 32, and 32 bits, respectively. Each plane has an independent global clock, and all planes share the common power supply provided by the IVR. In aggregate the entire NoC has 256 routers and a bisection bandwidth of 2 kbit/T, where T is the clock period. For instance, when at the same time all the NoCs run at a clock frequency of 500 MHz ( ns) the bisection bandwidth is 1 Tbit/s. All planes adopt traditional wormhole flow control and XY dimension-order routing, which is proven to be simple to implement and deadlock-free for 2D mesh networks. The 2D-Mesh topology is achieved using five-by-five routers (Fig. 15) , where four I/O ports are attached to neighbor routers, and the fifth port is used for traffic injection/ejection. The router is a traditional input-queued router. A five-by-five crossbar connects each input to every output, and a simple per-output distributed round-robin arbitration solves the contention when multiple input packets request to be forwarded towards the same output port.
We adopted Ack-Nack as link-level flow control between adjacent routers. In order to implement this protocol, we added two signals to the data bus that carries the flits. One signal validates the flit at a given clock cycle, while the other wire transports back-pressure information. Back-pressure is a way for the downstream router to signal congestion to the upstream router. Under congestion the input queue of a router tends to fill up, and when it is finally full the flit currently in flight on the link cannot be stored properly. The upstream router must then maintain the old flit on the output port in such a way that it can be correctly received by the downstream router once congestion is resolved. Under persistent congestion, since no new flits can be forwarded towards the busy output port, the input queue occupation in the upstream router might grow as well, and might require the back-pressure to be propagated backward, up to the traffic source when necessary.
We used a constant depth of flits for all the input queues, in order to fit the desired topology in the form factor of the chip. Every router has a synchronous output, i.e., . As shown in Fig. 15 , we adopted bypassable input queue so that the zero-load latency of traversing one router is one clock cycle. Under no congestion, the incoming flit bypasses the input queue and is routed and stored directly in the appropriate output register. Only under congestion the input queue is used to store the incoming flits until congestion is resolved. We also installe relay-stations (RS) on the links between adjacent routers. RSs are synchronous flow-control aware repeaters, which on one side increase the modularity of the design and facilitate timing closure during layout, while on the other side act as distributed buffers, expanding the capacity of the router input queues, thus alleviating congestion [27] . Our layout is very regular, but under less regular NoCs, RS promise to fix timing exceptions in a very flexible way, without requiring change to the queue sizing within the routers or the network topology. The traffic injected at each router is generated according to externally programmable parameters, supporting four synthetic random traffic patterns: uniform, tornado, transpose and hot-spot. We obtained all the results presented in this paper by averaging across different traffic patterns and traffic injection rates.
