Abstract-This
I. INTRODUCTION

D
YNAMIC voltage and frequency scaling (DVFS) is a popular technique to improve energy efficiency in digital systems [1] . As performance requirements change over time, the voltage can be changed appropriately to maximize energy efficiency while meeting performance constraints.
DVFS is commonly implemented using off-chip voltage regulators, but off-chip regulation has a number of disadvantages. Increased parasitics and an on-chip to off-chip feedback loop cause slow mode transitions. Also, each voltage domain must still be supplied through the package separately-limiting the total number of voltages available and increasing packaging costs and complexity. Lastly, off-chip regulators and supporting components, such as inductors, increase total system size and cost.
Integrating regulators on-chip, and tightly connecting power supply control with the microprocessor, offers significant advantages by reducing system cost and supporting much finer grain DVFS in terms of both operating mode period and voltage domain area. Transition times between modes can be reduced, providing additional energy savings through more frequent DVFS to better track changing performance requirements [2] . Supporting many smaller voltage domains provides better isolation between high and low performance regions, and supplying hundreds of independent voltage domains is desirable to improve the energy efficiency of many-core systems [3] [4] . Instead of requiring separate power grids for each domain or needing to support full power delivery requirements through each of a few shared voltage rails, on-chip switched-capacitor DC-DC (SC DC-DC) only requires the delivery of two supplies through the package, which simplifies package design and makes it less expensive. Finally, no off-chip components are necessary, providing significant platform size and cost reductions. Despite these numerous advantages, adoption has been limited because low converter efficiency has negated the many benefits of on-chip regulation.
Previous proposals for on-chip conversion include integrated low-drop-out (LDO) regulators, buck converters with off-chip inductors, and SC DC-DC converters. Wide voltage operation requires a regulator with a high efficiency across the full range of output voltages, and LDO regulators suffer from sub-50% efficiency at low operating voltages [5] . Buck converters with on-chip switches and off-chip inductors offer high efficiency but still require inductors to be integrated into the package or PCB [6] - [8] . Because the quality of integrated inductors is inherently worse than integrated capacitors [9] , buck converters with on-chip inductors report lower efficiency [10] . By replacing inductors with capacitors, SC DC-DC converters can be fully integrated on-chip, but achieving high efficiency compared to designs with off-chip passives is challenging. Traditional interleaved SC DC-DC converters stabilize the output voltage to minimize frequency margining for supply variation [11] . Standalone converters have demonstrated high efficiency of 80%-90% [12] - [15] . However, full system implementations that use converters to drive real digital loads have reported limited efficiency of 52%-84% [16] - [18] . Table I provides a summary of prior work. This paper implements a different switched-capacitor control approach, simultaneous-switching, to achieve high efficiency by switching all possible capacitance simultaneously and using an adaptive clock to maximize clock frequency for the resulting voltage ripple. The on-chip SC DC-DC converter powers a RISC-V [19] scalar microprocessor with vector accelerator, enabling improved DVFS with fast transitions between modes (20 ns), low area overhead (16%), simple package requirements (two voltages with no off-chip components), scalability to numerous voltage domains, and high efficiency. Section II describes the reasons for the improved efficiency of simultaneous-switching over interleaved converters. Section III provides details about the design and implementation. Section IV analyzes measurement results from the chip, and discusses different sources of efficiency loss. 
II. SIMULTANEOUS-SWITCHING VERSUS INTERLEAVED SC DC-DC CONVERTERS
Maximizing conversion efficiency of DC-DC converters is essential for on-chip regulation, because low efficiency may cancel the energy efficiency gains of DVFS. Losses in SC DC-DC converters can be categorized into four separate components [15] : charge-sharing SC loss (P cfly ), conduction loss (P cond ), switching loss (P gate ), and bottom-plate loss (P bottom ). The contribution of each loss term to total losses for an interleaved and simultaneous-switching SC converter is shown in Fig. 1 . After design-time optimization of switch and capacitor size, the only parameter that changes efficiency is the switching frequency (f sw ) and the associated ripple size (ΔV ). P cfly is inversely proportional to switching frequency, while P gate and P bottom are proportional-therefore, optimizing efficiency requires setting f sw such that the sum of losses is minimized. The efficiency differences between interleaved and simultaneous-switching SC DC-DC converters arise from the charge-sharing loss term. Fig. 2 compares the overall simultaneous-switching and interleaved approaches. Interleaved converters switch one unit cell at a time to stabilize the output voltage and remove losses due to unnecessarily high voltages for a fixed clock frequency, but unit cells share charge with each other and P cfly remains a significant loss component. Simultaneous-switching operation improves converter efficiency by switching all unit cells simultaneously to avoid charge sharing losses, while an adaptive clock translates the rippling supply voltage into additional performance to eliminate system-level efficiency losses caused by the voltage ripple on the core supply [20] . For simultaneous-switching converters driving an ideal resistive load, perfect frequency adaptation would completely remove all charge-sharing loss (P cfly = 0). By removing the only loss component that is proportional to ripple size, the switching frequency can be decreased to further reduce the other loss terms. In a real implementation, however, nonidealities cause a nonzero P cfly , and Section IV-D analyzes this loss further by using measured results. Fig. 3 shows the chip architecture. The 64 bit scalar core implements the free and open RISC-V instruction set [19] . A high-performance 64 bit vector accelerator improves energy efficiency by amortizing instruction fetch and control overhead for data-parallel operations. The processor boots Linux and executes compiled scalar and vector code. Two voltages, a 1.0 V core and 1.8 V I/O supply, are delivered to the on-chip converters. The SC DC-DC converter is partitioned into twenty-four 90 µm × 90 µm unit cells surrounding the core (16% area overhead) and generates four dynamically reconfigurable average ideal output voltages of 1.0, 0.9, 0.67, and 0.5 V. These fixed ratios were chosen in order to utilize common core and I/O voltages as inputs, and for their low output impedance coefficients [21] . Continuous voltage selection for DVFS is achieved by hopping between discrete SC DC-DC modes [20] , [22] , and these specific voltages were chosen as a tradeoff between DVFS tuning granularity and implementation complexity. A shared 
III. INTEGRATED SYSTEM IMPLEMENTATION
A. Scalar Core
The Rocket scalar core, shown in Fig. 4 , is a 64 bit 5-stage single-issue in-order pipeline that executes the RISC-V instruction set architecture (ISA). It is carefully designed to minimize the impact of long clock-to-output delays of SRAM macros. For example, the pipeline resolves branches in the memory stage to shorten the critical path through the bypass path, but relies on extensive branch prediction (64 entry branch target buffer, 256 entry two-level branch history table, and a 2 entry return address stack) to mitigate the increased branch resolution penalty. The blocking 16 KB instruction cache is private to the scalar core, while the nonblocking 32 KB data cache is shared between the scalar core and the vector accelerator. The scalar core has a memory-management unit that supports page-based virtual memory. Both caches are virtually indexed and physically tagged, and have separate TLBs that are accessed in parallel with cache accesses. The core has an IEEE 754-2008 compliant floating-point unit that executes singleand double-precision floating-point operations, including fused multiply-add (FMA) operations, with hardware support for subnormal numbers. The resulting Rocket scalar core is competitive to industrial designs in terms of performance, power consumption, and area [23] .
To reduce design complexity, the microprocessor is implemented as a tethered system. Unlike a standalone system, a tethered system depends on a host machine to boot, and lacks any I/O devices such as a console, mass storage, frame buffer, and network card. The host (e.g., an x86 laptop) is connected to the target tethered system via the host-target interface (HTIF), a simple protocol that lets the host machine read and write target memory and control registers. All I/O-related system calls are forward to the host machine using HTIF, where they are executed on behalf of the target. Programs that run on the scalar core are downloaded into the target's memory via HTIF. The resulting system is able to boot modern operating systems, such as Linux, utilizing I/O devices residing on the host machine, and can run standard complex applications such as the Python interpreter.
B. Vector Accelerator
The Hwacha vector accelerator, shown in Fig. 5 , is a decoupled single-lane vector unit optimized for ASIC designs. Hwacha executes vector operations temporally (split across subsequent cycles) rather than spatially (split across parallel datapaths), and has a vector length register that simplifies vector code generation and keeps the binary code compatible across different vector microarchitectures with different numbers of execution resources [24] .
The Rocket scalar core sends vector memory instructions and vector fetch instructions to the vector accelerator. A vector fetch instruction initiates execution of a block of vector arithmetic instructions. The vector execution unit (VXU) fetches instructions from the private vector instruction cache (VI$), decodes instructions, clears hazards, and then sequences vector instruction execution by sending multiple μops down the vector lane. The vector lane consists of a banked vector register file built out of 1R1W SRAM macros, operand registers, per-bank integer ALUs, and long-latency functional units. Multiple operands per cycle are read from the banked register file by exploiting the regular access pattern with operand registers used as temporary space [23] . The long-latency functional units such as the 64 bit integer multiplier, single-and double-precision FMA units are shared between the scalar core and the vector accelerator. The vector memory unit (VMU) supports unit-strided, constant-strided, and gather/scatter vector memory operations to the shared L1 data cache. Vector memory instructions are also sent to the vector runahead unit (VRU) by the scalar core. The VRU prefetches data blocks from memory and places them in the L1 data cache ahead of time to increase performance of vector memory operations executed by the VXU [24] , [25] .
The resulting vector accelerator is more similar to traditional Cray-style vector pipelines [26] than SIMD units such as those that execute ARM's NEON or Intel's SSE/AVX instruction sets, and delivers high performance and energy efficiency while remaining area efficient.
C. SC DC-DC Unit Cell
This system uses a reconfigurable DC-DC converter unit with a topology similar to [15] , where separate networks of switches allow different conversion ratios for the same shared flying capacitor. Due to the availability of two different input voltages in the IO pads, two sets of switches are used: one for the configurations operating off a 1 V input and the other one for configurations operating off a 1.8 V input. Four possible discrete SC DC-DC configurations, shown in Fig. 6 , generate voltages between 0.5 and 1 V to enable a wide operating range. The converter has two phases: in the first phase φ 1 , the flying capacitor is connected in series with the output, while in the second phase φ 2 , the flying capacitor is connected in parallel. The 1 V input is divided with a 2:1 and 3:2 ratio to generate the 0.5 and 0.67 V modes, while the 1.8 input is divided with a 2:1 ratio to generate the 0.9 V mode. All 1 V input switches are implemented as LVT devices to reduce their ON resistance, while the larger 1.8 V input switches are implemented as RVT devices to reduce their leakage. Additionally, the largest switches are forward-bodybiased to reduce their ON resistance when they are active (i.e., in 1.8 V 1/2 mode). The flying capacitor is implemented as MOS capacitance with two layers of MOM capacitance above. Parasitic bottom-plate capacitance is reduced by using a series connection of the box, well, and substrate capacitances [27] . SC DC-DC converters are best suited for low-power-density applications where the limited capacitive density of on-chip capacitors is sufficient and the area overhead of converters is reasonable. While this implementation uses MOS capacitors to reduce cost, area overhead can be further reduced with MIM capacitors. Twenty-four unit cells were used in the design for a total flying capacitance of 2.1 nF. For testing and measurement purposes, the bypass mode of the converter uses the 1 V mode to connect the regulator's 1 V input rail to V out of the microprocessor through power gates in the SC DC-DC unit cells, and the 1 V input rail is supplied by the desired bypass voltage to directly control the voltage of the microprocessor.
D. SC DC-DC Controller
The purpose of the SC DC-DC controller block is to trigger the switching of the converter unit cells in order to guarantee that the converter can provide the required current to the processor at all times. Analytically, the converter output current I out needs to equal the load current I L , which is assumed constant over one switching cycle T sw
The topology proportionality constant (α) and the total amount of flying capacitance in the converter C fly are set at design time. During runtime, the SC DC-DC controller needs to maximize efficiency by appropriately controlling the amplitude of the voltage ripple (ΔV ) and the converter switching frequency (f sw ). This design implements a lower-bound (hysteretic) controller, shown in Fig. 7 , that switches the cells when the output voltage V out drops below a reference voltage V ref -explicitly setting ΔV and implicitly modulating f sw in response to changing load current [28] . Lower-bound control was chosen for quick reaction to changes in the load current I L and to avoid switching the converter unnecessarily quickly.
The controller is composed of two main components: clocked comparators to detect when V out falls below V ref , and a finite-state machine (FSM) that generates the toggle signal for the unit cells. To guarantee that the toggle signal arrives simultaneously at all cells, the SC DC-DC controller is centralized, and the toggle signal is routed as a clock tree to minimize skew among cells.
Three separate StrongARM [29] comparators are used: the 1 V 2:1 mode uses the PMOS-based-comparator (for the lowest common mode input voltages), while the other modes use two NMOS-input-based comparators, with one operating on the rising edge of the clock and the other on the falling edge of the clock (for higher common mode input voltages). A multiplexer changes V ref for different conversion ratios. In a lower-bound controller, the shortest achievable time between two switching events (t sw,min ) is set by the propagation time of the toggling signal from the comparator output to the final power switches. The comparator clock frequency is set to 2 GHz to maximize power density by allowing all unit cells to toggle every t sw,min during high current loads, and to minimize the time that V out remains below V ref before detection triggers a toggle event.
A FSM, shown in Fig. 8 , sends the toggle signal to the unit cells based on the comparator output. The rising edge of the comparator output signal comparator_out toggles transitions between the two converter phases. If comparator_out remains high for multiple cycles (because a large current spike keeps V out below V ref even after a switching event), a counter increments and forces a toggle when it reaches an overflow value. The overflow count is set to be slightly longer than the propagation time from the comparator through the toggle signal clock tree and to the switches, to avoid spurious switching events. The reset state is used during reset and during converter mode transitions.
E. Adaptive Clock
The adaptive clocking scheme, shown in Fig. 9 , changes the clock frequency on a cycle-by-cycle basis to ensure that the system operates at the maximum instantaneous frequency obtainable for the instantaneous voltage [30] . The rippling supply voltage from the SC DC-DC converters powers a tunable replica circuit (TRC), adjustable from 4 to 124 FO1 inverter delays with a delay setting register, to mimic the critical path delay at each instantaneous voltage level. When the TRC generates a pulse, the controller selects one of the 16 DLL phases to send to the core. Separate TRC paths control the high and low clock periods to set the duty cycle. This is a free-running clock, in which nothing determines the average frequency other than the average delay through the TRC.
During operation, the first TRC output pulse asynchronously resets the clock toggler flip-flop to generate the falling edge of the clock output. The second TRC output pulse synchronizes the rising edge of the adaptive clock with the DLL references. Level shifters are located between the TRC and the controller. Since the DLL references and the TRC output pulse are fully asynchronous, a watchdog block monitors the system for metastability. Fig. 10 shows the ability of the adaptive clock to track changes in voltage by using the bypass mode to measure average frequency for different delay settings for the TRC. Annotations above the plot indicate the approximate voltage ranges seen in each SC DC-DC mode. Because the inverter-based replica path delay characteristics do not match the critical paths of the processor, a single delay setting poorly tracks the processor critical path over the entire voltage range. However, manual calibration of specific delay settings for each 
F. Physical Design
A multivoltage and multiclock design flow was used to construct the processor. Fig. 11 shows the processor floorplan, with the dotted red line separating the large core voltage domain at the top from the small uncore voltage domain at the bottom and sides of the chip. The custom SRAMs were manually placed within the core voltage domain. The DC-DC unit cells surround the core to minimize voltage drop. Two layers of thick upperlayer metal were dedicated to a power grid, where V out and GND each utilize 25% of the chip area in each layer, and power rail analysis estimates a 2 mV voltage drop at 1 V and 100 mA (nominal operating condition). Ideally, converter power would come from bumps directly above the converter, but because only wire-bond packaging was available, all of the power is supplied through the pad frame in this implementation. Outside the core, V out rails are not necessary, so the input voltages to the converters (V DD,1.0 and V DD, 1.8 ) use the majority of the power routing resources to connect power coming from the pad frame to the converters.
IV. EXPERIMENTAL RESULTS
A prototype system was designed and implemented [31] in 28 nm ultra-thin body and BOX fully depleted silicon-oninsulator (UTTB FDSOI) technology [32] . Fig. 23 and Table II show the die micrograph and chip summary, respectively. 
A. Measurement Setup
The measurement setup is shown in Fig. 12 . The die is packaged using chip-on-board wire bonding to a small daughterboard. There is decoupling capacitance for the 1 and 1.8 V inputs to the converter both on the chip and on the daughterboard. A multimeter or oscilloscope connects to sense points on the daughterboard to measure the output voltage rail supplied by the SC DC-DC converter. The daughterboard is connected over FMC to a motherboard which generates the necessary clock, supplies, and reference voltages. Additional testpoints on the motherboard connect to a sourcemeter to measure the input power provided to the SC DC-DC converter. The chip is controlled from a Zedboard, which includes a network-accessible ARM core with FPGA to connect to main memory and emulate system call operations.
B. DVFS for Improved Energy Efficiency
The measured traces of the rippling core voltage domain for all four possible configurations are shown in Fig. 13 . The actual average output voltage is lower than the ideal divided output voltages due to charge sharing with the inherent decoupling capacitance of the core. (The relationship between ripple size and average output voltage is further discussed in Section IV-E.) For all possible converter topologies with adaptive clocking, the processor successfully boots Linux and Tight integration of the on-chip SC DC-DC converter with the processor enables extremely fine-grained DVFS. Fig. 14 shows that the processor can switch between operating modes in approximately 20 ns. These fast mode transitions enable new DVFS algorithms that can operate at much shorter time scales.
The main goal of on-chip conversion is to improve energy efficiency through DVFS. Fig. 15(a) shows the energy efficiency of the system, for both the baseline system with ideal offchip regulation (bypass mode) and the four topologies. Energy efficiency is measured using a double-precision floating-point matrix multiplication kernel in terms of billions of floatingpoint operations per watt (GFLOPS/W), which is the inverse of energy per operation. Fig. 15(b) shows how different topologies change the absolute power and delay of the processor. FBB of the microprocessor in FDSOI enables threshold voltage control during runtime to trade off performance and power [33] , as shown for this design in Fig. 15(c) and (d) . By using the onchip converter to generate the lowest output voltage, the system achieves a peak efficiency of 26.2 GFLOPS/W.
C. System Efficiency
The efficiency of voltage converters is generally computed by measuring the current and voltage on both the input and output of the converter to measure the ratio of power delivered to power supplied. However, for the proposed system, efficiency defined in this way is not easily measurable. First, it is difficult to measure on-chip voltage and current, because the voltage is rippling very quickly. Second, even if power output of the converter could be measured, this metric would ignore the impact of the adaptive clock, which is an important loss component. Therefore, a different method is required to measure the efficiency of the implemented system. This paper defines system efficiency with a metric that fairly accounts for the adaptive clock and does not require measuring on-chip voltage and current. To characterize the processor load, the bypass mode is used to directly supply the core with an ideal off-chip voltage source. A self-checking benchmark is run for a fixed number of cycles at different voltages, and a binary search is performed at each voltage point to find the maximum frequency. At the maximum frequency, the total elapsed time and total energy to run the fixed-length benchmark is measured, where the energy is computed by measuring the current drawn from the off-chip supply and the delivered voltage is measured from sense points on V out , to remove the voltage drop across the on-chip bypass-mode power gates from the efficiency calculation. This provides the blue curve in the figure of energy versus time, and represents a 100%-efficient off-chip regulator.
Then, for each DC-DC mode, the same benchmark is run for the same number of cycles, and the total elapsed time and energy is measured. Due to nonidealities of the converter, it takes more energy to perform the same task in the same amount of time. Therefore, system efficiency is defined as the ratio of energy required to finish the same workload in the same time. This metric includes all sources of overhead, including nonidealities in the adaptive clock. Fig. 16 shows the measured voltage conversion efficiency ranges from 80%-86% for different output voltage modes.
D. Loss Analysis
The 14%-20% system efficiency losses are attributed to both converter losses and nonideal adaptive clocking based on measured results. 
1) Standalone Converter Losses:
The efficiency of the converter alone is estimated by characterizing the power at each voltage using a repetitive microbenchmark and numerically integrating the waveform at V out to determine the ratio of input to output power. These results are an approximate measure of efficiency, because the ripple measured from off-chip will not perfectly match the true on-chip voltage waveform. Fig. 17 shows that the converter alone achieves a maximum efficiency above 90%, and compares this efficiency to the measured system efficiency (and the corresponding power density of the benchmark) and the hypothetical efficiency for a system running with a fixed frequency clock at the minimum observed voltage. A wide range of power densities was measured by changing the proportion of 24 SC DC-DC unit cells that are enabled, which contributes to the discontinuities in the data.
2) Adaptive Clocking Losses: Analytical modeling of the adaptive clock, based on measured results, predicts a 5%-10% efficiency loss due to nonideal adaptive clocking. A simple experiment, illustrated in Fig. 18 , shows how clock frequency margins, required to compensate for imperfect adaptive clock generation, translate to system efficiency losses. First, the characteristic total energy versus total runtime of the core is plotted based on measured results. A hypothetical converter with 90% efficiency would require more total energy to complete the same workload in the same amount of time. If the hypothetical converter also increases the critical path delay by 5% due to any nonidealities, the curve shifts to the right due to the increase in runtime, and shifts slightly up due to increased leakage integration time. These shifts correspond to a decrease in efficiency, because an increase in runtime can also be interpreted as requiring a higher operating voltage to achieve the same overall runtime. In this case, a 5% increase in average delay would equate to an approximately 5% decrease in system efficiency. The exact translation from delay increase to efficiency depends on the slope of the energy-delay curve for a particular design and technology.
The quantitative effect of nonideal adaptive clocking can be estimated with numerical simulation based on measured results. The simulation, shown in Fig. 19 , divides a voltage ripple into small time steps and tracks the progression of a signal through the replica, clock, and critical path based on the delay at each instantaneous voltage. The voltage ripple and voltage versus frequency characteristics of the replica and critical path are supplied by measured results, while the insertion delay is supplied by back-annotated timing analysis. Two main effects cause nonideal adaptive clock tracking. First, each path has different characteristic delay versus voltage tradeoffs due to different gate types or different relative contribution of gate or wire delay. Second, the insertion delay of the clock tree means that the replica and critical path see different voltages, but the clock tree itself will compensate to diminish this effect [34] . Therefore, after many simulated cycles there is a distribution of extra FO4 stages that could be computed by the core before the margined adaptive clock edge arrives. The average of the distribution corresponds to the overhead of the adaptive clock, and the numerical simulation predicts an average cycle time increase of 7%. The losses due to nonideal clocking are already included in the system efficiency measurement, so this prediction serves as an estimate of the relative contribution of nonideal clocking to total losses.
E. Effect of V ref on Efficiency
As discussed in Section III-D, the choice of V ref sets the size of the output voltage ripple, and the load current automatically modulates the switching frequency f sw . For the ideal case shown in Fig. 1 , simultaneous-switching converters have essentially zero losses from P C fly , but in reality, a simultaneousswitching converter will still charge-share with the intrinsic capacitance of the load. Fig. 20 analytically compares the efficiency as a function of switching frequency for three 1 V 2:1 mode converters: a conventional interleaved converter, the proposed simultaneous-switching converter, and a hypothetical simultaneous-switching converter with no load capacitance. Because the interleaved converter has more charge-sharing losses, it incurs high losses for large ripple sizes at low switching frequencies, and therefore has a higher optimal switching frequency. A simultaneous-switching converter that closely matches the implemented system, with an output load capacitance equal to the converter capacitance, has charge-sharing losses with the output load that cause an approximate 5% efficiency loss versus an ideal simultaneous-switching converter. No explicit decoupling capacitance was added to the core in order to minimize charge-sharing losses.
Charge sharing also causes the average output voltage to fall below the ideal divided output voltage for each converter type. Measurements confirm that charge sharing with the processor's intrinsic capacitance causes the average output voltage to change for different V ref choices, as shown in Fig. 21 . While an optimal V ref maximizes efficiency, suboptimal V ref points could be chosen to achieve finer-grain control of the average output voltage (and therefore average performance) than switching between the fixed conversion topologies. As the load current changes, the optimal V ref will also change. 
V. CONCLUSION
The combination of the RISC-V architecture, low-voltage SRAM, and wide operating range DVFS enabled by onchip voltage conversion and adaptive clocking achieves 26.2 GFLOPS/W with the 1 V 1/2 DC-DC configuration when computing double-precision matrix-multiplication using the vector accelerator. A simultaneous-switching SC DC-DC built with MOS capacitors and a centralized lower-bound controller reconfigures to provide four output voltages between 0.45 and 1 V, and achieves high converter efficiency by avoiding charge sharing. An adaptive clock translates high converter efficiency to high system efficiency by maximizing clock frequency for the voltage waveform to the core. Measurement results show that the system achieves 80%-86% system efficiency, with losses attributed to traditional converter switching losses, charge-sharing with the intrinsic capacitance of the core, and imperfect clock tracking. The simultaneous-switching approach described in this paper provides a low cost and high efficiency DVFS solution for low-power mobile devices.
