Abstract-This paper presents a RISC-V system-on-chip (SoC) with integrated voltage regulation, adaptive clocking, and power management implemented in a 28 nm fully depleted siliconon-insulator process. A fully integrated simultaneous-switching switched-capacitor DC-DC converter supplies an application core using a clock from a free-running adaptive clock generator, achieving high system conversion efficiency (82%-89%) and energy efficiency (41.8 double-precision GFLOPS/W) while delivering up to 231 mW of power. A second core serves as an integrated power-management unit that can measure system state and actuate changes to core voltage and frequency, allowing the implementation of a wide variety of powermanagement algorithms that can respond at submicrosecond timescales while adding just 2.0% area overhead. 
I. INTRODUCTION

E
NERGY efficiency is the key constraint in modern systems-on-chip (SoCs). Server-class chips are thermally limited and require better energy efficiency to improve performance, while the utility of mobile and IoT devices depends substantially on low energy consumption to prolong battery life. These constraints demand advanced powermanagement systems that can improve energy efficiency without sacrificing performance by adapting to the demands and constraints of particular workloads. However, many modern workloads demonstrate rapid changes in program behavior [1] - [3] . Mobile and IoT devices, often driven by an unpredictable sensor or a user input, have especially bursty workloads, with short periods of high compute demand followed by long periods of inactivity [4] . Accordingly, the most effective power-management systems must be able to respond at microsecond timescales, rapidly adjusting voltage and frequency to track changes in workload and maximize energy savings while minimizing performance loss.
One barrier to implementing fine-grained power management is the ability to quickly switch between different voltage and frequency operating modes. Most systems supply a small number of voltage domains via off-chip regulators. These offchip components generally have slow transition times due to the large time constants of the off-chip passives, and the additional component count can substantially increase system cost. Integrated regulation can achieve much faster transition times because the passive elements are much smaller. However, area limitations and the lower quality of integrated inductors make it difficult to achieve high conversion efficiencies or power densities in fully integrated systems, negating any energy savings [5] - [9] . Similarly, most clock generation for digital circuits is done by large power-hungry phase-locked loop (PLL) circuits. These PLLs typically take many cycles to lock to a new frequency target. Thus, traditional regulation and clocking circuits severely limit the speed of voltage and frequency changes.
0018-9200 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. While integrated regulation has given some systems high conversion efficiencies, integrated power control logic is also required to achieve fine-grained power management. Some previous systems have lacked power-management capability [10] . Others have supported power management, but actuate voltage and frequency changes only in response to operating system requests at millisecond timescales [11] , [12] . Hardware-actuated power management has been proposed in [3] and [13] - [19] , but only a system implementation that includes fast efficient voltage regulation, lightweight clock generation, and dedicated power-management hardware can demonstrate the efficacy of the approach. Table I provides a summary of prior work. This paper implements a processor SoC with integrated voltage regulation and power management. Simultaneousswitching switched-capacitor DC-DC (SC-DCDC) converters are tightly integrated with an adaptive clock generator to achieve high system conversion efficiencies and power densities suitable for a mobile-class processor [10] . The generated voltage and clock supply a RISC-V [20] application processor with a vector accelerator that loads the SC-DCDC converters with up to 231 mW of power demand. A separate fixed-voltage RISC-V power-management unit (PMU) processor can actuate changes to each of these systems in response to changes in workload at submicrosecond timescales, enabling fully autonomous power management that can reduce core energy consumption by up to 39.6% with minimal performance penalty. This paper expands upon the initial presentation in [21] with more detailed design descriptions of the PMU processor, the adaptive clock generator, and top-level physical design. Additional measurement results show the result of diverse power management algorithms, demonstrating the utility and flexibility of the SoC. Section II describes the system implementation in detail, and Section III details measured performance results of the system. Section IV describes the benefits of a variety of power-management algorithms running on the SoC, and Section V concludes this paper. Fig. 1 summarizes the components of the SoC. The system is partitioned into two voltage domains: 1) a variable-voltage core domain that contains the application processor and 2) a fixed 1 V uncore domain that contains the power measurement and control blocks. Because the core can operate at varying frequencies and voltages, all digital communication between the core and the uncore uses level shifters and asynchronous queues.
II. SoC DESIGN
The core voltage domain is supplied by an integrated SC-DCDC converter. The reconfigurable converter can operate in four modes to downconvert external 1 V and 1.8 V supplies. Each conversion mode represents a configuration of the flying capacitors that result in voltage conversion ratios of one-half or two-thirds of the external supply voltages, generating voltages averaging 1 V (bypass mode), 0.9 V (1.8 V 1/2 mode), 0.67 V (1 V 2/3 mode), and 0.5 V (1 V 1/2 mode). The SC-DCDC converter is partitioned into forty-eight 90 × 90 μm unit cells and equipped with a lower-bound (hysteretic) controller. When the converter output voltage crosses a fixed external reference voltage, a toggle signal is generated and distributed to each unit cell in a clock tree balanced by the place-and-route tool. All unit cells switch simultaneously, and so the converter output voltage ripples significantly about the average value, improving the system conversion efficiency [10] . Because the generated voltage can ripple over a wide range, the core clock is produced by a freerunning adaptive clock generator that modulates its frequency to track fine-grained voltage variations. Core SRAMs are implemented as custom 8T macros to enable low-voltage operation.
The frequency of the core clock and the SC-DCDC toggle signal are recorded by counters that can be read by the PMU processor, which resides in the uncore domain. The PMU is fully programmable and executes power-management programs that read on-chip counters and alter the SC-DCDC output voltage mode while the application core is running. In addition, the threshold voltage of the logic in the core voltage domain can be manipulated by an integrated backbias generator that can supply up to 1.8 V of forward body bias (FBB) [22] . Measurement circuits, such as a programmable current mirror load and an integrated SC-DCDC waveform reconstruction unit, allow for the full characterization of key system parameters of the SoC [23] . The uncore voltage domain also contains logic to serialize off-chip accesses for communication with the host FPGA system.
A. Compute Core
The core voltage domain contains an application processor based on the open-source Rocket processor [24] with a vector coprocessor similar to the implementations described in [10] and [25] . Rocket is a 64-bit in-order single-issue processor that implements the free and open RISC-V instruction set architecture (ISA) [20] . The scalar core implements a simple branch predictor, L1 caches, and a memory-management unit that supports page-based virtual memory and boots modern operating systems such as Linux.
A high-performance single-lane vector coprocessor accelerates compute-intensive workloads. The coprocessor implements a decoupled vector-fetch architecture that can achieve high compute efficiency via systolic dataflow [26] . The vector unit is tightly integrated with the Rocket core, which issues commands to the accelerator via a custom extension to the RISC-V ISA. The memory system, including the 32 KB L1 data cache, is shared between the scalar core and the vector accelerator, although the vector unit implements its own 8 KB instruction cache. Compared with the implementation in [10] , the vector accelerator is improved by increasing floating-point throughput to two single-precision and double-precision fused multiply-add (FMA) units supporting up to four floating-point operations per cycle [27] and by adding separate long-latency functional units dedicated to the accelerator.
B. Adaptive Clock Generator
An adaptive clock generator supplies the clock to the core logic, improving system energy efficiency by tracking the voltage ripple produced by the simultaneous-switching SC-DCDC converters and adjusting the clock period on a cycle-by-cycle basis. Fig. 2 shows the schematic of the adaptive clock generator. The delay units are composed of four tunable delay banks supplied by the same rippling voltage as the core logic. Each bank uses a different cell for its delay element, and the banks can be tuned independently by changing the control settings of the multiplexers in each bank. The first bank uses a custom buffer cell designed to balance rise and fall times. The remaining banks each contain a library standard cell that frequently occurred in the core netlist produced by the synthesis tool. Each bank uses a different combination of pMOS/nMOS ratio and gate length, as these characteristics affect the voltage-frequency relationship of the cells, and the replica paths must be able to track similar cell variations in the critical paths of the core. This combination of multiple delay cells tracks the core critical path more accurately than a single standard cell [28] . The two delay units are identical, but can be independently tuned to adjust the duty cycle of the generated clock. This free-running adaptive clock generator was implemented alongside the clock generator from [10] that selects edges from a fixed reference for direct comparison.
The presence of insertion delay in the core clock tree negatively impacts the ability of the adaptive clock to accurately track the supply voltage. In the ideal case, the generated clock would instantaneously propagate to all clock sinks and the voltage of the core logic would be the same as that of the replica paths in the adaptive clock generator. In practice, the clock tree causes the clock edges to be received by the register sinks only after some delay. During this delay period, the core voltage further decreases, so the core operates more slowly than the equivalent replica path when the edges were generated. While the progression of the clock edges through the clock tree also slows down due to the decrease in voltage [29] , this clock-data compensation effect does not fully guard against insertion delay effects because the voltagedependent delay characteristics of the clock tree and logic critical paths may not be the same. Large insertion delays require additional timing margin that reduces system efficiency, so care was taken during physical design to minimize the insertion delay of the core clock tree.
C. Counters and Power Measurement
For power-management algorithms to be effective, the power consumption of the system must be known or estimated. Some power-management schemes implement complex models based on the activity of a number of key signals to estimate power consumption [12] , [30] , while others place a sense resistor between the supply and the load [31] or halt the normal operation of the voltage regulators to measure current draw [32] . In contrast, the SC-DCDC system in this paper enables simple, accurate, and noninvasive measurement of core power because power correlates directly with the switching frequency of the regulators. The SC-DCDC voltage regulators toggle when the supplied voltage reaches a known fixed reference voltage, and the capacitance of the system is fixed; therefore, each DC-DC toggle transfers a fixed amount of energy to the core. As shown in Fig. 3 , a slow toggling frequency implies lower power consumption and an increased toggling frequency implies high power consumption. Therefore, we instrument the SC-DCDC toggle signal with a counter that compares its frequency with a fixed reference to measure power, while a second counter measures the core clock frequency.
D. Power Management Core
The PMU contains a 32-bit 3-stage single-issue in-order RISC-V processor based on Z-scale [33] . Implementing the PMU with a processor instead of fixed-function hardware enables experimentation with a wide range of powermanagement algorithms. The power-management core forgoes caches in favor of an 8 KB scratchpad memory, which is 128 bits wide, and mapped to the upper half of the 4 GB physical memory space (the lower half is reserved for memory traffic from the application processor). The PMU supports the RV32IM instruction extensions and is fully programmable via the RISC-V software toolchain. Power-management programs are written into the scratchpad during system boot. Fig. 4 shows the pipelines of the two processors in the SoC, and Table II compares their key features.
The three-stage design minimizes gate count while enabling sufficient performance to enable fine-grained power management. The first stage of the pipeline fetches an instruction out of a 128-bit instruction buffer (the value of the program counter is calculated in the previous cycle). The of the pipeline decodes a RISC-V instruction, reads the register file, and executes an ALU instruction. Branches are resolved in the second stage, so the instruction in the fetch stage is flushed when a branch is taken. Writeback is isolated into the third stage of the pipeline, reducing the fanout delay on the write port of the register file. The result in the third stage is bypassed to the second stage, eliminating the need for stalls in some cases. The memory stage and the multiplication/division pipeline stages are also in the third pipeline stage. The multiply and divide units compute 1 bit of the operation each cycle, requiring 32 cycles to complete a multiply or divide. Loads and stores directly access the scratchpad. An extra pipeline register was added in the arbiter between instruction and data memory requests to eliminate a long critical path and allow the PMU to meet the setup time constraint of the uncore clock domain.
All system control registers that monitor and control the various subsystems of the chip, including the SC-DCDC converter voltage mode, are mapped into the PMU's control status register (CSR) space. Additionally, the application core and the PMU can communicate directly via inter-processor interrupts, and each has a register mapped directly into the CSR space of the other system, allowing arbitrary data to be communicated between the two cores. The PMU issues CSR reads to the SC-DCDC toggle counters and core clock counters to detect changes in workload. It can then issue CSR writes to actuate mode changes in a fast feedback loop as shown in Fig. 5 .
E. Custom Low-Voltage SRAMs
Operation at low voltages is critical to achieving maximum energy efficiency, but on-chip SRAM typically limits the minimum operating voltage of the entire system as the small transistors in SRAM bitcells are especially vulnerable to process variation. All SRAM arrays in the core voltage domain use the same custom 4 KB 8T-based SRAM macro shown in Fig. 6 , which is logically organized as 512 entries of 72 bits (64 bits + 8 possible error-correcting code bits) and physically organized as two arrays of 128 rows × 144 columns with two-to-one physical interleaving. Low-voltage operation is enabled by the 8T bitcell, where each transistor is larger than the equivalent high-density 6T bitcell, and by the fully depleted silicon-on-insulator (FD-SOI) process, which reduces the threshold voltage variation [34] , [35] . While the arrays also implemented a negative-bitline write assist, the assist was not necessary to achieve minimum voltage operation.
F. Physical Design
The top-level physical design of the SoC was completed using a multiclock and multivoltage flow in a digital place-androute tool. Fig. 7 shows the floorplan of the SoC. The design is partitioned into two voltage areas, with the core voltage area supplied by the SC-DCDC converters placed centrally. Numerous additional voltages and clocks are defined to supply both the core and the various analog and mixed-signal blocks that make up the SoC. As described in Section II-B, reducing the insertion delay from the adaptive clock generator to the core clock sinks is critical to improving system efficiency. Accordingly, the clock generator itself was placed near the center of the core area, and a "peninsula" of the uncore voltage domain was extended to allow the routing of control signals from the block to the top level of the design hierarchy. The location of the core clock multiplexer, which allowed the selection of different core clock sources for test, was specified explicitly and placed near the center of the core area. These improvements combined to reduce core insertion delay by several hundred picoseconds.
Decoupling capacitance was added to improve the integrity of the 1.0 V and 1.8 V inputs to the SC-DCDC unit cells and help offset power delivery issues caused by the wirebond packaging of the chip. A custom decoupling cell with MOS capacitors and a six-layer metal-oxide-metal 50 nm mesh add 539 pF of capacitance to the 1.8 V supply and 802 pF of capacitance to the 1.0 V supply.
III. EXPERIMENTAL RESULTS
A prototype system was designed and implemented [21] in 28 nm ultrathin body and BOX FD-SOI technology [36] . Fig. 7 shows the die micrograph, and Table III summarizes the key features.
A. Measurement Setup
The test setup is similar to that presented in [10] . The die is wire bonded to a small test board that includes decoupling capacitance and pinouts for measurement, including measurement pins dedicated to the generated adaptive clock and the dc-dc toggle clock. This test board connects to a Zedboard that includes an FPGA and a network-accessible ARM core. The serialized 16-bit digital interface of the test chip is connected to programmable logic in the FPGA, which can route memory traffic to local DRAM and emulate system calls. Software running on the ARM core acts as a host controller for the tethered test chip processors. To bring up the system, external supplies are adjusted and the uncore brought out of reset. Next, the clocking system is initialized, including the adaptive clock generator, followed by all other IP blocks, including the SC-DCDC converters, which can be initialized to any operating mode. Finally, programs are loaded into memory and the cores are brought out of reset. The PMU begins program execution first, followed by the application core. Table IV shows the measured system conversion efficiency of the SoC operating under each SC-DCDC mode. Because the instantaneous voltage and current consumption of the SC-DCDC output voltage cannot be easily measured, system conversion efficiency is calculated by comparing the energy cost of a long-running computation with the energy cost under a fixed supply voltage and clock that takes the same time to complete [10] . In this calculation, 100% efficiency represents a lossless regulator supplying the core as it operates at the maximum frequency achievable at that voltage. The efficiency calculation therefore accounts for both regulator inefficiency Fig. 9 . Measurement of core performance running a tuned matrix-multiply benchmark with the SC-DCDC in bypass mode. The number in each cell is the core energy efficiency in double-precision GFLOPS/W. and the energy cost of operating at a frequency less than the maximum possible.
B. Integrated Voltage Regulation
The adaptive clock is tuned at each voltage setting by sweeping the settings of its replica delay path and choosing the fastest setting that still results in correct core functionality. The adaptive clocking system provides a large improvement in system conversion efficiency because the core is able to operate at a higher average frequency as the supply voltage ripples, reducing the amount of energy required to complete the same amount of work. When supplied by the SC-DCDCs, the processor achieves a peak energy efficiency of 41.8 doubleprecision GFLOPS/W running an FMA microbenchmark on the vector coprocessor in 1/2 1 V mode. The processor is able to boot Linux and run user programs while powered by the rippling supply voltage and adaptive clock. Fig. 9 shows the processor functionality across wide voltage and frequency ranges. The SC-DCDC converter is placed into bypass mode for characterization, allowing the measurement of processor performance under fixed voltage and frequency. The best energy efficiency in bypass mode of 54.0 doubleprecision GFLOPS/W is achieved at 500 mV and 40 MHz. Fig. 10 shows the best frequency achievable at each operating point and the total energy consumed by a fixed-duration matrix-multiply benchmark at that operating point. The application of FBB increases performance but results in higher leakage power. Fig. 11 compares the generated frequency of the freerunning adaptive clock generator with the design from [10] . Both designs reliably generate a core clock that can supply the core area while complex software such as Linux is executed on the application core. The two designs track voltage similarly at the slowest delay settings, but the tracking varies at the fastest setting because the free-running design oscillates entirely in the variable-voltage domain, while part of the timing loop in the edge-selecting design is level-shifted to the fixed 1 V supply. The free-running generator achieves the same functionality as the adaptive clocking scheme from [10] , while eliminating the need for the distribution of a fixed reference and reducing design area. Fig. 12 compares the voltage-dependent delay characteristics of the four delay banks, normalized to the delay of the custom buffer cells in bank 3. Each bank's result was measured by recording the generated clock's frequency after selecting the maximum delay through that bank and the minimum delay through the remaining banks. The cells with small pMOS/nMOS ratios and larger gate lengths have larger delays at lower voltages. The wide variation in voltage-dependent delays between the different delay banks (up to 18% at 0.45 V) validates the need for multiple different standard cells to achieve accurate critical path tracking. of the three conversion modes, confirming the practicality of using this toggle frequency to estimate core power and system load.
C. System Energy Efficiency
D. Clock Generation
E. Power Measurement Counter
IV. POWER MANAGEMENT
The programmable PMU allows the implementation of a wide variety of power management algorithms to improve energy efficiency. Several different experiments demonstrate the flexibility of the system in implementing common energysaving techniques. 
A. Frequency Locking
Operating systems or users may request a particular operating frequency target to match a particular performance demand. However, systems with SC-DCDC regulators operate most efficiently only at the maximum frequency of each discrete operating voltage mode. Nonetheless, the prototype system is able to lock to a particular frequency target via hopping, in which the PMU automatically adjusts the voltage and frequency of the system to achieve an average frequency that is indistinguishable from the desired frequency at longer timescales. Despite the small number of discrete voltage settings, an arbitrary effective clock frequency can be achieved by rapidly switching between two voltage configurations. To implement this algorithm, the PMU first calibrates the system by using the core clock counter to measure the average operating frequency at each voltage mode. Then a target frequency is provided to the PMU, which polls the core clock counter and dithers the voltage setting to achieve the target frequency in aggregate.
The results of the experiment are shown in Fig. 15 . Without dithering, the processor would need to operate only in the higher mode to guarantee that the performance target is met, which would incur up to 40% more energy than the dithered approach. The choice of hopping frequency presents a tradeoff between increased fidelity to the target effective frequency and the more frequent occurrence of transition overheads, which can increase energy consumption. In this paper, the energy cost associated with transitions between voltage modes is small because the processor continues to operate as the clock frequency adjusts during the mode transition. In high-to-low mode transitions, no charge is wasted, but in some low-to-high transitions, the flying capacitance is charged to 1 V, consuming wasted energy
In the worst-case transition from 1 V 2/3 mode to 1.8 V 1/2 mode, the flying capacitance is charged from roughly 0.33 V, resulting in an E loss of roughly 1 nJ. At a typical core operating power of 50 mW, this loss is equivalent to the energy consumed by just 20 ns of normal operation. Accordingly, a hopping frequency of approximately 6 µs was chosen. This frequency is still quite fast, but is slow enough that transition energy costs can be neglected.
B. Voltage Dithering
Many voltage generation schemes can efficiently generate only a few discrete output modes [37] . Nonetheless, a wide range of effective operating voltages can be achieved through voltage dithering, in which the voltage is rapidly switched between two operating modes to achieve an average voltage between them [38] , [39] . Theoretical analysis has shown that voltage dithering between just a few modes can achieve nearoptimal efficiency [40] . Fig. 16 shows the measured results of a PMU program for voltage dithering. The program rapidly switches between neighboring voltages at a ratio specified at runtime. Each measured point represents the completion of a fixed-length matrix-multiply benchmark. The dithering algorithm enables a wide operating range for the core, bounded only by the lowest and highest voltage settings of the SC-DCDC converter. Fig. 17 shows the conversion efficiency of the system at each operating point as calculated by the method described in [10] . Two different dithering programs were run on the PMU. The first program simply switches the voltage mode setting of the SC-DCDC converters, without changing any other system settings; the delay settings of the replica paths in the adaptive clock generator were tuned to the best setting that could function across both operating modes. The results are shown by the red points in Fig. 17 . Because the best setting of the replica paths changes according to operating mode, the conversion efficiencies of this approach are less than optimal for part of the dithering range. The second program switches both the voltage mode and the delay settings of the replica paths according to a pre-characterization of the best adaptive clock setting associated with each voltage mode. This program is able to speed the generated clock at the higher voltage settings, leading to higher conversion efficiencies. In all, the efficiencies of the second program range from 70% to 100%, depending on the voltage mode and dithering ratio. Conversion efficiencies while dithering are not as high as the efficiencies of the fixed operating modes in Table IV because the external voltage reference used by the comparator cannot be tuned for a particular SC-DCDC mode.
C. Power Envelope Tracking
A system-level absolute power limit is a common power management constraint. Fig. 18 shows the results of a power management algorithm executed on the PMU that maximizes core performance within a user-specified power budget. The power management program polls an externally writeable control register that stores the absolute power limit for the program. The PMU core then monitors the SC-DCDC toggle counters to continuously estimate core power using the quadratic model described in [23] with pre-characterized coefficients. If the estimated core power is above the specified limit, core frequency is decreased, and if it is below the limit, core frequency is increased. In this way, the best possible performance is automatically obtained while the user-specified power budget is respected.
D. Fine-Grained Adaptive Voltage Scaling
The PMU can also use the integrated counters to coordinate fine-grained adaptive voltage scaling (AVS) on-chip without any explicit guidance from the programs running on the processor. Extremely fine-grained AVS algorithms can exploit much shorter periods of inactivity than software-based AVS algorithms. In the proposed algorithm, core power is used as a marker of program phase. When core power is higher, the core is likely to be executing a compute-intensive program region, and a high voltage is necessary to maximize performance. When the core power is lower, the core is likely to be waiting idle for off-chip communication in a memory-bound program region, and energy can be saved with minimal performance impact by reducing the voltage. For this experiment, the core runs a synthetic benchmark that alternates between the compute-intensive and idle phases at a timescale of tens of microseconds. Fig. 19 shows the core voltage measured during the execution of the benchmark and the AVS power-management algorithm. The algorithm switches the core voltage between the 1.8 V 1/2 mode and the 1 V 2/3 mode, actuated by core power estimates determined by continuously polling the SC-DCDC toggle counter. When the core voltage is high and the toggle rate drops below a threshold, this corresponds to an idle program period, so the PMU reduces the core voltage to save energy. When the core voltage is low and the toggle rate exceeds a threshold, the workload has increased and the PMU increases the core voltage. The system is able to detect changes in workload in less than 1 µs and adjust the core voltage in response. Without integrated voltage regulators and power management, the system would not be able to respond within the timescales of the workload variation. The results of the power-management algorithm are therefore compared against continuous operation in the higher voltage mode, which would otherwise be required to meet the same performance target. The power-management algorithm reduces the energy consumed by 39.8%, and the fast response incurs negligible (<0.2%) performance penalty compared with this baseline, demonstrating the efficacy of fine-grained AVS at improving energy efficiency with fast workload tracking.
V. CONCLUSION
This processor SoC couples an energy-efficient RISC-V core and vector accelerator with a power-management processor, integrated voltage regulators, and an adaptive clock generator, allowing for improvements in system energy efficiency through the use of power-management algorithms running in microsecond-scale feedback loops entirely on-die. The four discrete output voltages of the integrated SC-DCDC regulator can be dithered to produce a wide continuous range of effective core voltages and frequencies. An AVS algorithm can use integrated counters to track changes in program phase and respond at submicrosecond timescales, allowing for substantial energy savings with minimal performance penalty. Taken together, these technologies represent a powerful tool for the energy-efficient design of future mobile SoCs. He is now with the Circuits Research Group, NVIDIA Corporation, Santa Clara, CA, USA. His current research interests include soft error resilience and energy-efficient digital design, with an emphasis on low-voltage SRAM design and variation tolerance.
