Abstract-Aggressive supply voltage scaling to below the device threshold voltage provides significant energy and leakage power reduction in logic and SRAM circuits. Consequently, it is a compelling strategy for energy-constrained systems with relaxed performance requirements. However, effects of process variation become more prominent at low voltages, particularly in deeply scaled technologies. This paper presents a 65 nm system-on-a-chip which demonstrates techniques to mitigate variation, enabling sub-threshold operation down to 300 mV. A 16-bit microcontroller core is designed with a custom sub-threshold cell library and timing methodology to address output voltage failures and propagation delays in logic gates. A 128 kb SRAM employs an 8 T bit-cell to ensure read stability, and peripheral assist circuitry to allow sub-reading and writing. The logic and SRAM function in the range of 300 mV to 600 mV, consume 27.2 pJ/cycle at the optimal DD of 500 mV, and 1 W standby power at 300 mV.
I. INTRODUCTION
V OLTAGE scaling is a compelling approach for energy reduction in digital circuits as it provides quadratic savings in the energy. Although circuits exhibit slower speeds at low supply voltages, the trade-off remains attractive for energy-constrained systems with relaxed throughput constraints. As approaches the sub-threshold region, longer propagation delays eventually lead to a rise in the leakage energy per operation, since the leakage power must be integrated over increasing clock periods. These opposing trends in active and leakage energy give rise to a minimum energy point, which optimizes the energy per operation of a circuit [1] , as illustrated in Fig. 1 .
The previous argument assumes that the circuit can complete a task at exactly the optimal speed and then shut off, so that it consumes no leakage energy during idle periods. However, certain system components, such as SRAMs, must be powered for arbitrarily long periods unrelated to their own speed. In this case it is essential to also reduce their leakage power. Voltage scaling causes a decrease in leakage current by alleviating drain induced barrier lowering, which, combined with reduction from 1 V to 300 mV, can provide an order of magnitude leakage power savings (Fig. 1) .
Previous research has demonstrated the energy advantage afforded by ultra-low-voltage operation. For example, a 180 mV, 0.18 m FFT processor was presented in [2] , while a 0.13 m processor with 8-bit ALU, 32-bit accumulator, and a 2 kb SRAM functional down to 200 mV was implemented in [3] . Body biasing and several gate sizing strategies were examined in a 0.13 m sub-processor [4] .
Looking forward, technology scaling enables reduced energy and increased density, but presents a new challenge in the form of heightened intra-die variation. In [5] , a 65 nm 320 mV motion estimation accelerator achieving high throughput employed optimized datapath circuits to address weak ratio and threshold voltage variation. For instance, registers contained non-ratioed, upsized keepers, and multiplexers with more than 3 inputs were remapped into 2:1 multiplexers. In [6] , a 65 nm SRAM design with a 10 T bit-cell functions down to 400 mV. This paper describes a 65 nm system-on-a-chip with a 16-bit microcontroller and a 128 kb SRAM operating down to 300 mV; both are powered by an integrated DC-DC converter as shown in Fig. 1 . Variation-aware design approaches enable the core logic to function in deep sub-threshold. The sub-SRAM employs an 8 T bit-cell and peripheral circuit assists to overcome process variation while maintaining density. The DC-DC converter addresses the critical need for efficient power delivery in micro-power systems. Featuring programmable gain settings and optimized control circuitry, the converter can deliver variable load voltage and power levels with high efficiency and low area overhead. This paper first discusses the challenges in microcontroller logic design and describes approaches to address process variation. Specific circuits and architectures to enable a low-voltage SRAM and a high-efficiency DC-DC converter are then presented. Finally, Section VI provides prototype measurement results. Fig. 2 shows a block diagram of the core logic, which is based on the MSP430 microcontroller architecture [7] . The 16-bit RISC CPU supports 27 instructions and 7 addressing modes of the standard MSP430 instruction set. The microcontroller interfaces to 128 kb of unified instruction and data memory, implemented as a custom SRAM, as well as to a watchdog timer and general purpose I/O ports. Programming of the SRAM is performed at startup via a JTAG interface.
II. SUB-THRESHOLD LOGIC DESIGN

A. Microcontroller Overview
Targeting low power applications, the microcontroller provides several power management features as illustrated in Fig. 2 . The clock system, which distributes external clocks to the microcontroller logic, supports three low power modes. In the first mode (LPM0), the master clock (MCLK) going to the CPU is gated. At this time, the CPU does not perform any processing, although peripherals remain active. The high frequency clock for the peripherals, or the sub-system master clock (SMCLK), is disabled in the second low power mode (LPM2). However, the auxiliary clock (ACLK), the low frequency clock for peripherals, remains on so that peripherals can function with lower active power. In the standby mode (LPM4), all clocks are shut off. The microcontroller can wake up from any of these modes through an interrupt event generated by the watchdog timer or input port.
This implementation also contains two features not found in commercial versions of the MSP430 microcontroller. First, the memory interface contains a small cache to reduce the memory access power. One 64-bit row of memory, which contains four 16-bit CPU words, is fetched and stored at a time. Successive 16-bit accesses to the same row require no further memory activity. This provides up to 50% savings in the measured memory access power for applications with a high hit rate. Second, the logic is split into two power domains; the unused blocks shaded in Fig. 2 are power gated during standby mode. Key CPU states are retained such that the microcontroller can continue program execution upon emerging from standby. The on-chip sleep transistor is sized for approximately 5% delay penalty at 300 mV. Accounting for the energy overhead in turning this transistor on and off, the breakeven time for power gating is less than 100 s. In other words, the microcontroller only needs to remain in standby for a short period of time in order for power gating to provide a net energy benefit. 
B. Sub-Threshold Logic Design Challenges
In addition to system-level power management features, voltage scaling is a key strategy in improving the microcontroller energy efficiency. As mentioned in Section I, the energy consumed by a digital circuit can be minimized by operating at the optimal , which often lies in the sub-threshold region. However, sub-threshold logic design in a deeply scaled technology node must address two factors which critically impact functionality. In this regime, logic gates exhibit degraded ratios of on to off currents . Moreover, random-dopant-fluctuation is a dominant source of local variation in sub-, causing random, local threshold voltage shifts [8] . The resulting exponential changes in device currents, exacerbating the weak , imply that static CMOS logic gates can fail to provide rail-to-rail output swings. The two combined effects are illustrated in Fig. 3 by the voltage transfer curve (VTC) of an inverter at 300 mV. Global variation, which weakens the NMOS relative to PMOS here, skews the VTC towards one side. Additionally, local variation randomly changes the strengths of PMOS and NMOS to cause perturbations in the VTC, in some cases severely degrading the logic levels.
These degraded logic levels can adversely impact functionality, even in typically robust static CMOS circuits. For example, reduced logic swing in inverters of Fig. 4 decreases the hold static noise margin (SNM) of latches in the classic transmission-gate register. Another failure mechanism is illustrated in the transient simulation of Fig. 4 . Here, because the clock buffer has reduced output swing, the transistor cannot be completely turned off during the transparent mode of the slave latch. Consequently, a signal cannot propagate successfully from node N2 to N3. Issues such as these motivate the design of a custom library with functionality in the presence of sub-variation as the primary goal.
C. Variation-Aware Logic Design
One approach to mitigate local variation is to upsize transistors, since the standard deviation of varies inversely with the square root of the channel area [9] . However, in the interest of minimizing energy, transistors also should be kept as small as possible, to lower energy and leakage currents. To manage this trade-off, the butterfly plot is proposed as a design guideline in building a custom sub-standard cell library.
The butterfly plot is formed by simulating two gates in a back-to-back configuration, as seen in the example of Fig. 5(a) . To illustrate the worst case, NAND and NOR are selected here for their inherently skewed VTCs. Because the VTC is input-dependent, all inputs are varied simultaneously to obtain the worst skew. The resulting plot in Fig. 5(b) consists of the VTC of one gate superimposed on the inverse VTC of the other. Intersection points represent stable voltage levels that can be supported by the circuit.
Conceptually, the back-to-back structure, when unrolled, is equivalent to an infinitely long chain of the two gates arranged in an alternating manner [10] . Having two bistable points in the butterfly plot implies that a signal at the input of the logic chain will eventually regenerate to either logic high or logic low. One way to model local variation in the back-to-back structure is to include it as series noise sources, shown as and in Fig. 5 (a). Like process variation, these sources cause shifts in the VTCs. Now, when the back-to-back structure is unrolled, these sources affect every other gate in the long logic path in the same manner, shifting their VTCs in the butterfly plot. As shown in Fig. 5(c) , the shift due to local variation can be so severe that the VTCs meet at only one monostable point. This implies that any input to the long logic path will ultimately converge to only one logic state, resulting in functional failure.
A functional criterion based on the above can be described as follows. Consider selecting two logic gates at random from a circuit, each gate with its associated local variation. The two gates are considered to function properly together if an infinitely long logic path constructed from them can support two logic states, or equivalently, if the butterfly plot contains two bistable points. This is a more stringent requirement than simply cascading the gates and verifying the output voltage after two stages.
D. Sub-Standard Cell Library
To use the above metric in designing a standard cell library, the maximum fan-in of the library is first limited to three. A larger fan-in would require stacking many devices in series, significantly degrading . The logic gate to be designed (e.g., inverter, 2-input NAND, 2-input NOR) is put back-to-back with 3-input NAND and 3-input NOR, whose skewed VTCs give the most stringent input-high and input-low requirements respectively. Sizing of the 3-input gates are fixed to provide a starting point for designing the remaining gates. Then, of transistors in the gate under test and global (interdie) process conditions are randomized according to models and data provided by the foundry. The Monte Carlo runs are in effect analogous to sampling logic gates across multiple chips. Following the above definition for logic functionality, the failure rate of the gate under test is found from Monte Carlo simulations while varying , device sizing, and temperature. Several trends were observed from the analysis. The failure rate decreases exponentially as either or device width is increased. This is shown in Fig. 6 , which plots the failure rate caused by degraded output low voltage in an inverter. Starting from an inverter with minimum size devices, the NMOS width is increased at various . The arrow marks the region where all samples were functional in a 200 k-point simulation at 300 mV. Other logic primitives, such as two series NMOS in a NAND gate, exhibit similar behavior. Therefore, by increasing the device width or , the failure rate can be made sufficiently small.
To examine the effects of temperature, Fig. 7 (a) plots the nominal output low and output high voltages of a sub-inverter from 0 C to 100 C. The output voltage levels degrade slightly, but the overall effect is negligible. Fig. 7 (b) plots the standard deviation of and with local variation. The spreads in and are seen to increase slightly at high temperature. These observations imply that, in the process technology being used, the high temperature corner is worst case for sub-threshold logic gate functionality.
With these considerations, a 62-cell library was designed which includes various logic functions and drive strengths. Each logic primitive was sized to give the same failure rate. The Monte Carlo simulation effort was reduced by reusing the sizing of logic primitives across several gates. For example, the required sizing for two series NMOS devices was found from the 2-input NAND, where two leaking parallel PMOS oppose the pull-down devices to give the worst case scenario. This sizing can then be reused in other gates with two series NMOS devices (e.g., AND-OR-INVERT).
Although excluding NAND3 and NOR3 from the library allows the remaining cells to be sized smaller, the number of gates needed to synthesize the design would increase. Synthesis results showed that the latter effect dominates in this design; eliminating NAND3 and NOR3 would cause the total transistor area in the logic to increase by approximately 15%.
As discussed in Section II-B, register design also merited special attention. Data retention of the registers under mismatch can be verified by measuring the hold static-noise-margin (SNM) of the master and slave latches while accounting for the voltage drop across transmission gates. As with logic gates, the percentage of latches displaying negative hold SNM, or failure to retain data, decreases exponentially with and device width in the inverters. Additionally, signal propagation issues were addressed by upsizing local clock and data buffers to ensure that their outputs are sufficiently close to and ground. Fortunately, transmission gates were more robust against variation and did not require special upsizing. Compared to a register optimized for the above-threshold region, the sub-threshold register has 2.4 larger area and 2.1 higher clock loading. Fig. 8 plots results of the custom library design and motivates the need for variation-aware device sizing. Here, the worst case cells from an unoptimized above-library are compared to the custom sub-library at 300 mV. The left and right columns respectively plot distributions of the of 3-input NAND and the of 3-input NOR, under local variation and at global corners. As expected, the sub-cells exhibit significantly lower output voltage variation at the cost of larger area. Typically, cells with the smallest drive strengths in the sub-library are sized larger than their above-counterparts, but the higher drive strengths can be kept unchanged. For cells such as the inverter, buffer, and 2-input NOR (NOR2), a 10% area increase is sufficient, while NOR3, NAND2, and NAND3 required 190%, 100%, and 270% increases respectively. Nevertheless, the logic synthesis tool was able to reduce the overall area cost by selecting upsized cells less frequently; for example, NAND3 comprised of only 0.31% of the total gate count in this microcontroller.
III. SUB-THRESHOLD TIMING ANALYSIS
In addition to affecting functionality, process variation also increases delay uncertainty. In sub-, local variation causes the delay distribution to widen further. Fig. 9(a) plots the normalized delay distributions of a microcontroller logic path, highlighting how variability increases by an order of magnitude at 300 mV compared to 1.2 V. Conventional static timing analysis approaches typically treat logic gate delay as deterministic, taking points at the tails of the distribution to represent the maximum and minimum delay under process variation. However, given the wide distributions in sub-, such approaches would lead to unrealistic results. This motivates statistical timing analysis methodologies [11] which consider the entire delay distribution instead of only the tail points.
A. Variation-Aware Timing Methodology
Statistical static timing analysis in sub-is complicated by several factors. Unlike in the above-threshold regime, the analysis cannot be easily simplified with linear models due to the exponential dependence of delay on . Fig. 10(a) plots the relationship between delay and shift for a sub-inverter. Characterizing these relationships for all cells in a library, under different input and output conditions, naturally requires substantial effort. Borrowing techniques from the above-regime, one might envision forming a piecewise linear approximation in order to reduce the characterization effort. Following the example of computer-aided design tools in using three points (best, typical, worst), the piecewise linear model plotted by the dashed line in Fig. 10(a) is constructed. An approximate delay distribution can then be derived from the model and the variation statistics. However, as shown in Fig. 10(b) , the distribution obtained in this manner does not match well with Monte Carlo SPICE simulation results. Although adding more points to the piecewise linear model can improve accuracy at the expense of longer characterization time, statistical approaches which can capture the nonlinear delay-relationship should also be considered. The exponential relationship implies that, when is modeled as having a normal distribution, the resulting delay distribution for a logic gate will be lognormal. However, there is no closed form expression for adding lognormally distributed gate delays to obtain the logic path delay [12] . Instead, this must be done with iterative approaches [13] or analytical models, one example being the expression for the sum of identically distributed sub-gate delays in [8] . Further, register hold time is often not well-approximated by standard distributions in sub-.
hinges on whether a change in the data input causes a glitch or transition that incorrectly disturbs the output. This, in turn, is influenced by the slew rates of clock and data signals and can be a nonlinear phenomenon. Fig. 9(c) plots the simulated distribution for a register with asynchronous preset and reset. Here, neither the Gaussian nor lognormal models can accurately represent the simulation.
To capture these effects, this design employs an approach based on Monte Carlo simulation while using analytical methods to reduce the total simulation effort. Shown in Fig. 11 , the timing analysis flow focuses on hold time violations because they cause functional errors independent of the clock period. An exhaustive timing report listing the data and clock paths is obtained from the placed and routed design. This report is generated under the worst case global conditions-at the fast process corner for verifying hold time, and at low where variation is most prominent. However, the report does not consider local variation. Known paths with very short logic delays (e.g., shift registers) are removed from the timing report and handled separately. The remaining paths are grouped into bins by the nominal hold time margin, and the bins are then analyzed to select paths of interest for further simulation. The hold time margin is derived by rearranging the standard hold time constraint (Fig. 9(b) ) (1) and is defined as (2) It should be greater than zero for proper functionality.
Within a bin, an algorithm selects paths with high variance, whose long distribution tails result in a higher probability of timing violation. To estimate the variance of path delays, the algorithm employs the standard deviation over mean , which decreases with larger device sizes and higher logic depth [8] , as shown in Fig. 12 for a generic logic path with stacked devices. It is important to note, however, that the trends eventually reach diminishing returns, which must be considered in the timing methodology.
For logic gates, the of delay is first characterized at the device sizes used in the standard cell library. Input slew and load capacitance also have a slight effect on . These dependencies are summarized in lookup tables used by the algorithm. For a logic path, the relative variation becomes smaller as the logic depth increases, since variation tends to average out across stages. To account for this effect, the algorithm assigns a weighting factor to every path according to its logic depth. The factor is found empirically by simulating the delay variability in paths of different lengths and varied transistor sizes. The analysis did not consider spatial correlation since several studies [14] - [17] reported small spatial correlation coefficients, which showed weak or no discernible dependence on separation distance over the ranges of interest in this design. Further, it would be impractical to model the position dependence of spatial variation during the design phase, since this is very difficult to predict without the final layout [16] . The high variance paths are then selected to undergo Monte Carlo simulation with local variation and at the global fast corner. This gives an accurate hold time margin distribution, accounting for the local clock skew and the hold time requirement of the destination register. Probability of a hold time violation can then be estimated. The distribution of is generally not Gaussian nor lognormal according to the Anderson-Darling test [18] . Nevertheless, the data was fitted to a Gaussian curve, since this gave on average a more pessimistic probability of violation compared to finding the percentage of violating samples in the raw data.
If the probability is above a set threshold, as determined by the number of paths in the design and the desired timing yield, then extra delay buffers are applied to increase the hold time margin. To be conservative, buffers are also applied to unsimulated paths belonging to the same bin. Paths requiring extra buffering were concentrated in small bins with low average hold time margin. It should be noted that a variation-aware approach typically results in fewer delay buffers inserted compared to worst case timing analysis. For instance, a common worst case methodology uses two deterministic values to model fast and slow delay in a cell under local variation. One such example would be to use the points as the slow and fast delays. Hold time constraint is verified by assuming that all cells in the data path have fast delays, while those in the capture clock path have slow delays, in order to obtain the worst case scenario. However, in reality, it is unlikely that all cells in the data path uniformly exhibit fast delay due to local variation. Because of this pessimism, the worst case methodology identified 929 timing paths for hold time fixing, several times more than the 151 paths selected by the variation-aware approach.
B. Comprehensive Delay Variation Data
Apart from the analysis described above, comprehensive Monte Carlo SPICE simulations were performed for 30000 timing paths in the microcontroller over several months. The results serve to illustrate trends in sub-delay variability. In Fig. 13(a) , each horizontal cross section is the delay distribution of one timing path under local variation, at 300 mV and global fast corner. The rightward skew is typical of a lognormal distribution. Fig. 13(b) shows a scatter plot of the corresponding timing path statistics. Each point represents one path, with mean delay plotted on the x-axis and shown on the y-axis. Initially, the lower range of decreases with the mean delay, which reflects how variation tends to average out in longer paths. However, this quickly reaches diminishing returns, and does not decrease far below 0.1, even for very long paths. The same trend is observed when logic depth, instead of mean delay, is plotted on the x-axis. Since depends on both device sizes and logic depth, the lower bound observed reflects the inherent variability given the device sizes used in the standard cell library. Additionally, the upper range indicates that outliers with large amounts of variation occur less frequently in very long paths. However, when examining critical paths for hold time, it is important to consider both the shortest paths and slightly longer paths that may exhibit higher variability.
IV. ULTRA-LOW-VOLTAGE SRAM
Although the 6 T SRAM bit-cell provides a good balance between density, stability, and performance for conventional applications, its high sensitivity to variation leads to very unfavorable trade-offs for ultra-low-voltage applications (i.e., below 500 mV). Most critically, its read static noise margin (SNM) [19] is severely degraded by the greatly amplified effect of random dopant fluctuations (RDF) [20] , and up-sizing, to manage variation and ensure sufficient margin, leads to an impractically large bit-cell layout. Similarly, correct write operation requires that stored data be overwritten by the access devices; however, the relative device strengths necessary to ensure this cannot practically be guaranteed. Further, the increased sensitivity to variation also results in extremely low worst case read-current. The resulting effect on performance is drastic, but, even more importantly, the effect on functionality can be fatal, where the read-current can be exceeded by the aggregate bit-line leakage-current [21] .
A. Sub-SRAM Design
In this ultra-low-voltage design, an SRAM based on the 8 T bit-cell shown in Fig. 14(a) is used to provide full operation down to 300 mV. Though the cell area is increased by the read-buffer, it obviates the stringent read SNM, which is less than 80 mV (with sigma of approximately 40 mV); the remaining hold SNM is over 130 mV (with sigma of approximately 30 mV). Meanwhile, write-margin is ensured by control of , which selectively weakens the PMOS loads, 3/4, and bit-line leakage, to enable a high-level of column integration, is managed by control of , which gates the sub-leakage from unaccessed read-buffers. Though based on the techniques used in [22] , the SRAM requires several design changes for this application: a new bit-cell provides larger read-current by taking advantage of the reverse-short-channel effect; shorter column configurations reduce the bit-line leakage and loading, ensuring reliable sensing with 10 lower access-time at 500 mV, and an interface buffer allows independent optimization of SRAM word-length and CPU word-length.
Since the read-buffer devices of an 8 T bit-cell have no impact on stability, they can be sized primarily for optimal cell readcurrent. In above-designs, this typically leads to nearly minimum length devices, even though the cell layout height, which is limited by the other devices, permits longer lengths. However, longer read-buffer devices have the advantage of lower effective threshold voltage, through the reverse-short-channel effect [23] , and less RDF variation, which is particularly critical in sub-where its impact is greatly amplified. As a result, at ultra-low-voltages, both the mean and weak-cell read-current can be much higher [24] . This has been exploited to improve the cell's write-ability in sub-SRAMs [25] ; but in this design, it is applied by aggressively lengthening the read-buffer devices to increase read-current. It is worth noting that, despite the read-current improvement, there is no significant change in the sub-bit-line leakage current, since raising , as described in [22] , eliminates this source of leakage. Lastly, the increase in dynamic power to drive and is negligible, as their wire capacitance greatly dominates over gate and diffusion loading.
Additionally, reducing the number of cells per bit-line, from 256 (in [22] ) to 64, as shown in Fig. 14(b) , mitigates both secondary bit-line leakage sources and bit-line loading. Consequently, the higher read-current and lower bit-line capacitance enable a performance increase of 10 at 500 mV, which is critical in this application since system clocking requires the SRAM to operate at a higher clock rate than the logic.
Lastly, a local buffer is used to provide an interface between the CPU and the SRAM. Peripheral read and write assists in the SRAM are critical to ensure robust low-voltage operation. However, amortizing their overhead leads to large SRAM access words. Accordingly, the local buffer abstracts these constraints from the CPU, providing optimal data alignment.
V. DC-DC CONVERTER
The previous sections have described the energy savings that can be achieved by reducing the of logic and memory circuits. To realize the full energy savings of sub-operation, a DC-DC converter supplying ultra-low voltages at high efficiencies is essential. Since the power consumption of the logic and SRAM load circuits drops exponentially at sub-voltages, the DC-DC converter was designed to deliver a maximum of 500 W of load power. This reduced load power demand makes switched capacitor DC-DC conversion an ideal choice for this application. The switched capacitor (SC) DC-DC converter is based on [26] , and makes use of 600 pF of total on-chip charge transfer (flying) capacitance to provide scalable load voltages from 300 mV to 1.1 V. The logic and SRAM circuits in this system utilize voltages up to 600 mV. Fig. 15 shows the architecture of the DC-DC converter. The converter uses an all-digital pulse frequency modulation (PFM) mode of control to regulate the output voltage. In this method of control, the converter stays idle until the load voltage falls below the reference voltage , at which point a clocked comparator enables the switch matrix to transfer one charge packet to the load. A PFM mode control is crucial to achieving high efficiency for the extremely low power system being built. The switch matrix block contains the charge transfer switches and the charge transfer capacitors.
One of the main efficiency limiting mechanisms in a switched capacitor DC-DC converter is the linear conduction loss [26] . To maintain efficiency over the wide load voltage range of 300 mV to 1.1 V, this converter employs five different gain settings . Fig. 16 shows how the different gain settings are achieved from a total charge transfer capacitance of (600 pF). The external voltage input to the system is 1.2 V. Each gain setting at no-load provides a voltage ratioed output of the input voltage. A suitable gain setting is chosen off-chip, depending on the proximity of its no-load voltage to the load voltage being delivered, and its ability to provide the load power demand [26] . Since the logic and SRAM load circuits utilize voltages up to 600 mV, in the actual testing of the chip, only gain modes G2BY3, G1BY2 and G1BY3 were used.
The switching losses in the converter are dominated by the energy expended in turning the charge transfer switches ON and OFF. The switch widths are designed such that the charge transfer capacitors just settle at the end of a charge transfer cycle. In order to scale switching losses with load power, the charge transfer switches have adjustable widths which are enabled by the signal as shown in the inset of Fig. 16 . For any decrease (increase) in the load power by a factor of 2, the clock frequency (CLK) of the comparator is halved (doubled) and correspondingly, the width of the charge transfer switches is also halved (doubled). This helps to decrease the switching power by 4 when the load power decreases by 2 , leading to an increase in efficiency at lower load power levels. While the signal was set externally in this implementation, [26] describes a method to automatically determine the signal as the load power varies. In Fig. 20 , the gain in efficiency as the load power decreases close to 380 W and 200 W is due to the scalable switch width design. However, at very low load power levels (sub-5 W), leakage and other fixed losses in the control circuitry reduce the efficiency of the switched capacitor DC-DC converter.
VI. PROTOTYPE MEASUREMENTS
A summary and die micrograph of the test chip, fabricated in 65 nm CMOS, is shown in Fig. 17 . The DC-DC converter, including charge transfer capacitors, occupies just 0.12 . The minimum energy point of the microcontroller occurs at 500 mV, and functionality was verified down to 300 mV. together consume 27.2 pJ per clock cycle at 500 mV and 25 C. The optimum energy does not vary much across 20 chips; the measurements have a of 0.0897. Shown in Fig. 18(b) is the energy consumption of the microcontroller core logic while it executes specific instructions. Generally, instructions for arithmetic or boolean operations (e.g., add, and, compare), executed on operands stored in CPU registers, require roughly the same amount of energy per cycle. Instructions that involve memory accesses for data (e.g., load/ store, push/pop) exhibit higher energy consumption as expected. The jump instruction, which generates high switching activity on the address bus, requires the most energy.
A. Active Energy and Performance
The energy consumed by the SRAM array per system clock cycle is shown in Fig. 19 . The memory greatly influences the minimum energy point of the system since it consumes a major portion of the total system energy, highlighting the importance Fig. 20 . DC-DC converter efficiency while delivering 500 mV. The DC-DC converter is powered by a 1.2 V supply. Arrows mark efficiency gain from scalable switch width design as discussed in Section V. The efficiency of the DC-DC converter delivering 500 mV is shown in Fig. 20 . The converter achieves more than 75% efficiency with an order of magnitude change in load power, between 10 W to 250 W. With the microcontroller as a load, the converter provides 75% efficiency at 12 W. When measured standalone, the converter reaches a peak efficiency of 78%. Fig. 21 plots the microcontroller performance versus supply voltage at 0 C, 25 C, and 75 C. The measured frequency, accounting for logic and memory delays, is 434 kHz at 25 C and 500 mV. The frequency ranges from 8.7 kHz to 1 MHz across the operating range of 0.3 V to 0.6 V. The of measurements across 20 chips at 500 mV is 0.133.
B. Standby Power
The inclusion of a DC-DC converter enables the system to dynamically scale to 300 mV during standby mode, where memory and logic together consume less than 1 W, as shown in Fig. 22 . Accounting for the DC-DC converter efficiency loss at such low power levels, this represents a 2.1 reduction in leakage power compared to keeping constant at 500 mV during standby. 
VII. CONCLUSIONS AND SUMMARY
Voltage scaling enables energy minimization and leakage power reduction in micro-power systems. However, design techniques and circuit assists are necessary to overcome process variation in the ultra-low-voltage regime. The 65 nm sub-microcontroller presented here demonstrates several approaches to enable operation down to 300 mV. A standard cell library design methodology addresses the degraded and levels in sub-, which, at deeply scaled process nodes, can render logic gates non-functional. Circuit delays are similarly affected by variation, exhibiting an order of magnitude higher variability at low voltages. Conventional timing analysis approaches that treat delays as deterministic are insufficient. Instead, a variation-aware methodology combining Monte Carlo simulation and analysis was developed to verify hold time constraints. The SRAM represents a dominant portion of area and power in this system. Therefore, energy and leakage reduction through voltage scaling is highly desirable. In conventional 6 T SRAMs, variation causes severely degraded read-current and increased cell instability, limiting the minimum functional voltage. The SRAM in this system employs an 8 T bit-cell to address these limitations. Further, peripheral circuit assists enforce the relative device strengths needed for read and write functionality, despite significant variation. The fully integrated, switched capacitor DC-DC converter provides highly efficient power delivery at the low voltage and power levels required by energy-constrained systems. Employing multiple gain settings and efficient control circuitry, the DC-DC converter achieves above 75% efficiency while supplying 500 mV across an order of magnitude change in load power.
