Abstract-Increasing number of energy-limited applications continue to drive the demand for designing systems with high energy efficiency. This tutorial covers the main building blocks of a system implementation including digital logic, embedded memories, and analog-to-digital converters and describes the challenges and solutions to designing these blocks for low-voltage operation.
I. INTRODUCTION

E
NERGY efficiency of circuits is a critical concern for a wide range of applications including mobile multimedia and biomedical monitoring. Energy available to these applications is often limited by the capacity of a battery or by the extent of energy harvested from the ambient environment. Therefore, circuit designers face the challenge of optimizing algorithms, architectures, and circuit topologies not only for the functionality and performance but also for achieving lower energy per operation.
Dynamic voltage scaling is an effective method of reducing energy consumption of circuits under a time-varying performance constraint [1] , [2] . Specifically, lowering the supply voltage (V DD ) of the devices results in smaller energy per operation at the expense of slower performance. It has been shown that, for most digital circuits, the minimum energy point lies in the subthreshold region (V DD < V T ) where devices are operated at weak inversion [3] . However, the performance of the circuits in the subthreshold region is on the order of kilohertz which might not be suitable for various energyconstrained applications. Hence, recent work has focused on operating circuits at a voltage that is slightly above the V T of the devices to balance the tradeoff between energy efficiency and performance. Operating circuits at low voltages is challenging as conventional topologies and circuit techniques are often Manuscript received February 6, 2012 optimized for operation at nominal voltage [4] . Smaller voltage headroom, reduced noise immunity, and increased effects of transistor variation at low voltages are some of the drivers for innovative designs and techniques that are required for lowvoltage operation. The works in [5] and [6] are demonstrations of mixed-signal implementations targeted for low-voltage operation to improve energy efficiency. To achieve maximum energy savings, voltage scaling should be supported by algorithms, architectures, and circuits optimized for the specific requirements of the target application [7] . Efficient usage of dedicated hardware accelerators, scalable and reconfigurable architectures, and customdesigned circuit solutions is an effective means to reduce energy consumption at different stages of the design process. The rest of this brief is structured to cover the key building blocks of a low-voltage system starting with the digital logic and embedded memories and then focusing on the analog-to-digital converters (ADCs).
II. DIGITAL CIRCUIT BLOCKS
A. Architectures for Low-Voltage Systems
At the architecture level, a system can be designed to compensate for the reduced speeds of low-voltage circuits while preserving energy efficiency. In custom application-specific integrated circuits (ASICs), increasing concurrence by parallelism or pipelining allows the overall system to meet throughput constraints, although each part operates at a slow rate. This has been demonstrated in numerous systems, e.g., that in [8] , where a ultra-wideband baseband processor uses 620 parallel correlators to support 500-MS/s throughput at 0.4 V. Parallelism can be selectively applied to system bottlenecks to mitigate area overhead. For example, in an H.264 video decoder [9] , the motion compensation and deblocking filter blocks are bottlenecks in the system pipeline. The processing units in these blocks are parallelized, increasing the system area by 3%, but this enables the chip to decode a high-definition 720p video at 0.7 V with a modest core frequency of 14 MHz.
While custom ASICs are efficient at fixed tasks, some applications require programmability of processors (e.g., that in [10] ) to support changing algorithms. In these cases, hardware accelerators can improve the speed and energy efficiency of low-voltage processors [11] . As an example, the processor for wearable biomedical sensor nodes in [12] employs accelerators to process vital signs in real time while operating at 0.5 V. Based on software profiling of benchmark applications, four operations-FIR filtering, Fast Fourier Transform, median filtering, and math functions-are identified for hardware implementation. Over two complete EEG and EKG applications, the processor with both CPU and accelerators ( Fig. 1 ) achieves 10-11 × lower energy compared to using only the CPU.
Aside from reduced speeds, low-voltage designs must contend with increased sensitivity to variation. At the architecture level, timing errors can be addressed with error-tolerant techniques having low throughput and power overhead. For example, special flip-flops and a synthesis methodology are used in [13] and [14] , respectively, to detect if critical paths are exercised as V DD is lowered below the point of first failure; failed operations are either re-executed or given more time to complete. Another class of techniques (e.g., that in [14] ) exploits the inherent error tolerance of applications such as audio or video and aims not to correct all errors but to scale system reliability gracefully to maintain a desired signal-to-noise ratio.
B. Effect of Variations on Logic Timing
Operating circuits at low voltages exacerbates the effects of both global [15] and local variations. Local variations have long been known in analog and static random-access memory (SRAM) designs [16] , [17] . With shrinking transistor geometries and ultralow-voltage operation (V DD ≤ 0.5 V), local variations have become increasingly significant for logic timing as well. For low-voltage operation, these variations can result in timing path delays with standard deviation comparable to the global corner delay and must be accounted for during timing closure in order to ensure a robust manufacturable design. Fig. 2 shows that, for a representative cell from the 28-nm CMOS library, the global corner delay increases by 15× as V DD is reduced from 1.0 to 0.5 V, whereas the total delay (corner + 3σ stochastic delay) increases by 36×.
SRAM and logic can be designed, by adding a sufficient timing margin to the corner-based analysis, to ensure reliable operation at very low voltage, but generally, this is at the expense of significant performance loss at high voltage. Statistical static timing analysis (SSTA) has emerged as a necessary tool to achieve a maximally efficient low-voltage operating point, with no or minimal loss of performance at high voltage.
At nominal voltage, it is usually accurate to assume that the circuit performance is linear in transistor variation [15] , [19] . However, at low voltage (V DD ≤ 0.5 V), circuit delay is a nonlinear function of the transistor random variables. This greatly complicates the statistical analysis because the probability density function of the circuit delay is no longer Gaussian. Several approaches for SSTA have been proposed ranging from numerical integration techniques [20] to MonteCarlo-based techniques [21] , [22] to those based on probabilistic analysis. Several approaches have been developed for simulating the effects of global variations in the nonlinear nonGaussian case. Most of these methods rely on Taylor-seriesexpansion-based polynomial representations to model the cell and timing path delays [23] , [24] . High computational complexity of SSTA techniques often results in impractical run times. A computationally efficient approach, based on operating point analysis (OPA), is proposed in [18] . OPA-based timing analysis is used in [25] to perform timing closure for a full-scale digital signal processor (DSP) system-on-chip (SoC) test chip, designed using commercial 28-nm technology. The statistical timing analysis approach ensures reliable operation of the DSP logic elements down to V DD ≤ 0.5 V.
As the technology continues to scale, statistical timing analysis approaches will be key enablers of robust ultralow-voltage operation to maximize energy efficiency.
III. EMBEDDED MEMORY
A. Low-Voltage SRAM Design
The workhorse of the embedded memory is SRAM based on the six-transistor (6T) cell because of its straightforward design and area-efficient layout. SRAM cells are often designed with minimum-size devices to maximize area efficiency, and consequently, their operation is severely affected by process variation. As the effect of transistor variation is exacerbated at low voltages, operating SRAMs at lower V DD is a challenging design problem. Since a memory consists of a large number of bit cells, sense amplifiers, and row/column drivers, it is essential to consider the worst case process and the voltage and temperature conditions on these circuits to ensure robust operation. Static noise margin (SNM) is a metric used to quantify the stability of a bit cell under retention state and read/write conditions [26] . For large memories with millions of bits, it is not uncommon to consider 5σ−6σ tails of the SNM distributions to ensure robust operation.
At low voltages, the conventional 6T cell suffers from functional problems in which a read operation can alter the state of the bit cell and a write operation cannot overwrite the previous state of the bit cell. At low voltages, bit cells can even fail to retain data. Degradation of the I ON /I OFF ratio of devices at low voltages also introduces a challenge for the sensing network to distinguish a logic "0" from a logic "1." Fig. 3 shows the minimum operating voltage (V min ) for 6T SRAMs on different process nodes. Going to more scaled technologies, increased local transistor mismatch results in higher V min . Therefore, recent work has focused on alternative bit-cell topologies, peripheral assist circuits, and novel sensing techniques. Although these techniques often introduce area overhead over the conventional 6T SRAMs, they also enable low-voltage operation and provide lower V min . 
1) Alternative Bit-Cell Topologies:
The eight-transistor (8T) bit cell [27] has gained significant popularity in recent years. In this topology, two extra transistors form a separate read port that is decoupled from the cell's write ports. This enables write ability and read disturbance problems to be addressed independently. The works in [28] - [30] use this topology to achieve low-voltage operation down to the subthreshold region. The work in [31] uses a novel seven-transistor bit cell with single-ended read and write ports. The extra transistor is used to break the feedback between cross-coupled inverters during accesses to ensure stability. The work in [32] uses a ten-transistor bit cell tailored for subthreshold operation where extra transistors are used to form a read buffer similar to that of the 8T topology, but transistor stacking minimizes bit-line leakage which is particularly problematic in the subthreshold region due to the degraded I ON /I OFF ratio.
2) Rowwise and Columnwise Assist Circuits: Although different cell topologies enable low-voltage operation, peripheral assist circuits are often needed to complement low-voltage functionality of the bit cell. The work in [33] uses a rowwise write assist circuit where word lines are overdriven to improve write ability of bit cells. Alternatively, the work in [34] uses a columnwise approach and modulates the supply voltage of active columns during a write operation to ensure successful overwriting of stored data in the bit cell. The work in [35] also uses a columnwise assist by boosting the bit lines below the ground level.
A recent work in [36] uses a standard high-density (0.12 μm 2 ) 6T bit cell and supports it through peripheral assist circuits to enable low-voltage operation in 28-nm technology. Peripheral assists include read stability improvement through short local bit lines, word-line voltage boosting for write ability, a preread scheme to avoid half-select problem, and large-signal sensing for improved low-voltage performance.
3) Sensing Topologies: Conventional 6T SRAMs utilize differential sense amplifiers to amplify a small-signal voltage swing on bit lines to a rail-to-rail signal. However, the work in [37] showed that large-signal sensing can provide better tradeoffs in deeply scaled CMOS processes. It should be noted that sensing network design should be considered together with the bit-cell topology and overall SRAM architecture. For example, 8T-bit-cell-based designs require a single-ended or pseudodifferential sensing scheme.
Reconfigurability has been pursued in one class of solutions targeting low-voltage operation. The design in [38] augments the sense-amplifier circuit with a mechanism to select from multiple voltage references. Therefore, although the variation from one sense amplifier to the next inserts different amounts of error to each column of cells, the resulting error can be reduced by measuring each sense amplifier and storing the optimum reference setting. The design in [28] implements two sense amplifiers per column of memory cells where one sense amplifier serves as a backup. The penalty of the reconfigurability scheme is added test time which must be evaluated and must not offset gains brought by reconfigurability.
Replica circuits embedded with the memory bypass this limitation of increased test time. By measuring important circuit properties locally, die-to-die variation and process, voltage and temperature variations over time can be tracked, and circuits on the edge of failure can be brought back to stable operating conditions. For example, the memory in [39] arranges the read devices in the memory cell so that the leakage of each column is independent of data. As a result, it is possible to generate the nominal "0" level with a dummy memory column and use this voltage to provide a virtual ground to sensing inverters of functional memory columns.
Finally, offset compensation techniques aim to eliminate the contribution of sense-amplifier variation to improve overall read access failure. The SRAM design in [30] implements a coupling capacitor between the bit line and the sense circuit to separate the signal from the amplifier bias conditions. This approach resembles the classic domino read path with dynamic p-channel MOS (PMOS) inverters, except that the variable sensing PMOS devices are effectively translated to a common level of bias that corresponds to the onset of turn-on, i.e., V DD − |V T p |.
B. Alternative Low-Power Embedded Memories
Recent work at nominal voltages has replaced embedded SRAM with dynamic random-access memory (DRAM) to reduce cost [40] . The same opportunity exists at lower voltage, and recent work aims to implement DRAM designs in logic-compatible technologies that do not require out-of-range voltage biases [41] , [42] . Although the logic-compatible lowvoltage DRAM approach requires additional transistors in the cell (two or three transistors), the overall cost can be potentially lower than that of 6T SRAM. At even higher densities of integration or longer timescales of standby, it is ultimately desirable to have a high-speed nonvolatile RAM. The relaxed speed of low-voltage systems creates an opportunity to accommodate emerging cell technologies such as magnetic RAM, phasechange RAM, and resistive RAM.
IV. ANALOG-TO-DIGITAL CONVERSION
Nearly all systems require an ADC to interface between analog signals from the physical world and DSPs with rich processing capabilities. Operating analog circuits from the same low-voltage digital supply can simplify the system-level problem of efficient power conversion by eliminating the need for a separate analog supply. Moreover, compatibility with digital systems can lead to potential cost and form factor benefits by integrating analog processing together with DSP in a SoC. Therefore, this section provides an overview of ultralowvoltage ADC design.
Technology scaling can significantly reduce cost and increase performance of the digital portion of a SoC. However, device scaling requires a reduction in supply voltage which poses the challenge of reduced voltage headroom for analog circuits. Moreover, the reduced intrinsic gain of transistors in scaled technologies makes designs that rely on op-amps more difficult (e.g., pipelined ADCs). Lastly, in order to maintain a desired dynamic range under a scaled supply voltage, circuit noise must be proportionately reduced which necessitates an increase in power for purely noise-limited designs (≥12 bits) [43] . In general, analog power consumption is a complex function of factors such as dynamic range, gain, bandwidth, supply voltage, and circuit topology.
Despite the aforementioned challenges, voltage scaling remains a very effective method for improving the energy efficiency of ADCs which is commonly captured by an empirical figure of merit (FOM) defined as F OM = P/(2 ENOB f S ), where P is the power consumption, f S is the sampling rate, and EN OB is the effective number of bits [44] . Therefore, there has been a recent trend toward low-voltage (< 1 V) low-to moderate-resolution designs (< 10 bits) using highly digital op-amp-free architectures such as flash ADCs [45] , [46] and successive approximation register (SAR) ADCs [47] , [48] . Current state-of-the-art low-voltage ADCs such as a 0.5-V 8-bit 1.1-MS/s SAR ADC in 40-nm CMOS [48] achieve FOMs as low as 6.3 fJ per conversion step by leveraging the superior transistor speed of scaled technologies to operate at near threshold, which is made possible at the architectural level by using only digital or passive circuits. Fig. 4 shows the energy per conversion versus the signal-to-noise-and-distortion ratio (SNDR) for ADCs published in the last five years, demonstrating that the most energy-efficient ADCs in the low-to moderate-resolution range operate from a sub-1-V supply.
A. Reconfigurable Voltage-Scalable SAR ADC
In a majority of applications, there are fixed ADC dynamic range and bandwidth requirements which depend on the signal characteristics and system requirements. However, applications such as sensor networks often have varying bandwidth and dynamic range requirements, making reconfigurable ADCs highly desirable. An example of a reconfigurable 5-to 10-bit SAR ADC whose power scales with resolution and sampling rate can be found in [50] . A resolution-reconfigurable digitalto-analog converter whose power scales exponentially with resolution is used to reduce CV 2 switching energy. Moreover, as the ADC resolution is decreased, the linearity requirement of the sampling network is relaxed, and quantization noise becomes much larger than the sampled (kT /C) thermal noise. Therefore, at lower resolutions where increased distortion and increased thermal noise (relative to the quantization noise) can be tolerated, the supply voltage is scaled to improve energy efficiency. Voltage scaling, however, places a limit on the maximum f S , resulting in increased leakage energy per conversion. To address this, the ADC in [50] is duty cycled so that leakage power gating with a high-V T device can be used during the SLEEP state to further reduce power. These techniques allow the ADC to remain energy efficient over a wide range of resolutions and sample rates.
B. High-Performance Low-Voltage ADCs
Despite the efficiency of SAR ADCs, they fail to simultaneously achieve high speed, high resolution, and superior energy efficiency. For applications that require high throughput, parallelism can be exploited to achieve high speed at low voltages [51] , [52] . In [51] , a 250-MS/s SAR ADC is demonstrated at 0.8 V by interleaving 36 channels, but the resolution is limited to 5 bits. In order to simultaneously achieve high speed and resolution, low-voltage op-amp-free pipelined ADCs are being explored. For example, highly digital pipelined ADCs using comparator-based switched-capacitor circuits and zerocrossing detectors eliminate the need for op-amps, making low-voltage pipelined ADCs a promising new direction [53] . Finally, for low-speed high-accuracy applications, ultralowvoltage ΔΣ converters using body-input transconductance amplifiers [54] or inverter-based circuits [55] can achieve very high dynamic range.
V. CONCLUSION
Operating circuits at low voltages is an effective method to reduce energy consumption, but challenges of low-voltage operation need to be addressed at different levels of the design. At the architecture level, reduced performance of transistors at low voltages can be mitigated with increased concurrence through parallelism or pipelining. At the circuit level, new topologies and techniques need to be tailored for the tradeoffs of lowvoltage operation. Hardware reconfigurability can be used to adjust to the different requirements of various operating voltage ranges. Finally, energy savings can be maximized by optimizing algorithms, architectures, and circuits for the specific and key requirements of each target application.
