ULTRA ENERGY-EFFICIENT SUB-/NEAR-THRESHOLD COMPUTING: PLATFORM AND METHODOLOGY by ZHAO WENFENG
Ultra Energy-Ecient Sub-/Near-Threshold
Computing: Platform and Methodology
Zhao Wenfeng
(B.Eng, Huazhong University of Science and Technology, 2007 )
(M.Eng, Huazhong University of Science and Technology, 2009 )
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2013
Declaration
I hereby declare that this thesis is my original work and it has been written by
me in its entirety. I have duly acknowledged all the sources of information which
have been used in the thesis.






My postgraduate study would not have been so exciting and memorable with-
out the guidance and help from many people in my life. With this opportunity, I
would like to express my sincere gratitude to them.
First of all, I would like to express my appreciation to my supervisor Prof. Ha
Yajun, for his guidance, encouragement and trust during my PhD study. I am so
grateful that he gave me the valuable study chance and an interesting research topic
for the past years. He has been always willing to oer his insights and suggestions
whenever I encounter technical problems and I would not be able to nish my
research without his encouragement and inspiration. I also thank him for involving
me in several collaboration projects beyond my research topic, which also greatly
extends my knowledge.
I would like to acknowledge the in-depth comments and feedback from my
doctoral committee members, Prof. Heng Chun Huat, Prof. Xu Yongping and
Prof. Lian Yong. I would like to appreciate the help from Prof. Massimo Alioto
for his rigorous review and discussions for a journal paper.
During my PhD time, I have worked with several group members under Prof.
Ha Yajun. I would like to thank Anastacia Alvarez rst, for her contribution to
several research projects related to my PhD research. I also enjoyed the collabo-
ii
ACKNOWLEDGEMENTS
rations with Loke Wei Ting, Dr. Syed Riswan and Dr. Do Thi Thu Trang and
discussions with Dr. Yu Heng, Dr. Wang Yi, Li Ang, Hoo Chin Hau, Luo Shaobo
and Chen Yongzhen.
It is a pleasant experience in the VLSI lab with the presence of many fellows
friends: Zhao Jianming, Liu Xu, Li Xuchuan, Liu Xiayun, Zhou Lianhong, Li Yong
Fu, Mahmood Khayatzadeh, Chua Dingjuan and Jerrin Pathrose. Also, I will never
forget the afternoon coee group with Pan Rui and Wu Tong.
Finally, I would like to express my gratitude from the bottom of my heart




ADC Analog to Digital Converter
AES Advanced Encryption Standard
AFE Analog Front End
AOF Area Overhead Free
ASIC Application Specic Integrated Circuits
BB Body Biasing
BSN Body Sensor Network
CAD Computer Aided Design
CCS Composite Current Source
CF Composite Field
CMOS Complementary Metal Oxide Semiconductor
CORDIC COordinate Rotation DIgital Computer
CPU Central Processing Unit
DCVS Dierential Cascade Voltage Swtich
DIBL Drain Induced Barrier Lowering
DoE Design of Experiments
DRAM Dynamic Random Access Memory
DSTA Deterministic Static Timing Analysis




EDA Electronic Design Automation
EDNM Eective Diode Network Model
FDC Frequency to Digital Converter
FDSOI Fully Depleted Silicon on Insulator
FFT Fast Fourier Transform
FIR Finite Impulse Response
GF Galois Field
GIDL Gate Induced Drain Leakage
GPU Graphic Processing Unit
IC Integrated Circuits
INWE Inverse Narrow Width Eect




LVSB Low Voltage Swapped Biasing
MC Monte Carlo
MCU Micro-Controller Unit
MEP Minimum Energy Point
MOMCAP Metal-Oxide-Metal Capacitor
MOSFET Metal Oxide Semiconductor Field Eect Transistor
MTCMOS Multiple Threshold CMOS




PDF Probability Density Function
PTAT Proportional to Absolute Temperature
RDF Random Dopant Fluctuation
RF Radio Frequency
RSCE Reverse Short Channel Eect
RVT Regular Threshold Voltage
S-Box Substitution Box
SFBB SelF-Body Biasing
SIMD Single Instruction Multiple Data
SMA Surrogate Modeling Adjustment
SoC System on Chip
SSTA Statistical Static Timing Analysis
Sub-Vth Sub-threshold
TDC Time to Digital Converter
TSD Temperature Sensitive Delay
TSRO Temperature Sensitive Ring Oscillator
ULP Ultra Low Power
ULV Ultra Low Voltage
VLSI Very Large Scale Integration
Vth Threshold Voltage
VT Thermal Voltage
VTC Voltage Transfer Characteristics
WSN Wireless Sensor Network




List of Abbreviations iv
Contents vii
Summary xi
List of Tables xiv
List of Figures xv
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 10
2.1 Modeling and Technology Implications . . . . . . . . . . . . . . . . 10
2.2 Circuit Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
CONTENTS
2.2.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Level Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Circuit/Architecture Techniques and Design Automation Method-
ologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Circuit Techniques . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Design Automation Methodologies . . . . . . . . . . . . . . 16
2.4 SoC Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Near-Vth ASIC Design: Statistical Timing Analysis and Perfor-
mance Boosting 19
3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 ULV Timing Analysis Challenges . . . . . . . . . . . . . . . 21
3.1.2 ULV Body Biasing Challenges . . . . . . . . . . . . . . . . . 22
3.2 Proposed Surrogate Model Adjustment based SSTA (SMA-SSTA) . 23
3.3 Area-Overhead-Free Body-Biasing Techniques . . . . . . . . . . . . 26
3.3.1 Conventional AOF-BB Schemes and Limitations . . . . . . . 27
3.3.2 Proposed SelF-Body-Biasing Scheme . . . . . . . . . . . . . 28
3.4 Case Study: Advanced Encryption Standard . . . . . . . . . . . . . 33
3.4.1 Low-Cost AES Architectures and S-Box Implementation . . 34
3.4.2 Automated and Detailed SMA-SSTA Design Flow . . . . . . 36
3.4.3 Runtime for Local Variation Characterization and Compari-
son with SSTA . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4 Performance and Design Margin Recovery . . . . . . . . . . 40
3.4.5 Physical Implementation Considerations . . . . . . . . . . . 41
3.5 Testchip Measurement Results . . . . . . . . . . . . . . . . . . . . . 41
viii
CONTENTS
3.5.1 Performance Measurement and Energy Comparison . . . . . 42
3.5.2 Static and Dynamic Robustness of the Body Voltage Bias
Point in SFBB . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . 48
4 A 65nm 30.7fJ/bit Subthreshold Level Shifter Design 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 State-of-the-Art Implementations . . . . . . . . . . . . . . . . . . . 55
4.3 Proposed Level Shifter Design . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 NMOS-Diode Current Limiter based Level Shifter . . . . . . 57
4.3.2 Level Shifter Optimization with MTCMOS and Subthreshold
Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.3 Comparative Analysis to Previous Implementations . . . . . 62
4.4 Measurement Results and Discussions . . . . . . . . . . . . . . . . . 64
4.5 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . 67
5 Robust and Energy-Ecient Ultra-Low Voltage Standard Cell De-
sign with Intra-Cell Mixed-Vth Methodology 70
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Subthreshold Logic Robustness . . . . . . . . . . . . . . . . 73
5.2.2 MVT-LM Design Technique . . . . . . . . . . . . . . . . . . 75
5.3 MVT-ULV: Robustness-Driven Mixed-Vth for ULV Operation . . . 78
5.4 Experimental Results: Iso-Area Constraint . . . . . . . . . . . . . . 81
5.5 Experimental Results: Iso-Yield Constraint . . . . . . . . . . . . . . 84
5.5.1 Cell Level Evaluation . . . . . . . . . . . . . . . . . . . . . . 85
ix
SUMMARY
5.5.2 Library Level Evaluation . . . . . . . . . . . . . . . . . . . . 87
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Exploring Energy Eciency in Embedded DRAM 90
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Hidden-Refresh Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Circuit Design for Self-Refresh eDRAM . . . . . . . . . . . . . . . . 95
6.3.1 Bitcell Choice and Operation Principle . . . . . . . . . . . . 97
6.3.2 Write/Read Bitline Circuit Design . . . . . . . . . . . . . . 100
6.3.3 Wordline Driver Circuit Design . . . . . . . . . . . . . . . . 101
6.4 Power Metrics of the Hidden-Refresh eDRAM under Voltage Scaling 104
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hy-
brid Domain Temperature Sensor 110
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Ratioed-Current/Delay PTAT Sensor Core . . . . . . . . . . . . . . 114
7.3 Hybrid Domain Temperature Sensing Scheme . . . . . . . . . . . . 117
7.4 Circuit Implementation Details . . . . . . . . . . . . . . . . . . . . 118
7.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 Conclusion and Future Work 129
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129




The demand of low power and energy-ecient computing platforms are the
driving force of the modern integrated circuit technologies. Power consumption and
energy eciency are becoming the prior design considerations for mobile computing
devices, wireless sensor network, wearable and implantable biomedical devices, etc.,
where devices are powered by either low volume batteries or energy scavenging
devices.
Voltage scaling is one eective way to reduce power consumption. Especially,
the aggressive voltage scaling, Ultra-Low Voltage (ULV) sub-/near-threshold oper-
ation has been demonstrated as a viable solution for extremely energy-constrained
applications, which paves the way for the realization of the above-mentioned power-
aware computing platforms. However, this brings about opportunities together
with many design challenges that are inevitable for ULV circuits, comprising logic
robustness issue, circuit functional yield, as well as excessive energy/area/throughput
design margins. There are some existing solutions that want to achieve robust ULV
operation, but they come with signicant energy/area overhead.
This thesis presents our design methodology and circuit optimization to achieve
energy and area eciency in ULV designs beyond the state-of-the-art solutions.
First, we present a design ow with a novel Surrogate Model Adjustment based
xi
SUMMARY
Statistical Static Timing Analysis (SMA-SSTA) framework and a novel SelF-Body-
Biasing (SFBB) scheme to reduce the excessive design margins as well as to boost
the performance in ULV ASIC designs. A 65nm AES encryption engine test chip
is designed under this framework and delivers 12.2Mbps with 1.65pJ/bit at 0.5V,
which is 22 and 7.8 over the state-of-the-art AES design, respectively, while
reducing silicon area by 28%.
Second, we propose several custom designed ULV circuit blocks. A NMOS-
diode current limiter based level shifter is proposed with MTCMOS and INWE-
aware sizing techniques. This level shifter achieves 25.1ns delay and 30.7fJ/bit
energy in a 65nm technology, outperforming all previous level shifter implementa-
tions. For logic design, we propose a robust and energy-ecient intra-cell mixed-Vth
standard cell design methodology. This methodology identies the bottleneck de-
vices causing robustness degradation in ULV logic cells and replaces these devices
with low threshold voltage devices. Library level comparison shows that on av-
erage 30.1% energy eciency improvement can be achieved through the proposed
methodology. For memory design, we explore the eDRAM as a memory alter-
native to SRAM. We introduce a hidden-refresh scheme to cope with the refresh
operation and several circuit techniques to mitigate the requirement of additional
power supplies in conventional eDRAM designs. Experiment results conrm that
the hidden-refresh eDRAM shows higher density and lower access energy when
compared to the SRAM counterpart.
Finally, we demonstrate the design of a 65nm 0.4V 280nW nearly all-digital
temperature sensor for wireless sensing platforms. A ratioed-current/delay PTAT
sensor core and hybrid domain temperature sensing scheme are proposed to elim-
inate the dependence on the external frequency reference. The measured eight
xii
SUMMARY
temperature sensor test chips show maximum error of -1.6oC/1oC across 0100oC
range after two-point calibration, with 40 samples/second sample rate at 0.4V.
xiii
List of Tables
3.1 Comparison of state-of-the-art Area-Ecient AES architectures . . 35
3.2 Comparison of state-of-the-art normalized S-Box design . . . . . . . 36
3.3 Summary and comparison of state-of-art area-ecient AES designs 44
4.1 Summary of the transistor sizing . . . . . . . . . . . . . . . . . . . 62
4.2 Comparison to state-of-the-art LS designs . . . . . . . . . . . . . . 67
5.1 Comparison of SNM (VDD=300mV, 25oC) and VDDmin of several
logic cells under dierent logic design methods . . . . . . . . . . . . 83
5.2 Synthesis results of the ITC'99 benchmark circuits. . . . . . . . . . 88
6.1 Comparison among SRAM, eDRAM and Hidden-Refresh eDRAM . 108
7.1 Categories of the CMOS temperature sensor . . . . . . . . . . . . . 111
7.2 Summary of state-of-the-art nW temperature sensor . . . . . . . . . 127
xiv
List of Figures
1.1 Technology scaling trend of supply voltage. . . . . . . . . . . . . . . 2
1.2 Scope of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Conceptual illustration of (a) feed-forward SSTA, (b) feed-back based
SMA-SSTA and (c) detailed ow chart of SMA-SSTA. . . . . . . . 23
3.2 Timing derating feature of logic datapath under local variations. . . 26
3.3 (a) Cross-section of the deep N-well technology and, (b) parasitic
diode connections of a CMOS inverter under three AOF-BB schemes
(ZBB, LVSB, proposed SFBB). . . . . . . . . . . . . . . . . . . . . 27
3.4 Illustration of the EDNM model of (a) an inverter cell and, (b) a
2-input NAND cell for SFBB scheme. . . . . . . . . . . . . . . . . . 30
3.5 Illustration of the EDNM model for SFBB scheme of (a) an inverter
cell, (b) a NAND cell, (c) a 3-stage ring oscillator, (d) simulation
waveform of the self-bias node voltage uctuation of the 3-stage RO
SFBB and, (e) timing potentials of the LVSB and SFBB scheme over
the ZBB scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Illustration of the AOFBB well tap designs and body-biasing oor-
plan under a tap-less standard cell library technology. . . . . . . . . 32
xv
LIST OF FIGURES
3.7 State-of-the-art AES designs: energy vs. throughput and area (s-
caled to 65 nm node for the dierent adopted technologies). . . . . 34
3.8 The AES engine architecture with native round and key expansion
S-Box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Pareto set plot of DoE for delay sensitivity analysis toward dierent
variation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 +3 setup time accuracy analysis of SMA-SSTA after model adjust-
ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.11 Statistical distribution of the delay (normalized to  3 delay under
SFBB at 0.6V) for dierent body biasing schemes and resulting clock
frequency improvement over DSTA (top-right). . . . . . . . . . . . . 40
3.12 Die micrograph with annotated core and on-chip testing buer and
summary of the AES encryption engine at room temperature. . . . 42
3.13 Measurement results of operating frequency and energy per bit of
the testchip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.14 Leakage measurements of ZBB, LVSB and SFBB versus VDD under
dierent temperatures in a single die. . . . . . . . . . . . . . . . . . 47
3.15 Leakage measurements across 15 dice. . . . . . . . . . . . . . . . . . 47
3.16 Measured self-bias node voltage at 0.5V supply. . . . . . . . . . . . 49
3.17 Dynamic stability test of the self-bias node. . . . . . . . . . . . . . 49
4.1 Conventional DCVS level shifter topology. . . . . . . . . . . . . . . 55
4.2 Proposed NMOS-diode current limiter based level shifter topology
with reduced pull-down device size. . . . . . . . . . . . . . . . . . . 58
xvi
LIST OF FIGURES
4.3 Simulated transient waveform of the MTCMOS DCVS level shifter
and the proposed level shifter. . . . . . . . . . . . . . . . . . . . . . 58
4.4 Schematic of the optimization to the Proposed LS with MTCMOS
and INWE-aware sizing. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Transient simulation of the INWE eects on NMOS. . . . . . . . . 60
4.6 Normalized comparison of the delay, energy and energy-delay prod-
uct of the proposed level shifter with adopted optimization techniques. 61
4.7 Transient simulation of the proposed LS and the previous designs in
[57, 58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Monte Carlo simulation of the proposed LS and the previous designs
in [57, 58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Die photo and layout view of the proposed level shifter. . . . . . . . 65
4.10 Measured waveform of the proposed LS with a 60mV to 1.2V con-
version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Measured waveform of the proposed LS with a 60mV to 1.2V con-
version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.12 Measured LS delay from a typical die. . . . . . . . . . . . . . . . . . 68
4.13 Measured statistics of the proposed LS: delay, VDDmin, dynamic
power and leakage power. . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 (a) schematics of commercial standard cells and subthreshold log-
ic failure mechanism, (b) cross-coupled NAND/NOR pair and, (c)
example of buttery plot. . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 NAND/NOR pairs of (a) RVT, (b) LVT, (c) previous MVT-LM
technique, and (d) buttery plot of three pairs. . . . . . . . . . . . 76
xvii
LIST OF FIGURES
5.3 Previous MVT-LM ip-op design, (a) Mixed-I, and (b) Mixed-II,
and (c) ip-op with reset function. . . . . . . . . . . . . . . . . . . 76
5.4 (a) MVT-ULV NAND cell design with other possible variant cells,
and (b) Monte-Carlo simulation of the NAND-NOR Pair VTC curves. 79
5.5 MVT-ULV AOI21/OAI21 cell design with annotated bottleneck tran-
sistors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 MVT-ULV ip-op cell with asynchronous reset. . . . . . . . . . . . 80
5.7 Buttery plots of four sets of 2-input NAND-NOR pairs. . . . . . . 82
5.8 Buttery plots of variant design of 2-input NAND-NOR pairs. . . . 83
5.9 Temperature eects on logic output swing. . . . . . . . . . . . . . . 84
5.10 Output swing failure rate of three sets of standard cells, RVT, MVT-
LM, MVT-ULV under 27oC and 125oC. . . . . . . . . . . . . . . . . 86
5.11 Delay and power distribution of three 10-stage NAND-NOR chains
(commercial, upsized and MVT-ULV). . . . . . . . . . . . . . . . . 86
6.1 Conceptual illustration of the hidden-refresh scheme for eDRAM. . 94
6.2 Top-level block diagram of the hidden-refresh eDRAM. . . . . . . . 96
6.3 Detailed timing diagram of the hidden-refresh eDRAM. . . . . . . . 96
6.4 2T gain-cell eDRAM with simplied operation waveform. . . . . . . 98
6.5 Timing diagram of the eDRAM write, read and refresh operation. . 98
6.6 Write/read bitline design of the hidden-refresh scheme. . . . . . . . 100
6.7 Bootstrapped WWL driver and simulated waveform. . . . . . . . . 101
6.8 Illustration of the worst case read disturbance issue of the 2T gain-
cell eDRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.9 Proposed tri-state RWL driver. . . . . . . . . . . . . . . . . . . . . 103
xviii
LIST OF FIGURES
6.10 Schematic of the 1K-bit hidden-refresh eDRAM. . . . . . . . . . . . 104
6.11 Power consumption of the hidden-refresh eDRAM. . . . . . . . . . . 105
6.12 Retention time and static power of the hidden-refresh eDRAM. . . 106
6.13 Read/write power with duty cycled refresh power of the hidden-
refresh eDRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1 Power consumption versus frequency of the state-of-the-art ultra-low
power frequency reference for illustration of power overhead due to
frequency reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Schematic of the proposed ratioed-current/delay PTAT sensor core. 114
7.3 Mathematical background of the operation principles of the proposed
current-ratioed PTAT. . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Optimal VGS vs. Adjusted-R
2 coecient. . . . . . . . . . . . . . . 116
7.5 Timing diagram of the ratioed-current/delay temperature sensor. . 118
7.6 Schematic of the hybrid-domain temperature sensor. . . . . . . . . . 120
7.7 Simulated delay ratio the ratioed-current/delay temperature sensor. 121
7.8 Simulated temperature error of the ratioed-current/delay tempera-
ture sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.9 Illustration of time domain (top left), hybrid domain (top right)
processing and the hybrid domain processing benets on TDC band-
width (bottom left) and dynamic frequency scaling (bottom right,
simulated). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.10 Hybrid domain digital processing data format. . . . . . . . . . . . . 123
7.11 Die micrograph with annotated oorplan. . . . . . . . . . . . . . . . 124
xix
LIST OF FIGURES
7.12 Measured delay ratio (left) and adjusted-R2 coecient (right) of 8
chips. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125




1.1 Background and Motivation
The advancements of the solid-state-circuit technologies have empowered the
prosperity of the integrated circuit industries for the past 60 years. Technology s-
caling trend, also known as the Moore 's law [1], is the driving force for pushing the
limits of high-performance computing capability and system integration level. In
the meanwhile, energy eciency has been simultaneously improved through succes-
sive CMOS technology generations [2], with great contributions from both supply
voltage scaling and parasitic reduction due to the shrinking device geometries.
The benets from both technology scaling and architecture/circuit innovation-
s have profoundly contributed to the diverse application scenarios of the CMOS
technologies and the future generations of computing concepts. These emerging ap-
plications and concepts will eventually accelerate the formation of the correspond-
ing business models in the economic ecosystems. For instance, battery-operated

























Fig. 1.1: Technology scaling trend of supply voltage.
leading to the rapid market growth of personal computing and mobile devices in
recent years, such as smart phones, tablets and ultrabooks, etc.
However, as CMOS device feature size approaches nanometer regime, technol-
ogy scaling encounters a challenging trend that the supply voltage almost ceases to
scale beyond 90nm technology node, as shown in Fig. 1.1. The main reason is due
to the dramatic leakage current increase in advanced technology nodes. As a conse-
quence, the constant electric eld scaling does not hold valid and the supply voltage
scaling solely can hardly provide energy eciency improvement. In the meanwhile,
the transistor density doubles every generation, which leads to signicant concerns
for power density increase [3].
The intuition is to keep the supply voltage scaling trend all the way to the
fundamental limit of the CMOS technology, as described in [4, 5]. This is an
eective solution in reducing the dynamic energy due to the quadratic relationship
to the supply voltage (CV2DD) and the leakage power reduction from Drain-Induced
Barrier Lowering (DIBL) eect. On the other hand, power reduction of logic gates
2
CHAPTER 1. Introduction
also comes with the increased propagation delay due to aggressive voltage scaling.
When the required energy for a given task is interested, the situation is altered
as the leakage energy becomes remarkably increased at low voltages due to the
increased logic delay. As a result, the total energy is dominated by the leakage
energy when aggressive voltage scaling is applied, and this leads to the fundamental
concept of Minimum Energy Point (MEP) in CMOS digital circuits, which is both
analytically modeled [6] and silicon-veried [7]. The supply voltage of MEP point
is often found to be around the transistor threshold in modern CMOS technologies,
providing more than an order of magnitude energy reduction when compared to
that of the nominal supply voltage. Therefore, this technique is also referred to as
sub-/near-threshold (sub-/near-Vth) operation.
The sub-/near-Vth operation denitely provides a viable solution to achieve
ultra-low-power consumption and improved energy eciency beyond the technolo-
gy scaling capability. Besides, it also provides great opportunities to rene previous
design models for low power target or to resolve several challenging issues in emerg-
ing applications. For example, one direct observation from sub-/near-Vth operation
is that further reducing the supply voltage beyond MEP point is less eective as
the energy eciency degrades. In this way, the system level task scheduling for
energy minimizations needs to be revised accordingly when compared to the classic
dynamic voltage scaling framework [8]. And it is strongly suggested to distinguish
the role of energy and power, adopting the correct design strategies for dierent
building blocks with varied duty cycle proling [9].
Nowadays, several emerging applications are constrained by the power budgets
and the energy-eciency, while the sub-/near-Vth operation is a likely candidate
for solving these design constraints. We categorize these emerging applications into
3
CHAPTER 1. Introduction
two broad scenarios, as listed below.
 Solutions to Dark Silicon with Improved Energy Eciency
Thanks to the doubled number of transistors in each technology generation,
high performance computing platforms enter the multi-core era but with a
nearly constant power budget. In view of this, today's multi-core comput-
ing architecture will eventually encounters the "multi-core apocalypse" (i.e.,
the dark silicon crisis) [10] in the near future. The dark silicon crisis indi-
cates that future generations of multi-core CPUs/GPUs might have degraded
performance improvement due to limited number of working cores.
Near-Vth operation achieves a good compromise between the energy and per-
formance to mitigate the dark silicon crisis, which turns the "dark silicon"
into "dim silicon" [11] with improved energy-eciency [12]. Several recent
reported computing architectures also exploit the near-threshold operation,
such as wide SIMD [13], 3-D many core [14], as well as future heterogeneous
computer architecture with general-purpose processors and dedicated accel-
erators [15].
 Empowering WSN, BSN and Implantable Electronics
The vision of Internet of Things (IoTs) advocates the connection of thousands
of sensor-based computing platforms to form the wireless sensor networks
[16], with encouraging future of smart cities/home, ubiquitous environmental
monitoring and industrial control, etc. Further extending IoTs to body sensor
networks [17] allows remote e-health services like healthcare assessment and
diagnosis, fall detection and vital human body signal monitoring (e.g., ECG),
etc. In addition, medical implantable electronics oer profound possibilities of
4
CHAPTER 1. Introduction
improving patients' life qualities [18]. Early devices like the pacemakers and
the cochlear prosthesis devices have already been commercialized and newly-
emerged devices like deep brain stimulation, intraocular pressure monitoring
and retinal/neural prosthesis are in active research and prototyping period.
The above-mentioned computing platforms, which include the energy deliv-
ery subsystems, sensors/analog frond end, digital processing/control circuits
and RF subsystems, are condensed into a vanishingly small volume (e.g.,
millimeter-scale) to cause negligible invasion to either the environment or to
human body. The application nature also requires long lifetime, autonomous
and even perpetual operation, whereas the form size of such systems brings
about the practical challenges between the energy sources with limited power
densities (e.g., Lithium batteries, solar-, piezo-/thermo-electric energy har-
vesting devices) and higher integration level of the system building blocks.
Therefore, with the knowledge of the voltage scaling trend in modern IC tech-
nology, aggressive energy-aware and energy-ecient design considerations are
particularly necessary for the above-mentioned applications.[19{22].
Being a viable solution for achieving ultra-low power/energy in CMOS circuits
indeed, the robustness of the ultra-low voltage (ULV) CMOS circuits are however
degraded. Therefore, the robust ULV operation becomes the top challenging task,
which is generally not a serious issue for super-Vth circuit designs. This leads to the
fact that the conventional super-Vth circuit topologies and design methodologies are
not applicable in the ULV domain due to several practical challenges and limitations
[9].
First, device characteristics are signicantly compromised when the supply
voltage approaches to the device threshold. The conduction (on) current degrades
5
CHAPTER 1. Introduction
exponentially and causes signicant logic delay increase. Therefore, sub-Vth oper-
ation causes signicant performance loss when compared to super-Vth counterpart.
In addition, the on/o current ratio is also reduced and causes non-ideal, ratioed
operation of both sub-Vth logic and memory design.
Second, process variabilities introduced by fabrication technologies become
one of the major concerns for CMOS integrated circuits, and advanced technolo-
gies often comes with higher level intra-die variations due to the shrinking device
dimensions. This essentially results in serious concerns for ULV designs as further
reducing the supply voltage exacerbates the above-mentioned situations. Both the
logic and memory suer from higher failure rates due to device parameter vari-
ations. Meanwhile, with device parameter variations being modeled as Gaussian
(i.e., normal) distribution, the logic delay shows non-Gaussian and non-symmetrical
distribution with long tail towards the worst-case scenarios, as a consequence of
exponential dependence between the on current and the transistor threshold volt-
age.
Third, the corresponding design methodologies and automated design ows,
such as ultra-low voltage IPs and CAD supports are not available at the moment.
As a result, eorts have to be made at all design entries, such as low-voltage
logic/memory designs, low voltage liberty IPs for synthesis and timing closure as
well as the physical design IPs for back-end design solutions. Especially, excessive
design margins are inevitable in ULV domains if the conventional corner-based
design ow is adopted and variation-aware design ows are especially interested.
During the past decade, the low-power communities have paid considerable
attention and eorts to sub-/near-Vth designs. Sub-/near-Vth circuits have been
widely investigated and silicon prototypes are reported from both academia and
6
CHAPTER 1. Introduction
industries. Given minimizing the energy as the fundamental target, optimal sub-
/near-Vth designs are often the trade-o among robustness, energy, area and per-
formance. For instance, ULV logic designs generally require transistor upsizing
to mitigate logic failure caused by process variations, and ULV SRAM introduces
more transistors into the basic bit-cells to achieve robust read/write operations. In
addition, large design margins of the corner-based ow are inevitable as higher gate
sizing eort is needed during technology mapping to meet the worst-case corner
timing requirement. Eventually, all above facts lead to the increased energy and




ASIC Design Flow for 
Energy/Area/Performance 
Optimization










ULV Hybrid Domain 
Temperature Sensor
(Chapter 7)





Digital for AnalogDigital for Digital
Fig. 1.2: Scope of the thesis.
This thesis presents the design methodologies for ultra-energy-ecient sub-
/near-Vth computing, with special emphasis on further improving the energy e-
ciency beyond the state-of-the-art sub-/near-Vth designs. We aim at simultaneously
7
CHAPTER 1. Introduction
achieving robust ULV operation with energy/area/performance optimization, in-
cluding ULV ASIC design ow for near-Vth timing closure and performance boost-
ing through novel forward body biasing, customized circuit blocks (level shifter,
logic design and memory design), and novel ULV design applications (nearly all-
digital temperature sensor). The thesis scope is illustrated in Fig. 1.2 and the
contributions of the thesis are elaborated as follows:
 Application-Specic Integrated Circuit (ASIC) Design Flow
First, this thesis covers an ASIC design ow with near-Vth statistical tim-
ing analysis and design-time forward body biasing to reduce the excessive
design margins and to improve the performance when compared with the
conventional design ow. A novel computational ecient standard cell char-
acterization method is proposed and an overhead-free forward body biasing
scheme is introduced for performance boosting. As a result, we fabricated an
Advanced Encryption Standard engine for proof-of-concept, demonstrating
the eectiveness of improved performance and energy eciency with mini-
mum design overheads.
 Custom Design of ULV Circuit Building Blocks
Second, this thesis focuses on the custom design and optimization of several
key building blocks for ULV circuits, including level shifter, logic, and memo-
ry designs. Level shifter is an important interfacing circuit block for multiple
voltage systems. In this thesis, a novel NMOS-diode current limiter based
topology is proposed with improved propagation delay and energy eciency.
For logic design, we propose an energy-ecient intra-cell mixed-Vth method-
ology to enhance the logic cell robustness with minimum device upsizing and
8
CHAPTER 1. Introduction
energy overhead. For memory design, we investigate the logic-compatible
hidden-refresh embedded DRAM as an alternative for the widely adopted
SRAM with improved energy and area eciency.
 ULV all digital temperature sensor
Third, this thesis introduces a temperature sensor capable of ULV operation.
Generally, ADC and TDC based temperature sensors are power hungry. In
order to solve this, we propose a nearly all-digital hybrid domain temper-
ature sensor. A ratioed-current PTAT sensor core and hybrid temperature
sensing scheme are introduced. A main feature of this sensor topology is
timing reference independent, eliminating the considerable energy required
by a frequency reference.
1.3 Organization of the Thesis
The organization of the thesis is listed as follows. Chapter 2 reviews the state-
of-the-art work related to the sub-/near-Vth designs. Chapter 3 describes the ASIC
design ow with statistical timing analysis and design-time performance boosting
through forward body biasing. Chapter 4 presents an energy-ecient subthreshold
level shifter design. Chapter 5 addresses the energy overheads in logic design
through a robustness-driven mixed-Vth standard cell design methodology. Chapter
6 demonstrates the design of logic-compatible hidden-refresh eDRAM as a viable
memory alternative for SRAM. Chapter 7 covers the design of nearly all-digital
hybrid domain temperature sensor. Chapter 8 concludes the thesis and discuss




State-of-the-art work related to the scope of this thesis is reviewed in this chap-
ter, including analytical modeling framework and technology implications, circuit
building blocks, circuit/architecture techniques and EDA design methodologies and
complex ULV SoCs.
2.1 Modeling and Technology Implications
As conned by the classic Alpha-power law [23], the conventional Dynam-
ic Voltage Scaling (DVS) [24] framework entails practical limitations due to the
ignored contribution from leakage current, which continuously increases with tech-
nology scaling into the nanometer regime. Consequently, DVS technique has to be
revisited for its fundamental benets when aggressive voltage scaling is applied.
The energy consumption versus voltage scaling relationship is theoretically
modeled in [6, 8] by taking into account of the portion of leakage energy. In
contrast to the quadratic dynamic energy reduction, the leakage energy actually
10
CHAPTER 2. Literature Review
experiences an exponential increase due to the increased logic delay under reduced
supply. As a result, the total energy are found to be minimal when the supply
scales to around the transistor threshold, while further supply voltage reduction
is less energy-optimal. In addition, another important factor has to be taken into
consideration is the device parameter variability in the CMOS fabrication process
[25]. When process variations is considered for sub-Vth circuits, robust operation
becomes challenging due to the exponential Id-Vgs dependence. The threshold
voltage variations are the most dominant variation sources, including both Vth
corner shift (inter-die or global variation) and Vth mismatch due to random dopant
uctuations (RDF, intra-die or local variations).
In view of these, technology scaling is arguably favorable to energy minimiza-
tion [26, 27]. Basically, the leakage current grows nearly an order of magnitude
with each technology generation for several years [28]. The leakage increase leads
to the on/o current ratio degradation, and such device characteristics degrada-
tion will be detrimental to the static noise margins, propagation delay as well as
energy dissipation. In addition, scaled technologies suer from even higher-level
Vth mismatch when compared with older technologies.
Inspired by above-mentioned aspects, the optimal technology selection is inves-
tigated for achieving minimum energy consumption and variability in ULV appli-
cations [29]. Due to the vastly dierent duty cycle and maximum frequency among
the ULV applications, the optimal technology tend to be application-specic as
well. The basic rule of thumb is that advanced technologies are in favor of those
applications with high duty cycle ratio and maximum operating frequency and vice
versa.
11
CHAPTER 2. Literature Review
2.2 Circuit Building Blocks
In order to achieve minimum energy operation, the CMOS circuits should be
completely functional and reliable in the sub-Vth region. Therefore, building blocks,
e.g., logic cells, memory and level shifter, are extensively studied in previous works.
2.2.1 Logic
Preliminary study on sub-Vth logic design in [7] suggests that static CMOS
logic family with transmission-gate based cells are preferred for minimum voltage
operation. Theoretically, standard cell libraries with minimum-sized devices (i.e.,
small strength logic cells) are energy-optimal in nominal corner [6], while this design
choice may cause functional failure in other process corners.
Considering the performance loss in sub-Vth logic, several logic optimization
methods are introduced to improve the propagation delay and variability as well.
Sub-Vth logical eort [30] is proposed with a closed-form derivation of optimal
stacking transistors sizing strategy to achieve performance optimization. Second-
order geometrical eects are also explored for sub-Vth logic optimization, including
both Reverse Short Channel Eect (RSCE) [31] and Inverse Narrow Width Eect
(INWE) [32]. A statistical sizing methodology is also introduced based on RSCE
[33].
Yield-enhanced and variation-aware sub-Vth logic designs are also explored in
[34, 35]. By introducing practical functional yield evaluation criterions, such as
buttery plot and output voltage level, functional yield can be quantied during
logic design/optimization phase. The major contributors of logic failure are the
Vth mismatch and the unbalanced topologies in practical CMOS logic gates (e.g.,
12
CHAPTER 2. Literature Review
NAND/NOR gates). The situation can be mitigated through device upsizing, while
this incurs energy and area overheads.
Apart from the above mentioned work, Schmitt-Trigger logic family [36] is
especially benecial for those applications with optimal supply voltage below the
minimum energy point supply with extremely low active duty cycle, such as always-
on circuitries. However, the area, energy and performance penalties are inevitable
for Schmitt-Trigger logic family.
2.2.2 SRAM
Static Random Access Memory (SRAM) is the most widely adopted embedded-
memory type and on-chip SRAMs often play a dominant role in area and energy
consumption in today's CMOS SoCs. Owing to this, keeping area and energy ef-
ciency is essentially important for on-chip memories. Supply voltage scaling is
also valid for SRAM to reduce the energy consumption as well as reducing the
leakage power. However, similar to sub-Vth logic design, sub-Vth SRAM design is
also challenging due to the critical design trade-o between the macro density and
cell robustness. The situation is even worse when process variations (esp., the Vth
mismatch) are incorporated, as the device mismatch is paramount in the extremely
scaled SRAM bitcells [37, 38].
Sub-Vth SRAM has been an active research topic for the past decade. In
order to mitigate the above-mentioned challenges, revising the bitcell topologies and
exploring read/write assisting circuitries are viable choices to ensure low voltage
functionality.
A revised 6T sub-Vth SRAM is demonstrated in [39] through a single-end
transmission-gate access topology, with device upsizing and virtual VDD/VSS
13
CHAPTER 2. Literature Review
write-assisting technique. Although this design increases no transistors to the bit-
cell, the upsized bitcell area is 2 larger than a traditional 6T cell.
Based on the conventional 6T cell design, novel read-decoupled topologies
(7T [40], 8T [41{43], 9T [44], 10T [45{47]) with additional transistors or inserting
feedback cuto transistors are proposed to enable the correct read/write operation.
Sense amplier redundancy is also explored to achieve robust operation in dense
SRAM designs [41]. In addition to this, a 10T bitcell with dierential sensing
scheme is proposed with bit-interleaving support [47].
In summary, the above-mentioned techniques can signicantly improve the
low voltage SRAM robustness, however, at the cost of increased area and energy
overheads. One possible way is to seek for alternative memory circuits to replace the
current SRAM. For example, Gain-cell based embedded DRAM (eDRAM) [48] can
be a good candidate due to the small area and energy consumption. Nevertheless,
the dynamic nature of eDRAM requires solutions for ecient refresh management.
2.2.3 Level Shifter
Level shifter are key building blocks of multiple supply voltage (multi-VDD)
designs. As the prospect of sub-/near-Vth blocks is promising, the level shifter
is also desired to be operational in a wide dynamic range, from sub-threshold to
above-threshold.
Unfortunately, the conventional level shifter based on Dierential Cascade
Voltage Switch (DCVS) topology is incapable of robust conversion from subthresh-
old to nominal due to the signicant contention caused by the limited strength of
the pull-down devices operated in the sub-Vth region.
Multiple stage level shifter design [49] is a feasible solution as the contention
14
CHAPTER 2. Literature Review
is minimized between consecutive conversion level using additional intermediate
power rails (300mV, 400mV, 600mV and 1.2V). However, this introduces the over-
head of generating the supply voltages via voltage regulators, which is costly in
most cases. Increase the channel width of the pull-down devices is also helpful in
achieving robust functionality, but the introduced area and power overheads are
overwhelming as both the density and power consumption are important design
metrics for subthreshold level shifter [50]. Therefore, area-/energy-ecient level
shifter with small propagation delay are highly desired and a few attempts are
demonstrated [50{58].
2.3 Circuit/Architecture Techniques and Design
Automation Methodologies
2.3.1 Circuit Techniques
As the performance of sub-Vth circuits are strongly aected by the device
threshold, threshold selection and body biasing are more eective circuit techniques
to achieve optimal energy and performance. For instance, exploring the many-Vth
devices in nanometer technologies can improve the energy-eciency, performance
and variability [111]. Body biasing is also adopted in [59{61] to mitigate the perfor-
mance degradation and the variability. Similarly, bootstrapped techniques exploit
the exponential dependence on the drain current and the gate overdrive voltage
[62, 63]. Through simple insertions of both positive and negative charge pump
circuits to the inverter, the gate overdrive voltage swing (0 to VDD) is expanded
to -VDD to 2VDD, showing both improved delay as well as reduced leakage (due
15
CHAPTER 2. Literature Review
to GIDL, Gate-Induced Drain Leaking eects).
Architecture level techniques like parallelism and pipelining are investigated
for sub-/near-Vth designs. Ecient parallelism for medium-throughput application-
s with near-Vth operation are very popular for multimedia applications [61, 64, 65].
Optimal pipeline stage in sub-/near-Vth region is signicantly aected by Vth mis-
match. As a result, reducing the pipeline stage to have long critical path delay
permits more averaging eects of device mismatch but at the cost of degraded
throughput [25]. Latch-based pipelining [66] is therefore reclaimed in sub-Vth re-
gion for variation-tolerant pipeline design with signicantly improved performance
and energy eciency.
2.3.2 Design Automation Methodologies
The dominance of process variations will also be detrimental to the low voltage
circuit timing analysis. Basically, the two major timing violations (i.e., setup and
hold time) are both aected by the process variations. Conventional Deterministic
Static Timing Analysis (DSTA) is pessimistic as the corner models are extracted
from extreme rare situations, thereby DSTA approach comes with unrealistic design
margins (e.g., over-estimated setup/hold violations) [67]. In order to reduce such
design margin, variation-aware approaches, such as Monte-Carlo (MC) simulation
based [25] or Statistical Static Timing Analysis (SSTA) [101], are preferred. SSTA
is more computational ecient than MC simulation.
Also, hold time violations are independent from the clock frequency but strong-
ly correlated to the clock skew determined by the clock tree topologies. Unlike
super-V th multi-stage clock-tree designs, ULV clock skews are dominated by the
local variations. As a result, shallow clock tree with huge clock tree buer is pro-
16
CHAPTER 2. Literature Review
posed for ecient clock tree design [68].
Previous body biasing designs ignore the impact of body biasing at design time,
therefore over dimensioning is inevitable during technology mapping. Sub-Vth body
biasing driven synthesis ow is introduced [69]. However, the major drawback is
that this body biasing driven synthesis ow is still corner-based, thereby still have
overly conservative design margins.
2.4 SoC Designs
Due to the compelling energy benets, sub-/near-Vth circuits and systems
draw considerable attention from both academia and industries. During the past
few years, several ULV SoC prototypes are demonstrated for various applications.
A brief review of the ULV design progress is given below.
Embedded processors (e.g., micro-controller, MCU) with general-purposes in-
struction sets oer decent exibility, therefore they are the most popular candidates
for wireless sensor node applications [70{76]. These MCU-based platforms show
widely spread performance spectrum, ranging from hundreds of kHz frequency for
data sensing and logging applications [72], to tens of MHz frequency for future IoT
applications demanding more computing power [76].
However, this architecture is less energy-ecient for biomedical application-
s with demanding digital signal processing requirements. Accelerator-based plat-
forms [77{81] are therefore more preferred to achieve minimum energy for these spe-
cic applications, with dedicated hardware accelerators like FFT, FIR, CORDIC,
etc. Novel computing architectures with coarse-grained recongurable processor
is also demonstrated by IMEC and Samsung for emerging ambulatory biomedical
17
CHAPTER 2. Literature Review
signal processing applications [81]. Industrial prospects of sub-/near-Vth circuits
are very promising as eorts have also been made for several wide dynamic range,
high-performance and energy-ecient hardware accelerators (motion estimation
[82], SIMD vector processing [83], encryption [84], oating-point multiply-add unit
[85], etc.,) by Intel. Also, an IA-32 Pentium processor [86] is demonstrated with
encouraging energy eciency benets.
Finally, self-contained sensor node SoCs with fully integrated energy sources
with power management units, sensors with AFEs, clocking, RF and ULV digital




Near-Vth ASIC Design: Statistical
Timing Analysis and Performance
Boosting
Near-threshold operation enables high energy eciency at signicant perfor-
mance loss and increased sensitivity to process variations. In this chapter, we
address both issues with two synergistic approaches. First, we introduce a novel s-
tatistical design methodology to eciently and accurately evaluate the guardband,
thereby keeping the near-threshold energy/performance cost of variations at its
very minimum. Secondly, we introduce a novel body biasing technique to mitigate
the performance loss at near-threshold voltages while not requiring any additional
circuitry for the body biasing. Based on these ideas, a 65nm highly energy-ecient
AES testchip is presented as a case study. Experimental results show maximum
throughput up to 12.2Mbps with energy of 1.65pJ/bit at 0.5V, i.e. a 22 and 7.8
19
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
improvement over previous designs, which was implemented in the same technology
node. The proposed techniques also reduce area by 28% over a previous design in
the same technology, and enable reliable operation over a wide voltage range (0.5V
to 1.2V) without any additional post-silicon tuning or body bias feedback control,
as opposed to traditional body biasing schemes.
The remainder of this chapter is organized as follows. Section 3.1 describes
the background and motivation of this work. Section 3.2 introduces the Surrogate
Model Adjustment-based Statistical Static Timing Analysis (SMA-SSTA) method-
ology. Section 3.3 details the Area-Overhead-Free Body-Biasing (AOF-BB) with
novel SelF-Body-Biasing (SFBB) scheme for ULV design. Design considerations
are given for area-/energy-ecient AES architectures in Section 3.4. Section 3.5
covers the testchip implementation and the detailed automated ow under the
SMA-SSTA/SFBB framework. Section 3.6 reports the measurement results. The
conclusions are drawn in Section 3.7.
3.1 Background and Motivation
Ultra-low voltage (ULV) operation is a popular approach to achieve high en-
ergy eciency [91]. Minimum energy per operation is typically obtained when the
supply voltage is close to the transistor threshold voltage, although this comes
at the cost of signicantly degraded performance and larger sensitivity to process
variations [12]. The performance issue is determined by the low transistor speed,
hence it can be addressed by making transistor faster (e.g., through forward body
biasing). The variability issue is typically addressed by adding a design margin
that permits correct operation even in the worst case. Such margin needs to be
20
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
accurately evaluated to avoid excessive pessimism (i.e., performance/energy/area
degradation), or optimism [92] (i.e., yield degradation). Consequently, practical
design challenges are inevitable for ULV timing analysis and performance boosting
technique like forward body biasing.
3.1.1 ULV Timing Analysis Challenges
ULV timing analysis is signicantly aected by local (within-die) and global
(die-to-die) process variations [92]. The traditional Deterministic Static Timing
Analysis (DSTA) is well known to be pessimistic due to the large design margin.
Accurate evaluation of the required design margin can be achieved by Monte Carlo
analysis [9, 25, 74], but it entails an extremely high computational eort. Thanks
to its better computational eciency, Statistical Static Timing Analysis (SSTA) is
more practical [92]. SSTA requires a preliminary timing characterization of logic
gates under global/local variations. Unfortunately, standard cell characterization
for local variations is computationally intensive in ULV designs, and asymmetrical
delay Probability Distribution Functions (PDFs) further complicates the charac-
terization [25].
Previous solutions [101, 102] entail a signicant computational eort for stan-
dard cell characterization due to the increased number of characterization points
(e.g., using 0.5 as interval between  3) to deal with the nonlinearities across
the variation space. Also, the number of needed circuit simulations is proportion-
al to the number of transistors in a standard cell and the number of parameters
under process variations. This approach also increases the computational eort to
perform statistical timing analysis due to the non-linear dependence between delay
and slew.
21
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
3.1.2 ULV Body Biasing Challenges
Body biasing can directly modulate the threshold voltage of MOSFETs, which
is an eective knob to boost the performance at near threshold. However, this
technique is typically introduced as a post-silicon tuning technique or adaptive
tuning solution [59{61]. Since these works do not include body biasing information
during technology mapping, the gate strength and hence area are largely oversized.
To partially limit such large area penalty, the works in [69, 93] include some limited
body biasing information based on the corner analysis. However, these frameworks
are still largely pessimistic since they are based on corner and DSTA analysis.
Another issue of body-biasing design is the overhead caused by the body bi-
asing voltage generation circuitry and the related feedback control for post-silicon
tuning. These additional circuits make SoC integration and verication more com-
plicated, and are responsible for signicant area/energy overhead. For instance,
the body biasing circuitry takes up 18% of the total energy in [60].
In the remaining part of the chapter, we address the above issues by introduc-
ing two synergistic techniques that respectively estimate eciently the statistical
design margin and improve the transistor speed. As rst technique, excessive design
margin is suppressed through the novel Surrogate Model Adjustment Statistical
Static Timing Analysis (SMA-SSTA). SMA-SSTA accurately evaluates the strictly
required design margin with considerably reduced computational eort compared
to previous statistical analysis. The second technique is a novel SelF-Body-Biasing
(SFBB) scheme to boost performance with zero circuit overhead (i.e., it does not
need any external component such as body bias generator/controller).
22












































Step 4: Surrogate Model Adjustment
Fig. 3.1: Conceptual illustration of (a) feed-forward SSTA, (b) feed-back based SMA-SSTA and
(c) detailed ow chart of SMA-SSTA.
3.2 Proposed Surrogate Model Adjustment based
SSTA (SMA-SSTA)
Traditional SSTA ow involves two basic steps as shown in Fig. 3.1(a). The
rst step is the accurate characterization of standard cell libraries, and the second
step is the statistical timing analysis based on the characterized libraries. The rst
step is well-known to be very computationally expensive [9].
To drastically reduce the computational eort of the traditional SSTA ow,
we propose a Surrogate Model Adjustment-based SSTA (SMA-SSTA) framework,
as shown in Fig. 3.1(b). It uses surrogate modeling to reduce the computational
eort of both standard cell characterization (step one) and timing analysis (step
two) in the traditional SSTA ow. Our contribution is on step one, whereas for
step two we rely on the work in [103], which assumed gate-level timing variations
to be known up front, without providing a solution for step one.
A detailed ow chart for the proposed SMA-SSTA framework is illustrated in
Fig. 3.1(c). The ow starts from the statistical HSpice model and consists of the
23
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
following key steps:
1. To reduce the number of parameters under variations, we perform Design of
Experiments (DoE) based sensitivity analysis to identify the most impactful
variation parameters.
2. The standard cell timing is characterized under both global and local varia-
tions, considering only the impactful parameters identied in previous step.
Global variations are treated normally. For simplicity, local variations are re-
garded as fully correlated within each standard cell (but the characterization
results will be treated as uncorrelated during step 4).
3. A reference datapath is statistically characterized through Monte Carlo sim-
ulations (reference datapath selection will be discussed later on).
4. The pre-characterized surrogate timing models are fed to a linear SSTA tool
to estimate the timing of the reference datapath in step 3 in a set of cases of
interest (e.g.,  + 3 for setup time check). Then, SSTA input parameters
related to local variations (i.e., ) are iteratively adjusted to match its timing
predictions with the results from MC simulations obtained from step 3. This
improves the surrogate model accuracy, despite of the inaccuracy introduced
by the approximation introduced in step 2.
Being the above framework based on the adjustment of a surrogate model, it
preserves the intrinsic correlations and statistical properties of timing as discussed
in [103]. Once the surrogate model is calibrated at step 4, it can be used for timing
analysis of arbitrary datapaths. In the following we will provide the details on the
above steps.
24
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
In step 1, the most impactful variation parameters are identied through a
delay sensitivity analysis of a reference logic gate. In detail, we evaluated the delay
variation of a 1 inverter at dierent points of interest (i.e., -3 to +3 with 1-
step) and selected only the variation parameters that led to the most signicant
delay variation.
In step 2, for simplicity we assumed that all transistors have the  value per-
taining to a single-nger device, regardless of their size. The inaccuracy introduced
by this simplication is then mitigated through the calibration at step 4.
In step 3, based on [103], the reference datapath is selected as an appropriate
portion of the critical path identied from a preliminary synthesis based on DSTA.
The logic depth of the portion of the critical path that is adopted as reference
datapath is chosen dierently for setup and hold time check, as discussed in the
following.
Fig. 3.2 shows the MC simulation results of the delay variability versus log-
ic depth for two 40-stage inverter-chains (1 and 2 strength) at 0.5V. As the
logic depth increases, = of global variations is basically constant regardless of
cell strength and logic depth, while = of local variations is derating along the
logic depth. As a consequence, when increasing the logic depth, = under both
global and local variations also derate and converge to a lower bound set by global
variations (see Fig. 3.2). From this plot, for setup time check, the logic depth
of the reference datapath has to be small enough to capture a signicant amount
of local variations, but not unrealistically small to be representative of practical
near-threshold designs. Accordingly, we set the logic depth to 15, which is certainly
on the slow side of practically adopted logic depths [66].
For hold time check, in principle it would be possible to perform model ad-
25
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting



























 global + local
 global
 local
Fig. 3.2: Timing derating feature of logic datapath under local variations.
justment as is done for setup time check, but we need to consider that hold time
violations are typically a major issue in ULV designs [9]. Accordingly, we intro-
duced some slight pessimism by choosing a reference datapath with minimum-sized
cells and large clock buers, as actually desirable in ULV designs [102]. The re-
sulting design margin is considerably smaller than DSTA methods. Details on
the accuracy and runtime of the above discussed SMA-SSTA ow are reported in
Section 3.4.1.
3.3 Area-Overhead-Free Body-Biasing Techniques
In this section, area-overhead-free body-biasing (AOF-BB) schemes are dis-
cussed. In such schemes, no additional circuitry (e.g., voltage generation/control)
or routing resources are needed for body biasing. After highlighting the limita-
tions of existing AOF-BB schemes, a novel biasing scheme to circumvent those
26































Fig. 3.3: (a) Cross-section of the deep N-well technology and, (b) parasitic diode connections of
a CMOS inverter under three AOF-BB schemes (ZBB, LVSB, proposed SFBB).
limitations is introduced.
3.3.1 Conventional AOF-BB Schemes and Limitations
Fig. 3.3(a) shows the cross-section of an inverter cell in a triple-well process,
where P-well (PW) and N-well (NW) are located in Deep N-well (DNW). Among
the possible P- and N- well biasing strategies, the conventional Zero Body Biasing
scheme (ZBB) ties NW to VDD and PW to VSS. Since all parasitic diodes are
always reverse biased, ZBB circuits can reliably operate from super-threshold to
sub-threshold.
When VDD is constrained to be in the order of 0.5V, the Low-Voltage Swapped
27
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
Body Biasing (LVSB) can also be adopted [104], where the P- and N-well voltages
are swapped compared to ZBB. This is an extreme forward body biasing case and
can achieve signicant performance boosting. The supply voltage limitation of
LVSB is imposed by diodes D1, D2 and D5 in Fig. 3.3(b). Approximately, oper-
ation above 0.5V causes a large current ow due to the forward-bias diodes. The
conduction of D3 and D4 also degrades the output voltage swing and remarkably
increases the transistor junction capacitance, thereby further degrading energy ef-
ciency.
3.3.2 Proposed SelF-Body-Biasing Scheme
To maintain the exibility and the reliability of ZBB while boosting the transis-
tor performance similarly to LVSB, we propose a novel SelF-Body-Biasing (SFBB)
scheme, as shown Fig. 3.3(b). For SFBB, the body terminals of all NMOS and
PMOS devices are directly connected to each other without being tied to external
supplies, resulting in a self-biasing node that is applied to both NW and PW. SFBB
scheme has two advantages:
1. The P-to-N well diode (D1) is shorted by the connection between the VBN
and VBP node under SFBB scheme, thereby eliminating altogether its leakage
current.
2. D2D5 form a diode network between VDD and ground and lead to a de-
terministic self-bias voltage. This diode network forms a voltage divider and
therefore ties the body to an intermediate voltage between ground and VDD;
hence, the resulting body biasing scheme can be used in a much wider range
of supply voltages, compared to LVSB.
28
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
To understand how the body voltage is set in SFBB scheme, in the following
we introduce the Equivalent Diode Network Model (EDNM) to evaluate the self-
biased body voltage. We rst consider the case of a single gate, in order to gain
an insight into the impact of the cell topology. Then, we generalize to the case
of multiple gates sharing the same substrate, to understand the impact of the
gate-level topology.
Analysis of a single logic cell
Fig. 3.4 shows the detailed illustration of SFBB scheme on an inverter cell and
a 2-input NAND cell. As shown in Fig. 3.4(a), the D2 and D5 diodes are series
connected between the VDD and ground, while the eect of D3 and D4 on the
self-biasing node voltage is determined by the corresponding output voltage level.
Assuming OUT is high, the equivalent diode-network is the parallel of D2-D3 in
series with D5. D4 is reverse-biased and thus it can be regarded as an open circuit.
Assuming for simplicity that all diodes are identical, the self-bias node voltage is
2/3VDD. Similarly, the self-bias body voltage when OUT is low results to 1/3VDD.
For a 2-input NAND cell, the resulting EDNM model is shown in Fig. 3.4(b).
Analysis shows that the A high/B low input vector leads to the self-bias node volt-
age that is farthest from VDD/2. In this case, the EDNM consists of four parallel
diodes that are connected in series with single diode, thereby leading to a bias
voltage of 4/5VDD. Similarly, a 2-input NOR cell has a body voltage of 1/5VDD.
This simplied model was extensively validated through HSpice simulations, and
was found to be accurate within 15%.
29




































Fig. 3.4: Illustration of the EDNM model of (a) an inverter cell and, (b) a 2-input NAND cell for
SFBB scheme.


































































Fig. 3.5: Illustration of the EDNM model for SFBB scheme of (a) an inverter cell, (b) a NAND
cell, (c) a 3-stage ring oscillator, (d) simulation waveform of the self-bias node voltage uctuation
of the 3-stage RO SFBB and, (e) timing potentials of the LVSB and SFBB scheme over the ZBB
scheme.
30
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
Analysis of multiple cells
We use two inverters to illustrate the impact of the gate-level topology on self-
bias node voltage. First, we consider the parallel connection in Fig. 3.5(a). Since
the output nodes of two inverters are identical, the diode-network for two inverters
are the same with a single inverter case, the voltage divider ratio is therefore
unchanged. In the case with two cascaded inverters in Fig. 3.5(b), the two inverters
have opposite output voltage levels and this results in a balanced diode-network.
As a result, the self-biased body voltage is VDD/2.
The above observations indicate that both the cell topology and the gate-level
topology impact the self-bias node voltage. In actual designs, it is necessary to
evaluate the practical range of the body voltage, so that cells are correctly char-
acterized. For cells in low voltage timing libraries, as expected we found that the
worst-case individual logic cells are the NAND/NOR gates. The corresponding
voltage levels for isolated gates are around 80% VDD and 20% VDD, respective-
ly. However, as suggested by intuition, averaging takes place in practical circuits
consisting of a large number of dierent logic gates. As a consequence, the prac-
tical range of the body voltage is narrower, and it has a very limited uctuation
during the network switching. For example, the self-bias body voltage waveform
in a 3-stage ring oscillator in Fig. 3.5(c-d) can uctuate only by 40mV around
VDD/2. Extensive analysis of a wide range of circuits showed that self-bias node
voltage are always well within 30% and 70% VDD range. To characterize the cells
accordingly, we adopt a strategy that is similar to FDSOI standard cell libraries
[105]. In particular, the bias bounds of 30% to 70% VDD are applied to VBN and
VBP node for characterization, respectively.
As LVSB and SFBB are forward body biasing, they can inherently oer a
31












Fig. 3.6: Illustration of the AOFBB well tap designs and body-biasing oorplan under a tap-less
standard cell library technology.
considerable performance improvement over ZBB. Also, as opposed to LVSB, the
proposed SFBB can operate at supply voltages that are above 0.5V and up to the
nominal voltage, since the forward body bias voltage is a fraction of VDD rather
than the entire VDD. This is clearly shown in Fig. 3.5(e), which plots the simulated
3-stage ring oscillator frequency at 25oC for the three above AOF-BB schemes.
From this gure, LVSB stops working beyond 0.5V due to the abrupt leakage
increase and the above discussed logic swing degradation, while the proposed SFBB
is able to operate correctly up to 1.2V and achieves up to 40% higher operating
frequency than ZBB over the entire operation range.
In addition to the above advantages, SFBB maintains the desirable features of
ZBB and LVSB that the body biasing does not require any area overhead, as the
32
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
body is biased through proper body tap cells. Such cells short the n-well and the
p-well, and their layout is shown in Fig. 3.6. Similarly to LVSB and ZBB, body
tap cells are regularly placed in the layout rows by the Place&Route tool.
3.4 Case Study: Advanced Encryption Standard
In this work, the above techniques are used to design a near-threshold Ad-
vanced Encryption Standard (AES) 65nm testchip. Fig. 3.7 summarizes the
throughput-energy features of previously published AES prototypes, which focus
on either high throughput (Gbps or more) [94{96] or very low power and low
throughput [97{99]. Compared to low-throughput designs, the energy/bit of some
high-throughput designs can be signicantly smaller (down to pJ/bit or even slight-
ly lower). On the other hand, the area of high-throughput designs is signicantly
larger than low-throughput designs (by 4-17). Hence, a gap exists around medi-
um data rate applications (tens of Mbps) with intermediate energy and compact
area, compared to the existing designs.
The adoption of the proposed techniques permits to ll this gap, as our near-
threshold1 AES core in 65nm CMOS achieves 12.2Mbps (38.4Mbps) throughput
with 1.65pJ/bit (2.34pJ/bit) energy at 0.5V (0.6V). At the same time, our AES
engine occupies only 0.013 mm2 silicon area. Hence, our techniques permit to
achieve the lowest energy (by 5-10 or more) compared to previous designs for
sensor nodes, while increasing the achievable throughput to tens of Mbps, and
keeping area comparable or better.
1In the adopted technology, the NMOS threshold voltage is 0.6V.
33
























Fig. 3.7: State-of-the-art AES designs: energy vs. throughput and area (scaled to 65 nm node
for the dierent adopted technologies).
3.4.1 Low-Cost AES Architectures and S-Box Implemen-
tation
In the following, the selection of the architecture organization and the S-
BOX implementation are discussed with the goal of enhancing the eciency of
the energy-area-performance tradeo.
Area-ecient AES core implementations typically adopt folded datapath mi-
croarchitecture to reduce the gate count [97{100], as opposed to very high-performance
targets, whose area can be higher by an order of magnitude (see, e.g., [96]). In
area-ecient designs, area tends to be dominated by memory, as shown by the
area breakdown in Table 3.1 of the area-ecient designs in Fig. 3.7. Hence, the
memory organization is a key lever to optimize the area-performance tradeo.
In unied-RAM architectures (e.g., [97, 98]), data and key are fetched, pro-
cessed and stored in a single RAM block. These architectures suer from low
34
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
Table 3.1: Comparison of state-of-the-art Area-Ecient AES architectures
Unied-RAM Split-RAM Shift Register
[97],[98] [99] [100]
Memory 60% 55% 59%
S-Box 12% 7% 16%
MixColumn 9% 8% 9%
Control & Others 19% 30% 16%
Cycles/Block 1134 356 160
Gate Equivalent (GE) 3.4K 5.5K 4K
Silicon Prototype Yes Yes No
performance due to the large number of clock cycles required for round computa-
tion and key expansion. Instead, split-RAM architectures (see, e.g., [99]) have two
separate RAMs for data and key. This architecture enables key expansion inter-
leaving during round computation while using a single S-BOX, thereby reducing
the number of cycles (i.e., increasing throughput) by 2.9. Such advantage comes
at the price of more complex and larger-area control logic, as conrmed by Table
3.1. As an alternative, the architecture in [100] uses a shift-register based memory,
which eciently performs byte permutations in both data (round computation)
and key processing (key expansion). The presence of an additional S-BOX permits
to simultaneously perform round computation and key expansion [100], thereby
improving the performance by 7.1 at only a 14.7% area penalty compared to
unied-RAM architecture [98].
Another key choice to optimize the area-energy-performance tradeo is the
appropriate selection of the S-BOX implementation. Indeed, although S-BOX oc-
cupies a moderate fraction of the AES area (see Table 3.1), it lies on the critical path
and hence it strongly impacts the overall performance-energy tradeo (especially
at near threshold, as the substantial contribution of leakage energy is lowered when
35
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting






[106] 263.5 75.3 19842
[107] 335 62.9 21072
[108] 276 66 18216
[96] 324.5 45.3 14700
reducing the cycle time [66]). In our AES design, we adopt native Composite Field
(CF) S-BOX design as in [96], which removes the overhead of mapping and inverse
mapping from GF(28) to CF. In the adopted 65nm technology, this respectively
reduces S-BOX delay and area by 28% and 20%, as shown in Table 3.2. However,
the native CF AES design causes additional delay overheads in Mixcolumn unit
(3 XOR delay). When considering the 5 XOR delay reduction obtained in native
S-BOX, the overall critical path delay reduction is 2 XOR, which yields around
10% delay reduction. The detailed AES architecture with both native round and
key S-BOX is depicted in Fig. 3.8.
3.4.2 Automated and Detailed SMA-SSTA Design Flow
A commercial standard cell library was pruned by discarding cells with fan-
in larger than two [59]. Also, transmission gate structures connected at the in-
put/output pins of cells were buered to avoid sneaky leakage paths, and minimum-
strength cells were eliminated to limit random variations. The resulting library
comprises 25 dierent types of cells with dierent strengths, leading to 116 cells in
total.
Regarding the step 1 of the SMA-SSTA ow in Section 3.2, the identication
36








































































0 1 0 1 1 0 0 0
1 1 1 1 0 1 0 0
0 1 1 1 0 0 1 0
0 0 1 1 0 1 0 1
0 1 0 0 0 0 0 1
1 0 1 0 1 1 1 0
1 0 0 1 0 0 1 1













1 0 1 0 0 0 0 0
1 0 1 0 1 1 0 0
1 1 0 1 0 0 1 0
0 1 1 1 0 0 0 0
0 0 0 1 1 0 1 0
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 0






01 2b 2a 01
01 01 2b 2a













1 0 0 0 0 1 0 0
1 1 0 0 1 1 1 0
0 0 0 0 0 1 0 0
1 1 0 1 1 0 1 0
0 1 1 0 0 0 1 0
1 0 1 0 0 0 1 0
1 0 1 1 0 0 0 0





















































































Fig. 3.8: The AES engine architecture with native round and key expansion S-Box.
of the main variation sources is performed through Design of Experiments (DoE)
based on a delay sensitivity analysis on 1 inverter. The Pareto set plot pertaining
to the variation parameters with delay normalized to its mean value is shown in Fig.
3.9. From this gure, in the considered 65nm technology, the dominant sources of
variations are the threshold voltage and the oxide thickness. The oxide thickness
variations are global, while threshold voltage variations are both global and local.
Regarding the step 2 of the SMA-SSTA ow, the cell characterization was
performed at 0.5V for all the three AOF-BB schemes and at 0.6V for ZBB and
SFBB, due to the 0.5V voltage limitation of LVSB. As the outcome of step 2,
the resulting surrogate cell delay model of local variations is stored in the form of
Composite Current Source (CCS) model, along with the global variations [110].
The assumptions and approximations introduced in related step 3 and 4 in
Section 3.2 were validated extensively. The model adjustment was performed on a
15-stage critical sub-circuit and the adjusted model was then applied to validate
other datapath. This adjusted model was validated on several other randomly
selected datapaths with logic depth ranging from 5 to 25. Fig. 3.10 depicts some
37





























Fig. 3.9: Pareto set plot of DoE for delay sensitivity analysis toward dierent variation parameters.































Fig. 3.10: +3 setup time accuracy analysis of SMA-SSTA after model adjustment.
38
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
of the results, and shows that the error is moderate at low depths, and is below
7% for logic depths of ten or more, compared to Monte Carlo simulations. This
conrms the correctness of the design considerations on the logic depth of the
reference datapath in Section 3.2. Despite of the drastic simplication oered by
the proposed models, this error is as small as that encountered in SSTA predictions
[101]. Regarding the hold x, the resulting area and power cost of buer insertion
for hold x under SMA-SSTA was respectively found to be 4% and 6%, at 0.5V
under ZBB scheme. Under the DSTA approach (using FF corner as the worst
case as usual), the hold time x overhead is 23% and 18% for area and power,
respectively. This conrms that SMA-SSTA avoids over-design, as opposed to
DSTA.
3.4.3 Runtime for Local Variation Characterization and
Comparison with SSTA
The SMA-SSTA ow enables signicant reduction in the runtime needed by the
cell characterization under local variations. In our SMA-SSTA ow, only (2Nrv+1)
spice simulations are needed for each delay/slew point in the proposed framework,
where Nrv is the number of random variables [110]. For NLOPALV in [101], the
number of SPICE simulations required to characterize each cell is on the order of
O(NrvNtrNiter), where Ntr is the number of transistors within a logic cell and Niter
is the required iterations for operating point convergence. In addition, a total of
13 points (Nchar) characterization (spacing = 0.5) is adopted from -3 to +3.
Accordingly, the speed-up oered by SMA-SSTA compared to NLOPALV can be
39



























1 2 3 4 5 6 7 8 9 10
Normalized Critical Path Delay PDF





Fig. 3.11: Statistical distribution of the delay (normalized to  3 delay under SFBB at 0.6V) for
dierent body biasing schemes and resulting clock frequency improvement over DSTA (top-right).
expressed as
2Nrv + 1
(0:28NrvNtr + 0:23Nchar + 1)Niter
(3.1)
where the coecient in the denominator is the empirical value adopted from [101].
For example, for a 2-input NAND gate with Nrv=2, Ntr=4 and Niter=3, the run-
time of the proposed framework is only 26.8% of NLOPALV. The runtime advantage
of SMA-SSTA is further improved for cell topologies with higher complexity.
3.4.4 Performance and Design Margin Recovery
The statistical distribution of the critical path delay obtained with SMA-SSTA
is plotted in Fig. 3.11 for ZBB and SFBB for 0.5V and 0.6V, as well as for LVSB
at 0.5V. For simplicity, the delay is normalized to the value achieved at   3 by
the SFBB scheme at 0.6V. From this gure, DSTA introduces a very large design
margin, which translates into a signicant clock frequency penalty. In particular,
40
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
at 0.5V the clock frequency at  + 3 for DSTA is only 2.05MHz under ZBB
scheme, whereas it increases to 4.2MHz when the SMA-SSTA framework is adopted.
Accordingly, the SMA-SSTA methodology enables a 2 design margin recovery
(i.e., performance), as shown in the upper-right portion of Fig. 3.11, which shows
the clock frequency normalized to the DSTA approach. As expected, the adoption
of LVSB and SFBB lead to a signicant performance improvement compared to
ZBB. Indeed, at 0.5V the clock frequency at  + 3 under LVSB and SFBB is
respectively 13.6 and 6.2MHz, which represent a 3 and 6.6 improvement as
shown in the upper-right part of Fig. 3.11. At 0.6V, SFBB has a higher clock
frequency (17.4MHz), whereas LVSB cannot operate correctly.
3.4.5 Physical Implementation Considerations
The nal design was placed and routed in SoC Encounter. In this design, a
tap-less standard cell library is used and therefore the body tap cells are placed
during oorplan phase. As usual, the spacing rules dictated by the technology are
followed when placing tap cells throughout columns. PMOS and NMOS body taps
are connected externally to easily recongure the same core to ZBB, LVSB and
SFBB biasing scheme. This permits to fairly compare the three techniques in the
presence of the same local variations.
3.5 Testchip Measurement Results
The AES encryption engine with AOF-BB schemes was implemented with the
proposed design ow in a 65nm Low Leakage CMOS process. Fig. 3.12 shows the
die photograph with annotated details and the test chip summary. Plain text and
41














Process 65 nm Low Leakage 
Supply 0.5 - 1.2 V 
Area  
0.008 mm2 (w/o power rings) 
0.013 mm2 (w power rings) 
Maximum 
Throughput 
12.2 Mbps (0.5V, SFBB) 
38.4 Mbps (0.6V, SFBB) 
Energy per 
bit 
1.65 pJ/bit (0.5V, SFBB) 





Fig. 3.12: Die micrograph with annotated core and on-chip testing buer and summary of the
AES encryption engine at room temperature.
key are inputted serially, and input (SPI, serial-to-parallel interface) and output
buers (PSI, parallel-to-serial interface) are provided for on-chip testing purposes.
The serial scan of input/output data minimizes the number of testing pads (the
chip is pad limited), but constrains the frequency of at-speed testing to about
100MHz. This sets a 0.7V voltage upper bound for at-speed testing, which fully
covers the near-threshold region of interest.
3.5.1 Performance Measurement and Energy Comparison
Performance and energy were measured for the ZBB, LVSB and the proposed
SFBB body biasing schemes. The supply voltage was varied from the minimum val-
ue that enables correct operation, which is in deep subthreshold, up to 0.7V. If not
stated otherwise, in the following the temperature is set to 25oC. Fig. 3.13 shows
the operating frequency and energy per bit of the three body biasing schemes. For
a chip that is close to typical corner (see Fig. 3.13), the minimum operating voltage
of ZBB is 230mV, and the minimum energy point occurs at 300mV. Around this
42










Fig. 3.13: Measurement results of operating frequency and energy per bit of the testchip.
point, the measured energy is 0.89pJ/bit and throughput is 152Kbps. The LVSB
scheme has the same minimum operating voltage and the minimum energy point
voltage of 300mV. At this voltage, LVSB increases the throughput to 640Kbps,
while keeping 2% energy overhead of ZBB, which represent a 4.1 performance
improvement at nearly the same energy. In the 0.23-0.5V range, LVSB boosts the
performance by 3.7-5.4 compared to ZBB, with an energy that is 3.4%-13% high-
er than the latter. As a severe limitation of LVSB, operation above 0.5V is not
allowed as explained in Section 3.3.
As expected, the proposed SFBB scheme does not have such voltage limita-
tion, and can operate up to the nominal voltage (1.2V in this technology). Correct
functionality of SFBB was veried experimentally in the 0.35-1.2V range. This
is a considerable advantage of SFBB over LVSB in practical designs. Indeed, it
strongly simplies SoC integration since VDD can be freely set during the op-
43
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
Table 3.3: Summary and comparison of state-of-art area-ecient AES designs
 







AES Mode ECB, enc/dec ECB, enc ECB, enc 
Technology 65nm LP 0.13mm 65nm LL 
Supply Voltage (V) 0.5 0.8 0.5/0.6 0.5 0.5/0.6 
Area (mm2) 0.018 0.06# 0.013 
Frequency (MHz) 4.9 12 9.3/32 34.3 15.3/48 
Throughput (Mbps) 0.55 4.3 7.4/25.6 27.4 12.2/38.4 
Energy (pJ/bit) 12.9 23 1.61/2.3 1.74 1.65/2.34 
Cycle/Block 1134 356 160 
Latency (ms) 231.4 29.7 17.2/5 4.66 10.5/3.3 
Near-threshold 
statistical design 
No No Yes 
   *0.5V data is extrapolated from [98].  #Area is estimated from [99] including power rings. 
 
 
timization/aggregation of voltage domains and the related voltage selection. In
addition, the performance of SFBB is much more scalable than LVSB, thanks to
the wider voltage range. Because of the above mentioned limitation on the testing
frequency, Fig. 3.13 reports the measured frequency up to 0.7V. From this gure,
SFBB consistently oers 1.65 more performance over ZBB across the 0.35-0.7V
range. Also, the frequency of SFBB can be scaled up to 90-MHz when VDD is set to
0.7V, which is 2.6 higher than LVSB at 0.5V. Clearly, an even wider performance
increase is achieved when pushing the voltage beyond the near-threshold region
depicted in Fig. 3.13.
Table 3.3 shows the comparison of our AES core and state-of-the-art area-
ecient AES implementations. From this table, the proposed SFBB body biasing
scheme, the proposed SMA-SSTA methodology and the more ecient microar-
chitecture deliver a 22 throughput improvement and 7.8 reduction in energy
per bit, when compared to [98]. Observe that [98] is implemented in the same
technology node, and the performance improvement of our testchip is due to the
44
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
adoption of SFBB (1.65), SMA-SSTA (2), and microarchitecture (7.1). The
latter reduces the number of clock cycles per encryption from 1134 to 160.
Regarding the silicon area, the proposed SFBB and SMA-SSTA reduce the
gate size (i.e., total area) for a given performance target, thereby improving area
eciency [109]. In this design, a 28% layout footprint reduction is achieved com-
pared to [98], despite of the 14.7% area penalty associated with the selected high-
performance architecture. Such area reduction is also partially responsible for the
improved eciency of the energy-performance tradeo. Indeed, small area leads to
shorter wires and reduced parasitics, thereby benetting both energy and perfor-
mance. Summarizing, the proposed SFBB scheme and SMA-SSTA methodology
boost performance by 3.3 at no area overhead and insignicant energy penalty
compared to a standard ZBB design. Compared to [98], our AES core achieves a
7.8 energy reduction and exhibits an additional 7.1 speed improvement thanks
to a more ecient architecture.
3.5.2 Static and Dynamic Robustness of the Body Voltage
Bias Point in SFBB
The stability of the body voltage bias point in the SFBB scheme was exten-
sively assessed in a wide range of static and dynamic conditions. Regarding the
static stability, leakage is an eective indicator of the stability of the body bias
voltage, due to its exponential impact on the threshold voltage. Accordingly, the
chip leakage current was measured under VDD widely ranging from 0.5V to 1.2V,
as well as for temperatures ranging from 25 to 75oC. Due to the very high leakage
current of LVSB for VDD>0.5V, a 1-kOhm current-limiting resistor was inserted
45
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
between the supply and the testchip.
Fig. 3.14 summarizes the leakage measurements versus VDD at a temperature
of 25oC, 50oC and 75oC (leakage is normalized to the measurement under ZBB
scheme at 0.5V, 25oC). As expected, LVSB scheme suers from a very large leakage
current at 0.5V, which is more than one order of magnitude higher than SFBB and
ZBB from Fig. 3.14. Furthermore, LVSB experiences a very rapid (exponential)
increase in leakage current for VDD over 0.5V at 25
oC, which makes its adoption
impractical for VDD above 0.5V. At higher temperatures leakage further increases,
and maintaining a targeted leakage current requires a 50mV reduction in VDD
for every 25oC temperature increase. In other words, the tight voltage constraint
imposed by LVSB is further reduced at higher temperatures. On the other hand, as
expected the proposed SFBB scheme and ZBB exhibit a moderate leakage increase
at higher voltages, which conrms that they can operate at voltages up to the
nominal VDD.
For the sake of completeness, the leakage measurements were repeated in other
15 dice under the same environmental conditions, eliminating the current-limiting
resistance altogether. Fig. 3.15 summarizes the measurement results across the dice
at VDD equal to 0.5, 0.8 and 1.2V, and 25
oC temperature (leakage is normalized
to the measurement under ZBB scheme at 0.5V, 25oC). As shown in this gure,
the results are consistent with the observations made in Fig. 3.14. In particular,
the leakage of SFBB scheme is consistently 5-7 the leakage of ZBB for the entire
considered voltage range. This means that the SFBB body voltage is highly stable
across voltages, temperatures and dice, thereby conrming the solid stability of its
bias point.
The body voltage in SFBB scheme was also measured in dynamic through a
46
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting





























 25  50  75  LVSB
 25  50  75  SFBB
 25  50  75  ZBB
Temperature
Increase
Fig. 3.14: Leakage measurements of ZBB, LVSB and SFBB versus VDD under dierent temper-
atures in a single die.





















































Fig. 3.15: Leakage measurements across 15 dice.
47
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
unity-gain CMOS analog buer. As shown in Fig. 3.16, which refers to the case of
supply voltage of 0.5V, the self-bias node voltage is close to 1/2VDD and well within
the sign-o range with VBN and VBP of 0.3VDD and 0.7VDD. Small glitches with
an amplitude of about 60mV are seen on the self-bias node due to the internal
switching, as expected from Section 3.3.
The dynamic stability of the body voltage bias point was further analyzed
by perturbing the steady-state value by forcing a large-signal voltage pulse and
observing the transient response. An aggressive external voltage pulse with a width
of 5 s and amplitude of 0.25V was applied directly to the SFBB body voltage to
mimic a dramatic voltage change. Such pulse was forced through a large coupling
capacitance (0.1 F). The measured SFBB body voltage response under VDD of
0.5V is depicted in Fig. 3.17, which clearly shows that the body voltage rapidly goes
back to its steady-state waveform after the transient. Extensive measurements in
various conditions consistently conrmed that the body voltage bias point is highly
stable even under extreme perturbations.
3.6 Conclusion and Summary
In this chapter, we have introduced two synergistic techniques that counteract
two fundamental issues of near-threshold VLSI circuits: performance loss and large
guardband due to process variations. A novel SelF-Body-Biasing scheme boosts the
transistor speed while entailing zero area overhead. As opposed to existing area-
overhead-free body-biasing schemes, our technique boosts the performance from
near threshold to nominal voltage and ensures reliable operation, as opposed to
LVSB, whose supply voltage is severely limited to 0.5V and below. This dramat-
48
















Fig. 3.17: Dynamic stability test of the self-bias node.
49
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
ically simplies SoC integration, since SFBB does not set any constraint on the
(static or transient) voltages that need to be distributed on chip.
A novel Surrogate Model Adjustment based Statistical Static Timing Analysis
methodology has also been introduced for ecient cell library characterization and
timing analysis at near threshold. The adoption of surrogate gate delay models for
local variations permit to drastically reduce the computational eort compared to
traditional SSTA, while maintaining a comparable accuracy. The joint adoption of
these techniques enables for the rst time the variation-aware body biased design,
as opposed to previous body biasing design strategies that either completely ignore
variations [59{61], or suer from large guardband due to the simplistic timing
analysis based on corners [69, 93].
Experimental results of a near-threshold AES core in 65nm CMOS have shown
that the above techniques can be synergistically adopted to drive the throughput
up to 12.2Mbps (38.4Mbps) at 0.5V (0.6V). This represents a 22.2 throughput
improvement over [98], which is implemented in the same technology node. Also,
the proposed techniques permit to achieve an energy of 1.65pJ/bit (2.34pJ/bit) at
0.5V (0.6V), i.e. a 7.8 (5.5) energy reduction compared to [98]. As an additional
benet, the gate size reduction enabled by the improved transistor speed and the
elimination of over-design, our AES engine occupies only 0.013 mm2, i.e. 28% less
than [98]. In general, our AES core enables 5-10 energy reduction over previous
designs for sensor nodes [98, 99], and increases throughput to the range of tens of
Mbps, while exhibiting comparable or better area.
In summary, the proposed techniques are very well suited for the design of
near-threshold IPs with very high energy eciency, while considerably boosting
performance at reduced design eort and area. The proposed techniques enable
50
CHAPTER 3. Near-Vth ASIC Design: Statistical Timing Analysis and Performance Boosting
seamless SoC integration since they do not require any post-silicon tuning or ex-
ternal block to control the body bias voltage, and ensure reliable operation from
near-threshold to nominal voltage.
51
Chapter 4
A 65nm 30.7fJ/bit Subthreshold
Level Shifter Design
In this chapter, a novel static level shifter is proposed and measured in 65nm
CMOS technology for robust and ecient level conversion in a wide input voltage
range. Several circuit techniques are proposed to improve the energy eciency,
delay and area metrics. A novel level shifter topology with NMOS-diode based
current limiter for current contention reduction is proposed for ecient level con-
version through weakening the pull-up network strength. Second-order geometrical
eect is also explored to increase the drivability of the devices in the pull-down net-
work in the subthreshold region. Combining the popular MTCMOS technique in
today's CMOS technology, the measured level shifter achieves robust conversion
from deep subthreshold (sub-100mV) to nominal supply voltage (1.2V). For the
target conversion from 0.3V to 1.2V, the proposed level shifter shows on average
25.1ns propagation delay with 30.7fJ/bit energy eciency, and the average leakage
52
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
power is 2.2nW across 25 test chips.
4.1 Introduction
The growing demand of low-power systems are driven by the battery-powered
or energy-scavenging applications, where power consumption becomes the prior
design concern to prolong the operational lifetime [87]. As a result, voltage scal-
ing is extensively studied to reduce the total power consumption and multi-VDD
technique is popular for today's low-power System-on-Chip (SoC) designs. For
energy-constrained and performance-relaxed applications, aggressive voltage scal-
ing into the subthreshold region shows even promising energy benets [7] as the
minimum energy point often exists in the subthreshold region [6].
For multiple supply voltage designs, level shifters are indispensable and u-
biquitously inserted between dierent voltage domains, or directly used to drive
the highly capacitive I/O devices or external loads. In view of the subthreshold
operation, the level shifters are preferred to operate in a wide dynamic range, in-
cluding the subthreshold input scenarios. Unfortunately, the conventional level
shifter design based on the Dierential Cascode Voltage Switch (DCVS) topology
is challenging for robust up-conversion from subthreshold to superthrehsold supply
voltage, which is due to the signicant current contention caused by the limited
strength of the subthreshold operated pull-down devices. Generally, when the input
low signal scales below 500mV, the current contention between pull-up/pull-down
networks leads to the conversion failure of the DCVS level shifter.
Multi-stage level shifter design [49] is one feasible solution as the contention in
each stage is mitigated by using additional intermediate power rails with reduced
53
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
voltage dierence (300mV, 400mV, 600mV and 1.2V). However, this introduces the
overhead of generating the supply voltages via voltage regulators, which is costly
in most cases. Also, increasing the channel width of the pull-down devices is also
helpful to mitigate the contention, but the introduced area and power overheads
are overwhelming as both the area and power consumption are important design
metrics for subthreshold level shifters. Consequently, proper design techniques
[50{58] for reliable and ecient level shifter operation are highly desired.
In this chapter, we demonstrate the design and measurement of a novel energy-
ecient static level shifter in 65nm CMOS. Through the adoption of novel topology,
MTCMOS and the subthreshold device sizing, the proposed level shifter achieves
robust and ecient conversion from deep subthreshold voltage up to the nominal
supply voltage of 1.2V. Measurement results show that the proposed level shifter
can successfully up-convert a minimum supply voltage (on average 100mV) to the
nominal voltage of 1.2V. For a 0.3V, 1MHz input signal, 25 measured level shifter
achieves on average 25.1ns propagation delay and 30.7fJ/bit energy, which outper-
forms the state-of-the-art level shifter designs.
The rest of this chapter is organized as follows. Section 4.2 briey reviews the
state-of-the-art subthreshold level shifter implementations. Section 4.3 describes
the proposed level shifter design. Section 4.4 presents the measurement results of
the test chips and comparison to previous level shifter implementations. Section
4.5 concludes this chapter.
54




















Fig. 4.1: Conventional DCVS level shifter topology.
4.2 State-of-the-Art Implementations
In order to achieve robust level conversion with subthreshold input, novel level
shifter topologies and proper device choice and sizing are necessary to mitigate the
prominent contention issue.
The conventional DCVS topology, as shown in Fig. 4.1, is less optimal because
unreasonably large pull-down devices are required. In the chosen 65nm technolo-
gy, assuming a full-RVT (regular threshold voltage device) implementation (Fig.
4.1(a)) is adopted for 300mV to 1.2V conversion, the NMOS size should be 830
of PMOS. With the availability of multiple-threshold devices, an MTCMOS im-
plementation (Fig. 4.1(b)) can reduce the NMOS-to-PMOS ratio to around 60,
however, which is still a signicant overhead. To address this issue, several designs
have been proposed in the past few years. And the key point to ensure the cor-
55
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
rect functionality is to balance the pull-up and pull-down strengths, while keeping
minimal design overheads (e.g., area and energy, etc.).
PMOS current limiter based level shifter is proposed in [50]. By introducing
a reference path to bias the current limiter devices, the pull-up network strength
is weakened. However, the reference path contributes an always-on static current,
resulting in high leakage power overheads.
In [51], a PMOS-diode current limiter based level shifter is proposed. In this
work, the LS topology uses the drain node of the upper PMOS device as out-
put, where the swing is limited between VDD and VTp. In order to x this issue,
additional pull-down devices are inserted to pull this signal to ground. As a re-
sult, static current are introduced due to this pull-down requirements, leading to
increased propagation delay and higher energy consumption.
A two stage level shifter with virtual supply topology is proposed in [52].
Through the insertion of an always-on NMOS diode gating transistor, the pull-up
devices are weakened due to the virtual supply to the main conversion stage. Then
the output is converted to full-swing through a latch comparator. Although this
topology achieves proper functionality without additional power line, this design
suers from increased propagation delay.
A dynamic logic style subthreshold level shifter is proposed in [53]. This struc-
ture introduces additional clock synchronization circuitries between subthreshold
and superthreshold voltage domains, which leads to signicantly increased area and
energy.
The level shifter designs in [54, 55] allows voltage conversion from 0.3V to 2.5V
while still delivering decent delay and energy consumption. Thick oxide devices are
used for pull-up strength reduction. However, thick oxide devices do not scale well
56
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
in advanced technologies, which might lead to area overhead and integration issues
to other system blocks built with thin-oxide devices.
Recent works based on Wilson current mirror [57] and interrupted DCVS
topology [58] with improved performance metrics are also demonstrated to be fea-
sible for subthreshold operation. However, they have not been fully validated in
silicon.
4.3 Proposed Level Shifter Design
This section describes the proposed level shifter design and the comparative
analysis to previous designs. We rst demonstrate the benets of the proposed
novel level shifter topology. Then, MTCMOS and subthreshold device sizing are
considered to achieve the optimal design. At last, we show the comparative analysis
to previous implementations through simulation results.
4.3.1 NMOS-Diode Current Limiter based Level Shifter
The proposed level shifter topology in this work is shown in Fig. 4.2, including
the input inverter, the main voltage conversion stage and the output buer. The
key part in the main voltage conversion stage is the NMOS-diode based current
limiter in the pull-up network, which is introduced on purpose to drastically reduce
the current contention. Unlike the implementation of [51], we choose to use the
drain of the pull-down NMOS as the output node, thereby reducing the additional
pull-down devices in [51]. A full RVT implementation of the proposed level shifter
is studied and we nd out that the size of the pull-down devices in the proposed
design can be identical to that of the pull-up devices owing to the introduction of
57






























Fig. 4.2: Proposed NMOS-diode current limiter based level shifter topology with reduced pull-
down device size.






















Fig. 4.3: Simulated transient waveform of the MTCMOS DCVS level shifter and the proposed
level shifter.
58
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
the NMOS-diode current limiter. As a result, the NMOS-to-PMOS ratio can be
reduced to 2, which is a 30 reduction in pull-down device size when compared to
the MTCMOS DCVS implementation, as shown in right gure in Fig. 4.2.
The operation principle of the proposed level shifter is briey described with
the transient simulation of the level shifter. As shown in Fig. 4.3, when the input
A is the low-to-high transition, the internal node N1 can be easily pulled down due
to the reduced pull-up strength. In the meanwhile, the voltage droop across the
NMOS diode is signicantly larger, resulting in fast transition across the threshold.
The node N1 is nally inverted to the output node with full swing (OUT). And
the case of input A with high-to-low transition is similar. Note that the internal
node N1 of the proposed design is not full swing, the output buer may suer
from increased short circuit current. As a result, the output inverting buer with
stacking transistors should be adopted.
4.3.2 Level Shifter Optimization with MTCMOS and Sub-
threshold Sizing
Based on the proposed level shifter topology, the design can be further op-
timized through the MTCMOS and subthreshold device sizing. MTCMOS can
further reduce the pull-up strength in the proposed topology. As shown in Fig.
4.4, with MTCMOS technique, the pull-up network are HVT transistors while the
pull-down network are kept to be RVT transistors.
In addition, the inverse narrow width eect [32] is explored to improve the
drivability of the pull-down devices. Fig. 4.5 shows the simulated NMOS threshold
voltage and drain current versus the channel width in the chosen technology. As
59



















Fig. 4.4: Schematic of the optimization to the Proposed LS with MTCMOS and INWE-aware
sizing.




















































































Fig. 4.5: Transient simulation of the INWE eects on NMOS.
60
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
1 10 100
130X
Normalized Performance @ 1 MHz, 0.3 V






















Fig. 4.6: Normalized comparison of the delay, energy and energy-delay product of the proposed
level shifter with adopted optimization techniques.
can be observed, the threshold voltage experiences a deep roll-o when the channel
width reduces, and the corresponding drain current shows an increase with narrow
width transistor. Actually, the current density of the NMOS transistor (nA/nm)
peaks at the minimum transistor width, and this can be utilized to signicantly
improve the level shifter performance. As a result, the pull-down network design
can achieve both improved drivability and reduced device size. This also leads to
reduced energy consumption due to minimal parasitic capacitance.
The proposed level shifter design and optimization strategies can be further
elaborated with the following comparison. Fig. 4.6 shows the delay, energy and
energy-delay product of the MTCMOS DCVS LS topology and the proposed level
shifter with dierent optimization technique. We apply a 300mV, 1MHz and 50%
duty cycle input signal with 5ns rise and fall time at 25oC, and the load of all
61
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
Table 4.1: Summary of the transistor sizing
Transistor W/L (m) Transistor W/L (m)
M1 0.36/0.06 M2 0.6/0.06
M3 (0.12/0.06)*5 M4 (0.12/0.06)*5
M5 0.18/0.06 M6 0.18/0.06
M7 0.27/0.1 M8 0.27/0.1
M9 0.18/0.1 M10 0.18/0.1
M11 0.27/0.06 M12 0.27/0.1
LSs are a 1 strength inverter in the chosen technology. The remaining part will
also use this default simulation setup if not stated otherwise. All performance
metrics are normalized to the nal LS design with all optimization techniques
adopted. As shown in Fig. 4.6, the MTCMOS DCVS LS is ecient in neither
energy nor delay metrics, while the baseline NMOS-diode based LS (built with
minimum-sized RVT transistors) shows signicant energy reduction due to the
minimized pull-down device size. However, this incurs limited delay improvement.
Further incorporating MTCMOS in the pull-up network and INWE-aware sizing
in the pull-down network, total reduction compared to the DCVS LS in delay,
energy and energy-delay product are 2.47, 52.5 and 130, respectively. Indeed,
the energy/bit of the minimum sized N-Diode LS is optimal, the small delay and
energy-delay product of the nal design (with all optimization techniques applied)
are more preferred. The transistor sizing of the nal design are summarized in
Table 4.1.
4.3.3 Comparative Analysis to Previous Implementations
In order to demonstrate the benets of the proposed design, we perform a
comparative analysis to previous LS designs in [57, 58]. For the sake of fair com-
62
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design







































Fig. 4.7: Transient simulation of the proposed LS and the previous designs in [57, 58].









µ = 31.8 nW




















µ = 28.7 nW




















µ = 22.4 nW
σ = 8.18 nW

















µ = 13.7 ns



















µ = 15.7 ns



















µ = 10.7 ns











Fig. 4.8: Monte Carlo simulation of the proposed LS and the previous designs in [57, 58].
63
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
parison, we reconstruct and optimize the two implementations in the chosen tech-
nology. Due to the optimal topology discussed above, the proposed level shifter
shows smaller propagation delay when compared to previous designs. As shown
in Fig. 4.7, the propagation delay at 0.3V (0.25V) are 1.63 (2.1) ([58]) to and
1.34 (1.69) ([57]) of the proposed level shifter, respectively. Contemporarily,
the power consumption of the proposed LS are reduced by 1.52 (2.16) ([58])
and 1.31 (5) ([57]), respectively. Compared to the design in [58] (which has
a similar topology in the main shifting stage), our proposed level shifter achieves
smaller propagation delay due to the fact that the pull-up network in [58] is overly
weakened by the inverting input stage. As a result, the input stage of [58] has to
be LVT devices. The proposed design uses only the RVT devices and the delay
metric can be further improved if LVT devices are used, however, at the cost of
increased leakage current.
1K-point Monte Carlo simulation at 0.3V is also performed to investigate the
statistical performance of the proposed LS circuits. As shown in Fig. 4.8, the
proposed level shifter shows on average delay and energy benets over the previous
designs. The mean delay of the proposed level shifter shows 1.47 ([58]) and 1.28
([57]) delay reduction, respectively and the mean energy is reduced by 1.28 ([58])
and 1.42 ([57]), respectively.
4.4 Measurement Results and Discussions
The proposed NMOS-diode current limiter based level shifter was implemented
in a 65nm Low Leakage CMOS process. Fig. 4.9 shows the die photo and the layout
view of the level shifter, which occupied 16.3m2 (5.5m3.2m) silicon area. The
64







Fig. 4.9: Die photo and layout view of the proposed level shifter.
output of the level shifter is buered through an inverter chain designed for driving
the large capacitive load of the external testing equipment (up to 20pF), and the
power consumption of the LS circuit was measured excluding the power of the
output buer. Both the dynamic and leakage current are on nano-ampere scale so
the Agilent HP34401A was used for current measurement.
Fig. 4.10 and Fig. 4.11 show the measured waveform of the proposed level
shifter. With a relaxed input frequency (10KHz), the LS circuit can successfully
convert a 60mV signal into a 1.2V signal. For a 300mV input signal, the proposed
level shifter can be operated at 1MHz without obvious delay. Fig. 4.12 shows the
measurement propagation delay of a typical chip with dierent VDDL. As can be
observed, the level shifter experiences an exponential delay increase when VDDL
scales into the subthreshold region (below 0.5V). On the other head, when VDDL
exceeds 0.5V, the LS delay is gradually saturated to a few nano-seconds.
Fig. 4.13 shows the measured statistics of the proposed LS at 0.3V, 1MHz
input. The mean measured delay is 25.1ns, with standard deviation of 8ns. The
minimum input voltage of the proposed LS can be as low as 60mV, while the
65








Fig. 4.11: Measured waveform of the proposed LS with a 60mV to 1.2V conversion.
66
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
Table 4.2: Comparison to state-of-the-art LS designs
Tech. (m) Range (V) Delay (ns) Energy/bit Leakage (nW) Area (m2)
Proposed 0.065 0.1 - 1.2 25.1@0.3V 30.7fJ@0.3V, 1MHz 2.5 16.3
[50] 0.13 0.1 - 1.2 50@0.2V 25pJ@0.2V, 50KHz 8 NA
[51] 0.18 0.13 - 1.8 600@0.3V 20pJ@0.3V, NA NA NA
[52] 0.13 0.19 - 1.2 57.9@0.2V NA NA NA
[53] 0.13 0.3 - 2.5 125@0.3V 1.7pJ@0.3V, 8MHz NA 111800
[54] 0.13 0.3 - 2.5 41.5@0.3V 229fJ@0.3V, 5KHz 0.475 102.26
[55] 0.13 0.3 - 2.5 58.8@0.3V 191fJ@0.3V, 5KHz 0.724 71.9
[56] 0.35 0.23 - 3 104@0.4V 5.8pJ@0.4V, 10KHz 0.23 1880
[57]* 0.09 0.1 - 1.2 18.4@0.2V 94fJ@0.2V, 1MHz 6.6 NA
[58]* 0.09 0.18 - 1.2 21.8@0.2V 74fJ@0.2V, 1MHz 6.4 36.5
* Simulation results only.
worst case is 140mV among the 25 measured chips. The average dynamic and
leakage power are 30.7nW and 2.5nW, respectively. These measurement results
demonstrate that the proposed LS shows improved energy eciency and delay,
competitive area eciency and static power when compared to the previous designs,
as summarized in Table 4.2. To the author's knowledge, the measured sub-100mV
minimum input is the lowest input signal level reported to date.
4.5 Conclusion and Summary
The level shifter circuits capable of converting subthreshold input are becoming
essentially important. In this chapter, we proposed a novel NMOS-diode current
limiter based level shifter topology and optimization techniques through MTCMOS
and INWE-aware sizing. The proposed design was fabricated in a 65nm low leakage
process and the measurement results across 25 dies validate the proposed design,
which can convert an input signal as low as 100mV (on average) to 1.2V supply.
In addition, the proposed level shifter shows on average 25.1ns propagation delay,
30.7fJ/bit energy and 2.5nW leakage power when converting a 300mV input to
67
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design














Fig. 4.12: Measured LS delay from a typical die.



















µ = 25.1 ns
σ = 8 ns
σ/µ = 0.319












µ = 101.2 mV












Minimum Supply Voltage (mV)
 VDD
min







µ = 30.7 nW





















µ = 2.5 nW














Fig. 4.13: Measured statistics of the proposed LS: delay, VDDmin, dynamic power and leakage
power.
68
CHAPTER 4. A 65nm 30.7fJ/bit Subthreshold Level Shifter Design
1.2V, occupying only 16.2m2 silicon area. In summary, the proposed LS circuit





Ultra-Low Voltage Standard Cell
Design with Intra-Cell Mixed-Vth
Methodology
High functional yield is one of the key challenges for subthreshold standard
cell designs. Device upsizing is a commonly used but sub-optimal method due to
its overheads in energy and area. In this chapter, we propose a robustness-driven
intra-cell mixed-Vth design methodology (MVT-ULV) for the robust ultra-low volt-
age operation. It uses low threshold voltage transistors in the weak pulling network
of logic gates to enhance the robustness. It guarantees the high functional yield
with the minimum energy/area overheads. We demonstrate on a commercial 65n-
m CMOS process that, our proposed design methodology shows up to 60mV and
70
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
110mV robustness improvement at 300mV power supply voltage over the commer-
cial library cells and the cells built with previous Leakage-Minimization mixed-Vth
methods (MVT-LM) under the same cell area constraints, respectively. In addi-
tion, the proposed MVT-ULV library enables ITC'99 benchmark circuits to show
on average 30.1% and 78.1% energy-eciency improvement when compared to the
libraries built with the device-upsizing methods and the previous MVT-LM meth-
ods under the same yield constraints, respectively.
5.1 Introduction
The compelling energy benet of aggressive voltage scaling into the subthresh-
old region has been demonstrated on digital VLSI circuits in recent years for
performance-relaxed applications. However, the on/o current ratio degradation
and the sensitivity to process variations bring big challenges like the functional
yield loss or logic failure to subthreshold logic circuits.
Prior researches on subthreshold logic design have been introduced in [6, 7, 30{
35]. Several works focus on timing and energy optimization [6, 7, 30{33]. However,
these works do not take functional yield into consideration. Therefore, a few works
are proposed for yield enhancement for subthreshold logic design. Variation-aware
logic designs are proposed for improving the logic cell robustness [34, 35] with the
evaluation criterions like the buttery plot or the logic gate output voltage swing
level. Under a target functional yield constraint, device upsizing is an eective
solution to reduce the functional failure for subthreshold logic. However, this design
strategy performs trades-o with both the standard cell layout area and the power
consumption overheads due to the increased device dimension.
71
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Recently, a gate-level multi-Vth design technique is explored and demonstrat-
ed to be benecial for subthreshold operation [111]. However, the robustness is
still an open issue for multi-Vth logic library designs. The multi-Vth logic cells
are built with monolithic threshold voltage devices. Therefore, the conventional
device upsizing technique is still required and both energy and area overheads are
inevitable.
In this chapter, we propose a novel and orthogonal subthreshold intra-cell
mixed-Vth (MVT-ULV) logic design methodology and demonstrate its applicabili-
ty on a commercial 65nm MTCMOS process for robustness enhancement with the
minimum energy/area overheads. Previous intra-cell mixed-Vth design techniques
[112, 113] have been proposed for the leakage optimization in super-threshold de-
signs. However, they cannot be directly applied to the subthreshold region.
Our proposed MVT-ULV method aims at improving the cell robustness with
the minimum layout area overhead. In other words, the proposed method can
achieve the total energy reduction and area minimization under a target yield
constraint. The basic idea of building MVT-ULV cells is to replace the selected
regular-threshold-voltage (RVT) transistors with the low-threshold-voltage (LVT)
ones in the weak pulling networks (either pull-up or pull-down), for instance, the
stack transistors in NAND/NOR gates in the C2MOS logic family. Modern CMOS
processes generally support multi-Vth devices (Vth100-150mV) and this replace-
ment is feasible by only manipulating the CAD layers and proper device sizing. We
demonstrate that logic cells built with a monolithic device choice, either RVT or
LVT, require the similar upsizing strategy to improve functional yield. Howev-
er, the proposed logic design can enhance robustness with the minimum upsizing
overheads, which eventually reduces the standard cell footprint as well as energy
72
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
consumption.
The following part of this chapter is organized as follows. In Section 5.2, we
cover the related works on the subthreshold logic robustness and the transistor-level
mixed-Vth design. In Section 5.3, we introduce our proposed subthreshold intra-
cell mixed-Vth design methodology for combinational logic gates and ip-ops. In
Section 5.4 and Section 5.5 we validate the eectiveness of the proposed MVT-ULV
library through same-area constraint and same-yield constraint, respectively. We
conclude the chapter in Section 5.6.
5.2 Related Work
5.2.1 Subthreshold Logic Robustness
In [6], the authors claim that the minimum-sized commercial cell libraries are
energy-optimal for subthreshold designs. However, the modern commercial cell
libraries overlook the logical eort sizing strategy to maintain the compact layout
footprint, as illustrated in Fig. 5.1(a). In this way, the robustness of the commercial
logic cells can be problematic when operated in the subthreshold region.
Due to the degradation of ION/IOFF current ratio and the increased sensitivity
to process variations, subthreshold logic cells suer from reduced output swings,
which are the root cause of the subthreshold logic failure. The worst case of such
failure mechanism can be seen in the logic gates (eg., NAND/NOR), which are
composed of several stack transistors together with complementary parallel tran-
sistors. The leakage current from the parallel transistors degrades the output swing
and causes the functional failure, as shown in Fig. 5.1(a).
The buttery plot together with the corresponding output voltage swing [34]
73




















































Fig. 5.1: (a) schematics of commercial standard cells and subthreshold logic failure mechanism,
(b) cross-coupled NAND/NOR pair and, (c) example of buttery plot.
can be used to indicate the functional failure rate. The buttery plot is derived
via the Monte-Carlo analysis of the voltage transfer characteristic (VTC) curves
of the cross-coupled inverters (eg., NAND/NOR pair), as shown in Fig. 5.1(b).
The worst case VTC curves are used to form the buttery plot and to evaluate
whether the two bi-stable points are observed (Fig. 5.1(c)). Both the global and
local variations cause the VTC curve shifts and degrade the static noise margin
(SNM) of the buttery plot. Practical solution to enhance the functional yield is to
upsize the devices to mitigate the process variations. However, the device upsizing
solution contradicts the energy minimization goal due to the increased device size.
Also, area increase is inevitable.
74
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
5.2.2 MVT-LM Design Technique
Gate-level multi-Vth design technique [111] is proposed for ULV circuits to
achieve enhanced performance and energy eciency. They still stick to the tra-
ditional design strategy with monolithic device choice at the gate-level. Further
going beyond the gate-level multi-Vth design techniques, transistor-level (intra-
cell) mixed-Vth designs are proposed for Leakage Minimization (MVT-LM) with
the minimum overhead in critical path delay. However, these designs haven't been
validated for ULV applications. Therefore, we will elaborate several related works
[112, 113] and discuss the applicability of MVT-LM to subthreshold logic designs.
Combinational Cell Design
The previously proposed MVT-LM cell design in [112] achieves leakage reduc-
tion through the RVT device assignment while preserving the worst-case timing
arc delay. Transistors are assigned with the RVT devices when it is not on the
critical timing arcs. Examples of 2-input MVT-LM NAND/NOR gates are illus-
trated in Fig. 5.2(c). The design strategy in [112] improves leakage reduction with
the minimum delay overhead. However, the cell robustness is further deteriorated
as the stacking network of the MVT-LM cell is even weaker when compared to
the RVT/LVT cell with monolithic devices. Buttery plots of three NAND/NOR
pairs are shown in Fig. 5.2(d) with supply voltage of 300-mV in a 5k-point Monte-
Carlo (3) simulation. As illustrated, the MVT-LM pair yields diminishing SNM
(approaching 0-mV) at 300-mV, which is even worse than the RVT/LVT pair.
75



























Fig. 5.2: NAND/NOR pairs of (a) RVT, (b) LVT, (c) previous MVT-LM technique, and (d)



















Fig. 5.3: Previous MVT-LM ip-op design, (a) Mixed-I, and (b) Mixed-II, and (c) ip-op with
reset function.
76
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Sequential Cell Design
MVT-LM ip-ops are also investigated in the superthreshold region for leak-
age reduction due to the signicant leakage contribution from sequential cells. The
idea in [113] is to assign RVT devices in either the master or the slave stage of the
master-slave ip-ops. Thus, the delay is increased in either the setup time or the
clock-to-q delay. This technique avoids the abrupt timing change and still retains
the good leakage reduction purpose. Fig. 5.3(a-b) show the original design of [113].
Nevertheless, these two sets of designs do not contribute to better robustness
of the ULV ip-ops. Moreover, ip-ops with reset/set functions are extensively
used in digital VLSI designs. Therefore, it is more practical if we take a ip-op
with the asynchronous reset function (Fig. 5.3(c)) for demonstration. As annotated
in Fig. 5.3(c), the logic failure of the reset ip-op is due to:
 Skewed Feed-Forward (FF) inverter in slave stage latch, and
 Skewed and minimum-sized weak Feed-Back (FB) keeper in the master stage
latch.
Similar to the previous discussed method in section 5.2.1, the assignment based
on [113] cannot recover the above-mentioned failure because every single inverter
in the reset ip-op is still built with monolithic RVT/LVT devices. Thus, the
robustness of reset ip-op cannot be optimized.
77
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
5.3 MVT-ULV: Robustness-Driven Mixed-Vth for
ULV Operation
As discussed in previous sections, the logic design based on the monolithic
device, either RVT or LVT, is more prone to functional failure due to the weak
stack devices and the process variations in the subthreshold region. Besides, the
previous intra-cell MVT-LM techniques are optimized for leakage reduction, which
are demonstrated to be less benecial to the logic robustness in subthreshold oper-
ation. In this section, we will introduce our proposed robustness-driven MVT-ULV
logic design methodology.
Fig. 5.4(a) shows the diagram of the proposed MVT-ULV 2-input NAND cell
design for robustness enhancement. The idea is to assign LVT devices to the weak
pulling-down network in NAND gates. The similar concept is also applicable to
the NOR gates as well. Through this assignment, the VTC curves under process
variations can be tightened and eventually lead to enhanced logic robustness. As
can be observed in Fig. 5.4(b), the VTC curves (NAND-NOR pair) of MVT-ULV
cells under 3 Monte-Carlo simulation (5k-point) are signicantly tightened when
compared to the RVT cells.
It is worth noting that there are other design variants under the proposed de-
sign concept. Although several other intra-cell mixed-Vth variants exist (24 variants
for 2-input NAND/NOR gates), MVT-ULV restricts LVT devices to the parallel
structures due to the robustness consideration. Thus, there are two other variants
that can also improve the robustness in the NAND/NOR logic design, as shown
in Fig. 5.4(a). The choice can be determined by choosing the design variants
with maximum SNM improvement because it is mostly inuenced by the threshold
78











MVT-ULV NAND NAND Variant1 NAND Variant2
(a)
(b)
Fig. 5.4: (a) MVT-ULV NAND cell design with other possible variant cells, and (b) Monte-Carlo
simulation of the NAND-NOR Pair VTC curves.
voltage dierence between the RVT and LVT devices in the chosen process.
The design of complex logic gates, such as AOI or OAI cells, does not com-
pletely follow the previous design strategy. For such cells, both their stack and
parallel structures co-exist in either the pull-up or the pull-down networks, making
the relative strength less dierent when compared to the cells like NAND/NOR.
Previous strategies on NAND-NOR pairs may have negative eects on AOI/OAI
cell robustness. Therefore, the LVT devices should be assigned only to the bot-
tleneck transistors. For example, in AOI21/OAI21 cells, the pulling network is
dominated by the single stack transistors annotated in Fig. 5.5.
79





















Fig. 5.6: MVT-ULV ip-op cell with asynchronous reset.
80
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
With the previous knowledge, the MVT-ULV method can also be applied to
ip-op cells with the reset functions, as shown in Fig. 5.6. For the case of a slave
stage latch with the FF inverter, it is similar to the NAND-NOR logic cells. While
for the master stage latch with the FB inverter, the AOI/OAI design strategy can
be adopted.
5.4 Experimental Results: Iso-Area Constraint
In this section, we provide more simulation results to demonstrate the ef-
fectiveness of the proposed robustness-driven MVT-ULV methodology. The pre-
assumption is that all cells with the same logic functions are limited with the
iso-area for the fair comparison of robustness. We select several logic cells from
a 65nm commercial RVT standard cell library. For the iso-area layout footprint,
we only replace RVT stack devices into LVT. This can be achieved through an
additional device layer, which means no area overheads. Compared to the weak
current in the stack transistors in monolithic (RVT/LVT) logic cells, the degraded
on/o current ratio is recovered by using LVT stack devices. In addition, we also
built LVT and MVT-LM cells for comparison.
5k-sample Monte-Carlo (MC) simulations are run for each complementary logic
cells to obtain the worst-case VTC curves and the corresponding buttery plots.
Fig. 5.7 shows the four sets of buttery plot of NAND-NOR pairs with the supply
voltage of 300mV at room temperature of 25oC. As can be observed, the proposed
MVT-ULV design shows approximately 2X SNM improvement over the RVT/LVT
design, while up to 11X SNM improvement over the MVT-LM cells. More data of
SNM and VDDmin of other logic cells are summarized in Table 5.1. The proposed
81
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Fig. 5.7: Buttery plots of four sets of 2-input NAND-NOR pairs.
MVT-ULV cells are consistently better in term of SNM, which indicates a lower
minimum supply voltage.
We also investigate the variant designs in this process. As shown in Fig. 5.8,
both variant designs show improved SNM when compared to the RVT designs.
However, the improvement is less signicant when compared to the chosen MVT-
ULV design in the chosen process. It is possible that for another technology with
large dierences in the threshold voltage between RVT/LVT devices, variants may
be better alternatives.
In addition, temperature is another dominant factor for subthreshold logic
robustness. Commonly, the increased temperature will cause exponential increase
in gate leakage and the ION/IOFF ratio is further degraded. This makes the logic
82
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Table 5.1: Comparison of SNM (VDD=300mV, 25oC) and VDDmin of several logic cells under











Async reset DFFa 
SNM 
RVT/LVT 58mV/69mV 95mV/100mV 30mV/18mV 55mV/75mV FBb: 52mV/48mV 
FFc: 58mV/53mV MVT-LM 11mV 68mV -2mV (failed) 28mV 
MVT-ULV 120mV 165mV 75mV 90mV 
FB: 62mV  
FF: 82mV 
VDDmin 
RVT/LVT 240mV/230mV 203mV/194mV 270mV/280mV 245mV/225mV 
FB: 248mV/252mV 
FF: 242mV/247mV MVT-LM 290mV 230mV 305mV 270mV 
MVT-ULV 180mV 136mV 225mV 210mV 
FB: 240mV  
FF: 218mV 
   a The MVT-LM FF is built based on [14], therefore the latch designs are either RVT or LVT. 
   b FB: master stage latch containing feed-back keeper  
   cFF:  slave stage latch containing feed-forward inverter 
Fig. 5.8: Buttery plots of variant design of 2-input NAND-NOR pairs.
83
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Fig. 5.9: Temperature eects on logic output swing.
cell functionality most vulnerable at high temperature corners. The output swing
voltage levels of VOL (NAND) / VOH (NOR) gates across the -20 to 125
oC range
are plotted in Fig. 5.9. As shown in Fig. 5.9, the robustness of the commercial
cells is severely degraded at high temperature corners while the proposed MVT-
ULV cells still show better robustness across a wide operation temperature range.
5.5 Experimental Results: Iso-Yield Constraint
The previous discussions indicate that the proposed MVT-ULV logic cells show
a signicant robustness improvement over the commercial libraries cells with the
same area constraint. This indicates the potential energy benets when a target
functional yield constraint is applied. In this section, we compare the proposed
MVT-ULV design with the conventional device upsizing solution.
84
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
5.5.1 Cell Level Evaluation
With the purpose to achieve the target functional yield constraint, the con-
ventional logic cells are upsized to mitigate process variations and process skews.
The output swing failure rate is used as a quantitative indicator for a certain func-
tional yield [34, 35]. Under this framework, the proposed MVT-ULV requires the
minimum upsizing to achieve the same functional yield.
Fig. 5.10 shows the output failure rate versus the normalized device width of
2-input NAND cells in three sets of designs (RVT, MVT-LMMVT-ULV) at both
room temperature and the high temperature. As shown, with 0.1% failure rate
at 27oC and 0.5% failure rate at 125oC, no extra device upsizing is required for
the proposed MVT-ULV cells. However, signicant device upsizing up to 4X and
6X are required for RVT and MVT-LM cells to achieve the target failure rate,
respectively. In this case, the proposed MVT-ULV NAND gate gains around 4X
and 2X reduction in active device area and layout area over the upsized RVT cells,
respectively. Similar situations can also be found during the design of other logic
cells. Under the same yield constraint in this work, for 1X strength AOI21/OAI21
cells, up to 1.6X/2X active device area reduction can be observed, respectively.
Compared to the upsized ip-ops with approximate 2X area and higher clock
loading, the proposed MVT-ULV ip-op achieves the same robustness with no
overheads in area and clock loading.
Due to the increased device capacitance, the circuit delay and power consump-
tion under the conventional upsizing scheme increase. The FO4 delay and power
consumption of ten-stage NAND-NOR-inverter-chain are examined at 0.3V with a
5k-point Monte-Carlo simulation. As shown in Fig. 5.11, the commercial cell shows
the worst balance between the rise and fall time, indicating the worst robustness
85





















































Fig. 5.10: Output swing failure rate of three sets of standard cells, RVT, MVT-LM, MVT-ULV
































Fig. 5.11: Delay and power distribution of three 10-stage NAND-NOR chains (commercial, up-
sized and MVT-ULV).
86
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
for the subthreshold operation. The MVT-ULV design shows improved mean delay
over the other two designs and statistically balances the rising and falling timing.
Under the same functional yield constraint, the MVT-ULV design shows up to 40%
switching power reduction when compared to the upsized design.
5.5.2 Library Level Evaluation
In order to demonstrate the eectiveness of the proposed MVT-ULV method-
ology for subthreshold operation, we designed four sets of standard cell libraries
(RVT, LVT, MVT-LM, MVT-ULV) with the same functional yield constraint (0.1%
@ 27oC and 0.5% @ 125oC) in a 65nm commercial CMOS process at 0.3V. Both
RVT and LVT libraries are designed with the conventional upsizing (weak pulling
network upsizing) technique according to [34]. The MVT-LM library is designed
based on [112, 113] with the similar upsizing strategy with RVT/LVT libraries, and
the MVT-ULV library is designed according to our proposed methodology. For all
libraries, 12 logic functions (inverters, nand2, nand3, nor2, nor3, mux2, xor2, DFF
with reset function, etc.,) with 3 driving strengths are included. The standard cells
are designed with 9-track compact layout templates and the cells are characterized
at 0.3V at TT corner of 27oC.
ITC'99 benchmark circuits are selected with six dierent designs (from small
to large circuits). Synthesis results are listed in Table 5.2. As shown in Table 5.2,
the maximum frequency of both MVT-LM and the proposed MVT-ULV library
are similar, and are on average 51.6% improvements over the RVT library. The
LVT library still shows to be best in term of the operating frequency.
For the energy eciency metric, the proposed MVT-ULV library outperforms
the other three library sets by 33.3%, 30.1% and 78.1% over RVT, LVT and MVT-
87
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
Table 5.2: Synthesis results of the ITC'99 benchmark circuits.
 
Maximum Frequency  
(MHz) 
Area (um2) Total Power Consumption (nW) Energy Efficiency (nW/MHz) 




















B01 0.71 3.3 0.9 0.9 140.8 127.8 126.7 166 2.5 10.9 2.82 4 3.5 3.3 3.1 4.4 
B03 0.52 2.5 0.77 0.79 582.8 550.8 503.6 594 5.8 27.7 7.1 9.3 11.1 11.1 9.2 11.8 
B04 0.43 1.85 0.68 0.6 1828 1758 1488 3343 12.1 53.6 14.5 33.8 28.1 29 21.3 56.3 
B12 0.35 1.8 0.6 0.62 2998 2999 2682 3661 12.6 62 14.9 22.7 36 34.4 24.8 36.6 
B14 0.15 0.8 0.26 0.25 18494 18954 12454 23102 60.1 303 64.6 143 401 379 248 550 
B15 0.21 0.96 0.28 0.26 20142 19146 15303 20330 58.5 265 61.2 93.9 279 276 219 361 
 
LM, respectively. This is expected according to previous discussions. Due to the
upsized devices to achieve the target functional yield, the MVT-ULV requires the
minimum upsizing and eventually achieves better energy eciency. In addition,
the area benets of MVT-ULV library are 23.6%, 19.5% and 54.7% over RVT,
LVT and MVT-LM library, respectively. These results indicate that our proposed
MVT-ULV logic design methodology is promising to achieve the robust and energy-
ecient designs for the subthreshold operations.
5.6 Conclusion
In this chapter, we propose a novel intra-cell mixed-Vth design methodolo-
gy for robust subthreshold standard cell library based designs. The proposed
MVT-ULV design methodology shows up to 60mV and 110mV SNM improve-
ment at 300mV supply voltage over the commercial library cells and the cells with
the previous Leakage-Minimization mixed-Vth method (MVT-LM) under the same
cell area constraints, respectively. In addition, the proposed MVT-ULV library
enables on average 30.1% (over RVT/LVT library) and 78.1% (over MVT-LM
88
CHAPTER 5. Robust and Energy-Ecient Ultra-Low Voltage Standard Cell Design with Intra-Cell Mixed-Vth
Methodology
library) energy-eciency improvement under the same yield constraints, respec-
tively. These promising results demonstrate the robustness and energy eciency
benets of the proposed intra-cell MVT-ULV standard cell library. Future work will
extend this design methodology for processes (45nm and below) with signicant
second order geometrical eects.
89
Chapter 6
Exploring Energy Eciency in
Embedded DRAM
In this chapter, we explore the feasibility of using logic-compatible gain-cell
based embedded DRAM (eDRAM) as a potential memory alternative targeting
for ULV/ULP systems. The higher density provided by the eDRAM allows higher
memory capacity to be integrated to the volume-limited ULV/ULP systems. How-
ever, the dynamic nature of eDRAM requires the memory cells to be periodically
refreshed for data integrity, and this refresh operation reduces the memory access
availability as well as increases the energy overheads due to the refresh operation.
To solve these, we rst focus on a hidden-refresh scheme to allow the eDRAM
to operate with a SRAM-like interface, without dedicated refresh cycles for data
retention. This ensures 100% availability memory access, thereby reducing the
design eort for system integration. Also, low-voltage eDRAM behavior is not
fully exploited at the moment. Therefore, we explore the voltage scaling behaviors
90
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
of the eDRAM under the proposed hidden-refresh scheme.
6.1 Background
SRAM is one of the indispensable circuit building blocks in today's System-
on-Chips (SoC) designs and SRAM generally occupies large silicon area and power.
6T-SRAM is the most prevalent memory type, which provides fast operation speed
and high density. However, the stringent density requirements push SRAM bitcells
to be extremely small in each technology generation, therefore the 6T-SRAM is very
sensitive to local variations and prone to functional failure due to the signicant
reduced read/write margins. Voltage scaling will further worsen the situation and
generally the 6T-SRAM fails to be fully functional below 0.6V.
The emerging SRAM designs capable of ULV operation are also proposed for
aggressive power reduction purposes. The contradictory read/write optimization
strategies complicate the ULV SRAM designs and ULV SRAM bitcells generally
adopt more transistors (7T to 10T) [37, 38, 40{47] to maximize the read/write
margins. As a result, this incurs reduced density which is a disadvantage when
large memory capacity is needed for enhanced computational power.
Recent circuit [48, 114{119] and architecture [120] researches in memory/cache
designs explore the gain-cell based eDRAM as a potential alternative to the SRAM.
When compared with SRAM, the gain-cell based eDRAM can provide improved
memory density. In addition, since no direct leakage path exists in the gain-cell
eDRAM. Therefore, the leakage current of eDRAM is smaller than SRAM.
However, eDRAM suers from the data retention issue due to the electrical
charge leaking through the access transistors. As a result, the periodical refresh,
91
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
which includes a dummy read and write-back operation, is indispensable for data
integrity consideration. In conventional 1T1C (1-transistor and 1-capacitor) e-
DRAM, trench capacitor is introduced through additional mask layers to maintain
high density and to increase the retention time. Consequently, the 1T1C eDRAM
is generally incompatible with the standard digital process.
Gain-cell based eDRAMs rely on the MOSFET gate capacitors (fF) as the
main storage node. Therefore, gain-cell based eDRAM suers from shorter re-
tention time than 1T1C eDRAM. Despite of this, the gain-cell eDRAM is fully
compatible to the standard digital process, which is an advantage over the 1T1C
eDRAM when targeting for low cost ULV system.
When considering the replacement of SRAM by eDRAM in ULV/ULP systems,
it is very crucial to provide an ecient refresh scheme to allow easy integration.
The previous eDRAM designs [48, 114{119] all need a dedicated period for data
refresh, which causes reduced memory availability. This situation is not a signicant
issue for high-performance applications, but it is considerably an serious overhead
in practical ULV/ULP designs. For example, a low-cost ULP MCU has to be re-
designed to include a refresh-controller for data refresh and pipeline stall operation.
In this chapter, we propose a hidden-refresh scheme for 100% availability, gain-
cell based eDRAM. This scheme ensures the gain-cell eDRAM with a SRAM-like
interface without dedicated refresh cycles for data retention, therefore it relaxes
the design eorts for the easiness of system integration. Several circuit techniques
and design considerations are explicitly applied to enable the hidden-refresh scheme
and to ensure robust read/write/refresh operation.
Also, we explore the voltage scaling eects of the eDRAM based on the above-
mentioned scheme. The intuition is that voltage scaling will denitely reduce the
92
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
access/refresh power, but the retention time is also reduced at the same time,
which may cause increased refresh rate and energy. As a result, the power/energy
of voltage scaled eDRAM becomes an interesting trade-o, which is another focus
of this chapter.
The rest of this chapter is organized as follows. Section 6.2 describes the
proposed hidden-refresh scheme. Section 6.3 presents circuit design considerations
for the self-refresh eDRAM design Section 6.4 evaluates the voltage scaling eects
on the hidden-refresh eDRAM. Section 6.5 concludes this chapter.
6.2 Hidden-Refresh Scheme
Conventional eDRAM designs have to refresh periodically to retain the data
storage and a dedicated period of time for refresh has to be considered. During
the refresh period, the eDRAM design is not available for any access (read/write)
operation, which indicates reduced memory availability. Note that the conventional
eDRAM designs are targeting for high-performance operation, which are optimized
with reduced cycle time for hundreds MHz and near-GHz operation. This fast clock
results in a short period for refresh, which leads to inevitable system overhead for
refresh control.
When eDRAM is considered for low-power systems with MHz range operation,
the conventional refresh scheme is not applicable as the eDRAM availability is
further deteriorated with reduced operating frequency while the data retention
time is not changed. In addition, system level refresh control is generally expensive
in most low-cost low-power systems.
Fortunately, with the awareness of the reduced clock frequency in low power
93






Fig. 6.1: Conceptual illustration of the hidden-refresh scheme for eDRAM.
systems, we can explore the long clock cycle for smart refresh control at memory
circuit design level. In order to reduce the system design overhead, we propose a
hidden-refresh scheme for eDRAM. This allows the eDRAM to have a SRAM-like
interface with 100% memory availability while refresh operation is handled through
the refresh operation between two consecutive access operation.
Fig. 6.1 shows the timing diagram of the proposed hidden-refresh scheme for
eDRAM. Since a long clock cycle is generally available in low power systems (e.g.,
below tens of MHz), it is possible to utilize this long clock cycle by inserting refresh
operation between two consecutive access operations.
One straightforward way of implementing this scheme is to use both the clock
levels as a natural schedule. When CLK signal is high, the normal access operation
(i.e., read or write operation) is scheduled to either write in (read out) the data
to (from) the memory macro. When CLK signal is low, the refresh operation (i.e.,
dummy read followed by write-back) is scheduled.
Here, we need to clarify several power/energy metrics of the proposed hidden-
refresh eDRAM operation. Although the refresh operation is "hidden", the refresh
power still exists. As a result, refresh power will also contribute to both the dynamic
94
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
and static power of the hidden-refresh eDRAM. For the normal access mode, the
dynamic power (read/write power) has to take the refresh power into consideration.
When the eDRAM is in the data retention mode without being accessed, the total
static power includes both the leakage power and the refresh power, as the refresh
operation still takes place during the retention mode.
Indeed, it is possible to enable the refresh operation all the time. However,
considering the power/energy overhead of the refresh operation, it is less energy
ecient. Assuming a refresh duty cycle prole is known up-front for an eDRAM
macro, as shown in Fig. 6.1, the true refresh operation can be performed only
according to this duty cycle to maintain the data integrity of the whole memory
and to minimize the refresh power/energy overhead.
6.3 Circuit Design for Self-Refresh eDRAM
In this section, we will cover the circuit design aspects for the self-refresh
eDRAM design. Also, this design will be used for simulation to further investigate
the voltage scaling eects on the hidden-refresh eDRAM.
The top level eDRAM block diagram of the hidden-refresh eDRAM is shown
in Fig. 6.2. The external interfaces of the eDRAM is identical to a synchronous
SRAM macro.
The major architectural modication to enable the hidden-refresh scheme can
be realized with additional building blocks, which is highlighted using dashed lines.
As shown in Fig. 6.2, except for the main building blocks similar to normal SRAM,
an additional address multiplexor and a refresh counter is added. The address
multiplexor is used for access and refresh address selection while the refresh counter
95
























































Access Access Access Access AccessRefresh Refresh Refresh Refresh Refresh
Access
RW SEL
Write Access Read Access






















Fig. 6.3: Detailed timing diagram of the hidden-refresh eDRAM.
96
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
is used to generate the refresh address when refresh enable signal is applied. A more
detailed timing diagram based on Fig. 6.1 is shown in Fig. 6.3. As shown, the
access operation is further classied into write (Write OP) and read (Read OP)
access, which will be used as the control signal for wordline and bitline control.
In addition, only when refresh operation is enabled (REF EN = 1), the refresh
operation is regarded to be valid and the corresponding dummy read and write-
back (REF OP) operation is performed when CLK is low.
Note that a clock signal is required for refresh operation, which is used for
the counter to function periodically. The pulse generation for the both Access and
Refresh signal in Fig. 6.3 can be achieved through the classical Address Transition
Detection (ATD) circuit [28]. The access address is applied from external signals,
while the refresh address is generated from the refresh counter.
6.3.1 Bitcell Choice and Operation Principle
Previous gain-cell eDRAM bitcells consists of either two (2T) or three tran-
sistors (3T) with an optional MOSFET capacitors or diodes for dierent design
purposes. In this chapter, we stick to the basic 2T bitcell implementation [116],
which includes a PMOS access transistor and a NMOS storage transistor, as shown
in Fig. 6.4.
For the considered commercial 65nm technology, the write transistor (MW ) is
an HVT (high threshold voltage) PMOS device with lowest available leakage in the
process to increase the retention time, while an RVT (regular threshold voltage)
NMOS device is chosen for a moderate read sensing speed.
The operating principles and timing diagram of the 2T eDRAM are shown
Fig. 6.4 and Fig. 6.5, as described below:
97

























Fig. 6.5: Timing diagram of the eDRAM write, read and refresh operation.
98
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
For write operation, due to the voltage droop caused by the write transistor
MW , the write wordline (WWL) of the selected row is applied with a negative pulse
to enhance the drivability of the write transistor during write operation. Also, a
proper timing control is needed to ensure that the input data fed to the write bitline
(WBL) should be kept stable until the WWL signal goes back to VDD, which is
due to the level-sensitive write mechanism. In the meantime, the internal storage
node Q will also be coupled to a higher voltage level due to WWL switching.
Furthermore, it is worth noting that the WBL voltage level (during un-accessed
time) has signicant eects on the retention time of the storage node in the gain-
cell based eDRAM. For the chosen 2T eDRAM design with PMOS write device,
WBL should be always discharged to VSS when the WBL is un-accessed.
For read operation, the read bitline (RBL) is needed to be precharged rst.
During sensing period, the precharge is disable and the selected read wordline
(RWL) is pulled from VDD to GND to sense the storage node value. When the
storage node value is "0", the read transistor MR is o and the RBL is kept to be
VDD. When the storage node value is "1", the read transistor MR is on and ideally
the RBL is discharged from VDD to a lower voltage level.
Refresh operation is a combination of the read and write-back operation, as
shown in Fig. 6.5, with half clock-cycle when CLK is low. As a result, the proposed
hidden-refresh scheme requires a clock cycle at least two times longer than a refresh
operation and equivalently, the hidden-refresh scheme halves the maximum oper-
ating cycle when compared to the previous eDRAM design. However, this is still
tolerable for ULP system required maximum frequency up to only tens of MHz.
And the benet of the hidden-refresh scheme is that the eDRAM has a SRAM-like
interface for the easiness of system integration.
99
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
6.3.2 Write/Read Bitline Circuit Design
Fig. 6.6 shows the write (WBL) and read (RBL) bitline circuit design of the
proposed hidden-refresh scheme. As mentioned earlier, the write bitline should be
precharged to GND (through WBL PC) for maximizing the retention time when
it is not accessed. Considering both the write/write-back requirement of the same
WBL port, two enable signals (WR and WB) are used to distinguish the write oper-
ation and write-back operation for correct data/refresh data input. The WBL PC
should be disabled during both write and write-back operation, which can be real-














Fig. 6.6: Write/read bitline design of the hidden-refresh scheme.
For read bitline circuit, the RBL is rst precharged to VDD, then the RBL is
inverted and sampled by two ip-ops for read out and refresh operation, respec-
tively. This read out scheme is signicantly dierent from the SRAM design as
100
























Fig. 6.7: Bootstrapped WWL driver and simulated waveform.
the SRAM address/control/data signals will be consistent within one clock cycle
during operation. On the contrary, the address/control/data of the hidden-refresh
eDRAM are altering every half clock cycle during the refresh period. As a result,
two ip-ops (for both read and refresh operation) in one column are necessary for
the correct hidden-refresh operation. The sample clock signals for both ip-ops
can be generated through proper delay control to ensure the correct data to be
stored during read out operation.
6.3.3 Wordline Driver Circuit Design
Wordline drivers are very important for the correct eDRAM operation. Sev-
eral design issues in wordline drivers are encountered in previous eDRAM designs.
As mentioned in Section 6.3.1, the write wordline requires a negative voltage to en-
hance the drivability, and additional supply is inevitable for providing this negative
voltage.
In order to minimize the additional power supply overhead, we propose a
charge-pump WWL driver, as shown in Fig. 6.7. A negative VDD charge pump
driver based on the bootstrapped inverter [62] is adopted. The conversion eciency,
101
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
which is also equivalent to the bootstrapped negative voltage level, can be satised
through a reasonable ratio between the charge pump capacitor (CCP ) and the load
capacitance of all write access transistors in one row (CL). The charge pump
capacitor can be implemented with MOSCAP to minimize the design overhead.
Another practical design concern is the read disturbance issue in the 2T e-
DRAM design with multiple cells sharing one read bitline, as shown in the left
gure in Fig. 6.8. During read operation, the selected wordline driver is pulled to
GND and the RBL voltage is determined by the data on the storage node. When
data "1" is stored, the selected wordline driver should be able to pull down the RBL
through the read device. However, the worst case for the read "1" disturbance aris-
es when all cells store data "1". The simplied equivalent read out circuit is shown
in the right gure in Fig. 6.8. For multiple cells sharing one RBL, the un-selected
read wordline (VDD) delivers contention current to the RBL through the storage
transistors with data "1". As a result, this signicant read disturbance might cause
read "1" failure. And existing solutions suggests to use additional power rails [117]
to lower the contention current (set UNSEL RWL to a voltage smaller than VDD)
or a compromised design choice with reduced cells per RBL [119].
Fig. 6.9 shows the proposed tri-state RWL buer to mitigate the read dis-
turbance issue, with the purpose of reducing the contention current from the un-
selected row. As shown in Fig. 6.9, the tri-state enable signal allows the selected
RWL to be functional as usual, while the un-selected RWLs are oating as the
un-selected tri-state buers are disabled. Consequently, the RBL can be easily
pull-down to a low voltage level.
102




















































Fig. 6.9: Proposed tri-state RWL driver.
103
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
6.4 Power Metrics of the Hidden-Refresh eDRAM
under Voltage Scaling
In this section, the energy metrics of the hidden refresh eDRAM is investigated
under supply voltage scaling from nominal voltage of 1.2V down to 0.6V. Based
on the previous mentioned techniques in this chapter, we design a 1-Kbit (64-
row by 16-bit) 2T gain-cell based hidden-refresh eDRAM in a commercial 65nm

























































Fig. 6.10: Schematic of the 1K-bit hidden-refresh eDRAM.
Simulation results show that the worst case read and write operation at 0.6V
can be nished within 10ns. This implies that a target operating frequency of
10MHz can be achieved for the 2T eDRAM, where a clock cycle of 100ns is avail-
able. The simulated power consumption under dierent supply voltages at room
temperature 25oC are plotted in Fig. 6.11. As can be observed, the refresh op-
eration shows largest power consumption as both read and write-back operation
are performed. Read power is signicantly larger than write power, which is due
104
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
































Fig. 6.11: Power consumption of the hidden-refresh eDRAM.
to the switching power of the ip-ops in the RBL sensing circuits. The power
reduction trends are obvious. When the supply voltage scales from 1.2V to 0.6V,
approximate 5 power reduction can be achieved for all three operation types.
Fig. 6.12 shows the retention time of the 2T eDRAM and the static power
consumption, which conrms the retention time reduction due to supply voltage
scaling [117]. When supply voltage scales from 1.2V to 0.6V, the retention time
reduces from 120s to 40s, which is a 3 reduction. As a result, the power benet
from voltage scaling (5) dominates the retention time loss (3). For the 1-Kbit
memory with a 10MHz clock, the 64-address memory can be refreshed within 6.4s.
Therefore, the eDRAM is fully functional with a duty cycled refresh operation. This
results in a static power consumption of 640nW at 0.6V, as shown in Fig. 6.12.
Since the hidden-refresh eDRAM is designed with an SRAM-like interface,
105
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM












































Fig. 6.12: Retention time and static power of the hidden-refresh eDRAM.































 Read&Duty Cycle Refresh
 Write&Duty Cycle Refresh
Fig. 6.13: Read/write power with duty cycled refresh power of the hidden-refresh eDRAM.
106
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
when considering the actual read/write power, the duty cycled refresh power should
also be included, as depicted in Fig. 6.13. At 0.6V, the hidden-refresh eDRAM
shows 3.9W read power (including the duty cycled refresh power) with 10MHz
operating frequency, and the write power (including the duty cycled refresh power)
is 1.7W.
Based on the above results, voltage scaling is still benecial for power reduction
in the hidden-refresh eDRAM down to 0.6V. Although the refresh duty cycle is
reduced by 3 at lower supply voltage, the power reduction of 5 from voltage
scaling is dominating. However, we found that further reducing the supply below
0.5V would cause serious concerns to the eDRAM functionality. This is because the
refresh duty cycle below 0.6V is approaching to or even smaller than the required
refresh period (6.4s). Indeed, it is still possible to revise the hidden-refresh scheme
to enable more than one refresh address within one clock cycle. However, this
modication is beyond the scope of this chapter.
Table 6.1 shows the comparison of the proposed hidden-refresh eDRAM to
previous SRAM/eDRAM designs. As can be observed, the 2T eDRAM shows
higher density than the SRAM design. When normalized the area of SRAM designs
[39, 137] to 65nm, the 2T eDRAM still shows around 2.5 reduction in bitcell
area compared to SRAM designs. When operating at 0.5V, the access energy of
16-bit operation is 2.33pJ [39]. While for eDRAM with 0.6V supply, the 16-bit
operation access energy (read & duty cycled refresh) of hidden-refresh eDRAM
is 0.4pJ, which is 5.8 energy reduction. These promising results show that the
hidden refresh eDRAM might be a viable option for replace SRAM. However,
the static power of the 2T eDRAM is 3.2 of that of SRAM, which is mainly
caused by the refresh power overhead. Further enhancing the retention time will
107
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
Table 6.1: Comparison among SRAM, eDRAM and Hidden-Refresh eDRAM
 [39] [137] [116] This work 
Type 6T SRAM 8T SRAM 2T-eDRAM 2T-eDRAM 
Access No No External Hidden 
Technology 0.13mm 40nm 65nm 65nm 
Power Supply Single Single Multiple Single 
Capacity 2-kb 512-kb 192-kb 1-kb 
Bitcell Area 4.788mm2 0.706mm2 0.48mm2 0.48mm2 
Frequency 5.62MHz, 0.5V 6.25MHz, 0.5V 500MHz, 1.2V 10MHz, 0.6V 
Access Energy (16b) 2.33pJ  12.9pJ NA 0.4pJ (include ref.) 
Retention Time NA NA 20ms, 0.8V 40ms, 0.6V 
Static Power 200nW, 0.5V 72.8 mW, 0.5V 109 mW/Mb, 85oC  640nW, 600mV 
 
be helpful in reducing the static power. In addition, 65nm process has higher
leakage when compared to the 0.13m process, therefore eDRAM design in 0.13m
will have longer retention time for static power reduction. Compared to eDRAM
without hidden-refresh scheme [116], this work achieves lower voltage operation. In
addition, the proposed design uses only one power supply, which is an advantage
compared to the conventional eDRAM designs.
6.5 Conclusions
In this chapter, we explore the energy eciency in eDRAM design as a mem-
ory alternative to SRAM. The eDRAM provides higher density than conventional
SRAM, however, the refresh operation complicates the system-level design due to
the overhead of the control blocks. In order to remove the system level integra-
tion eort, we propose a hidden-refresh scheme to enable the eDRAM to have a
SRAM-like interface. In addition, existing eDRAM implementations requires addi-
tional supply voltages for correct functionality. Several circuit techniques, including
108
CHAPTER 6. Exploring Energy Eciency in Embedded DRAM
bootstrapped WWL driver and tri-state RWL driver, are proposed to minimize this
overhead for achieving true single-VDD operation. Finally, a 1-Kbit hidden-refresh
eDRAM is designed and simulated in a commercial 65nm technology. Through
comparison to a SRAM counterpart, eDRAM shows promising benets in term of
memory density and access energy. This result is especially compelling for those
memory-intensive ULP systems with stringent area and energy limits.
109
Chapter 7
A 0.4V 280nW Nearly All-Digital
Frequency Reference-less Hybrid
Domain Temperature Sensor
This chapter covers a case study of applying ULV digital-assist circuits for
emerging sensor designs. We present the design of a sub-W nearly all-digital
hybrid domain temperature sensor for wireless temperature sensing applications.
A subthreshold-biased ratioed-current/delay based PTAT sensor core is proposed.
In addition, we propose a hybrid domain digital processing technique based on the
proposed sensor core to relax the requirement of an external accurate frequency
reference, which is generally power hungry for energy-constrained systems. The
proposed design was fabricated in a 65nm CMOS process and measured with 0.4V
power supply. The eight measured chips show -1.6/1oC error across the 0100oC
range after 2-point calibration, and the power consumption is on average 280nW.
110
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
Table 7.1: Categories of the CMOS temperature sensor
References Type Sensor Core
[121, 122] Voltage Domain BJT-based
[123{127] Time Domain Delay line based
[128{131] Time Domain Current to delay converter based
[132{135] Frequency Domain Frequency to digital converter based
7.1 Introduction
Smart temperature sensors based on CMOS technologies are popular due to
the low cost and easiness for the on-chip integration purposes. As a result, C-
MOS temperature sensors are widely used for on-chip dynamic thermal manage-
ment in high-performance microprocessors, environment sensing/food monitoring
in wireless sensor nodes or RFID tags and temperature compensation in MEMS
resonators. Due to the dierent performance requirements for various application-
s, temperature sensor designs are vastly dierent and challenging to achieve high
accuracy/resolution, high data rate and low power consumption as well.
Basically, the existing temperature sensors can be categorized into three ma-
jor types based on their distinguished operating principles, as shown in Table 7.1.
The rst type is the voltage-domain BJT-based temperature sensors [121, 122].
The temperature-dependent voltage is then converted to digital codes through the
integrated high-resolution ADCs, which are preferred in applications with high ac-
curacy/resolution requirements, such as temperature compensation in MEMS res-
onators. Second, the time-domain temperature sensors explores the temperature-
dependent delay with integrated TDCs are also proposed [123{131]. The temperature-
dependent delay is then converted to digital codes through time-to-digital convert-
ers. In order to generate a delay large enough to achieve a decent resolution, either
111
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
hundreds of cascaded inverter chains [123{127] or the current-to-delay converter
based sensor core (diodes, MOSFETs, etc.) are used [128{131]. The third type is
to utilize the temperature dependent frequency of the ring oscillator as the sensing
element [132{135].
With the growing interest of integrating smart temperature sensors into the
ultra-low-power wireless sensor platforms and RFID tags, where compact nW tem-
perature sensors with moderate sensing error are attractive due to the limited
system power budgets (i.e., a few W) [136]. However, due to the use of ADCs,
long delay lines and ring oscillators with super-threshold supply, the power con-
sumption of the above-mentioned temperature sensors are normally well exceeding
the W power budgets, making them unsuitable for wireless sensing applications.
Eorts have been made for dierent temperature sensor types to reduce the
total power consumption. And several prototypes are demonstrated with sub-
W power consumption [130, 131, 134, 135]. However, current-to-delay converter
[130, 131] based temperature sensors are highly dependent upon the availability of
the frequency reference for TDC operation. Therefore, additional clock reference
is needed but it is generally power hungry for ultra-low-power platforms, as shown
in Fig 7.1. In order to mitigate this, temperature sensors with integrated on-
chip frequency references into the FDC based sensors [134, 135] are demonstrated.
However, these designs rely heavily on the analog blocks and techniques through
iterative design eorts, which is a disadvantage for technology migration.
In order to resolve the above-mentioned challenges, this chapter presents a
0.4V 280nW nearly all-digital hybrid domain temperature sensor for wireless sens-
ing applications. A ratioed-current/delay PTAT sensor core realized with two
subthreshold biased MOSFETs is proposed. Based on the proposed sensor core,
112
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor































Fig. 7.1: Power consumption versus frequency of the state-of-the-art ultra-low power frequency
reference for illustration of power overhead due to frequency reference.
we propose a hybrid domain processing technique. As a result, the requirement for
accurate frequency reference is eliminated and the nearly all-digital implementation
makes this design more technology scaling friendly.
The rest of this chapter is organized as follows. Section 7.2 describes the pro-
posed ratioed-current/delay PTAT sensor core. Section 7.3 presents the proposed
hybrid domain temperature sensing scheme and its benets. Section 7.4 shows
the measurement results of the hybrid domain temperature sensor. Section 7.5
concludes this chapter.
113







Fig. 7.2: Schematic of the proposed ratioed-current/delay PTAT sensor core.
7.2 Ratioed-Current/Delay PTAT Sensor Core
In this work, we explore the ratioed-current/delay to be the PTAT sensor core.
Fig. 7.2 portraits the schematic of the ratioed-current/delay PTAT sensor core.
Two identical NMOS devices are biased in the subthreshold region with the gate
overdrive voltage of VGS1 and VGS2, respectively. The corresponding subthreshold
































where  is the NMOS mobility, Cox is the oxide capacitance, VT is the thermal
voltage, Vth is the NMOS threshold voltage, n is the subthreshold slope, VGS
and VDS are the gate over-drive voltage and drain-source voltage, respectively.
Note that the two NMOS devices are identically and can be sized large enough to
114
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
































Fig. 7.3: Mathematical background of the operation principles of the proposed current-ratioed
PTAT.









As can be seen in Eq. 7.3, the current ratio is determined by the dierence
of the overdrive voltage VGS (= VGS2-VGS1 < 0, assuming VGS1 > VGS2), the
subthreshold swing n and the thermal voltage VT . Equivalently, the absolute tem-







Then we make observations of the basic function: y(x) = 1/ln(x), as shown in Fig.
7.3. Notice that there are two lobes of the function plot, and we are interested
in the circled region of the left lower lobe, and a zoom view of the region is also
plotted in Fig. 7.3 on the right. As can be observed, a portion of this curve
(x 2 (0:05; 0:3)) shows decent linearity. The linear tting adjusted-R2 coecient is
115
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor





















Fig. 7.4: Optimal VGS vs. Adjusted-R
2 coecient.
0.99943 for this specic region. It turns out that if a proper VGS is applied to the
temperature sensor core to meet the desired current ratio, the absolute temperature




A theoretical calculation of the optimal VGS versus the Adjusted-R
2 coe-
cient across 0-100oC temperature range is performed, as shown in Fig. 7.4. For
VGS within -100 to -60 mV range, the Adjusted-R
2 coecient is over 0.999 and
the coecient peaks at VGS of -80mV, which is the chosen value in this work. As
a result, the current radio can be functional like a PTAT.
With the linear dependence between the absolute temperature and the current
116
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
ratio, we can further explore the delay ratio through simple current-to-delay con-
verter. The output node is charged to VDD through PMOS and then the PMOS is
turned o and the output node slowly discharges through the subthreshold biased
NMOS to the switching threshold. After several simple logic gates, we can get two
Temperature-Sensitive-Delays (TSDs) pulses, where the delay ratio is proportional
to the absolute temperature, as described in the equivalent equations in Eq. 7.6.
td1 / C1
I1







7.3 Hybrid Domain Temperature Sensing Scheme
The conventional time domain temperature sensors use only one temperature-
sensitive delay [123], or use the delay dierence of two temperature-sensitive delays
[130, 131]. As a result, the absolute delay information has to be digitized through
the use of time-to-digital converters (TDCs), where an accurate external frequency
reference is needed.
As a result, with the proposed ratioed-current/delay sensor core, the require-
ment of the external frequency reference is relaxed through the hybrid domain pro-
cessing technique, as shown in Fig. 7.5. Since the ratio of the two TSDs generated
from the sensor core is interested, we can use a free-running temperature-sensitive
ring-oscillator (TSRO) as the main clock of the TDCs to sample both TSDs simul-
taneously. And the ratio can be then calculated through digital arithmetic circuits.
Since we use a frequency-domain information (from TSRO) to quantize the time-
domain information (from TSDs), then the TDC values and ratio calculation are
therefore represented by the hybrid domain (time-frequency domain), as opposed
117















Fig. 7.5: Timing diagram of the ratioed-current/delay temperature sensor.
to the previous time-domain value sampled by a frequency reference. It is worth
noting that the hybrid-domain TDC values should be large enough to minimize the
quantization error for delay ratio calculation.
7.4 Circuit Implementation Details
The circuit block diagram of the hybrid domain temperature sensor is shown in
Fig. 7.6. A 5-tap PMOS resistor ladder is implemented to provide the temperature-
insensitive overdrive voltage. With 0.4V supply, the dierence of the overdrive
voltage is 80mV. The current-to-delay converters converts the NMOS discharge
118
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
current into delay information, and the capacitor is used to increase the delay pulse
width. The capacitor ratio of 27 is adopted to balance the two delays across a wide
temperature range as they will be sampled by the same Temperature Sensitive
Ring Oscillator (TSRO) to satisfy the trade-o between the TSRO frequency and
overow of the TDCs. As a result, 31-stage TSRO is implemented to digitize the
Temperature Sensitive Delays (TSDs) with binary counter based 14-bit TDCs. The
31-stage TSRO is selected to ensure that there is enough time slack for the 14-bit
TDCs with long carry-chain as its critical path.
We can still refer to Fig. 7.5 for the detail operation principle of the hybrid
domain temperature sensor. PMOS devices are rst turned on to charge the capaci-
tors. From the onset of the switching-o of the PMOS devices, the current-to-delay
converters start to discharge the capacitors and convert the currents into the TSDs.
The two TSDs are then digitized by the TSRO and the delay ratio is calculated,
both in the hybrid domain. The two TSDs are also used as the clock-gating signals
for the two TDCs (GCLK1 and GCLK2) and the enable signal of the TSRO to
reduce dynamic power. The hybrid domain ratio calculation is performed using a
15-bit single-precision oating point format.
The discharging NMOS devices are sized with longer gate length to minimize
the device mismatch. Also, this will lead to smaller capacitors. In the chosen tech-
nology, NMOS device with 2m/1m geometry is used, and the total capacitance
is 1.1pF(unit capacitor is around 39fF). The PMOS device is sized to have proper
Ion and Ioff to ensure the correct charging and discharging of the capacitors in both
current-to-delay converters.
Fig. 7.7 shows the simulation results of the delay ratio based on the hybrid
domain sensor described above. The process variation eects are also considered to
119





































Fig. 7.6: Schematic of the hybrid-domain temperature sensor.
120
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor































Fig. 7.7: Simulated delay ratio the ratioed-current/delay temperature sensor.



























Fig. 7.8: Simulated temperature error of the ratioed-current/delay temperature sensor.
121












Fig. 7.9: Illustration of time domain (top left), hybrid domain (top right) processing and the
hybrid domain processing benets on TDC bandwidth (bottom left) and dynamic frequency
scaling (bottom right, simulated).
verify the sensor performance. As can be observed, the proposed delay ratio shows
good PTAT behaviors under all process corners with worst-case at SS corner with
Adjusted-R2 coecient of 0.99838. However, the slope can be slightly varied, which
is due to the relatively imbalanced driving strength between the charging (PMOS)
and discharging devices (NMOS) with dierent process corners. Fig. 7.8 shows
the simulation error after two-point calibration at 10 and 90oC. For the interested
temperature range, the maximum error across process corners is -11.3oC.
In addition to removing the frequency reference, the hybrid domain process-
ing has extra benets. The conceptual illustration is shown in Fig. 7.9. For
the time domain case, the worst-case frequency is needed to sample the delay at
122
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
Fig. 7.10: Hybrid domain digital processing data format.
high temperature/FF corner and this fast clock will signicantly increase the TDC
bandwidth and power consumption at low temperature/SS corners. This will incurs
large TDC bandwidth and higher power consumption due to the high frequency
of the reference. On the contrary, for the hybrid domain case, the TSRO provides
a process-temperature-tracking frequency. As long as the TDC values are large
enough to minimize the quantization error, the TSRO can dynamically reduce the
TDC bandwidth and dynamic power. In this way, the hybrid domain technique
provides 26.3% TDC bandwidth reduction, as indicated in Fig. 7.9. In addition,
3.3 worst case dynamic power reduction (FF corner) can be observed for hybrid
domain technique at 0oC due to the dynamic frequency scaling of the TSRO.
The hybrid domain digital processing uses a 15-bit single-precision oating
point format revised from the IEEE 754 standard, as shown in Fig. 7.10.
7.5 Measurement Results
The temperature sensor is realized in UMC 65nm 1P6M low leakage process.
Metal-oxide-metal capacitors (MOMCAP) is adopted in this prototype. Fig. 7.11
shows the die micrograph with annotated oorplan. The sensor takes 0.022mm2
silicon area in total and sensor core plus two TDCs takes only 0.0054mm2.
The sensor core, TDCs and the hybrid domain processing circuit are operated
with an o-chip regulated 0.4V supply. The ESPEC SU-240 temperature chamber
123

































Fig. 7.11: Die micrograph with annotated oorplan.
is used and eight test chips packaged in QFN footprint are measured over the
temperature range from 0 to 100oC, with 10oC per step.
The measured delay ratio, which is represented as digital code output of the
hybrid domain processing unit, is shown in Fig. 7.12 left, and the statistics of
the corresponding adjusted-R2 coecient is shown in Fig. 7.12 right. As shown,
the measured delay ratio shows larger oset/slope variations when compared to
the simulated results in Fig. 7.7. This might be caused by the less accurate
subthreshold models. Despite of this, the measured mean value of Adjust-R2 is
0.9993 and the standard deviation is only 0.0004, showing good linearity. Fig. 7.13
shows the measured sensor inaccuracy over 8 test chips. After 2-point calibration
at 10oC and 90oC, the measured error of the hybrid domain temperature sensor is
-1.6oC/1oC across 0100oC.
124
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor











































Fig. 7.12: Measured delay ratio (left) and adjusted-R2 coecient (right) of 8 chips.






















Fig. 7.13: Measured temperature error across 8 dies.
125
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
The worst case sampling rate of the hybrid domain temperature sensor can
be operated over 800Hz 10% duty-cycle clock (used for charging the capacitor, not
for reference) at 0oC. We take 20 samples to obtain the average output to reduce
the read-out error and this results in 40 conversion/sec. With this rate, the test
chips dissipate on average 280nW power consumption supply at room temperature.
For 0100oC sensing range, the 9 bit fractional digits lead to average resolution of
0.25oC across eight chips.
Table 7.2 shows the summary of the state-of-the-art nano-watt temperature
sensors. Previous nW implementations generally require 1V/1.2V supply, or even
dual supply voltages. Also, analog blocks, such as resistors and op-amps, are
extensively used. For TDC-based sensors, if external frequency reference power
is considered, the total power will exceed the W limit if a wide sensing range
and a moderate sample rate is needed. For FDC-based sensors with integrated
frequency reference, the sensor accuracy and power consumption are largely aected
by the available resistor types to achieve the desired temperature dependence. As a
result, improving area and accuracy with nW power budget is challenging in scaled
technologies.
Hybrid domain temperature sensor does not require external accurate frequen-
cy reference, and the only used analog block is capacitor. Also, it achieves 0.4V
operation with nearly all-digital implementation, which is preferred for wireless
sensing systems and technology scaling friendly. Simulation results indicate that
over 90% power consumption of the hybrid domain temperature sensor are leakage
power. Therefore, our design in 0.18m will decrease below 100nW. It is also worth
noting that the proposed sensor can further reduce power if system building block
like MCU is available for ratio calculation, and only sensor core with TDCs are
126
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
Table 7.2: Summary of state-of-the-art nW temperature sensor
 This Work CICC’13 [135] CICC’08 [134] TCS II’09 [130] JSSC’10 [131] 
Technology 65nm 0.18mm 0.18mm 0.18mm 0.18mm 
Area (mm2) 
0.0054 (Core + TDC) 
0.022 (Total) 
0.09 0.05 0.0324 0.042 









Supply Voltage 0.4V 1.2V 1V 1V 0.5V, 1V 
Frequency Reference 
Dependence 
No Integrated Integrated Required Required 
Power Consumption 280nW 65nW 220nW 405nW 119nW 
Range (°C) 0~100 0~100 0~100 0~100 -10~30 
Calibration Method 2-point 2-point 2-point 2-point 2-point 
Inaccuracy (°C) -1.6~+1 -1.4~+1.3 -1.6~+3 -0.8~+1 -1~+0.8 
Resolution (°C) 0.25 0.3 0.3 0.3 0.2 
Sample/second 40 32 10 1000 33 
 
needed.
The proposed temperature sensor design proves that ultra-low voltage (down
to 0.4V), nearly all-digital implementation with decent linearity and inaccuracy are
feasible in nanometer technologies. Comparable performance can be achieved when
compared to the state-of-the-art designs. Especially, the hybrid domain processing
concept achieves no dependence on the frequency reference.
7.6 Conclusion
In this chapter, we present the design and implementation of a 65nm 0.4V
280nW hybrid domain temperature sensor for wireless sensing platforms. A ratioed-
127
CHAPTER 7. A 0.4V 280nW Nearly All-Digital Frequency Reference-less Hybrid Domain Temperature Sensor
current/delay sensor core and hybrid domain temperature sensing scheme are pro-
posed to eliminate the generally required frequency reference and the measured
inaccuracy after 2-point calibration is -1.6oC/1oC across 0100oC with 40 sam-
ples/second rate. Due to the nearly all-digital implementation, the proposed design
shows reduced design eort and is suitable for scaled technologies.
128
Chapter 8
Conclusion and Future Work
8.1 Conclusion
Sub-/near-Vth operation is a compelling circuit design strategy for reducing
the power consumption in CMOS digital integrated circuits, which also provides
more than an order of magnitude energy reduction compared to the nominal supply
voltage operation. However, the compromised device characteristics, together with
the increased sensitivity to process variations under sub-/near-Vth operation, bring
along great challenges in achieving a proper design trade-o between robustness,
energy/area eciency, and performance. This thesis presents our research work for
designing robust sub-/near-Vth circuits with minimum energy/area overhead and
performance boosting target on dierent design entries.
First, a near-Vth ASIC design ow with statistical timing analysis incorporat-
ing design-time forward body-biasing to reduce the excessive design margin and
boost the performance is introduced in Chapter 3. We proposed the Surrogate
Model Adjustment based Statistical Static Timing Analysis (SMA-SSTA) for re-
129
CHAPTER 8. Conclusion and Future Work
ducing the runtime cost of standard cell characterization for local variations and
statistical timing analysis. In addition, a novel SelF-Body-Biasing (SFBB) scheme
is proposed for overhead free performance boosting purposes. The two synergistic
approaches enable the variation-aware body-biased design for near-Vth ASIC design
for the rst time. The benet of the proposed design ow is experimentally veried
on a near-threshold AES encryption engine fabricated in a commercial 65nm low
leakage process. Through the co-design of architecture/design ow, the measured
testchip delivers 12.2Mbps, with 1.65pJ/bit at 0.5V, which is 22 and 7.8 over
a state-of-the-art AES design, while still reducing 28% silicon area.
Second, customized designs for several key ULV building blocks are demon-
strated, as listed below:
 Energy-ecient level shifter design
In Chapter 4, we proposed the design of an NMOS-diode current limiter
based level shifter. Several circuit techniques, such as MTCMOS and inverse-
narrow-width-eect (INWE) aware sizing, are explored for further improving
the energy eciency of the level shifter. Measurement results shows the
proposed level shifter design achieves on average 25.1ns propagation delay
and 30.7fJ/bit operation when converting a 300mV input to 1.2V.
 Robust and energy-ecient intra-cell mixed-Vth standard cell design method-
ology
In Chapter 5, we proposed a novel intra-cell mixed-Vth standard cell design
methodology for robust ULV operation with improved energy-ecient. The
proposed solution replace the bottleneck devices in ULV logic cells with LVT
devices to maintain the cell area and the energy eciency is maintained due
130
CHAPTER 8. Conclusion and Future Work
to the reduced parasitics. Library level experiments validate that the pro-
posed methodology achieves on average 30.1% energy-eciency improvements
compared with the previous device upsizing techniques.
 Energy/Area-Ecient Hidden-Refresh eDRAM
In Chapter 6, we explored the potential energy benets of eDRAM as an
alternative to SRAM. The eDRAM provides higher density but the refresh
operation complicates the system-level design due to the reduced availabil-
ity during refresh. We proposed a hidden-refresh eDRAM design to have
a SRAM-like interface and several circuit techniques are introduced to re-
alize true single-VDD operation. Simulation results show that eDRAM has
promising density and energy benets over the SRAM counterpart, which is
a promising design choice for memory-intensive ULP systems.
Finally, we propose a 0.4V 280nW nearly all-digital hybrid domain sub-Vth
temperature sensor capable of ULV operation. A ratioed-current/delay PTAT sen-
sor core and hybrid domain temperature sensing scheme are proposed to elimi-
nate the dependence on frequency reference. After 2-point calibration, the eight
testchips show measured inaccuracy of -1.6oC/1oC across 0100oC range, with 40
samples/second sample rate.
8.2 Future Work
We will further explore the eectiveness of the proposed intra-cell mixed-Vth s-
tandard cell design methodology in real silicon designs. It will be applied to a wide
variety of designs, including both dedicated hardware accelerators and general-
purpose micro-controller. The application of the hidden-refresh eDRAM will be
131
CHAPTER 8. Conclusion and Future Work
further investigated with novel bitcell design for improving retention and reducing
the static power. In addition, it will be more benecial to incorporate architectural
exploration of both SRAM/eDRAM in ULP systems to nd out the optimal com-
bination. For the temperature sensor design, further extending this sensor type to
multiple application scenarios require eorts to reduce the calibration workload.
In addition, low power clocking circuit such as ecient KHz/MHz frequency
reference are necessary for self-contained mm3-scale sensor platforms as crystal os-
cillators are too bulky. Also, power management circuits are essentially important
as sensor node computing platforms will be largely dependent upon the energy har-
vesting devices. A future sensory platform with above-mentioned techniques/blocks
are planned and will be implemented in the near future.
132
Bibliography
[1] G. Moore, "No exponential is forever: but 'forever' can be delayed!" in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2003, pp. 20 -
23.
[2] J. G. Koomey, S. Berard, M. Sanchez, and H. Wong, "Implications of historical
trends in the electrical eciency of computing," IEEE Annuals of the History
of Computing, vol. 33, no. 3, pp. 46 - 54, 2011.
[3] S. Borkar, "Design challenges of technology scaling," IEEE MICRO, vol. 19,
no. 4, pp. 23 - 29, 1999.
[4] R. M. Swanson and J. D. Meindl, "Ion-implanted complementary MOS tran-
sistors in low-voltage circuits," IEEE J. Solid-State Circuits, vol. 7, no. 4, pp.
146 - 153, Apr. 1972.
[5] J. D. Meindl and J. A. Davis, "The fundamental limit on binary switching
energy for terascale integration (TSI)," IEEE J. Solid-State Circuits, vol. 35,
no. 10, pp. 1515 - 1516, Oct. 2000.
[6] B. Calhoun, A. Wang, and A. P. Chandrakasan, "Modeling and sizing for min-
imum energy operation in subthreshold circuits," IEEE J. Solid-State Circuits,
vol. 40, no. 9, pp. 1778 - 1786, Sept. 2005.
[7] A. Wang, and A. P. Chandrakasan, "A 180-mV subthreshold FFT processor
using a minimum energy design methodology," IEEE J. Solid-State Circuits,
vol. 40, no. 1, pp. 310 - 319, Jan. 2005.
[8] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, "The limit of dynamic
voltage scaling and insomniac dynamic voltage scaling," IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 13, no. 11, pp. 1239 - 1252, Nov. 2005.
[9] M. Alioto, "Ultra-Low power VLSI circuit design demystied and explained: a




[10] H. Esmaeilzadeh, et al., "Dark silicon and the end of multicore scaling," IEEE
MICRO, vol. 32, no. 3, pp. 122 - 134, 2012.
[11] L. Wang, K. Skadron, and B. H. Calhoun, "Dark vs. dim silicon and near-
threshold computing," In Dark Silicon Workshop in conjunction with ISCA,
2012.
[12] R. Dreslinski, et al., "Near-threshold computing: reclaiming moore's law
through energy ecient integrated circuits," Proceedings of the IEEE, vol. 98,
no. 2, pp. 253 - 266, Feb. 2010.
[13] E. Krimer, et al., "Synctium: a near-threshold stream processor for energy-
constrained parallel applications," IEEE Computer Architecture Letters, pp. 21
- 24, 2010.
[14] D. Fick, et al., "Centip3de: a 3930 DMIPS/W congurable near-threshold
3D stacked system with 64 ARM Cortex-M3 cores," in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2012, pp. 190 - 192.
[15] S. Borkar, et al., "The future of microprocessors," Communications of the
ACM, vol. 54, no. 5, pp. 67 - 77, 2011.
[16] B. Warneke, M. Last, B. Liebowitz, and K. S. J. Pister, "Smart dust: com-
municating with a cubic-millimeter computer," Computer, vol. 34, pp. 44 - 51,
2001.
[17] Yang, G.Z., Ed. Body Sensor Networks ; Springer: London, UK, 2006.
[18] R. Sarpeshkar, Ultra Low Power Bioelectronics: Fundamentals, Biomedi-
cal Applications, and Bio-Inspired Systems. Cambridge, U.K. Cambrige Univ.
Press, 2010.
[19] L. Doherty, B. A. Warneke, B. E. Boser, and K. S. J. Pister, "Energy and
performance considerations for smart dust," Int. J. Parallel Distrib. Syst. Net-
works, vol. 4, no. 3, pp. 121 - 133, 2001.
[20] B. H. Calhoun, et al., "Design considerations for ultra-low energy wireless
microsensor nodes," IEEE Trans. Computers, vol. 54, no. 6, pp. 727 - 740,
June. 2006.
[21] A. P. Chandrakasan, N. Verma, and D. C. Daly, "Ultralow-power electronics




[22] G. Chen, S. Hanson, D. Blaauw, and D. Sylvester, "Circuit design advances
for wireless sensing applications," Proceedings of the IEEE, vol. 98, no. 11, pp.
1808 - 1827, Nov. 2011.
[23] T. Sakurai and A. Newton, "Alpha-power law MOSFET model and its ap-
plications to CMOS inverter delay and other formulas," IEEE J. Solid-State
Circuits, vol. 25, no. 2, pp. 584 - 594, Apr. 1990.
[24] K. Flautner, S. Reinhardt, and T. Mudge, "Automatic performance setting
for dynamic voltage scaling," in Proc. 7th Annu. Int. Conf. Mobile Computing
and Networking (MobiCom'01), May. 2001, pp. 260 - 271.
[25] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "Analysis and mitigation
of variability in subthreshold design," in proc. Int'l Symp. Low Power Electro.
and Design, Aug. 2005, pp. 20 - 25.
[26] S. Hanson, M. Seok, D. Sylvester, and D. Blauw, "Nanometer device scaling
in subthreshold logic and SRAM," IEEE Trans. Electron Devices, vol. 55, no.
1, pp. 175 - 185, Jan. 2008.
[27] D. Bol, R. Ambroise, D. Flandre, and J.-D. Legat, "Interests and limitations
of technology scaling for subthreshold logic," IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 17, no. 10, pp. 1508 - 1519, Oct. 2009.
[28] J. Rabaey, "Digital Integrated Circuits: A Design Perspective," Prentice Hall,
2003.
[29] M. Seok, D. Sylvester, and D. Blaauw, "Optimal technology selection for min-
imizing energy and variability in low voltage applications," in proc. Int'l Symp.
Low Power Electro. and Design, 2008, pp. 9 - 15.
[30] J. Keane, H. Eom, T. H. Kim, S. Sapatnekar, and C. Kim, "Subthreshold
logical eort: a systematic framework for optimal subthreshold device sizing, "
in proc. Design Automation Conference, 2006, pp. 425 - 428.
[31] T. H. Kim, H. Eom, J. Keane, and C. Kim, "Utilizing reverse short channel
eect for optimal subthreshold circuit design," in proc. Int'l Symp. Low Power
Electro. and Design, 2006, pp. 127 - 130.
[32] J. Zhou, et al., "A 40 nm inverse-narrow-width-eect-aware sub-threshold s-




[33] B. Liu, M. Ashouei, J. Huisken, and J. P. de Gyvez, "Standard cell sizing for
subthreshold operation," in proc. Design Automation Conference, 2012, pp. 962
- 967.
[34] J. Kwong and A. Chandrakasan, "Variation-Driven device sizing for minimum
energy sub-threshold circuits," in proc. Int'l Symp. Low Power Electro. and
Design, Aug 2006, pp. 8 - 13.
[35] Y. Pu, J. P. de Gyvez, H. Corporaal, and Y. Ha, "VT balancing and device
sizing towards high yield of sub-threshold static logic gates," in proc. Int. Symp.
on Low Power Electronics and Designs, pp. 355 - 358, Aug 2007.
[36] N. Lotze, and Y. Manoli, "A 62 mV 0.13 m CMOS standard-cell-based design
technique using schmitt-trigger logic," IEEE J. Solid-State Circuits, vol. 47, no.
1, pp. 47 - 60, Jan. 2012.
[37] N. Verma, J. Kwong, and A. P. Chandrakasan, "Nanometer MOSFET varia-
tion in minimum energy subthreshold circuits," IEEE Trans. Electron Devices,
vol. 55, no. 1, pp. 163 - 174, Jan. 2008.
[38] B. Calhoun and A. P. Chandrakasan, "Static noise margin variation for sub-
threshold SRAM in 65-nm CMOS," IEEE J. Solid-State Circuits, vol. 42, no.
3, pp. 680 - 688, Mar. 2007.
[39] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "A variation-tolerant sub-
200 mV 6-T subthreshold SRAM," IEEE J. Solid-State Circuits, vol. 43, no.
10, pp. 2338 - 2348, Oct. 2008.
[40] M.-F. Chang, et al., "A sub-0.3 V area-ecient L-shaped 7T SRAM with
bitline swing expansion schemes based on boosted read-bitline, asymmetric-
VTH read-port, and oset cell VDD biasing techniques, " IEEE J. Solid-State
Circuits, vol. 48, no. 10, pp. 2558 - 2569, Oct. 2013.
[41] N. Verma, and A. P. Chandrakasan, "A 256 kb 65 nm 8T subthreshold SRAM
employing sense-amlier redundancy," IEEE J. Solid-State Circuits, vol. 43, no.
1, pp. 141 - 149, Jan. 2008.
[42] M. E. Sinangil and A. P. Chandrakasan, "A recongurable 8T ultra-dynamic
voltage scalable (U-DVS) SRAM in 65 nm CMOS," IEEE J. Solid-State Cir-
cuits, vol. 44, no. 11, pp. 3163 - 3173, Nov. 2009.
[43] T.-H. Kim, J. Liu, and C. H. Kim, "A voltage scalable 0.26 V, 64 kb 8T
SRAM with Vmin lowering techniques and deep sleep mode," IEEE J. Solid-
State Circuits, vol. 44, no. 6, pp. 1785 - 1795, 2009.
136
BIBLIOGRAPHY
[44] M.-F. Chang, S.-W. Chang, P.-W. Chou, and W.-C. Wu, "A 130 mV SRAM
with expanded write and read margins for subthreshold applications," IEEE J.
Solid-State Circuits, vol. 46, no. 2, pp. 520 - 529, Feb. 2011.
[45] B. H. Calhoun and A. P. Chandrakasan, "A 256 kb sub-threshold SRAM in
65 nm CMOS," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, Feb. 2006, pp. 2592 - 2601.
[46] T.-H. Kim, J. Liu, J. Keane, and C. H. Kim, "0.2 V, 480 kb subthreshold
SRAM with 1 k cells per bitline for ultra-low-voltage computing," IEEE J.
Solid-State Circuits, vol. 43, no. 2, pp. 518 - 529, 2008.
[47] I. J. Chang, J. J. Kim, S. P. Park, and K. Roy, "A 32 kb 10T subthreshold S-
RAM array with bit-interleaving and dierential read scheme in 90 nm CMOS,"
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2008, pp.
388 - 622.
[48] W. K. Luk and R. H. Dennard, "A novel dynamic memory cell with internal
voltage gain," IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 884 - 894, April.
2005.
[49] B. Zhai, et al., "Energy-ecient subthreshold processor design," IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 8, pp. 1127 - 1137, Aug.
2009.
[50] T.-H. Chen, J. Chen, and L. T. Clark, "Subthreshold to above threshold level
shifter design," J. Low Power Electron., vol. 2, no. 2, pp. 251 - 258, Aug. 2006.
[51] H. Shao and C. Tsui, "A robust, input voltage adaptive and low energy con-
sumption level converter for sub-threshold logic," in proc. ESSCIRC, pp. 312 -
315, 2007.
[52] S. N. Wooters, B. H. Calhoun, and T. N. Blalock, "An energy-ecient sub-
threshold level converter in 130-nm CMOS," IEEE Trans. Circuits. Syst. II,
Exp. Briefs, vol. 57, no. 4, pp. 290 - 294, Apr. 2010.
[53] I. J. Chang, J. Kim, K. Kim, and K. Roy, "Robust level converter for sub-
threshold/superthreshold operation: 100 mV to 2.5 V," IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 19, no. 8, pp. 1429 - 1437, Aug. 2011.
[54] Y. Kim, D. Sylvester, and D. Blaauw, "LC2: limited-contention level converter
for robust wide-range voltage conversion," in Symp. VLSI Circuits Dig. Tech.
Papers, pp. 188 - 189, 2011.
137
BIBLIOGRAPHY
[55] Y. Kim, Y. Lee, D. Sylvester, and D. Blaauw, "SLC: split-control level con-
verter for dense and stable wide-range voltage conversion," in proc. ESSCIRC,
pp. 478 - 481, 2012.
[56] Y. Osaki, T. Hirose, N. Kuroki, and M. Numa, "A low-power level shifter with
logic error correction for extremely low-voltage digital CMOS LSIs," IEEE J.
Solid-State Circuits, vol. 47, no. 7, pp. 1776 - 1783, Jul. 2012.
[57] S. Lutkemeier and U. Ruckert, "A subthreshold to above-threshold level shifter
comprising a wilson current mirror," IEEE Trans. Circuits. Syst. II, Exp. Briefs,
vol. 57, no. 9, pp. 721 - 724, Sep. 2010.
[58] M. Lanuzza, P. Corsonello, and S. Perri, "Low-Power Level Shifter for
Multiple-Supply Voltage Designs," IEEE Trans. Circuits. Syst. II, Exp. Briefs,
vol. 59, no. 12, pp. 922 - 926, Dec. 2012.
[59] S. Hanson, et al., "Exploring variability and performance in a sub-200-mV
processor," IEEE J. of Solid-State Circuits, vol. 43, no. 4, pp. 881 - 891, April,
2008.
[60] M. Hwang and K. Roy., "ABRM: adaptive -ratio modulation for process-
tolerant ultradynamic voltage scaling," IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 18, no. 2, pp. 281 - 290, Feb, 2010.
[61] Y. Pu, J. P. de Gyvez, H. Corporaal, and Y. Ha, "An ultra-low-energy multi-
standard JPEG co-processor in 65 nm CMOS with sub/near threshold supply
voltage," IEEE J. of Solid-State Circuits, vol. 45, no. 3, pp. 668 - 680, Mar,
2010.
[62] Y. Ho, and C. Su, "A 0.1-0.3 V 40-123 fJ/bit/ch on-chip data link with ISI-
suppressed bootstrapped repeaters," IEEE J. of Solid-State Circuits, vol. 47,
no. 5, pp. 1242 - 1251, May, 2012.
[63] Y. Ho, Y.-S. Yang, C. C. Chang, and C. Su, "A near-threshold 480 MHz 78
W all-digital PLL with a bootstrapped DCO," IEEE J. of Solid-State Circuits,
vol. 48, no. 11, pp. 2805 - 2814, May, 2012.
[64] R. Rithe, C.-C. Cheng, and A. P. Chandrakasan, "Quad full-HD transform
engine for dual-standard low-power video coding," IEEE Journal of Solid-State
Circuits, vol. 47, no. 11, pp. 2724 - 2736, Nov. 2012.
[65] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, "A
249-Mpixel/s HEVC video-decoder chip for 4K Ultra-HD applications," IEEE
Journal of Solid-State Circuits, preprint.
138
BIBLIOGRAPHY
[66] D. Jeon, M. Seok, C. Chakrabarti, D. Blaauw, and D. Sylvester,"A super-
pipelined energy ecient subthreshold 240MS/s FFT core in 65nm," IEEE J.
of Solid-State Circuits, vol. 47, no.1, pp. 23 - 34, Jan. 2012.
[67] A.R. Sadeghi, D. Naccache (Eds.), Towards Hardware-Intrinsic Security,
Springer, 2010.
[68] M. Seok, D. Blaauw, and D. Sylvester, "Clock network design for ultra-low
power applications," in proc. Int'l Symp. Low Power Electro. and Design, 2010,
pp. 271C276.
[69] M. Meijer, J. P. de Gyvez, and A. Kapoor, "Ultra-low-power digital design
with body biasing for low area and performance-ecient operation", ASP J. of
Low Power Electro., vol. 6, no. 4, pp. 1 - 12, 2011.
[70] L. Nazhandali, et al., "Energy optimization of subthreshold-voltage sensor
network processors," in proc. Int. Symp. Comput. Archit., 2005, pp. 197 - 207.
[71] B. Zhai, et al., "2.60 pJ/Inst subthreshold sensor processor for optimal energy
eciency," in Symp. VLSI Circuits Dig. Tech. Papers, 2006, pp. 154 - 155.
[72] S. Hanson, et al., "A low-voltage processor for sensing applications with pi-
cowatt standby mode," IEEE J. of Solid-State Circuits, vol. 44, no. 4, pp. 1145
- 1155, April. 2009.
[73] S. C. Jocke, et al., "A 2.6-W sub-threshold mixed-signal ECG SoC ," in
Symp. VLSI Circuits Dig. Tech. Papers, 2009, pp. 60 - 61.
[74] J. Kwong, Y.K. Ramadass, N. Verma, and A. Chandrakasan, "A 65 nm Sub-Vt
microcontroller with integrated SRAM and switched capacitor DC-DC convert-
er," IEEE J. of Solid-State Circuits, vol. 44, no. 1, pp.115 - 126, 2009.
[75] S. Lutkemeier, et al., "A 65 nm 32 b subthreshold processor with 9T multi-Vt
SRAM and adaptive supply voltage control," IEEE J. of Solid-State Circuits,
vol. 48, no. 1, pp. 8 - 19, Jan. 2013.
[76] D. Bol, et al., "Sleepwalker: A 25-MHz 0.4-V sub-mm2 7-W/MHz microcon-
troller in 65-nm LP/GP CMOS for low-carbon wireless sensor nodes," IEEE J.
of Solid-State Circuits, vol. 48, no. 1, pp. 20 - 32, Jan. 2013.
[77] N. Ickes, D. Finchelstein, and A. Chandrakasan, "A 10-pJ/instruction, 4-MIPS




[78] M. Ashouei, et al., "A voltage-scalable biomedical signal processor running
ECG using 13 pJ/cycle at 1 MHz and 0.4 V," in IEEE Int. Solid-State Circuits
Conf. (ISSCC) Dig. Tech. Papers, 2011, pp. 332 - 334.
[79] J. Kwong and A. P. Chandrakasan, "An energy-ecient biomedical signal
processing platform," IEEE J. of Solid-State Circuits, vol. 46, no. 7, pp. 1742 -
1753, Jan. 2011.
[80] S. R. Sridhara, et al., "Microwatt embedded processor platform for medical
system-on-chip applications," IEEE J. of Solid-State Circuits, vol. 46, no. 4,
pp. 721 - 730, April. 2011.
[81] M. Konijnenburg, et al., "Reliable and Energy-Ecient 1MHz 0.4V Dynami-
cally Recongurable SoC for ExG Applications in 40nm LP CMOS," in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2013.
[82] H. Kaul, et al., "A 320 mV 56 W 411 GOPS/Watt ultra-low voltage motion
estimation accelerator in 65 nm CMOS," in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2008, pp. 316 - 317.
[83] H. Kaul, et al., "A 300 mV 494 GOPS/W recongurable dual-supply 4-way
SIMD vector processing accelerator in 45 nm CMOS," in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2009, pp. 260 - 261.
[84] S. Mathew, et al., "53 Gbps native GF (24)2 composite-eld AES-
encrypt/decrypt accelerator for content-protection in 45 nm high-performance
microprocessors," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2010, pp.
169 - 170.
[85] H. Kaul, et al., " A 1.45GHz 52-to-162GFLOPS/W variable-precision oating-
point fused multiply-add unit with certainty tracking in 32nm CMOS," in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2012, pp. 182
- 184.
[86] G. Ruhl, et al., "IA-32 processor with a wide-voltage-operating range in 32-nm
CMOS," IEEE MICRO, vol. 33, no. 2, pp. 28 - 36, 2013.
[87] G. Chen, et al., "Millimeter-scale nearly perpetual sensor system with stacked
battery and solar cells," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, 2010, pp. 288 - 289.
[88] Y. Lee, et al., "A modular 1mm3 die-stacked sensing platform with optical
communication and multi-modal energy harvesting," in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, 2010, pp. 288 - 289.
140
BIBLIOGRAPHY
[89] F. Zhang, et al., "A batteryless 19 W MICS/ISM-band energy harvesting
body area sensor node SoC," in IEEE Int. Solid-State Circuits Conf. (ISSCC)
Dig. Tech. Papers, 2012, pp. 298 - 300.
[90] G. Chen, et al., "A cubic-millimeter energy-autonomous wireless intraocular
pressure monitor," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, 2011, pp. 310 - 311.
[91] M. Alioto, "Guest editorial for the special issue on ultra-low-voltage VLSI
circuits and systems for green computing," IEEE Trans. Circuits. Syst. II, Exp.
Briefs, vol. 59, no. 12, pp. 849 - 852, Dec. 2012.
[92] M. Orshansky, S. Nassif, D. Boning, Design for Manufacturability and Statis-
tical Design, Springer, 2008.
[93] M. Meijer and J. P. de Gyvez, "Body-bias-driven design strategy for area- and
performance-ecient CMOS circuits," IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 20, no. 1, pp. 42 - 51, Jan, 2012.
[94] S.-F. Hsiao, M.-C. Chen, and C.-S. Tu, "Memory-free low-cost designs of ad-
vanced encryption standard using common subexpression elimination for sub-
functions in transformations," IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
53, no. 3, pp. 615 - 626, Mar. 2006.
[95] P.-C. Liu, J.-H. Hsiao, H.-C. Chang, and C.-Y. Lee, "A 2.97 Gb/s DPA-
resistant AES engine with self-generated random sequence," in proc. ESSCIRC,
Spet. 2011, pp. 71 - 74.
[96] S. K. Mathew, et al., "53 Gbps native GF(24)2 composite-eld AES-
encrypt/decrypt accelerator for content-protection in 45 nm high-performance
microprocessors," IEEE J. of Solid-State Circuits, vol. 46, no. 4, pp. 767 - 776,
April, 2011.
[97] M. Feldhofer, J. Wolkerstorfer and V. Rijmen, "AES implementation on a
grain of sand," IEE Proc. on Inf. Secur., vol. 152, no. 1, pp. 13 - 20, Oct, 2005.
[98] C. Hocquet, et al., "Harvesting the potential of nano-CMOS for lightweight
cryptography: an ultra-low-voltage 65 nm AES coprocessor for passive RFID
tags," Springer J. of Crypto. Eng., vol. 1, no. 1, pp. 79 - 89, 2011.
[99] T. Good and M. Benaissa, "692-nW advanced encryption standard (AES) on
a 0.13-m CMOS," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.
18, no. 12, pp. 1753 - 1757, Dec, 2010.
141
BIBLIOGRAPHY
[100] P. Hamalainen, T. Alho, M. Hannikainen, and T. Hamalainen, "Design and
implementation of low-area and low-power AES encryption hardware core," in
proc. DSD, pp. 577 - 583, 2006.
[101] R. Rithe, et al., "The eect of random dopant uctuations on logic timing at
low voltage," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no.
5, pp. 911 - 924, May, 2012.
[102] N. Ickes, et al., "A 28 nm 0.6 V low power DSP for mobile applications,"
IEEE J. of Solid-State Circuits, vol. 47, no. 1, pp.35 - 46, 2012.
[103] L. Xie, A. Davoodi, J. Zhang, and T-H. Wu, "Adjustment-based modeling for
timing analysis under variability," IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 28, no. 7, pp. 1085 - 1095, July, 2009.
[104] S. Narendra, et al., "Ultra-low voltage circuits and processor in 180nm and
90nm technologies with a swapped-body biasing technique," in IEEE Int. Solid-
State Circuits Conf. (ISSCC) Dig. Tech. Papers, pp. 156 - 158, Feb, 2004.
[105] S. K. Mathew, et al., "Sub-500ps 64b ALUs in 0.18m SOI/bulk CMOS:
design and scaling trends," IEEE J. of Solid-State Circuits, vol. 36, no. 11, pp.
1636 - 1646, Nov, 2001.
[106] D. Canright, "A very compact rijndael s-box," Naval Postgraduate School,
Monterey, CA, Tech. Rep. NPS-MA-04-011, 2005.
[107] M. M. Wong, M. L. D. Wong, A. K. Nandi, and I. Hijazin, "Construction
of optimum composite eld architecture for compact high-throughput AES S-
boxes," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 6, pp.
1151 - 1155, June, 2012.
[108] X. Zhang and K. K. Parhi, "High-speed VLSI architectures for the AES
algorithm," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 9,
pp. 957 - 967, Sep, 2004.
[109] D. Markovic, and R. Brodersen, DSP Architecture Design Essentials,
Springer, 2012
[110] Synopsys. Liberty NCX User Guide Version F-2011.06, June 2011.
[111] M. Seok, "A ne-grained many VT design methodology for ultra low voltage




[112] C. S. Nagarajan, L. Yuan, G. Qu, and B. G. Stamps, "Leakage optimization
using transistor-level dual threshold voltage cell library," in Int'l Symp. on
Quality Electronic Design (ISQED), pp. 62 - 67, Mar 2009.
[113] J. Kim, and Y. Shin, "Minimizing leakage power in sequential circuits by
using mixed Vt ip-ops," in Int'l Conf. on Computer-Aided Design (ICCAD),
2007, pp. 797 - 802.
[114] D. Somasekhar, et al., "2 GHz 2 MB 2T gain cell memory macro with 128
GBytes/sec bandwidth in a 65 nm logic process technology," IEEE J. Solid-
State Circuits, vol. 44, no. 1, pp. 174 - 185, 2009.
[115] K. Chun, et al., "A sub-0.9V logic-compatible embedded DRAMwith boosted
3T gain cell, regulated bit-line write scheme and PVT-tracking read reference
bias," in Symp. VLSI Circuits Dig. Tech. Papers, 2009, pp. 134 - 135.
[116] K. Chun, P. Jain, T. Kim, and C. H. Kim, "A 1.1 V, 667 MHz random cycle,
asymmetric 2T gain cell embedded DRAM with 99.9 percentile retention time
of 110 sec," in Symp. VLSI Circuits Dig. Tech. Papers, Jun.2010, pp. 192 -
193.
[117] K. Chun, W. Zhang, P. Jain, and C. Kim, "A 700 MHz 2T1C embedded
DRAM macro in a generic logic process with no boosted supplies," in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2011, pp. 506 - 507.
[118] W. Zhang, K. Chun, and C. Kim, "A write-back-free 2T1D embedded DRAM
with local voltage sensing and a dual-row-access low power mode," in proc.
Custom Integr. Circuits Conf., 2012, pp. 1 - 4.
[119] Y. Lee, M.-T. Chen, J. Park, D. Sylvester, and D. Blaauw, "A 5.42nW/kB
retention power logic-compatible embedded DRAM with 2T dual-Vt gain cell
for low power sensing applications," in proc. ASSCC, 2010, pp. 1 - 4.
[120] X. Liang, R. Canal, G. Wei, and D. Brooks, "Process variation tolerant
3T1D-based cache architectures," in proc. MICRO, 2007, pp. 15 - 26.
[121] F. Sebastiano, et al., "A 1.2V 10W NPN-based temperature sensor in 65nm
CMOS with an inaccuracy of 0.2oC (3) from -70oC to 125oC," in IEEE Int.
Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2010, pp. 312 - 313.
[122] K. Souri, Y. Chae, and K. Makinwa, "A CMOS temperature sensor with a
voltage-calibrated inaccuracy of 0.15oC (3) from -55oC to 125oC," in IEEE




[123] P. Chen, C.-C. Chen, C.-C. Tsai, and W.-F. Lu, "A time-to-digital-converter-
based CMOS smart temperature sensor," IEEE J. Solid-State Circuits, vol. 40,
no. 8, pp. 1642 -1648, Aug. 2005.
[124] P. Chen, et al., "A fully digital time-domain smart temperature sensor re-
alized with 140 FPGA logic elements," IEEE Trans. Circuit and Syst. I: Reg.
Papers, vol. 54, no. 12, pp. 2661 - 2668, Dec. 2007.
[125] P. Chen, et al., "A time-domain SAR smart temperature snesor with curva-
ture compensation and a 3 inaccyracy of -0.4oC+0.6oC over a 0oC to 90oC
range," IEEE J. Solid-State Circuits, vol. 45, no. 3, pp. 600 - 609, Mar. 2010.
[126] P. Chen, et al., "All-digital time-domain smart temperature sensor with an
inter-batch inaccuracy of -0.7oC - +0.6oC after one-point calibration," IEEE
Trans. Circuit and Syst. I: Reg. Papers, vol. 58, no. 5, pp. 913 - 920, Jun. 2011.
[127] K. Woo, et al., "Dual-DLL-based CMOS all-digital temperature sensor for
microprocessor thermal monitoring," in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2009, pp. 68 - 69.
[128] E. Saneyoshi, K. Nose, M. Kajita, and M. Mizuno, "A 1.1V 35m35
thermal sensor with supply voltage sensitivity of 2oC/10%-supply for thermal
management on the SX-9 supercomputer," Symp. VLSI Circuits Dig. Tech.
Papers, pp. 152 - 153, 2008.
[129] G. R. Chowdhury and A. Hassibi, "An on-chip temperature sensor with a
self-discharging diode in 32-nm SOI CMOS," IEEE Trans. Circuits. Syst. II,
Exp. Briefs, vol. 59, no. 9, pp. 568 - 572, Dec. 2012.
[130] M. K. Law and A. Bermak, "A 405-nW CMOS temperature sensor based on
linear MOS operation," IEEE Trans. Circuits. Syst. II, Exp. Briefs, vol. 56,
no. 2, pp. 891 - 895, Dec. 2009.
[131] M. K. Law, A. Bermak, and H. C. Luong, "A sub-W embedded CMOS
temperature sensor for RFID food monitoring application," IEEE J. Solid-State
Circuits, vol. 45, no. 6, pp. 1246 - 1255, Mar. 2010.
[132] K. Kim, H. Lee, S. Jung, and C. Kim, "A 366kS/s 400W 0.0013mm2
frequency-to-digital converter based CMOS temperature sensor utilizing multi-
phase clock," in proc. CICC, pp. 203 - 206, Sept. 2009.
[133] S. Hwang, et al., "A 0.008 mm2 500 W 469 kS/s frequency-to-digital con-
verter based CMOS temperature sensor with process variation compensation,"




[134] Y.-S. Lin, D. Sylvester and D. Blaauw, "An ultra low power 1V, 220nW
temperature sensor for passive wireless applications," in proc. CICC, pp. 507 -
510, Sept. 2008.
[135] S. Jeong, J. Sim, D. Blaauw, and D. Sylvester, "65nW CMOS temperature
sensor for ultra-low power microsystems," in proc. CICC, pp. 1 - 4, Sept. 2013.
[136] Y. W. Li, and H. Lakdawala, "Smart integrated temperature sensor - mixed-
signal circuits and systems in 32-nm and beyond," in proc. CICC, pp. 1 - 8,
Sept. 2011.
[137] S. Yoshimoto et al., "A 40-nm 0.5-V 12.9-pJ/access 8T SRAM using low-
power disturb mitigation technique," Symp. VLSI Circuits Dig. Tech. Papers,
pp. 77 - 78, 2013.
145
List of Abbreviations
1. W. Zhao, A. B. Alvarez, Y. Ha, "A 65-nm 30-fJ/bit subthreshold level
sonverter for robust and wide range voltage conversion," submitted under
review.
2. W. Zhao and Y. Ha, "A 65-nm 12.2-Mbps 1.65-pJ/bit near-threshold AES
engine based on novel self-body-biasing and statistical design
methodology", submitted to IEEE TVLSI, under revision.
3. W. Zhao, Y. Ha, C. H. Hoo and A. B. Alvarez, "Robustness driven
energy-ecient ultra-low voltage standard cell design with intra-cell
mixed-Vt methodology", in Proc. of ISLPED, pp. 323-328, 2013.
4. W. Loke, W. Zhao and Y. Ha, "Criticality-based routing for FPGAs with
reverse body bias switch box architectures", in Proc. of FPL, pp. 1-6, 2013.
5. W. Loke, Y. Ha and W. Zhao, "A Power and Cluster-Aware Technology
Mapping and Clustering Scheme for Dual-VT FPGAs," in Proc. IPDPSW,
pp. 221-226. 2012.
146
