Performance-Driven Energy-Efficient VLSI. by Ma, Wei-Hsiang
Performance-Driven Energy-Eﬃcient VLSI
by
Wei-Hsiang Ma
A dissertation submitted in partial fulﬁllment
of the requirements for the degree of
Doctor of Philosophy
(Electrical Engineering)
in The University of Michigan
2011
Doctoral Committee:
Professor Marios C. Papaefthymiou, Chair
Professor Dennis M. Sylvester
Associate Professor Jerome P. Lynch
Assistant Professor Zhengya Zhang
c© Wei-Hsiang Ma 2011
All Rights Reserved
To my family and friends for their love and support
ii
ACKNOWLEDGMENTS
First and foremost I oﬀer my sincerest gratitude to my advisor, Prof. Marios
Papaefthymiou, who has supported me throughout my thesis with his patience and
knowledge while allowing me the room to work in my own way. I would also like
to thank all the other committee members, Prof. Dennis Sylvester, Prof. Zhengya
Zhang, and Prof. Jerome Lynch, providing valuable feedback and support.
I want to thank Jerry Kao, who as a good friend and group mate of mine as
always willing to help and give his best suggestions. It would have been a lonely
oﬃce without him. Many thanks to Visvesh, Carlos, Yu-Shiang, Juang-Ying, and
Tai-Chuan for sharing their time and expertise. Thanks to the program coordinator
Beth Stalnaker and the ACAL/DCO staﬀs: Denise, Bert, Lauri, Steve, Ed, and Joel
for their immense help.
Finally, I would like to thank my parents. They are always supporting me and
encouraging me with their best wishes.
iii
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . xiii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Principles of Charge-Recovery Design Techniques . . . . . . . 3
1.2 LC Oscillation and Power-Clock . . . . . . . . . . . . . . . . 7
1.3 Power-Clock Generator . . . . . . . . . . . . . . . . . . . . . 9
1.4 Charge-Recovery Systems . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Fine-grain Systems . . . . . . . . . . . . . . . . . . 12
1.4.2 Coarse-grain Systems . . . . . . . . . . . . . . . . . 16
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 SBL and SBL-Based FIR Filter Design . . . . . . . 18
1.5.2 Flash ADC with Resonant Clock Distribution . . . 19
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Reversible Systems . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Reversible Computing . . . . . . . . . . . . . . . . . 21
2.1.2 Reversible Logic . . . . . . . . . . . . . . . . . . . . 22
2.2 Charge-Recovery Logic . . . . . . . . . . . . . . . . . . . . . 24
iv
2.3 Resonant-Clocked Designs . . . . . . . . . . . . . . . . . . . . 31
3 Subthreshold Boost Logic . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 SBL Overview and Blip Clock Generator . . . . . . . . . . . 39
3.3 SBL Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 SBL Energetics . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 187MHz Charge-Recovery FIR Filter with Subthreshold Boost
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 FIR Filter Architecture . . . . . . . . . . . . . . . . . . . . . 50
4.2 Power-Clock Generator and Clock Network Design . . . . . . 52
4.3 SBL FIR Filter Design Methodology . . . . . . . . . . . . . . 54
4.4 SBL FIR Filter Spice-Level Analysis . . . . . . . . . . . . . . 55
4.4.1 Spice Simulation Results . . . . . . . . . . . . . . . 55
4.4.2 Performance Comparison with CMOS FIR Filter . . 58
4.5 SBL FIR Filter Test-Chip Measurement . . . . . . . . . . . . 60
4.5.1 Single-Supply Conﬁguration . . . . . . . . . . . . . 61
4.5.2 Two-Supply Conﬁguration . . . . . . . . . . . . . . 62
4.5.3 Energy Trade-oﬀ and Robustness Analysis . . . . . 64
4.5.4 SBL FIR Filter Test-Chip Summary . . . . . . . . . 66
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Architecture and Design of Resonant-Clock Flash ADC . . . 70
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Resonant-Clock Flash ADC Architecture . . . . . . . . . . . . 71
5.3 Resonant-Clock Flash ADC Building Blocks . . . . . . . . . . 72
5.3.1 Track-and-Hold Ampliﬁer . . . . . . . . . . . . . . . 72
5.3.2 Comparators . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3 Grey Code Encoder . . . . . . . . . . . . . . . . . . 75
5.3.4 Sense-Ampliﬁer Flip-Flop . . . . . . . . . . . . . . . 76
5.4 Clock Network Design . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Inductor Design and Analysis . . . . . . . . . . . . . . . . . . 79
6 Evaluation and Testing of 7GS/s Resonant-Clock Flash ADC 83
6.1 Measurement Results . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7 Conclusions and Future Directions . . . . . . . . . . . . . . . . 93
v
7.1 Subthreshold Boost Logic . . . . . . . . . . . . . . . . . . . . 93
7.2 Resonant-Clock Flash ADC Design . . . . . . . . . . . . . . . 94
7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 95
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
vi
LIST OF FIGURES
Figure
1.1 Charging and discharging load capacitance using conventional and
charge-recovery techniques: (a) First-order RC network with DC
supply. (b) First-order RC network with n-step supply. . . . . . . . 4
1.2 Practical implementation of power-clock using an inductor: (a) Schematic.
(b) Waveform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 A single-phase power-clock generator with two supplies. . . . . . . . 9
1.4 A single-phase power-clock generator with single supply and a voltage
divider. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 H-bridge two-phase power-clock generator. . . . . . . . . . . . . . . 11
1.6 (a) Basic scheme for a four-phase power-clock generator. (b) Circuit
schematic related to C1− C3. . . . . . . . . . . . . . . . . . . . . . 11
1.7 (a) Schematic of 2N-2P inverter. (b) Four-phase power-clock wave-
forms. (c) Operating waveforms of 2N-2P inverter. . . . . . . . . . . 14
1.8 Resonant-clocked pipeline example. . . . . . . . . . . . . . . . . . . 16
2.1 Truth table of an irreversible gate. . . . . . . . . . . . . . . . . . . . 23
2.2 (a) Truth table and (b) symbol of the Feynman gate. . . . . . . . . 24
2.3 Schematic of (a) NMOS ADL inverter, and (b) PMOS ADL inverter. 25
2.4 Schematic of QSERL inverter. . . . . . . . . . . . . . . . . . . . . . 26
2.5 Schematic of 2N-2N2P inverter. . . . . . . . . . . . . . . . . . . . . 27
vii
2.6 Schematic of PAL inverter. . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Schematic of PFAL inverter. . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Schematic of (a) NMOS TSEL inverter, and (b) PMOS TSEL inverter. 29
2.9 Schematic of (a) NMOS SCAL inverter, and (b) PMOS SCAL inverter. 29
2.10 Schematic of Boost Logic inverter. . . . . . . . . . . . . . . . . . . . 30
2.11 Schematic of Edge-Triggered (E-R) latch. . . . . . . . . . . . . . . . 32
2.12 Schematic of (a) pTERF ﬂip-ﬂop, and (b) nTERF ﬂip-ﬂop. . . . . . 32
2.13 Schematic of sense-ampliﬁer ﬂip-ﬂop used in [1]. . . . . . . . . . . . 33
2.14 Schematics of resonant-clocked latches used in RF1: (a) H-LAT, and
(b) L-LAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.15 Schematic of resonant-clocked latch used in RF1, B-LAT. . . . . . . 34
2.16 Schematic of DESL inverter. . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Schematic of an SBL gate. . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Simple "blip" clock generator: (a) Schematic. (b) Waveform. . . . . 40
3.3 SBL operation: (a) Evaluation Phase. (b) Boost Phase. . . . . . . . 41
3.4 Cascade of SBL gates. . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Clock waveform modeling: (a) Sine clock with equal peak-to-peak
swing. (b) Sine clock with 1.5X peak-to-peak swing. . . . . . . . . . 44
4.1 Block diagram of SBL FIR ﬁlter and BIST circuits. . . . . . . . . . 50
4.2 Schematic and layout of a 4-2 compressor. . . . . . . . . . . . . . . 51
4.3 Distributed "blip" clock generator and measured clock waveform. . 52
4.4 SBL design ﬂow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Simulated energy consumption of SBL FIR ﬁlter. . . . . . . . . . . 56
viii
4.6 Histogram of simulated power-clock insertion delays at a resonant
frequency of 53.7MHz. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 (a) Layout of conventional CMOS FIR. (b) Simulated operating fre-
quency and energy per cycle vs. supply voltage for conventional
CMOS FIR ﬁlter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8 Simulated energy consumption of conventional and SBL FIR ﬁlters. 59
4.9 Measured energy consumption vs. operating frequency for SBL FIR
ﬁlter (single supply and two supplies). . . . . . . . . . . . . . . . . . 63
4.10 Comparison of measured and simulated energy consumption for SBL
FIR ﬁlter (single supply). . . . . . . . . . . . . . . . . . . . . . . . . 64
4.11 Measured total energy consumption vs. VCC for the SBL FIR when
operating at 26.4MHz with VDC = 0.28V. . . . . . . . . . . . . . . . 65
4.12 Measured resonant frequency distribution at VDC = VCC = 0.36V. . 66
4.13 SBL FIR die microphotograph. . . . . . . . . . . . . . . . . . . . . 68
5.1 Architecture of ADC with single-phase resonant clock. . . . . . . . . 72
5.2 Track-and-hold ampliﬁer (THA) circuit. . . . . . . . . . . . . . . . . 73
5.3 Schematic of the comparator and waveform. . . . . . . . . . . . . . 74
5.4 Schematic of the 2-cycle resonant-clocked Gray code encoder. . . . . 76
5.5 Schematic of the sense-ampliﬁer ﬂip-ﬂop. . . . . . . . . . . . . . . . 77
5.6 Resonant clock generator with variable pulse controls. . . . . . . . . 78
5.7 Current density in the substrate underneath inductor: (a) without
M1 shield, and (b) with M1 shield. . . . . . . . . . . . . . . . . . . 80
5.8 (a) Inductance vs. resonant frequency with and without M1 shield.
(b) Quality factor vs. resonant frequency with and without M1 shield. 81
6.1 Measured DNL/INL. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Measured SNDR/SNFR vs. input frequency. . . . . . . . . . . . . . 85
6.3 Measured energy per cycle vs. sampling frequency. . . . . . . . . . . 86
ix
6.4 Measured power breakdown at 5.5GS/s. . . . . . . . . . . . . . . . . 87
6.5 (a) Measured SNDR vs. sampling frequency with input frequency of
400MHz. (b) Measured FoM vs. sampling frequency. . . . . . . . . 89
6.6 FoM vs. sampling frequency: comparison to prior work summarized
in [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7 ADC die microphotograph. . . . . . . . . . . . . . . . . . . . . . . . 92
A.1 Bonding diagram for SBL FIR test-chip. . . . . . . . . . . . . . . . 100
A.2 Schematic of the printed circuit board for SBL FIR test-chip. . . . . 102
A.3 Printed circuit board for SBL FIR test-chip. . . . . . . . . . . . . . 104
A.4 Test setup for SBL FIR test-chip. . . . . . . . . . . . . . . . . . . . 105
B.1 Bonding diagram for resonant-clock ADC test-chip. . . . . . . . . . 107
B.2 Schematic of the printed circuit board for resonant-clock ADC test-chip.108
B.3 Printed circuit board for resonant-clock ADC test-chip. . . . . . . . 109
B.4 Test setup for resonant-clock ADC test-chip. . . . . . . . . . . . . . 110
x
LIST OF TABLES
Table
4.1 SBL FIR ﬁlter statistics and performance measurements. . . . . . . 60
4.2 Performance table. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 Performance summary. . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.1 I/O information for SBL FIR test-chip. . . . . . . . . . . . . . . . . 101
A.2 Parts list for SBL FIR test-chip. . . . . . . . . . . . . . . . . . . . . 103
B.1 I/O information for resonant-clock ADC test-chip. . . . . . . . . . . 108
B.2 Parts list for resonant-clock ADC test-chip. . . . . . . . . . . . . . . 109
xi
LIST OF APPENDICES
Appendix
A. Testing Setup for SBL FIR Test-Chip . . . . . . . . . . . . . . . . . . 99
B. Testing Setup for Resonant-Clock Flash ADC Test-Chip . . . . . . . . 106
xii
LIST OF ABBREVIATIONS
ADC Analog-to-Digital Converter
ADL Adiabatic Dynamic Logic
BIST Built-in Self-Test
CMOS Complementary Metal-Oxide Semiconductor
dB Decibel
DESL Dynamic Evaluation Static Latch
DNL Diﬀerential Non-Linearity
DSP Digital Signal Processor
ENOB Eﬀective Number of Bits
ERBW Eﬀective Resolution Bandwidth
FIR Finite Impulse Response
FOM Figure of Merit
I/O Input/Output
INL Integral Non-Linearity
LCC Leadless Chip Carrier
LSB Least Signiﬁcant Bit
MOS Metal-Oxide Semiconductor
nTERF NMOS Energy Recovery Flip-Flop
PAL Pass-transistor Adiabatic Logic
PC Power Clock
xiii
PCB Printed Circuit Board
PFAL Positive Feedback Adiabatic Logic
pTERF PMOS Energy Recovery Flip-Flop
QFN Quad Flat No leads
QSERL Quasi-Static Energy Recovery Logic
RCK Resonant Clock
RERL Reversible Energy Recovery Logic
SAFF Sense-Ampliﬁer Flip-Flop
SBL Subthreshold Boost Logic
SCAL Source-Coupled Adiabatic Logic
SCRL Split-level Charge Recovery Logic
SMA SubMiniature version A
SMD Surface Mount Devices
SNDR Signal to Noise and Distortion Ratio
SNFR Signal to Noise Floor Ratio
THA Track-and-Hold Ampliﬁer
TSEL True Single-Phase Energy Recovery Logic
VLSI Vary Large Scale Integrated
xiv
ABSTRACT
Performance-Driven Energy-Eﬃcient VLSI
by
Wei-Hsiang Ma
Chair: Marios C. Papaefthymiou
Today, there are two prevalent platforms in VLSI systems: high-performance and
ultra-low power. High-speed designs, usually operating at GHz level, provide the
required computation abilities to systems but also consume a large amount of power;
microprocessors and signal processing units are examples of this type of designs. For
ultra-low power designs, voltage scaling methods are usually used to reduce power
consumption and extend battery life. However, circuit delay in ultra-low power de-
signs increases exponentially, as voltage is scaled below Vth, and subthreshold leakage
energy also increases in a near-exponential fashion.
Many methods have been proposed to address key design challenges on these two
platforms, energy consumption in high-performance designs, and performance/reliability
in ultra-low power designs. In this thesis, charge recovery design is explored as a solu-
tion targeting both platforms to achieve increased energy eﬃciency over conventional
CMOS designs without compromising performance or reliability.
To improve performance while still achieving high energy eﬃciency for ultra-low
power designs, we propose Subthreshold Boost Logic (SBL), a new circuit family
that relies on charge recovery design techniques to achieve order-of-magnitude im-
xv
provements in operating frequencies, and achieve high energy eﬃciency compared
to conventional subthreshold designs. To demonstrate the performance and en-
ergy eﬃciency of SBL, we present a 14-tap 8-bit ﬁnite-impulse response (FIR) ﬁl-
ter test-chip fabricated in a 0.13µm process. With a single 0.27V supply, the test-
chip achieves its most energy eﬃcient operating point at 20MHz, consuming 15.57pJ
per cycle with a recovery rate of 89% and a ﬁgure of merit (FoM) equal to 17.37
nW/Tap/MHz/InBit/CoeﬀBit. In comparison with a static CMOS-based implemen-
tation derived by synthesis of the same FIR architecture and automatic place-and-
route, the SBL-based FIR consumes 40% to 50% less energy per cycle in the 17MHz
- 187MHz range.
To reduce energy consumption at multi-GHz level frequencies, we explore the
application of resonant-clocking to the design of a 5-bit non-interleaved resonant-
clock ﬂash ADC with a sampling rate of 7GS/s. The ADC has been designed in a
65nm bulk CMOS process. An integrated 0.77nH inductor is used to resonate the
entire clock distribution network to achieve energy-eﬃcient operation. Operating at
5.5GHz, the ADC consumes 28mW, yielding 396fJ per conversion step. The clock
network accounts for 10.7% of total power and consumes 54% less energy over CV 2.
Operating at its maximum sampling frequency of 7.0GS/s with a 0.1V increase to
each supply, the ADC dissipates 45mW. At this frequency, 11.1% of the total energy
consumption is clock-related, and the ADC yields a FoM of 683fJ per conversion step.
By comparison, in a typical ﬂash ADC design, 30% of total power is clock-related.
xvi
CHAPTER 1
Introduction
Energy consumption has become a major design constraint in today's VLSI de-
signs. For the last several decades, Moore's Law [3] has been the main driving force
to reduce the size and energy consumption of silicon devices. However, this scaling
does not reduce power consumption per unit area. As more devices can ﬁt into a
given area, and the heat generated increases. Heat removal at the package level lim-
its further integration. Moreover, starting with 65nm, supply voltage no longer scales
with device sizes, remaining essentially constant over past several years and foresee-
able future. To make matters worse, leakage current increases due to the smaller
device sizes. Therefore, one of the largest issues facing designers nowadays is energy
and power dissipation. Their main challenge is to achieve energy eﬃcient computing,
extracting the maximum possible performance under a given power constraint.
Voltage scaling is one of the most eﬀective methods for reducing energy consump-
tion in digital circuits [4, 5, 6]. The energy consumption decreases quadratically when
the supply voltage VDD decreases, providing energy eﬃcient operation. However, this
energy-eﬃcient operation comes at the expense of performance degradation. When
voltage is scaled while remaining well above subthreshold (VDD >> Vth), perfor-
mance degradation is in approximately linear relationship with supply. When voltage
is scaled deeper in the subthreshold regime (VDD<Vth), circuit delay increases expo-
1
nentially with VDD, and becomes more sensitive to process variation. Leakage current
also increases in a near-exponential fashion in the subthreshold regime. This rise in
leakage energy eventually dominates total power consumption and creates a minimum
energy operating point [4]. Further scaling beyond this minimum energy point re-
sults in total energy consumption increasing and yields diminished energy eﬃciency.
Moreover, voltage scaling has its limitation on diﬀerent applications. For example,
in applications with constantly high workloads, voltage scaling can only help to a
limited extent.
Charge-recovery is an alternative design approach that can reduce energy con-
sumption by gradually charging/discharging capacitance and recycling the charge at
the end of each cycle [7, 8, 9, 10]. The energy dissipation of a traditional CMOS
circuit that goes through a charge or a discharge cycle is governed by the equation
Econv = switching activity × CV 2, while the corresponding energy dissipation of a
charge-recovery system is governed by Echarge−recovery = (K/T )CV 2, where T is the
duration of the transition and K is a constant proportional to the RC constant of the
system. Similar to voltage scaling, charge-recovery exhibits a trade-oﬀ relationship be-
tween energy consumption and computation delay, but this trade-oﬀ relation is linear.
A large volume of previous work has been proposed based on this trade-oﬀ, improving
the energy eﬃciency at the cost of system performance, focusing on achieving high en-
ergy eﬃciency in relatively low-performance designs. However, this trade-oﬀ dose not
limit the design scope of charge-recovery techniques. In fact, charge-recovery design
techniques can enable diﬀerent design points compared to conventional static CMOS
techniques. In principle, these design points could lead to better energy/performance
trade-oﬀs than conventional CMOS design.
This thesis argues that charge-recovery techniques can be used to design VLSI sys-
tems that achieve both high energy eﬃciency and high performance. To support this
proposition, two charge-recovery systems operating at diﬀerent frequency points are
2
demonstrated. Both designs provide high performance at their corresponding supply-
levels and achieve higher energy eﬃciency than their conventional CMOS designs
counterparts. For ultra-low energy consumption with high performance, we present a
novel ﬁne-grain charge-recovery circuitry that uses a single subthreshold-level supply.
By amplifying the internal subthreshold-level signals with a two-phase power-clock,
this charge-recovery circuitry relies on gate overdrive to enable fast operation while
improving robustness to the variations. In addition, it can share the same supply
with the clock generator, allowing operation with a single DC supply level. An 8×8
bit 14-tap ﬁnite impulse response (FIR) ﬁlter is used to demonstrate the high energy
eﬃciency of this logic, across the 5MHz-187MHz frequency range with a subthreshold
supply ranging from 0.16V to 0.36V. This design achieves a higher Figure of Merit
(FoM) than previous implementations of the same architecture in charge-recovery
logic and in static CMOS.
For high performance and high energy eﬃciency, we present a coarse-grain resonant-
clock ﬂash ADC structure. In this design, resonant-clocking is used to decrease the
energy consumption of the network that distributes the clock signal to the analog and
digital circuitry of the ﬂash ADC. A 5-bit non-interleaved ﬂash ADC that achieves a
sampling frequency range of 4.5GS/s-7GS/s is used to demonstrate the high energy
eﬃciency of resonant-clocking techniques. An integrated inductor is used to resonate
the capacitance of the entire clock distribution network at a target operating fre-
quency range. This work achieves a lower FoM than all previously published ADCs
operating above 2.2GHz.
1.1 Principles of Charge-Recovery Design Techniques
This section describes the principles of charge-recovery techniques as applied to
digital circuit designs. In digital circuits, a MOS transistor is usually used as a
switch with one node connected to supply (VDD) / ground (VSS) and the other node
3
connected to a capacitive load C. By observing the voltage level of the load, we
can determine the logic state of a gate. By turning the transistor on, we can charge
(discharge) the voltage level of the load and thus change the logic state of a gate.
time
VDD
T/2
Vc
+
R
CL
i(t)
R
VDD
S1
S2
Vc
Vc
VDD
timeT/2
Vc
T/2n
VDD/n
R
Vc
Vn-step
Vc
CLVn-step
i(t)
(a) (b)
Figure 1.1: Charging and discharging load capacitance using conventional and charge-
recovery techniques: (a) First-order RC network with DC supply. (b)
First-order RC network with n-step supply.
For simplicity, we can model digital circuits as ﬁrst-order RC networks, as shown
in Figure 1.1(a). In such a network, a capacitive load CL is charged and discharged
through transistors which are modeled as a resistor R. By turning on switch S1 and
turning oﬀ switch S2, supply VDD starts to charge the load CL from VSS to VDD, and
the voltage level at node Vout ramps up as an function of VDD(1 − e−t/RC), where t
is the transition time. The total energy dissipation in the circuit can be obtained by
integrating the instantaneous power dissipation on the resistive device, I2R, over the
charging time of CL:
4
E =
∞∫
0
iC(t)vout(t) dt
=
∞∫
0
CL
dvout
dt
vout dt
= CL
V∫
0
vout dvout
=
CLV
2
2
. (1.1)
The total energy drawn from the supply during charge is equal to the sum of the
energy stored in CL,
1
2
CLV
2, and the energy loss on the resistor R, 1
2
CLV
2.
Figure 1.1(b) illustrates how to charge and discharge CL in a charge-recovery
manner. In this ﬁgure, an n-step voltage source is used as an example. In practice,
a supply could be a resonant source capable of reclaiming charge and re-using it
for subsequent charging. This n-step source has a voltage of VDD/n for each step
and time interval of T/2n. While charging the capacitor, we assume that the time
constant RCL is much smaller than the time interval T/2n. This means that the
output CL can be charged to the same level of the supply during each time interval
T/2n. To calculate the energy consumption in Figure 1.1(b) during charging, we
apply Equation (1.1), and total energy dissipation is found to be
E = Energy Dissipation per Step× n
=
CL(
V
n
)2
2
· n
=
CLV
2
2
· 1
n
. (1.2)
The total energy drawn from the supply during charging phase is equal to the sum
5
of the energy stored in CL,
1
2
CLV
2, and the energy loss on the resistor R, 1
2
CLV
2 ·
1
n
. Unlike conventional CMOS circuits, energy loss for the circuit in Figure 1.1(b)
decreases by a coeﬃcient of 1
n
.
When discharging the capacitive load CL with the same n-step source, the charge
stored in the capacitive load CL ﬂows back to the source, and the amount of energy
loss on resistor R is the same as when charging CL. A typical voltage supply will shunt
any returned energy to ground, dissipating it across some resistance, and rendering
the charge-recovery discharging no more energy eﬃcient than the conventional case.
However, if the supply is, for example, a resonant source, it will be able to reclaim
the returned charge and use it for subsequent charging. This discharging method is
called charge recovery because the energy transferred to the capacitor is recovered
and reused by the supply.
As the number of steps n approaches inﬁnity, the dissipation approaches zero; how-
ever, large n also means small charge time T/2n. If T/2n is small enough compared
to the time constant RCL of the circuit, the switching event may not be completed
and the result in Equation (1.2) is no longer valid. Since n is related to the energy
savings of the circuit, we conclude that there is a trade-oﬀ between time (T ) and
energy dissipation (E). This energy-time trade-oﬀ forms the basis of charge-recovery
techniques.
Equation (1.2) implies that gradual transitioning is the key to achieve energy-
eﬃcient charge-recovery operation. Gradual transition reduces the potential diﬀer-
ence across the resistive element, results in a low-level current ﬂow, and thus minimizes
energy dissipation.
It should be noticed that in most charge-recovery systems, the power supply per-
forms a dual role, providing charge to internal circuit nodes, and synchronizing the
computation of the gate. For this reason, such a power supply is usually referred to
as a Power-Clock (PC).
6
1.2 LC Oscillation and Power-Clock
R
C
L
p (    )
Vc
Vc
VDD
time
p
T/2
i(t)
+
Energy transfers from C to L
Energy transfers from L to C
(a) (b)
Figure 1.2: Practical implementation of power-clock using an inductor: (a)
Schematic. (b) Waveform.
One realization of a power-clock generator with energy recovery ability is shown
in Figure 1.2. Here, an inductor L is employed to store the energy of the charge
returned from capacitor C in the form of a magnetic ﬁeld. Periodic energy transfer
between the inductor L and the capacitor C results in a sinusoidal waveform. The
only energy losses in the system are due to the parasitic resistance of the circuit. To
compensate for these losses, a shunt switch driven by pulse p is used to inject current
into the inductor, replenishing energy in each cycle. The natural oscillation frequency
fn of the ideal LC system is
fn =
1
2pi
√
1
LC
, (1.3)
where C is the total capacitance of LC oscillation system.
During practical design, the inductance value for L is chosen to meet a target
frequency fn with a given extracted load C. For example, with a load CL and a
target frequency fn, the inductance is chosen as L = 1/(2pifn)
2CL = T
2/4pi2CL. The
7
waveform of the forced LC oscillation is a sinusoid-like waveform, and the energy
consumption can be derived as
E =
1
2
∣∣I2∣∣RT
=
1
2
∣∣∣∣∣ V/21/jωC 2
∣∣∣∣∣RT
=
1
2
(
ωV
2
C)2RT
=
1
2
(pifV C)2RT
=
1
2
(
pi2CR
T
)CV 2. (1.4)
Using Equation (1.3), Equation (1.4) can be further simpliﬁed to obtain an expression
of energy dissipation in terms of circuit parameters:
E =
1
2
· pi
2CR
2pi
√
LC
CV 2
=
pi
4
R
√
C
L
CV 2
=
pi
4Q
CV 2. (1.5)
With the quality factor deﬁned as Q =
√
LC
R
, a direct energy dissipation comparison
can be drawn between a charge-recovery system and a conventional CMOS one with
the same load. As a result, Q is an important metric and is often used to evaluate
the eﬃciency of a charge-recovery system.
8
1.3 Power-Clock Generator
Figure 1.3 shows a single-phase power-clock generator, which uses an inductor L
and parasitic capacitance CL to form an LC oscillation. This clock generator provides
a single-phase sinusoidal clock waveform, and an inductor L is chosen to achieve a
target resonant frequency for a given load CL.
VSS
L
a 
b 
a 
b 
1/fn
1/fn1/2fn
+
+
VDD/2
VDD/2
PC
CL
M1
M2
Figure 1.3: A single-phase power-clock generator with two supplies.
The clock drivers (M1 and M2), similar to the one used in [9], periodically re-
plenishes the energy losses in the resonant system through current injection in the
inductor. As the clock approaches its minimum, pulse a causes the pull-down switch
to conduct, discharging the output clock voltage to 0V, and causing an RL current
build-up in the inductor. At the falling edge of pulse a, the system continues oscil-
lating freely with an initial condition V (PC) = 0V and I(L) = In, where In is the
current ﬂowing in the inductor at that time. Similarly, when the clock reaches its
peak, pulse b causes the pull-up switch to conduct, resulting in a similar RL current
build-up in the inductor. At the rising edge of b, the system once again resumes a
free oscillation, with an initial condition V (PC) = VDD and I(L) = Ip, where Ip is
the current ﬂowing in the inductor at that time.
The current build-up in the inductor at the crest and trough of V (PC) enables
the supply to provide energy to the system periodically, which is stored in the form
9
VDD
VSS
L
a 
b 
a 
b 
1/fn
1/fn1/2fn
PC
CL
M1
M2
Figure 1.4: A single-phase power-clock generator with single supply and a voltage
divider.
of a magnetic ﬁeld in the inductor. The amount of replenished energy required to
maintain stable oscillations is thus governed by the equation:
Ereplenished =
1
2
LI2n +
1
2
LI2p . (1.6)
Notice that, in this driven power-clock generator, the natural frequency of the
oscillation is determined by Equation (1.3), and the actual frequency of the generated
sinusoid waveform is determined by the frequency of the pulses. If this frequency is
too far away from the natural frequency of the oscillation, the generated waveforms
will be distorted.
Instead of the two-supply scheme in Figure 1.3, Figure 1.4 uses a single supply
with a capacitive voltage divider to achieve a similar functionality. Large capacitors
are used in the divider to provide stable voltage sources at the cost of capacitor area.
Various circuit topologies for power-clock generators have been proposed for dif-
ferent charge-recovery logic styles and for diﬀerent applications. Figure 1.5 shows an
H-bridge clock generator that generates a two-phase power-clock with cross-coupled
10
VDD
VSS
L
a 
b a 
b 
a 
b 
1/fn
1/fn1/2fn
PCPC_b
CL CL
Figure 1.5: H-bridge two-phase power-clock generator.
C1
C2
L C4
C3
PC1
PC2
PC4
PC3
(a)
VDD
VSS
L
PC3PC1
C1 C3
TGTG
(b)
Figure 1.6: (a) Basic scheme for a four-phase power-clock generator. (b) Circuit
schematic related to C1− C3.
pairs of NMOS and PMOS transistors. It generates complementary power-clock wave-
forms using only one inductor.
A four-phase power clock generator has been presented in [11], and the schematic
is shown in Figure 1.6. This circuit uses only one inductor to generate all four
clock phases. The idea of using a single rotating inductor is shown in Figure 1.6(a).
Rotation of inductor to transfer the energy between various clock phases is achieved
by sequentially switching the transmission gates (TG), as shown in Figure 1.6(b) for
11
loads C1 and C3. One of the issues of this generator is that the power dissipation
overhead of the controller circuit restricts its use to drive relatively smaller systems.
1.4 Charge-Recovery Systems
Charge recovery can be deployed various ways. The main two broad classiﬁcations
are charge-recovery logic and resonant-clocked designs. Charge-recovery logic belongs
to the class of ﬁne-grain design, which employs charge-recovery techniques at the gate
level to recover energy from load capacitance driven by all gates in a design. Resonant-
clocked design is an example of coarse-grain design and recovers energy from the clock
network in the design. Depending on the speciﬁc implementation of resonant-clocked
designs, charge-recovery techniques may also be extended to the internal nodes of
pipeline registers.
1.4.1 Fine-grain Systems
A ﬁne-grain charge-recovery system is inherently gate-level pipelined with charge-
recovery logic. This logic utilizes the idea of current-steering to conditionally charge or
discharge load capacitance based on the outputs of its proceeding stage. To illustrate
the structure, operation, and design of logic in a ﬁne-grain charge-recovery system,
we will use an early charge-recovery logic gate, 2N-2P [12] as an example. Even
though the detailed implementations of various charge-recovery logic families diﬀer ,
the underlying objectives, trade-oﬀs and basic circuit topologies are quite similar.
Figure 1.7(a) shows an inverter implemented in 2N-2P with idealized power-clock
waveforms, ϕ1, ϕ2, ϕ3, and ϕ4, shown in Figure 1.7(b). The gate utilizes cross-
coupled PMOS transistors to steer the current from the power-clock generator to
one of the (ideally) balanced output nodes, out or out. The initial resolution at
output nodes is determined by the complementary pull-down evaluation trees. The
losses in such a charge-recovery system are due to steering devices and parasitic
12
wiring resistance through which the load current ﬂows. For simplicity, we consider
the operation of the gate with two non-overlapping idealized power-clock waveforms
shown in Figure 1.7(c), and ignore the eﬀect of the threshold voltage Vth on the
operation of the gate.
The operation of a 2N-2P gate can be divided into four phases: evaluation, hold,
reset, and wait. For correct operation, all gates are cascaded in a way that gates driven
by ϕ1 connect to gates driven by ϕ2 and so on, until gates driven by ϕ4 connect to
gates driven by ϕ1. At time t = 0, the gate connected to ϕ1 is at the beginning of
its evaluation phase. At this point, the input signal in, at its hold phase, provides
full level inputs to transistor M1 and holds node out ﬁrmly to VSS. Since in is low,
transistor M2 is oﬀ, node out is ﬂoating during this time. As ϕ1 begins to ramp up,
transistor M4 conducts more strongly. Since out is held to VSS, increasing Vgs of M4
causes out to track ϕ1 closely. At the end of the evaluation phase, out reaches the full
rail, while out remains at 0V. In the hold phase of the gate, out and out provide full
level driving voltage to the next logic gate. Subsequently, as ϕ1 ramps towards VSS,
the outputs of the gate reset. Note that when the gate is in reset phase, input nodes
in and in are already in the reset phase, andM1 andM2 are both oﬀ. Consequently,
out tracks the power-clock through M4 and discharges to VSS gradually. Eventually,
both outputs reach 0V and remain stable until the next evaluation phase.
In resonant systems, presenting a constant capacitive load C is important to pro-
vide a stable oscillation frequency, and the dual-rail property of 2N-2P logic can
provide a near-constant load capacitance when looking into the ϕ node of the gate.
As long as the eﬀective capacitors at the 2N-2P output nodes are equal, the value
of C in a resonant system is independent of the output state of gates, and stable
LC oscillation is maintained. This is one of the salient advantages of using dual-rail
logic in a ﬁne-grain charge-recovery system. However, dual-rail logic has its limita-
tions. One such limitation is the constant switching activity of each gate, which is
13
M4
M2M1
M3
outout
in in
PC ( 1 phase)
VSS
Eval. Stack Eval. Stack
( 4 phase)( 4 phase)
(a)
eval
hold
reset
wait
eval
hold
reset
wait
eval
hold
reset
wait
eval
hold
reset
wait
1
2
3
4
(b)
eval hold
eval hold
eval hold
in
in
out
out
eval hold
( 4 phase)
( 1 phase)
(c)
Figure 1.7: (a) Schematic of 2N-2P inverter. (b) Four-phase power-clock waveforms.
(c) Operating waveforms of 2N-2P inverter.
50%, independent of the switching probability of the output state, since one of the
two output nodes switches in every cycle. As a result, a ﬁne-grain dual-rail charge-
recovery system could cause low-switching activity gates to dissipate more compared
to their conventional CMOS counterparts. A signiﬁcant portion of energy savings is
thus given up when applying ﬁne-grain charge-recovery techniques to designs with
14
low switching activity.
The need of multiple clock phases for 2N-2P logic also increases the complexity and
reduces the energy eﬃciency of the design. Use of multiple-clock phases requires more
clock network broadcasting in physical design and thus increases design complexity.
Gate cascading needs to follow a ﬁxed phase order in 2N-2P operation, and only gates
driven by speciﬁc phase-pairs can be connected. Furthermore, additional buﬀering
is often required to phase-delay noncritical paths, balancing the paths so that they
arrive at the same time as critical paths. This buﬀering results in additional power
dissipation and aﬀects the extent to which a design can be eﬃciently implemented
using ﬁne-grain charge-recovery techniques.
Fine-grain pipelining enables greater throughput but limits the range of designs
that can be eﬃciently implemented. Charge-recovery techniques involve a fundamen-
tal trade-oﬀ between energy dissipation and latency, as suggested by Equation (1.4);
however, they do not aﬀect system throughput, since ﬁne-grain systems are inherently
gate-level pipelined. Moreover, the maximum number of evaluation stack height in
each charge-recovery logic limits the function that can be implemented in each gate.
As a result, deeper pipelining is sometimes needed compared to a traditional CMOS
datapath. Fine-grain charge-recovery is therefore advantageous in systems that are
throughput intensive and can tolerate increased latencies.
Implementing ﬁne-grain charge-recovery techniques in sequential circuits with
feedback is also a challenge, since both latency and throughput are adversely af-
fected. With feedback loop in a design, no further computation can occur until the
previous one is completed, so throughput and latency become correlated.
Although the majority of work in charge-recovery design has focused on ﬁne-grain
systems, these techniques are not limited to such logic gates. Charge-recovery tech-
niques are particular eﬀective in applications which involve nets with large switching
activity. Regular datapath structures without feedback loops are also well-suited
15
to ﬁne-grain charge-recovery implementations. In the next section, we will discuss
designs that implement charge-recovery techniques on speciﬁc nets, leading to coarse-
grain charge-recovery design.
1.4.2 Coarse-grain Systems
Unlike ﬁne-grain designs, which employ charge-recovery techniques throughout
the design, coarse-grain designs employ charge-recovery selectively in part of of the
design, where the application of these techniques is more eﬀective. In this section,
a resonant-clocked pipeline is discussed as an example of coarse-grain system. In
particular, resonant-clocked pipeline designs apply charge-recovery techniques on the
parasitic capacitance of clock distribution network and in some cases, may extend to
the internal capacitance of pipeline registers.
p
+
-
VDD/2
Network 
Paracitics
Figure 1.8: Resonant-clocked pipeline example.
Figure 1.8 shows an example of a resonant-clocked design where the timing ele-
16
ments (pipeline registers) are specially designed and clocked by a single-phase resonant
clock, and combinational logic is identical to conventional CMOS design. An inductor
L is used to resonate the capacitance of the entire clock network C in the system. Due
to the high switching activity of the clock distribution network, substantial dynamic
power reduction can be achieved with special pipeline register designs. In contrast
to a ﬁne-grain gate-level pipelined system, such designs are pipelined at a coarser
level, which is similar to the conventional CMOS datapath with multi-stage pipelines.
Consequently, coarse-grain systems can be designed to exhibit identical system-level
timing properties with traditional clocked designs such as latency, throughput, and
cycle time.
Unlike ﬁne-grain charge-recovery systems, coarse-grain charge-recovery systems do
not require additional buﬀers to balance delay, greatly increasing the range of designs
that can be implemented in an energy manner. Furthermore, since logic gates in this
system can be implemented with conventional CMOS logic, the design of resonant-
clocked pipelines is more amenable to commercial tools, especially for synthesis and
place-and-route.
1.5 Contributions
This section outlines the contributions of this thesis. The main motivation of
this thesis is to apply charge-recovery techniques on designs and achieve better en-
ergy/performance trade-oﬀs than conventional CMOS design. We demonstrate charge-
recovery techniques on two systems with diﬀerent frequency points. Both designs pro-
vide high performance at their corresponding supply-levels and achieve higher energy
eﬃciency than their conventional CMOS designs counterparts.
17
1.5.1 SBL and SBL-Based FIR Filter Design
In this work, we present Subthreshold Boost Logic (SBL), a new circuit family
that relies on charge-recovery design techniques to achieve order-of-magnitude im-
provements in operating frequencies while still achieving high energy eﬃciency using
subthreshold DC supply levels. Speciﬁcally, SBL uses an inductor and a two-phase
power-clock to boost subthreshold supply levels, overdriving devices and operating
them in linear mode. Charge-recovery switching is used to implement this boosting
in an energy-eﬃcient manner.
To demonstrate the performance and energy eﬃciency of SBL, we also present a
14-tap 8-bit ﬁnite-impulse response (FIR) ﬁlter test-chip fabricated in a 0.13µm tech-
nology with Vth,nmos = 400mV. The energy-eﬃcient operation of the SBL-based FIR
test-chip has been experimentally veriﬁed for clock frequencies in the 5MHz-187MHz
range. With a single 0.27V supply, the test-chip achieves its most energy eﬃcient op-
erating point at 20MHz, consuming 15.57pJ per cycle with a recovery rate of 89% and
a Figure of Merit (FoM) equal to 17.37nW/Tap/MHz/InBit/CoeﬀBit. With the intro-
duction of a second subthreshold supply at 0.18V, energy consumption at 20MHz de-
creases further by 17.1%, yielding 14.40 nW/Tap/MHz/InBit/CoeﬀBit. At its maxi-
mum operating frequency of 187MHz, the test-chip achieves 35.31nW/Tap/MHz/InBit
/CoeﬀBit and 34.47 nW/Tap/MHz/InBit/CoeﬀBit with one and two subthreshold
supplies, respectively. To our knowledge, these ﬁgures of merit are the lowest pub-
lished for FIR test-chips to date [13, 14]. In comparison with a static CMOS-based
implementation derived by synthesis of the same FIR architecture and automatic
place and route, the SBL-based FIR consumes 40% to 50% less energy per cycle in
the 17MHz-187MHz range, based on device-level simulations, while incurring a 15%
area overhead.
18
1.5.2 Flash ADC with Resonant Clock Distribution
Resonant clocking has been shown to be an eﬀective approach to the reduction
of power consumption in GHz clock speed distribution networks [14, 15, 16]. In the
ADC presented in this thesis, resonant clocking is deployed to decrease the power
consumption of the network that distributes the clock signal to the analog and digital
circuitry of the ADC. Speciﬁcally, a fully integrated inductor is used to resonate the
parasitic capacitance of the entire clock distribution network all the way to the clocked
timing elements. This technique is thus compatible and can be used in conjunction
with previous power optimization approaches at the circuit and architecture levels.
Operating in the vicinity of its resonant frequency with sampling rate 5.5GS/s,
the ADC dissipates 28mW with only 10.7% of total power on clock distribution and
a FoM equal to 396fJ per conversion step. This FoM is lower than all previously-
published ADCs operating above 2.2GHz [2]. Correct operation has been validated
at the clock rates up to 7GHz with 45mW of total power consumption and 11.1% of
total power on the clock.
1.6 Thesis Outline
The remainder of this thesis is organized as follows: In Chapter 2, we give a brief
introduction of reversible computing, which provides the main ideas and motivations
of the charge-recovery design techniques. A summary of previous work is presented
in the area of both charge-recovery logic and resonant-clocked designs.
In Chapter 3, we present our novel ﬁne-grain charge-recovery circuit family, called
SBL, that achieves high performance and high energy eﬃciency with a single subthreshold-
level supply. We discuss its structure and operation, including high performance
achievable through eﬃcient signal boosting. We also give a detailed analysis of the
energy consumption for SBL gates.
19
In Chapter 4, we describe an 8-bit 14-tap SBL FIR test-chip fabricated in 0.13µm
technology. Results from device-level simulation of the SBL FIR ﬁlter and its static
CMOS counterpart with identical architecture are given and compared. Measurement
results from our SBL test-chip with both single- and two-supply schemes are presented
and discussed. This work was published in [17, 18].
Chapter 5 presents the architecture of our resonant-clock ﬂash ADC design and
its main building blocks. It also gives detailed designs, and discusses operations
of resonant-clocked dynamic comparators and sense-ampliﬁer ﬂip-ﬂops used in our
resonant-clock ﬂash ADC.
Chapter 6 presents the design, evaluation, and testing of a 7GSample/s resonant-
clock ﬂash ADC test-chip. The ADC has been designed and fabricated in a 65nm bulk
CMOS process. An on-chip inductor is used to resonate the entire clock distribution
network, and a detailed analysis of the inductor from a commercial 3D full-wave
electromagnetic ﬁeld solver is given. Measurement results from our resonant-clock
ADC test-chip with ADC performance characterization are also presented. This work
was published in [19].
Chapter 7 summarizes the contributions of this thesis and presents directions for
future research in this area.
20
CHAPTER 2
Background
In this chapter, we survey previous work on charge-recovery. In Section 2.1, we
describe reversible computing, which inspired later charge-recovery techniques. Sec-
tion 2.2 covers early research on charge-recovery logic. Section 2.3 discusses previous
work in the area of resonant-clocked designs. Following these early work, we explore
the techniques and challenges in charge-recovery area, which lead to the motivation
of the work in this thesis.
2.1 Reversible Systems
2.1.1 Reversible Computing
Long before energy dissipation emerged as a matter of interest in VLSI design,
physicists inquired into the fundamentals of energy dissipation and the loss of infor-
mation in a computing system [20]. In the early papers, researchers largely focused
on discussing the possibility of having physical machines which consume zero energy
while computing and tried to ﬁnd a lower bound on energy consumption. One of
the conclusions drawn by Landauer is that the minimum possible amount of energy
required to change one bit of information is equal to kT ln2 [21], where k is the Boltz-
mann constant, and T is the absolute temperature of the environment. This loss of
energy becomes heat expelled into the surroundings.
21
Another result of early work in the area is that to achieve zero energy computation,
the operation must be reversible or be implemented in reversible logic [22]. If the
devices in a computing system are designed to change state in a way that is logically
reversible, in which no known bits are erased, then in principle, arbitrarily little
free energy needs to be used. Of course, in practice, there are other sources of
energy dissipation such as leakage, or resistive loss. However, unlike the kT ln2 energy
dissipation for changing a bit operation, there are no fundamental lower bounds for
these sources of dissipation.
2.1.2 Reversible Logic
Reversible logic gates are digital circuits in which the number of inputs is equal
to the number of outputs and there is a one-to-one mapping between vectors of
inputs and outputs [23]. Therefore, in these gate, the input vectors can always be
reconstructed from output vectors. A gate with k inputs and k outputs is called a
k × k gate, and all gates in a reversible circuit must be reversible. Reversible logic
realizes balanced functions on all outputs, i.e., half of all minterms are mapped to 1
and the other half to 0. Consequently, garbage outputs are necessary to realize non
balanced functions (e.g., AND, OR, XOR etc).
Consider a reversible logic gate with two inputs. A conventional XOR gate takes
two single-bit inputs A and B, and yields one single-bit output X. If A = B, then
X = 1; otherwise X = 0. However, the XOR gate is not reversible, since we cannot
uniquely determine what the input vector (A, B) is from the output. For instance,
the output X = 1 could have come from either one of the two possible input vectors,
(0, 1) and (1, 0). Although a reversible gate must be a k × k gate, not all gate with
k-bit input and k-bit output are reversible. For example, consider a gate with input
vector (A, B) and output vector (X, Y), where X= A XOR B, and Y = A OR B, as
shown in Figure 2.1. Clearly, this is not a reversible gate, since there exists a 2-to-1
22
A  B    X  Y
0  0     0  0
1  0     1  1
0  1      
1  1     0  1
Figure 2.1: Truth table of an irreversible gate.
mapping from the input to the output.
For a two-bit input and two-bit output gate to be reversible, the mapping of its
logic function should be such that the set of output vectors is a permutation of the
set of input vectors, (00, 01, 10, 11). It then follows that there exist 4! = 24 possible
reversible two-bit input and two-bit output logic gates. The Feynman gate is one of
the most well known reversible two-bit input and two-bit output logic gates [24]. A
Feynman gate with input vector (A, B) and output vector (X, Y) implements the
logic functions: Y = B , and X = A XOR B. From input-to-output mappings, shown
in Figure 2.2(a), it is evident that the Feynman gate is a reversible gate. In this
example, one of the inputs B serves as a control signal. If B = 0, then the output
X is simply duplicating the input A; if B = 1, then the output X = A¯ (inverse of
the input A). For this reason, the Feynman gate is also called the controlled NOT
gate, or the quantum XOR gate (Figure 2.2(b)) due to its popularity in the ﬁeld of
quantum computing.
Implementing reversible logic in integrated circuits has many challenges and is-
sues. As suggested in [25], reversible computation requires all logic operations to be
carried out once in the forward direction and once in the backward direction, yield-
ing additional latency and circuit overhead. More importantly, it requires a large
amount of temporary storage to maintain intermediate results until computation in
23
A  B    X  Y
0  0     0  0
1  0     1  0
0  1     1  1
1  1     0  1
(a)
A
B Y=B
X=A XOR B
(b)
Figure 2.2: (a) Truth table and (b) symbol of the Feynman gate.
the backward direction is ready. Since storing these temporary values results in en-
ergy and circuitry overheads, implementing fully logically reversible logic in CMOS
is not particularly attractive.
A practical alternative to full reversible logic is to use the idea of reversible com-
puting in engineering systems with charge-recovery techniques, and try to approach
the theoretical possibility of zero dissipation as closely as possible. If state informa-
tion of a node in a circuit is available, and utilized when switching the state of that
node, no information is lost. As a result, most of the free energy is conserved in the
circuit and recycled for later reuse, rather than being dissipated.
2.2 Charge-Recovery Logic
Charge-recovery logic is a class of circuitry that recycles energy from the output
load capacitance of logic gates to achieve ultra-low power operation. It utilizes the
idea of current-steering to conditionally charge or discharge load capacitance based
on the outputs of its proceeding stage. Each charge-recovery logic gate is a inherently
gate-level pipeline stage, performing the role of both functional logic and timing
elements in conventional CMOS designs.
Early charge-recovery logic, such as Split-level Charge-Recovery Logic (SCRL)
24
[26] and Reversible Energy Recovery Logic (RERL) [27], implemented fully logically
reversible gates in CMOS. As pointed out in [25], the large number of temporary
storage elements in a fully reversible circuit yields large circuit overheads. Later work
in charge-recovery logic deviates from fully reversible circuits. These charge-recovery
logic family retained the gradual transition of the charge between computation nodes
but avoided design of reversible logic gates, keeping information around to reduce
change in energy.
Figure 2.3 shows the structure of NMOS and PMOS inverters in Adiabatic Dy-
namic Logic(ADL). Proposed by Dickinson and Denker in 1995, ADL is a single rail
adiabatic logic family [28]. ADL uses a diode to precharge the output node out to
high when the power-clock rises, and the evaluation stack conditionally discharges
the node out as the power-clock falls. The PMOS ADL inverter works in an opposite
fashion. In an ADL system, ADL gates are cascaded by alternating the NMOS ADL
and PMOS ADL gates. Both of them are synchronized by a two-phase power-clock
with 180 degree phase diﬀerence to ensure that precharge and evaluation phases of
all gates are synchronized. A chain of 64 inverters was successfully veriﬁed with an
external 250MHz power-clock in 0.9µm process [29].
in
out
CL
PC
VSS
Eval. Stack
(a)
in
out
CL
PC
VSS
Eval. Stack
(b)
Figure 2.3: Schematic of (a) NMOS ADL inverter, and (b) PMOS ADL inverter.
Quasi-Static Energy Recovery Logic (QSERL) has been proposed by Ye and Roy
[30] and has a similar diode structure to ADL, as shown in Figure 2.4. Instead
25
of precharging the output nodes with diodes, QSERL uses diodes to conditionally
hold/discharge the voltage level at output nodes. This single-rail structure for both
ADL and QSERL exhibits a data-dependent clock load, yielding high clock jitter and
degrading system performance. Moreover, the use of diode in this logic may cause
a substantial potential diﬀerence and generate large current ﬂow, reducing energy
eﬃciency.
in
out
PC
PC_b
Eval. Stack
CL
Figure 2.4: Schematic of QSERL inverter.
2N-2P is another early charge-recovery logic [12], and is has been discussed in
detail in Chapter 1. Unlike ADL, the dual-rail topology of 2N-2P gates provides
a data independent clock load for charge-recovery systems. However, requiring of
four-phase power-clock increases design complexity and limits its applicability.
Various charge-recovery logic families have been proposed since 2N-2P. 2N-2N2P
[12] is a variation of 2N-2P, as shown in Figure 2.5. By adding an additional pair
of cross-coupled NMOS devices at the bottom, 2N-2N2P eliminates ﬂoating outputs
during the hold phase.
Pass-transistor Adiabatic Logic (PAL), proposed by Oklobdzija et al.[31], is shown
in Figure 2.6. PAL retained the cross-coupled PMOS structure in 2N-2P, and moved
its evaluation stacks in parallel to the PMOS devices. Instead of a four-phase power-
26
M4
M6M5
M3
outout
inin
PC
M1 M2
VSS
Eval. Stack Eval. Stack
Figure 2.5: Schematic of 2N-2N2P inverter.
clock in 2N-2P, PAL can operate with a two-phase power-clock. A shift register with
1,600-stage PAL has been fabricated in a 1.2µm technology and correct operation has
been veriﬁed at 10MHz.
M4 M2M1 M3
outout
inin
PC
Eval. Stack Eval. Stack
Figure 2.6: Schematic of PAL inverter.
Positive Feedback Adiabatic Logic (PFAL), shown in Figure 2.7, was proposed
by Vetuli et al. [32]. PFAL is also a dual-rail charge recovery logic with a two-
phase power-clock. Similar to 2N-2N2P, it eliminates ﬂoating outputs by using a pair
if cross-coupled devices and its evaluation stacks are in parallel to PMOS devices.
Compared to PAL, the cross-coupled NMOS and PMOS structure provides higher
energy eﬃciency due to less leakage current. Compared to 2N-2P and 2N-2N2P,
27
PFAL has the potential to achieve higher operating frequency due to the full-rail
input during evaluation phase.
M4
M6M5
M3
outout
PC
M2 inM1in
VSS
Eval. Stack Eval. Stack
Figure 2.7: Schematic of PFAL inverter.
Figure 2.8 shows an inverter gate in True Single-Phase Energy-Recovering Logic
(TSEL) [33]. TSEL cascades use alternating NMOS and PMOS stage and operate
with a single-phase power-clock. A pair of current control switches (M3 and M4)
and reference voltages (VPREF and VNREF ) are used to improve its energy eﬃciency.
A Source-Coupled Adiabatic Logic (SCAL) [7] is derived from TSEL gates by re-
placing each DC reference voltage with a current source (M7 ), shown in Figure 2.9.
Each current source can be individually tuned by transistor sizing and globally ad-
justed with PMOS and NMOS biasing voltage to optimal operating condition. An
8×8 multiplier test-chip has been fabricated in 0.5µm technology, and correct opera-
tion has been veriﬁed with operating frequency up to 130MHz [8].
The charge-recovery logic families discussed so far all face a common challenge of
eﬃcient operation at high frequency. To address this challenge, Sathe et al. proposed
Boost Logic [34], which utilizes gate overdrive, reduced output swing, and charge-
recovery techniques to achieve energy eﬃcient operation at high operating frequency.
Figure 2.10 shows the schematic of a buﬀer implemented in Boost Logic. Boost Logic
28
M6M5
M1
outout
VNREF
M3
M2
M4
inin
PC
Eval. Stack Eval. Stack
(a)
M6M5
M1
outout
VPREF
M3
M2
M4
inin
PC
Eval. Stack Eval. Stack
(b)
Figure 2.8: Schematic of (a) NMOS TSEL inverter, and (b) PMOS TSEL inverter.
M6M5
M1
outout
VNBias
M3
M2
M4
inin
PC
VSS
Eval. Stack Eval. Stack
M7
(a)
M6M5
M1
outout
VPBias
M3
M2
M4
inin
PC
Eval. Stack Eval. Stack
M7
VDD
(b)
Figure 2.9: Schematic of (a) NMOS SCAL inverter, and (b) PMOS SCAL inverter.
is a two-phase, dual-rail, partially charge-recovering logic. The structure of a Boost
gate can be divided into two parts − logical evaluation (Logic) and charge-recovery
ampliﬁcation (Boost). Logic performs functional evaluation when power-clock is low.
When power-clock rises, Boost ampliﬁes the potential diﬀerent at output nodes to a
29
full-rail signal. The output nodes of Boost Logic are precharged to near 1
2
VDD, which
reduces the output swing of the gate when power-clock rises and thus reduces the
energy dissipated in the Boost stage.
Another feature of Boost Logic that enables its eﬃcient high-frequency operation
is the fact that the logic stage provides the complementary output nodes with an
initial voltage diﬀerence. Voltage diﬀerence ( 1
3
VDD) at outputs is pre-resolved at the
onset of boost conversion, which precludes any 'ﬁght' between cross-coupled inverters
and results in eﬃcient boost conversion. To demonstrate the high performance and
energy eﬃciency of Boost Logic, a test-chip with eight chains of AND, OR, XOR, and
INV gates was fabricated in 0.13µm technology, and correct operation was veriﬁed at
operating frequencies exceeding 1GHz.
out
Boost
Stage
Logic
Stage
VDD' VDD'
Logic
Stage
PC
PC
PC_b
out
VSS'
PC_bPC_b
in in
VSS'
Eval. StackEval. Stack
PC
Figure 2.10: Schematic of Boost Logic inverter.
Looking back at the evolution of prior work, one observes that many similar traits
30
are shared among various charge-recovery logic families. The ﬁrst common trait is the
pair of cross-coupled PMOS devices which are used to steer the current of the power-
clock and bring output to a full-rail signal. Another common feature of these gates is
the fact that multiple clock phases, supplies, or clock devices are used to reduce the
short current caused by the gradual transition of power-clock. These common traits
provide us with good guidance in designing future charge-recovery logic.
2.3 Resonant-Clocked Designs
Resonant clocking is a charge-recovery design methodology that recovers energy
from the clock distribution network. Due to the high switching activity and large
capacitance of clock distribution networks, it becomes a good candidate for the
application of charge-recovery techniques. Resonant-clock datapaths are similar to
conventional-clock datapaths and combinational logic in resonant-clock designs can
be implemented with conventional CMOS style logic.
Most of the previous work in resonant-clock designs focuses on developing new
timing elements to reduce the leakage current and improve the performance due to
the slow transition of the resonant clock. Athas et al. presented the E-R latch [35],
shown in Figure 2.11. The E-R latch uses the bootstrapping technique at node bn,
and the resonant clock recycles charge at both clock nodes and from the internal node
n. Charge recycling at node n improves the total amount of recoverable capacitance
but at the expense of reduced recovery eﬃciency due to the series resistance fromM3.
Correct operation of the E-R latch has been veriﬁed in AC-1, a 58.5MHz resonant-
clocked microprocessor in 0.5µm technology [35].
Figure 2.12 shows the PMOS energy recovering ﬂip-ﬂop (pTERF) and NMOS
energy recovering ﬂip-ﬂop (nTERF) proposed by Ziesler et al. [10]. A cross cou-
pled NOR/NAND gate is added to convert the internal resonant signals to static
ones, and an inverter is introduced at output node Q to increase drive strength.
31
VISO
VDD
D
Q
bn
n
VSS VSS
RCK
RCK
RCK_b
Figure 2.11: Schematic of Edge-Triggered (E-R) latch.
pTERF/nTERF has a similar issue to the E-R latch: the eﬀective resistance of the
cross-coupled devices dissipates energy during the charge and discharge cycles, lim-
iting overall energy recovery eﬃciency. A 115MHz wavelet-transform test-chip has
been fabricated in 0.25µm technology with pTERF and achieved 25% to 30% of en-
ergy saving compared to CV 2 [9].
M4
M2M1
M3
D
RCK
Q
VSS
(a)
M4
M2M1
M3
D
RCK
Q
VDD
(b)
Figure 2.12: Schematic of (a) pTERF ﬂip-ﬂop, and (b) nTERF ﬂip-ﬂop.
Ishii et al. [1] deployed resonant clocking in conjunction with sense-ampliﬁer ﬂip-
32
ﬂops, shown in Figure 2.13. They successfully demonstrated energy eﬃcient opera-
tion on an ARM926EJ-STM microprocessor with operating frequency up to 200MHz.
Clock-related power is reduced by 85%, and total power savings range from 20% to
35%, depending on application proﬁle.
D
VDD
RCK
RCK
RCK
RCK
Q
Q
VSS
Figure 2.13: Schematic of sense-ampliﬁer ﬂip-ﬂop used in [1].
Hansson et al. [36] have fabricated a 1.56GHz resonant-clock network in a 0.13 µm
technology with an integrated 1.2nH inductor. A single-phase resonant clock directly
drives 896 conventional master-slave ﬂip ﬂops without any intermediate buﬀers. The
relatively slower edge-rate of the sinusoidal clock is reported to increase the power
consumption in the ﬂip-ﬂops by 34%. Despite the higher power consumption in the
ﬂip-ﬂops, there is still a 57% clock power saving, resulting in a total power reduction
of 20% compared to the conventional clock network.
Chan et al. [15] later applied the global resonant clock distribution topology to a
commercial microprocessor. Their processor has 830 on-chip spiral inductors with a
natural resonant frequency of 3.2GHz. Unlike the work in [36], which drives ﬂip-ﬂops
directly, Chan inserted local clock buﬀers between the resonant clock mesh and the
33
timing elements, resulting in limited energy recovery.
RCK
D
RCK
Q
VDD
VSS
(a)
RCK
D
RCK
Q
VDD
VSS
(b)
Figure 2.14: Schematics of resonant-clocked latches used in RF1: (a) H-LAT, and (b)
L-LAT.
RCK
D
Q
VDD
VSS
RCK_b
Figure 2.15: Schematic of resonant-clocked latch used in RF1, B-LAT.
Other work in this area has focused on evaluating and characterizing variability on
resonant clocks. Chueh et al. [37] implemented a two-phase resonant clock network
with programmable drivers and loading to evaluate the eﬀect of imbalanced clock load
on clock skew. The 2mm×2mm distribution network with on-chip inductors fabri-
cated in 0.13µm technology performs a forced oscillation in the 900MHz to 1.2GHz
34
range. When running oﬀ-resonance by 10%, power dissipation increases by 3% and
clock amplitude drops by 3%. Imbalanced loading impacts power and amplitude by
less than 2%. When shifting from balanced to imbalanced loading, worst-case skew
increases by 6% of cycle time.
More recent implementations of resonant-clocked designs focus on improving per-
formance of timing elements. Sathe et al. have proposed a resonant-clock latch-based
methodology and demonstrated its high performance and energy eﬃcient operation
with two FIR ﬁlter test-chips, RF1 and RF2, in a 0.13µm technology [38, 39]. Unlike
ﬂip-ﬂops, which rely on sharp clock edges for eﬀective operation, latch performance
is primarily determined by the voltage level of the clock waveform. Moreover, latch-
based designs have the potential to achieve higher performance than ﬂip-ﬂop-based
designs, because the level-sensitive latches allow data to ripple through latch bound-
aries and enable time-borrowing across logic stages. Resonant-clocked latches used
in RF1: H-LAT and L-LAT are shown in Figure 2.14 and B-LAT, used in RF2 is
shown in Figure 2.15. RF1 and RF2 achieved 76% and 84% energy eﬃciency when
operating their natural resonant frequencies of 1.03GHz and 1.01GHZ, respectively
[14].
RCK
D
Q
VDD
RCK
RCK_b
VSS
Figure 2.16: Schematic of DESL inverter.
Kao et al. introduced Dynamic Evaluation Static Latch (DESL) logic [16], a dy-
35
namic gate with a level-sensitive latch at the output stage, as shown in Figure 2.16.
DESL mitigates performance degradation due to the resonant clock waveforms by
relying on the voltage level of the clock. Moreover, a static latch provides a mod-
est performance boost by relying on time borrowing and reduces dynamic power by
reducing switching activity on large capacitance nets.
An 8-cycle reduced-latency fused-multiply-add single-precision FPU test-chip has
been fabricated in a 90nm technology to demonstrate energy eﬃciency and per-
formance of DESL. The resonant-clocked FPU operates with clock frequencies up
to 2.07GHz, yielding 66.4% lower clock power and 31.5% lower total power over a
conventionally-clocked version of the same architecture.
36
CHAPTER 3
Subthreshold Boost Logic
In this chapter, we present Subthreshold Boost Logic (SBL), a new circuit family
that relies on charge-recovery design techniques to achieve order-of-magnitude im-
provements in operating frequencies while still achieving high energy eﬃciency using
subthreshold DC supply levels.
The remainder of this chapter is organized as follows: Section 3.1 describes previ-
ous work on subthreshold designs. Section 3.2 presents the structure of SBL and the
two-phase power-clock generator used for SBL. Section 3.3 describes the two-phase
operation of SBL gates and the overdrive property of cascaded SBL gates. Section 3.4
analyzes the energy consumption of SBL gates. Summary is in Section 3.5.
3.1 Introduction
Early subthreshold circuit designs appeared in electronic watches in the 60s and
70s, driven by form factor limitations on battery size [40]. The recent emergence of
untethered applications and energy scavenging devices has led to renewed interest in
this ﬁeld. A 1024-point FFT processor explored aggressive subthreshold designs for
minimum energy operation, achieving a clock speed of 10KHz with a 350mV supply
in a 0.18µm process with Vth = 450mV [40]. The Subliminal subthreshold processor
achieved 833KHz with a 360mV supply using a 0.13µm process with Vth = 400mV
37
[41]. The Phoenix processor deployed leakage reduction techniques to achieve pW-
level power consumption, targeting multi-year operation in sensor applications [5, 42].
Fabricated in a dual-threshold 0.18 µm process with Vth1 = 400mV and Vth2 = 700mV,
it achieved 2.8pJ/cycle at 106KHz with a 385mV supply.
A common issue underlying all subthreshold circuit designs is that the signiﬁ-
cant energy advantages achieved through deep voltage scaling result in subthresh-
old currents, typically resulting in sub-MHz clock frequencies. Recent subthreshold
designs have deployed circuit and architecture techniques to improve circuit robust-
ness by improving gate overdrive. The 32-bit RISC core in [43] and the 8×8 FIR
ﬁlter in [44] both deployed body biasing techniques to enable increased operating fre-
quency, achieving 375KHz at 230mV, and 12KHz at 200mV respectively. A high-speed
variation-tolerant interconnect technique relied on capacitive boosting to elevate the
critical gate supply voltage and was demonstrated through a 6MHz clock distribution
network with 400mV voltage supply [45]. A super-pipelining approach was demon-
strated in [46], where the multipliers in a 1024-point FFT were heavily pipelined to
reduce stage delay, achieving 30MHz with a supply of 270mV.
Subthreshold Boost Logic, introduced in this chapter, is a circuit family capable
of operating at multi-MHz clock frequencies using subthreshold supplies. Unlike sub-
threshold circuitry, in which computations are performed using subthreshold currents
and clock frequencies are typically limited to sub-MHz levels, SBL gates are over-
driven to operate in the linear region, achieving order-of-magnitude improvements in
operating speed over subthreshold logic. Energy eﬃcient operation is ensured through
the use of aggressively-scaled DC supplies at sub-threshold levels and by deploying
charge recovery design techniques to boost these subthreshold supply levels by 3X to
4X.
38
out
Boost
Stage
Logic
Stage
in_b
out_b
in
VCC
NMOS
PUN
NMOS
PDN
NMOS
PUN
NMOS
PDN
VCC
in_b
in
complementary
Logic
Stage
PC (     or     )
VSS VSS VSS
Figure 3.1: Schematic of an SBL gate.
3.2 SBL Overview and Blip Clock Generator
The structure of a SBL gate is shown in Figure 3.1. Each SBL gate consists of
two stages: Logic and Boost. The Logic stage has diﬀerential outputs out and out_b.
Each output is driven by a pull-up network (PUN) and a pull-down network (PDN),
similar to static CMOS logic, except that an NMOS PUN is used instead of a PMOS
one for increased gate overdrive ability. The Boost stage comprises a pair of cross-
coupled inverters connected to ground (VSS) and a charge-recovery power-waveform
PC. From a functional standpoint, each SBL gate consists of a combinational logic
block driving a transparent latch. Cascades of SBL gates are formed by clocking the
gates on alternating power-clock phases ϕ and ϕ¯.
The power-clock waveforms required by SBL can be generated using a clock gen-
erator circuit similar to the "blip" circuit in [47], as shown in Figure 3.2. This circuit
is formed by connecting two RLC oscillators back-to-back, using the output wave-
39
VDC
LL
R
C
1.2
0.8
0.4
0.0
(V)
5n 9n T(s)13n1n
R
C
VSSVSS
(a) (b)
Figure 3.2: Simple "blip" clock generator: (a) Schematic. (b) Waveform.
form ϕ of one oscillator to drive the other, and vice versa. The two waveforms ϕ
and ϕ¯ are partially overlapping, since the NMOS devices are not fully on until their
output voltages exceed the threshold voltage Vth. One of the advantages of this clock
generator is that it can provide a larger clock amplitude beyond its DC supply VDC .
As a result, it can share the same supply with the SBL gates, allowing a single DC
supply operation. In our FIR test-chip, we used a clock generator that is a distributed
injection-locked version of this circuit and is described in Section 4.2.
SBL improves upon Boost Logic [34], its closest charge-recovery logic relative, in
a number of signiﬁcant ways. Speciﬁcally, SBL can operate with a single DC supply,
whereas Boost Logic requires three DC supply levels. Boost Logic uses two DC supply
rails, VDD' and VSS', and develop a near
1
3
VDD potential diﬀerence at it output nodes.
However, SBL shifts this voltage diﬀerence to VCC and VSS and only require a single
supply VCC in each gate. The third supply for Boost Logic is the supply of the clock
generator, VDC . For SBL, beneﬁted by the blip generator, we explicitly share this
supply with VCC and achieve a single supply operation. Still, the energy eﬃciency of
SBL improves when using diﬀerent DC supply levels for logic and clock generation, as
40
demonstrated by the experimental results in Section 4.5. Moreover, the Logic stage
in SBL is connected to ground, resulting in greater gate overdrive and thus higher
performance than Boost Logic.
3.3 SBL Operation
V   (out_b)
Boost
Phase
Evaluation
Phase
potential difference
out
Boost
Stage
Logic
Stage
in_b
out_b
in
VCC
NMOS
PUN
NMOS
PDN
NMOS
PUN
NMOS
PDN
in_b
in
Logic
Stage
PC =
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(V)
1n 5n 9n T(s)
VCC
V   (out)
(a)
Boost
Phase
Evaluation
Phase
full rail
out
Boost
Stage
Logic
Stage
in_b
out_b
in
NMOS
PUN
NMOS
PDN
NMOS
PUN
NMOS
PDN
in_b
in
Logic
Stage
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(V)
1n 5n 9n T(s)
VCC VCC
V   (out_b)
V   (out)
PC =
(b)
Figure 3.3: SBL operation: (a) Evaluation Phase. (b) Boost Phase.
Each SBL gate operates in two phases, Evaluation and Boost, which are active
41
during mutually exclusive intervals. During Evaluation, as shown in Figure 3.3(a),
the Boost stage is oﬀ, and there is no signiﬁcant current ﬂowing through any of the
devices in the Boost stage, since the power-clock remains close to 0V. As ϕ transitions
low, the drive strength of the Logic stage gradually weakens, since its inputs gradually
ramp down. When its inputs reach the subthreshold supply level VCC , the Logic stage
is eﬀectively oﬀ.
As the power-clock ϕ rises, as shown in Figure 3.3(b), the gate transitions into the
Boost phase of its operation. During this phase, the Boost stage acts as an ampliﬁer
of the subthreshold voltage Vout−Vout_b. The voltage Vout tracks ϕ, reaching approx-
imately 1V as ϕ rises. As ϕ falls, the charge at the output node out is recovered
by the power-clock ϕ, and the output voltage is brought back to approximately Vth
levels. When ϕ falls below Vth, all transistors in the Boost stage are in cut-oﬀ, and
the next logic evaluation phase begins. Throughout the Boost phase, the node out_b
stays essentially at 0V.
V(out1)
V(out1_b)
Boost
Phase
Evaluation
Phase
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(V)
3X
out2
out2_b
out1
out1_b
SBL1
in
in_b
out
out_b
SBL2
in
in_b
out
out_b
Boost
Phase
Evaluation
Phase
1n 5n 9n T(s)
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(V)
1X
Boost
Phase
Evaluation
Phase
1n 5n 9n T(s)
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(V)
PC PC
V(out2)
V(out2_b)
Figure 3.4: Cascade of SBL gates.
42
The graphs in Figure 3.4 show two cascaded SBL inverters, SBL1 and SBL2,
and the waveforms at the output nodes out2 and out2_b with respect to the two
power-clock waveforms ϕ and ϕ¯. During Evaluation of the ﬁrst SBL gate, ϕ remains
eﬀectively low, whereas ϕ¯ transitions from low to high and then back to low. With
their inputs boosted by the preceding SBL gate to be much higher than the supply
voltage VCC , the PUN and PDN network charge out1 to VCC and discharge out1_b
to VSS in super-linear mode. Notice that even though PUNs are implemented in
NMOS, the output node does not suﬀer a Vth drop when charged to VCC , since the
PUN inputs are boosted to a value signiﬁcantly higher than VCC . With their inputs
boosted by the preceding SBL gate to be much higher than the supply voltage VCC ,
PUN1 and PDN1 charge out1 to VCC and discharge out1_b to VSS in super-linear
mode. Notice that even though PUNs are implemented in NMOS, the output node
does not suﬀer a Vth drop when charged to VCC , since the PUN inputs are boosted
to be signiﬁcantly higher than VCC .
Due to the signiﬁcant gate overdrive at the Logic stage, SBL can reach higher
operating speeds than static CMOS operating with the same subthreshold supply.
For example, when the Logic stage is evaluating, SBL can be designed so that the
inputs to the Logic stage exceed 0.9V even with VCC = 0.3V. Compared to static
CMOS with a 0.3V supply level, the Logic stage has 3X the gate overdrive, allowing
SBL implementations to operate at higher clock frequencies and drive larger output
loads.
3.4 SBL Energetics
The energy consumed during each cycle in the operation of an SBL gate is given
by the equation:
ESBL = ELogic + EBoost + ECrowbar, (3.1)
43
where ELogic and EBoost denote the energy consumed in the two stages of SBL, and
ECrowbar denotes the energy consumed by short-circuit currents during SBL operation.
The energy consumption of the Logic stage is given by the equation:
ELogic =
1
2
CLV
2
CC , (3.2)
where CL denotes the total switching capacitance at the SBL output. Compared to
conventional switching, this energy consumption is signiﬁcantly decreased due to the
aggressively-scaled subthreshold supply level VCC .
(a) (b)
, amplitude = Va
0.25Va (1+3sin t)
4n 8n 12n
1.0
0.5
0
-0.5
(V)
, amplitude = Va
0.5Va (1+sin   t)
1.0
0.5
0
-0.5
(V)
4n 8n 12n
tp1
tp2
Figure 3.5: Clock waveform modeling: (a) Sine clock with equal peak-to-peak swing.
(b) Sine clock with 1.5X peak-to-peak swing.
To derive an expression for EBoost, we model the Boost stage as a simple RC series
system with a blip voltage source that is modeled by two regions, sinusoidal and
linear, as shown in Figure 3.5. Simulations suggest that in the sinusoidal region, a
sinusoidal waveform with 1.5 times the peak-to-peak amplitude Va of the clock wave-
form ϕ provides a good approximation. Moreover, in the linear region, they indicate
that the clock waveform rises almost linearly to approximately 0.1V, independent of
clock frequency and amplitude. Accordingly, the clock waveforms in the two regions
44
can be approximated as follows:
ϕSine = 0.25Va(1 + 3sinωt), (3.3)
ϕLinear = 0.1 · t
T − |tP1 − tP2| , (3.4)
where ω = 2pi/T , T is the period of the clock waveform ϕ, and tP1 and tP2 are the
endpoints of the two regions, as shown in Figure 3.5.
Solving Equation (3.3) for 0.1V and 0V yields the following equation for the
endpoints tP1 and tP2, respectively, of the two regions:
tP1 = sin
−1
[
1
3
· ( 0.1
0.25
Va − 1)
]
· 1
ω
, (3.5)
tP2 = sin
−1
[−1
3
]
· 1
ω
. (3.6)
The energy ESine consumed in the Boost stage of a SBL gate during operation
in the sinusoidal region is given by integrating I2R over time from tP2 to tP1, where
I is the AC component of the current resulting when ϕSine drives the reactive load
1/jωCB, and R and CB are the eﬀective resistance and eﬀective capacitance, respec-
tively, when looking into the node PC of a SBL gate. (We assume that R << 1/jωCB,
as conﬁrmed by our test-chip.) From Equation (3.3), we have:
I =
∣∣∣∣ V1/jωCB
∣∣∣∣
=
∣∣∣∣0.25Va · 3sinωt1/jωCB
∣∣∣∣ , (3.7)
45
and, therefore,
Esine =
tP1∫
tP2
∣∣∣∣0.75Va · sinωt1/jωCB
∣∣∣∣2 Rdt
=
9V 2a ω
2C2BR
32
(t− 2cosωt)|t=tP2t=tP1
=
9V 2a pi
2C2BR
8T 2
(t− 1
ω
cos2ωt)|t=tP2t=tP1
=
K9V 2a pi
2C2BR
8T
. (3.8)
Equation (3.8) has been simpliﬁed by including a coeﬃcient K, 0.5 < K < 0.6, which
depends on the clock amplitude. Replacing the clock amplitude Va by the eﬀective
voltage swing in the Boost stage, Va − VCC , we obtain
ESine =
9K(Va − V 2CC)2pi2C2BR
16T
. (3.9)
The energy ELinear consumed in the Boost stage of a SBL gate during the linear
region of the clock waveform is given by integrating IR over time, where I is derived
from Equation (3.4):
ELinear =
T−|tP1−tp2|∫
0
I2Rdt
=
T−|tP1−tP2|∫
0
∣∣∣∣CB dVdt
∣∣∣∣2 Rdt
=
T−|tP1−tp2|∫
0
∣∣∣∣∣CB d(
0.1
T ·|tP1−tP2|t)
dt
∣∣∣∣∣
2
Rdt
=
0.01 · C2BR
T − |tP1 − tP2| . (3.10)
From Equations (3.9) and (3.10), it follows that ELinear/ESine < 1%, and therefore
the total energy consumption in the Boost stage can be approximated by ESine. From
46
Equations (3.1), (3.2), and (3.9), it follows that the total energy consumption of a
SBL gate during a cycle is given by:
ESBL =
1
2
CLV
2
CC +
9K(Va − VCC)2pi2C2LR
16T
+ ECrowbar. (3.11)
Note that CB is replaced by CL, since the Logic stage and the Boost stage are driving
the same output loads. Based on Spice simulation results of SBL test-chip presented
in Section 4.4, the eﬀective resistance and capacitance seen from each clock phase are
about 0.6 Ω and 57pF, respectively, conﬁrming that R << 1/jωCB.
The crowbar ECrowbar in Equation (3.11) has three components: EVCC−VSS , EVCC−PC ,
and EPC−VSS . The energy EVCC−VSS is associated with the Logic stage. Speciﬁcally,
due to the relatively slow rise time of the input waveform, short current will ﬂow from
VCC to VSS during the evaluation phase. This component dominates ECrowbar. At very
low operating frequencies, it also dominates the total energy consumption ETotal, as
we discuss the experimental results presented in Section 4.4. The energy EVCC−PC
is consumed during the Evaluation phase. As VCC charges one of the output nodes,
current ﬂows from VCC to the PC pin through the PMOS device in the Boost stage.
Since VCC is always at a subthreshold voltage level, this component is relatively small
compared to EVCC−VSS . The energy EPC−VSS is consumed during the Boost phase. As
ϕ rises, although the Logic stage is turned oﬀ, current still ﬂows from the PC pin to
VSS through the evaluation NMOS. Similar to EVCC−PC , this component is signiﬁcant
only at very low operating frequencies.
Equation (3.11) provides guidance for device sizing and illustrates some of the
energy trade-oﬀs between ELogic, EBoost, and ECrowbar For example, in the Boost
stage, up-sizing the PMOS devices reduces the eﬀective resistance R, but increases
the eﬀective capacitance CL. In the Logic stage, up-sizing the evaluation pull-up
and pull-down networks yields a greater potential diﬀerence at the output nodes by
47
the end of the evaluation period, resulting in higher energy eﬃciency during the
Boost stage. At low operating frequencies, however, such up-sized networks result in
increased EVCC−VSS .
3.5 Summary
SBL improves upon its closest charge-recovery logic, Boost Logic, in many signif-
icant ways. Speciﬁcally, SBL can operate with a single DC supply, whereas Boost
Logic requires three DC supply levels. Moreover, SBL achieves greater gate overdrive
and higher performance than Boost Logic through it supply conﬁguration.
Compared to subthreshold logic, SBL accomplishes signiﬁcant performance im-
provements through device overdriving. The NMOS-only PUN and PDN in the Logic
stage are driven with inputs of approximately 1V, allowing SBL to operate at clock
frequencies in the hundreds of MHz or, alternatively, to realize functions of signiﬁcant
complexity within a single clock cycle. In addition to enhanced performance, gate
overdriving leads to improved variation tolerance. All transistors in the Logic stage
conduct in super-threshold linear mode, and delay does not vary signiﬁcantly with
variations in the subthreshold supply VCC or Vth.
48
CHAPTER 4
187MHz Charge-Recovery FIR Filter with
Subthreshold Boost Logic
To demonstrate the fast and energy-eﬃcient operation of SBL, we used it in the
implementation of a transpose FIR ﬁlter. The relatively state-intensive nature of
the transpose-type FIR ﬁlter, coupled with the relatively simple computation that
is performed between state elements make it a natural ﬁt for SBL. The latency of
the SBL FIR is only 2 cycles longer than that of a similar-performance static CMOS
design. This SBL FIR is the ﬁrst design that demonstrates a near 200MHz operating
frequency with a single 0.36V supply level.
The remainder of this chapter is organized as follows: Section 4.1 presents the
architecture and SBL implementation of the FIR test-chip. Section 4.2 presents the
two-phase clock generator the clock distribution network used in the SBL FIR. Sec-
tion 4.3 describes the semi-custom design methodology used for the SBL circuitry in
the FIR. Section 4.4 presents results from device-level simulations of the SBL FIR
and its static CMOS counterpart with the same architecture. In Section 4.5, we
present measurement results from our SBL FIR test-chip. Conclusions are given in
Section 4.6.
49
4.1 FIR Filter Architecture
Pseudo Random Data Generator
8x8
Mult
8x8
Mult
8x8
Mult
8x8
Mult
C14C13C2C1
Signature Analyzer
Sense Amplifier Flops
Static CMOS (1.2V Domain)
SBL (VDC, VCC Domain)
Static CMOS (1.2V Domain)
4-2
Compressor
4-2
Compressor
4-2
Compressor
4-2
Compressor
Adder
Sense Amp. Flip-Flop
Broadcast Buffer
out_b
in_b
in
in_b
in
out
VCCVCC
d_b
q q_b
d
1.2V
1.2V
Broadcast Buffers
Figure 4.1: Block diagram of SBL FIR ﬁlter and BIST circuits.
We used SBL to design an 8-bit 14-tap FIR ﬁlter. A block diagram of the FIR
chip is given in Figure 4.1. A static CMOS built-in self-test (BIST) circuit is used to
generate and process the FIR input and output. The pseudo-random input sequence
generated by BIST is broadcast to 14 modiﬁed 8×8 Booth multipliers. The prod-
ucts of these inputs with the 14 FIR coeﬃcients are accumulated through 14 4-to-2
compressors. The ﬁnal result is obtained from a hybrid adder, and then sent to a
signature analyzer, generating a signature vector. 8×8 multipliers take 1.5 cycles to
generate sum and carry vector pairs and each tap takes 1 cycle to merge the sum
and carry vector pairs from the previous tap and from its 8×8 multiplier. The vector
pairs are then merged in a 20-bit hybrid carry-look-ahead/carry-select adder with
50
2 cycles of latency. The longest path through the SBL-based FIR has a latency of
19 cycles, including 0.5 cycle latency of the broadcast buﬀers. Compared to static
CMOS design with the same architecture, the latency overhead of the SBL FIR is 2
cycles: 1 cycle in the 8x8 multiplier, and 1 cycle in the 20-bit adder.
To enable SBL to communicate with the static CMOS BIST logic, two interface
blocks are inserted before and after the FIR. On the FIR input side, broadcast buﬀers
implemented in SBL are used to connect static CMOS signals from the BIST circuitry
into dual-rail sinusoidal-like signals for the SBL datapath. On the output side of the
FIR, sense-ampliﬁer ﬂip-ﬂops that operate oﬀ the same clock as SBL gates latch the
SBL signals from the FIR and make them available to the static CMOS signature
analyzer.
Combined
Boost Stage
8 m
2
6
.7
m
Sum Generation
sum sum_b
an
bn
cn
dn
cn-1
bn-1an-1
bn-1
Boost Stage
an
bn
cn
dn
an
bn
cn
dn
an-1
cn-1
bn-1
bn-1
an
bn
cn
dn
cn-1
bn-1
an-1
bn-1
an
bn
cn
dn
an-1
cn-1
bn-1
bn-1
an
bn
cn
dn
VCC
carry carry_b
Carry Generation
Boost Stage
dn
dn dn
dn
bn-1
an-1 cn
an
bn bn
an
cn
cn-1
bn-1
an-1bn-1
cn-1
bn
an
an-1
bn-1cn-1
bn-1
bn-1 bn-1
an-1
cn-1
bn-1
bn
anbn
an
bn
an
cn cn an-1
bn-1
dn
an-1
bn-1
dn
cn-1
bn-1 bn-1
cn-1
dn
bn-1
cn-1bn-1
an-1an-1
bn-1cn-1dn
bn-1
VCC
Figure 4.2: Schematic and layout of a 4-2 compressor.
Gate overdrive at the Logic stage of SBL gates allows the implementation of
51
functions with signiﬁcant complexity within a single clock cycle. Figure 4.2 shows
schematics and layout of the SBL-based 4-to-2 compressor used in the FIR. Designed
in a 0.13µm bulk CMOS process Due to the dual-rail nature of the SBL gates, the SBL
4-to-2 compressor has 2.1X area overhead compared to a standard-cell implementa-
tion. Each SBL gate has a maximum transistor stack height of six and, as discussed
in more detail in the experimental results presented in Section 4.5, can operate at
187MHz with VCC = 0.36V.
4.2 Power-Clock Generator and Clock Network Design
SBL Gate
PCVCC VCC
PUN
PDN
PUN
PDN
Negative 
Transconductance
+
-
L2
L1
Off-Chip
On-Chip
A B
Frequency tuning circuit
VDC Ref. clock
Off-Chip
On-Chip
Pulse 
Gen.
VDC
Figure 4.3: Distributed "blip" clock generator and measured clock waveform.
The SBL FIR uses two power-clock waveforms ϕ and ϕ¯ that are generated by the
clock circuit shown in Figure 4.3. The two-phase power-clocks are ﬁrst distributed
52
through a H-tree structure made with M5-M8 and then connect to the SBL gates
with a local clock mesh made with M2-M4. This local clock mesh has a rectangular
distribution area of 792.8µm×346.8µm with 24 pairs of 0.6µm M2, 20 pairs of 0.6µm
M3, and 65 pairs of 1.2µm M4 strips. In this clock generator, the basic "blip"
generator circuit, introduced in Section 3.2, has been augmented to include a pair
of weak drivers at the root of the tree that allow for the power-clock waveforms to
be injection-locked to a target clock frequency. These drivers are pulsed by reference
signals A and B that are generated by an on-chip pulse generator. In our test-chip,
the drivers are 150µm wide and can tune the operating frequency by as much as ±3%
oﬀ resonance. The tuning range can be increased by sizing up the injection-locked
devices. To maintain the oscillation, fourteen pairs of cross-coupled NMOS switches
with 2400µm total active width are distributed throughout a hierarchical two-phase
distribution network, similar to [14]. Two oﬀ-chip inductors are used to resonate
the parasitic capacitance of the clock distribution network and the SBL gates. In
our test-chip, the load on each phase of the power-clock is approximately 57pF, as
derived from layout extraction.
The clock circuitry is powered by a DC supply VDC that can be controlled in-
dependently of the supply VCC for the SBL gates. The level of VDC determines the
amount of energy re-introduced into the clock network each cycle, thus aﬀecting the
amplitude of the power-clock waveforms and controlling the overdrive level at the
Logic stages. Although not required for correct operation, the independent control of
VDC and VCC allows for increased energy eﬃciency. Speciﬁcally, by decreasing VCC to
limit crowbar current through the Logic stage while keeping VDC suﬃciently high to
ensure the requisite overdrive, energy eﬃciency can be improved without sacriﬁcing
performance. As shown in Section 4.5, the FIR achieves energy-eﬃcient operation
with VDC = VCC , but its energy consumption per cycle decreases further by 17.1%
when VDC and VCC are set to diﬀerent subthreshold values.
53
4.3 SBL FIR Filter Design Methodology
The SBL FIR ﬁlter was designed using a semi-custom design ﬂow that enables
the use of industrial static timing tools for timing analysis, as well as the use of HDL
languages for system-level functionality veriﬁcation. The standard cell library used
for the SBL ﬁlter consists of 65 diﬀerent cells, most of which are simpliﬁed versions
of a 4-to-2 compressor or a 3-to-2 compressor that deal with special cases.
The basic design ﬂow is shown in Figure 4.4. Initially, a Spice-level characteriza-
tion is performed on every standard cell. For each cell, pin capacitance is extracted,
and a delay-to-capacitance matrix is generated by sweeping a range of output loads
from 25fF to 75fF. Every SBL gate is modeled as a timing element, and the D − Q
delay is deﬁned as the time that the gate needs to develop a Vth-level voltage dif-
ference at it output nodes. A HDL model for each cell is also established in Verilog
and is used for the system-level functionality veriﬁcation when system-level design is
completed.
The next step in the design ﬂow is the manual pipelining, placement, and rout-
ing of the cells. During this step, the design is optimized for minimum area and
energy consumption. the power-clock pins of all cells are aligned, allowing for sig-
niﬁcant reduction in the resistance of the clock distribution network and, therefore,
improvement in the eﬃciency of energy recovery.
After completion of layout, wire capacitance is extracted for all internal nets.
Based on the extracted parasitics and the previously built timing models of the cells,
a commercial static timing tool is used for post-layout timing analysis. From timing
analysis results, we ﬁx all timing violations by replacing the original cells with the
ones that have larger driving ability and then repeat this place-and-route and timing
check process to ensure timing closure. This process can also be used to separate
out the critical paths in a design and improve the maximum operating frequency of
a system.
54
65 
Standard 
Cells
Timing Model
(Define C-Q delay 
with different loading)
HDL Model
(Modeled as a timing 
element with gate 
functionality)
Cell-level System-level
Static Timing Analysis
(Timing model 
+Parasitics info)
Functionality 
Verification
(HDL Simulation)
Figure 4.4: SBL design ﬂow.
4.4 SBL FIR Filter Spice-Level Analysis
In this Section, we present results from Spice-level simulations of our SBL test-
chip. For comparison purposes, we also present Spice-level simulation results of a
conventional static CMOS version of the FIR, which was obtained by performing
automatic synthesis, placement, and routing of the same FIR architecture that we
used to derive the SBL FIR test-chip. Measurement results from our SBL FIR test-
chip, along with a comparison of simulation and measurement results are given in
Section 4.5.
4.4.1 Spice Simulation Results
Figure 4.5 gives a graph of energy consumption per cycle versus operating fre-
quency for our SBL FIR design. This graph was obtained using Synopsis Hsim with
the BSIM model on a netlist of our SBL FIR that was obtained from layout extrac-
tion. All data points were obtained with the minimum supply setting VCC = VDC that
55
Total Energy (Single Supply)
VDC Energy (Single Supply)
VCC Energy (Single Supply)
0.29V 0.35V
0.31V
0.33V
0.30V
0.30V
0.30V
0.29V
0.29V
0.28V
0.27V
0.27V
0.28V
VDC = VCC = 0.36V
Figure 4.5: Simulated energy consumption of SBL FIR ﬁlter.
yielded correct operation at the corresponding operating frequency. The simulation
result shows that the SBL FIR achieves a near 200MHz operating frequency with a
single subthreshold level supply of 0.36V.
Notice that energy consumption is dominated by the component related to the
power-clock generator, which corresponds to the power supply VDC . Moreover, notice
that at frequencies below 20MHz, the energy consumption of the Logic stage, which
corresponds to the power supply VCC , starts rising at an increasing rate, due to the
increasing crowbar current from VCC to VSS caused by the slowly transitioning inputs
of the Logic stage. Consequently, total energy consumption for the SBL FIR starts
56
increasing at operating frequencies below 17MHz.
Min insertion delay = 13.7 ps
Max insertion delay = 53.3 ps
Avg insertion delay = 33.4 ps
Figure 4.6: Histogram of simulated power-clock insertion delays at a resonant fre-
quency of 53.7MHz.
Clock skew is introduced due to load variation across the chip. Figure 4.6 shows
power-clock insertion delay data obtained from Spice-level simulations of the entire
chip with extracted resistance, capacitance, and coupling capacitance. At the reso-
nant frequency of 53.7MHz, the maximum and the minimum insertion delay of the
power-clock is 53.3ps and 13.7ps, respectively. It follows that the diﬀerence of these
two numbers yields a maximum possible skew of 39.6ps.
57
0.35mm
0
.7
m
m
(a) (b)
Supply Voltage (V)
250KHz
@0.27V
E
n
e
rg
y
 P
e
r C
y
c
le
 (p
J
)
Figure 4.7: (a) Layout of conventional CMOS FIR. (b) Simulated operating frequency
and energy per cycle vs. supply voltage for conventional CMOS FIR ﬁlter.
4.4.2 Performance Comparison with CMOS FIR Filter
To compare our SBL FIR with conventional CMOS design, we synthesized a
standard-cell version of the same 19-cycle FIR architecture that we used to derive
the SBL design in the same 0.13µm technology. Synthesis was performed by Syn-
opsys Design Compiler, yielding a conventional FIR with the same latency as the
SBL FIR. Placement and routing were performed in a fully automatic manner using
Cadence SoC Encounter with 80% area utilization and a synthesized clock tree. The
layout of the resulting design is shown in Figure 4.7(a). With a 0.35mm × 0.7mm
footprint, the synthesized FIR occupies approximately 12.5% less area than its SBL
counterpart.
Figure 4.7(b) gives Spice-derived graphs for the operating frequency and the per-
cycle energy consumption of the static CMOS FIR as a function of the supply volt-
age. With 83% of its cells sized at X1 or X2 drive strength, this FIR achieves a clock
frequency close to 800MHz with a nominal 1.2V supply. As expected, energy con-
58
sumption per cycle varies quadratically with supply voltage. Furthermore, operating
frequency deteriorates exponentially fast, as supply voltage drops below 0.6V, barely
exceeding 250KHz when the supply is set at 0.3V.
52.9%
41.1%
43.7%
Conventional FIR filter (Spice)
SBL Fir filter (Spice)
0.35V
0.31V
0.33V
0.30V
VDC = VCC = 0.36V
0.30V
0.30V
0.29V
0.27V
VDD = 0.65V
0.63V
0.60V
0.57V
0.55V
0.54V
0.52V
0.51V
0.50V
0.49V
0.47V
0.46V
0.29V
0.27V
0.28V
0.28V
0.29V
0.43V
0.40V
Figure 4.8: Simulated energy consumption of conventional and SBL FIR ﬁlters.
For both the SBL and the conventional FIR, simulated per-cycle energy consump-
tion versus operating frequency is given in Figure 4.8. For simplicity, the SBL FIR
in this graph uses a single DC supply level, connecting to both VDC and VCC . In the
frequency range from 17MHz to 187MHz, the SBL FIR achieves 40% to 50% lower en-
ergy consumption than its conventional counterpart. The SBL design yields minimum
energy consumption at 17MHz, achieving 43.7% reduction over its conventional coun-
terpart. The maximum relative energy reduction of 52.9% is achieved at 44MHz. At
59
187MHz, the maximum clock frequency at which the SBL design functions correctly,
relative energy savings over the conventional FIR are 41.1%.
4.5 SBL FIR Filter Test-Chip Measurement
This section gives measurement results from the experimental evaluation of the
SBL FIR test-chip, validating its energy-eﬃcient operation with subthreshold supplies
at clock frequencies up to 187MHz. It also presents a comparison of measurement
and simulation results, showing good agreement between the two, with relative dis-
crepancy between measurements and simulations staying within 12% for operating
frequencies ranging from 20MHz to 187MHz.
Technology 0.13 m 8M CMOS (RVT)
Threshold Voltage NMOS: 0.40V
PMOS: -0.42V
Taps, In / Coeff Bits / Out 14, 8 / 8 / 20
Total Transistors Count
(including BIST)
PMOS: ~8000
NMOS: ~33000
Total Area (including BIST) 0.38 mm
2
Effective Cap. per Clock Phase ~57pF
BIST Supply Voltage 1.2V
Switching Activity 0.5
Measured Frequency Range 5MHz 187MHz
Single Supply Setting : Two Supplies Seeting :
Supply Voltage VDC=VCC=0.27V @ 20MHz
VDC=VCC=0.36V @ 187MHz
VDC=0.27V, VCC=0.18V @ 20MHz
VDC=0.36V, VCC=0.28V @ 187MHz
Energy per Cycle 15.57 pJ @ 20MHz
31.64 pJ @ 187MHz
12.90 pJ @ 20MHz
30.88 pJ @ 187MHz
Figure of Merit
(nW/MHz/Tap/In-Bit/Coeff-Bit)
17.37 @ 20MHz
35.31 @ 187MHz
14.40 @ 20MHz
34.47 @ 187MHz
Table 4.1: SBL FIR ﬁlter statistics and performance measurements.
Two sets of measurements were obtained. In the ﬁrst set, the supplies VCC and
VDC were set equal to each other. In the second set, the two supplies were controlled
60
independently. As shown in the table of Table 4.1, for both sets of measurements, the
FIR test-chip achieves a maximum operating frequency of 187MHz with all supplies
set at levels below Vth,nmos = 400mV. With the two supply values tuned independently,
the test-chip achieves higher energy eﬃciency than with a single-supply setting.
4.5.1 Single-Supply Conﬁguration
Figure 4.9 shows the per-cycle energy consumption of our test-chip for operating
frequencies ranging from 5MHz to 187MHz. Data points are given for both single-
supply and dual-supply settings. At each frequency point, the energy drawn from
each supply is given separately, along with the total energy consumed. The diﬀerent
operating points are obtained by selecting oﬀ-chip inductors that yield a resonant fre-
quency at that clock frequency. In all cases, the oﬀ-chip inductors were 0612 discrete
devices that were mounted on the printed circuit board in proximity to the test-chip.
The maximum operating frequency of 187MHz was obtained with no external in-
ductors, with only the bondwires and package traces related to the clock generator
providing all the parasitic inductance.
For each single-supply data point in Figure 4.9, the corresponding voltage and in-
ductor value are given above the data point. The data show that energy consumption
is dominated by the energy drawn from the clock generator, with VDC accounting
for more than 80% of total energy consumption. As operating frequency decreases
from the maximum operating point of 187MHz, energy consumption decreases ap-
proximately linearly. The minimum energy point of 15.57pJ per cycle is obtained
at 20MHz with VCC = VDC = 0.27V and two oﬀ-chip inductors of 680nH each. At
this frequency, the recovery rate of the energy supplied through VDC is approximately
89%, yielding a 17.37 nW/MHz/Tap/InBit/CoeﬀBit ﬁgure of merit. As operating
frequency decreases below 20MHz, total energy consumption increases at an acceler-
ating rate, due to increasing crowbar currents, with VCC to VSS crowbar currents in
61
the Logic stage quickly dominating, as evidenced by the cut-out that zooms on data
in the 5MHz to 30MHz range.
Beyond energy eﬃciency and performance, another question addressed by our
experimental evaluation is the accuracy of the Spice simulation results presented in
Section 4.4. Figure 4.10 gives simulation results under the conditions used to ob-
tain measurements with a single supply. For operating frequencies in the 20MHz to
187MHz range, the discrepancy between simulations and measurements stays within
12%. At operating frequencies below 20MHz, the energy consumption of the Boost
stage starts increasing. This increase is not reﬂected to the same extent in the simula-
tions. With voltage supply below 0.27V, we conjecture that the increasing discrepancy
between simulations and measurements is due to increasing model inaccuracies, due
to the aggressively scaled voltage supply.
4.5.2 Two-Supply Conﬁguration
The two-supply data points in Figure 4.9 have been obtained by keeping the same
VDC and inductor values as in the single-supply case, and by decreasing VCC by as
much as possible while still achieving correct function. The overall trends observed
are similar to the single-supply case. With VCC reduced, the energy drawn from VDC
increases, since the power-clock draws more energy to boost the smaller potential dif-
ference at the output of the Logic stage. As expected, however, energy consumption
in the Logic stage is signiﬁcantly decreased. The impact of reducing VCC is particu-
larly pronounced as operating frequencies decrease below 30MHz. Speciﬁcally, unlike
the single-supply case where VCC-related consumption starts increasing rapidly due
to crowbar currents, with two separate supplies the energy consumption in the Logic
stage remains relatively ﬂat, even at frequencies as low as 5MHz. Notice that at 5MHz,
where the crowbar current dominates, by separating VDC and VCC , we can reduce the
energy consumption by 61.7%. The minimum energy point is obtained at 20MHz with
62
0.35V
3nH
0.31V
18nH
0.33V
9nH
0.30V
82nH
0.30V
33nH
0.30V
100H
0.29V
150nH
0.29V
220nH
0.26V0.25V0.24V0.22V0.22V 0.24V0.2V0.21V
Bondwire 
inductance only
Total Energy (Single Supply )
VDC Energy (Single Supply )
VCC Energy (Single Supply )
Total Energy (Two Supplies, VDC and LSMD are the same as single supply )
VDC Energy (Two Supplies, VDC and LSMDare the same as single supply )
VCC Energy (Two Supplies , VDC and LSMD are the same as single supply )
VDC=VCC=0.36V
LSMD=0nH
VCC=0.28V
See (b) for 
more detail
(a)
61.7%
Reduction
0.29V
10000nH
0.28V
2700nH
0.27V
1100nH
0.27V
680nH
0.28V
390nH
0.20V0.15V0.18V0.16V 0.18V
(b)
Figure 4.9: Measured energy consumption vs. operating frequency for SBL FIR ﬁlter
(single supply and two supplies).
63
VCC = 0.18V, yielding a ﬁgure of merit equal to 14.4 nW/MHz/Tap/InBit/CoeﬀBit,
a 17.1% improvement over the single-supply case. At this frequency, the recovery rate
of the energy supplied through VDC is approximately 86%.
Err=1.6%
9.6%
4.1%
4.3%
8.0%
8.7%
7.7%
8.0%6.0%
35.3%
36.3%
12.6%
11.6%
54.7% Total Energy (Measured)
VDC Energy (Measured)
VCC Energy (Measured)
Total Energy (Spice)
VDC Energy (Spice)
VCC Energy (Spice)
Figure 4.10: Comparison of measured and simulated energy consumption for SBL
FIR ﬁlter (single supply).
4.5.3 Energy Trade-oﬀ and Robustness Analysis
Figure 4.11 gives a more detailed view of the trade-oﬀ between VCC- and VDC-
related energy consumption. The rightmost data points inside the oval on the right-
hand side give the energy consumption when a single supply is applied. By decreasing
VCC , VCC energy decreases as expected, and VDC energy increases gradually. Min-
imum total energy is obtained at VCC = 0.19V. When VCC decreases below 0.19V,
64
VDC Energy (VDC=0.28V)
VCC Energy 
Total Energy 
VCC Voltage 
Minimal Energy Point
Figure 4.11: Measured total energy consumption vs. VCC for the SBL FIR when
operating at 26.4MHz with VDC = 0.28V.
total energy consumption increases due to larger VDC-related energy.
Another focus of our experiments was to determine the variability of resonant
frequency across multiple test-chips. Figure 4.12 shows the resonant frequencies of
10 test-chips when running free with VDC = 0.36V and ﬁxed 3nH surface-mount in-
ductors. Correct function has been validated for all 10 chips, with average resonant
frequency µ = 160.33MHz and standard deviation σ = 0.83MHz. The resonant fre-
quency of these chips varies by ±1%. Even with 3σ variation of 1.4MHz, it is still
within the ±3% tuning range of the clock generator circuit.
65
f = 160.33 MHz
= 0.83 MHz
Figure 4.12: Measured resonant frequency distribution at VDC = VCC = 0.36V.
4.5.4 SBL FIR Filter Test-Chip Summary
A die photo of the SBL-based FIR is shown in Figure 4.13. Implemented in
a 0.13µm bulk silicon regular-Vth process, the FIR test-chip comprises a total of
approximately 41,000 devices. The FIR ﬁlter occupies 0.80mm× 0.35mm = 0.28mm2.
Including BIST, the entire test-chip occupies a total area of 0.38mm2. To reduce the
parasitic resistance of I/O pads and bondwires, two pads are used in parallel to
connect each power-clock phase to one of the terminals of the corresponding oﬀ-chip
inductor. With the exception of the inductors, which were discrete devices mounted
oﬀ the die, all other test-chip circuitry was fully integrated on the die.
66
The table in Table 4.2 summarizes the performance data for our FIR test-chip.
For comparison purposes, it also includes published results for other FIR chips. De-
pending on operating frequency and number of supplies used, our SBL-based FIR
test-chip achieves ﬁgures of merit that improve upon previous designs [44, 13, 14] by
a factor of at least 3X to 20X.
The results presented in this section suggest that SBL is a promising approach for
the implementation of regular datapaths with low energy consumption. To assess the
suitability and robustness of SBL for volume production, further evaluation would be
required, including sensitivity to temperature and wafer-to-wafer process variation,
device mismatch, and supply voltage variation.
Table 4.2: Performance table.
67
FIR Filter
Pads 
for L2
 
f  
Pads 
for L1
 
f  
Figure 4.13: SBL FIR die microphotograph.
4.6 Summary
This chapter presents a FIR ﬁlter test-chip that relies on a charge-recovery logic,
SBL, to achieve multi-MHz clock frequencies with subthreshold DC supply levels.
The SBL FIR is the ﬁrst design that demonstrates a near 200MHz operating fre-
quency with a single 0.36V supply level. Fabricated in a 0.13µm CMOS process
with Vth,nmos=0.40V, the FIR operates with a two-phase power-clock in the 5MHz-
187MHz range and with DC supplies in the 0.16V-0.36V range. With a single 0.27V
68
supply, the test-chip achieves its most energy eﬃcient operating point at 20MHz,
consuming 15.57pJ per cycle with a recovery rate of 89% and a ﬁgure of merit
equal to 17.37 nW/Tap/MHz/InBit/CoeﬀBit. With the introduction of a second
subthreshold supply at 0.18 V, energy consumption at 20MHz decreases further by
17.1%, yielding 14.40 nW/Tap/MHz/InBit/CoeﬀBit. At its maximum operating fre-
quency of 187MHz, the test-chip achieves 35.31 nW/Tap/MHz/InBit/CoeﬀBit and
34.47 nW/Tap/MHz/InBit/CoeﬀBit with one and two subthreshold supplies, respec-
tively. In Spice simulations of extracted layouts, the SBL-based FIR consumes 40%
to 50% less energy per cycle in the 17MHz-187MHz range, compared with a static
CMOS-based implementation derived by synthesis of the same FIR architecture and
automatic place-and-route.
69
CHAPTER 5
Architecture and Design of Resonant-Clock Flash
ADC
This chapter presents the architecture and main building blocks of resonant-clock
ﬂash ADC that was designed in a 65nm bulk silicon process and achieved sampling
frequency up to 7GS/s. The remainder of this chapter is organized as follows: In
Section 5.1, we describe the previous work and challenges in high speed ﬂash ADC
designs. Section 5.2 presents details of the ﬂash ADC architecture with single-phase
resonant clocking techniques. Section 5.3 describes key building blocks of the ADC,
including the track-and-hold ampliﬁer, comparator, encoder, and sense-ampliﬁer ﬂip-
ﬂop circuitry. Section 5.4 presents the single-phase clock generator and its distribution
network in resonant-clock ADC test-chip. Section 5.5 presents results from the analy-
sis of the integrated inductor that were obtained using a 3D full-wave electromagnetic
ﬁeld solver.
5.1 Introduction
High-speed low-resolution ADCs are essential building blocks for wireless commu-
nication systems and data storage devices. Since ﬂash ADC has the highest sampling
rate among ADC architectures, it is a natural choice for such applications. How-
70
ever, power consumption in ﬂash ADCs increases exponentially with resolution, and
designing low-power ﬂash ADCs is a challenging task.
Traditionally, ADC power reduction techniques have focused on the architecture
and analog circuitry of the converter (e.g., folding [48, 49], interpolation [50], time-
interleaving [51]). Based on previously published papers, however, the energy con-
sumption on the clock network and digital circuitry in ﬂash ADCs can be more than
50% of their total energy consumption [48, 52]. Clock-related power could be as high
as 30% [48], depending on ADC architecture.
Inductor-based techniques have been used in ADC design in the past, but not in
the context of reducing power consumption through resonant clocking. In [52, 53],
inductors were used to improve sampling rate. In [54], an inductor was used to
generate a low-jitter clock using an integrated LC-based VCO. In that design, the
clock network was driven conventionally using clock buﬀers, yielding no power savings
in clock distribution. Resonant clocking has been shown to be an eﬀective approach
to the reduction of power consumption in GHz clock speed distribution networks
[14, 15, 16].
5.2 Resonant-Clock Flash ADC Architecture
The architecture of the resonant-clock ADC is shown in Figure 5.1. The diﬀerential
input In+/In− is captured by a track-and-hold ampliﬁer (THA) similar to the one
in [55]. 31 dynamic comparators compare the captured signal with reference voltages
generated by a resistive ladder and store the resulting 31-bit thermometer code in
high-speed SR latches. A 2-cycle Gray code encoder ﬁxes any bubble or sparkle
errors and converts the 31-bit thermometer code to a 5-bit Gray code. These blocks
are described in more detail in Section 5.3.
All ADC components are synchronized using a single-phase resonant clock except
for the THA, which requires sharp clock edges to capture data accurately. To pro-
71
VREF_N
VREF_P
G
re
y
 C
o
d
e
E
n
c
o
d
e
r
X 31
5
VREF_P
VREF_N
T/H
In+
In-
S
R
 L
a
tc
h
e
s
D
e
c
im
a
to
r
/ /
5
Single-Phase 
Resonant Clock
0.77nH
Figure 5.1: Architecture of ADC with single-phase resonant clock.
vide these sharp clock edges, multiple stages of CMOS buﬀers are used to convert
the resonant clock waveform into a square-shape clock, as shown in Figure 5.1. To
facilitate testing, the ADC output is captured by a decimator once every 64 cycles.
A square-shape clock is generated using the same approach as in the THA, and a
standard-cell based frequency divider is used to generate the divide-by-64 clock for
the decimator.
5.3 Resonant-Clock Flash ADC Building Blocks
5.3.1 Track-and-Hold Ampliﬁer
Figure 5.2 shows the THA circuitry that consists of a passive PMOS sampling
switch and a sampling capacitor, followed by a source follower buﬀer. The passive
switches connect to a sampling capacitor through a dummy switch which lowers the
common-mode jump after the track-to-hold transition and improves the linearity of
72
VINP
CLK_b CLK
VINN
CLK_b CLK
VOUTP VOUTN
VDDH
Figure 5.2: Track-and-hold ampliﬁer (THA) circuit.
the switch outputs [56].
5.3.2 Comparators
Figure 5.3 shows the dynamic regenerative comparator used in the ADC. The
resonant clock RCK deﬁnes two operating phases: reset phase and comparison phase.
During the reset phase, RCK is low, and the diﬀerential outputs Out+ and Out- and
nodes T1 and T2 are precharged high by devices P3, P4, P5, and P6. As the
clock rises, devices N4 and N5 discharge the cross-coupled inverters with a slew rate
dependent on the input voltage, creating a slight voltage diﬀerence across the outputs.
During the comparison phase, the cross-coupled inverters regeneratively amplify the
voltage diﬀerence across the diﬀerential outputs to full rail, which is captured in the
SR latches.
Comparator oﬀset is predominantly caused by transistor mismatches, especially
MOS transistor threshold-voltage and current-factor mismatches [57]. In each com-
parator with two diﬀerential pairs, the standard deviation of the threshold voltage
can be approximated as follows:
73
VDD
In-
Out+Out-
RCKRCK
VSS
In+
RCK
Trimming 
cap
Trimming 
cap
P1 P2P3 P4P5
N1
N2
N4N5
N3
T2T1
t(s)
(V)
V(T1)
V(T2)
V(Out+)
V(Out -)
V(RCK )
1.2
0.8
0.4
0.0
1.2
0.8
0.4
0.0
1.2
0.8
0.4
0.0
1.3n        1 .4n         1.5n         1.6n
P6
Figure 5.3: Schematic of the comparator and waveform.
σ(Vth) =
AVth√
W × L × 2, (5.1)
where AVth is a process-speciﬁc parameter, and W ×L is the transistor gate area.
From Equation (5.1), for a 5-bit ADC to achieve an eﬀective number of bits (ENOB)
greater than 4.7, the comparator gate area needs to be sized up suﬃciently to keep
σ(Vth) below 0.1 LSB, where 1 LSB is equal to the full-scale input voltage divided
by 25 [58]. Rewriting Equation (5.1), the transistor gate area W × L can thus be
expressed as:
W × L = (2× AVth
σ(Vth)
)2
= (
2× AVth
0.1× LSB )
2
= (
2× AVth
0.1× VFS
25
)2
= (
26 × AVth
0.1× VFS )
2, (5.2)
74
where VFS is the full-scale input voltage of the ADC. Equation (5.2) yields transistor
sizes that are prohibitively large in practice. Furthermore, these large comparators
increase the output load capacitance of the THA and the load of the clock network,
resulting in increased energy consumption.
To keep power dissipation low and transistor sizes practical, the diﬀerential pairs
in the comparator are intentionally sized with standard deviation of threshold voltage
variation, σ(Vth), of up to 0.45 LSB of ADC. Threshold calibration is then used to
further compensate for threshold voltage mismatches in the diﬀerential pairs of the
comparators. Threshold calibration is performed by digitally including additional
PMOS capacitance. Using scannable controls to change the voltage applied to the
gate of the PMOS, the eﬀective capacitance presented at the source and drain junc-
tions connected to nodes T1 and T2 can be varied. This capacitance tuning ability
allows for the compensation of current diﬀerences caused by device mismatch. The
calibration step size is 1/3 LSB with a 4-bit control signal. The reference voltage
calibration range is about ±2.5 LSB, which is enough to cover a 3 sigma variation in
Vth mismatch.
To avoid edge eﬀects in the comparator layout, the length of transistors in each
comparator is sized up by 1.1X of the feature device length. The 31 comparators are
placed in two rows (15 and 16 comparators in each row). Five dummy comparators
are added to the two ends of the comparator array. The total number of comparators,
including the 5 dummies, is thus 36.
5.3.3 Grey Code Encoder
The function of the Grey encoder is to convert the 31-bit wide thermometer code
generated by the comparator array into a 5-bit Grey code. In a thermometer code,
if input is high with respect to the reference level of a particular comparator, then
the corresponding output of that comparator is high. Ideally, all comparator outputs
75
below the input level are 1s, and all comparator outputs above the input level are
0s. However, for fast input signals, small timing diﬀerences in the response times of
the comparators, combined with threshold voltage oﬀset can cause a situation where
a 1 is found above 0. This phenomenon is called a bubble or sparkle error in the
thermometer code. Metastability of the comparators and the presence of bubbles in
the thermometer code are two classes of errors that must be addressed.
Majority 
Func.
Majority 
Func.
Majority 
Func.
In_b [1:3]
In_b [3:5]
In [5:7]
In_b [7:9]
In_b [9:11]
In [11:13]
In [13:15]
In_b [15:17]
In_b [17:19]
In_b [19:21]
In [21:23]
In [23:25]
In_b [25:27]
In [29:31]
In [27:29]
G5
G4
G3
G2
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF FF
FF FF
FF FF
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
FF
In [2:4]
In_b [4:6]
In [6:8]
In_b [8:10]
In [10:12]
In_b [12:14]
In [14:16]
In_b [16:18]
In [18:20]
In_b [20:22]
In [22:24]
In_b [24:26]
In [26:28]
In_b [28:30]
In [31:30], 1'b0
In_b [1:2], 1'b0
G1
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
Majority 
Func.
Majority 
Func.
Majority 
Func.
Majority 
Func.
ab+cd FF
FF
Figure 5.4: Schematic of the 2-cycle resonant-clocked Gray code encoder.
Figure 5.4 shows schematic of a 2-stage the digital encoder that is used in the
ADC. In the ﬁrst stage, a majority function corrects the bubble/sparkle errors in the
31-bit thermometer code. Then, in the second stage, the function ab+cd is performed
to reduce the intermediate data and thus reduce the number of ﬂip-ﬂops. Only 14
latches are used in our decoder, resulting in decreased power consumption.
5.3.4 Sense-Ampliﬁer Flip-Flop
Figure 5.5 shows the schematic of the ﬂip-ﬂop used in the ADC. This sense-
ampliﬁer ﬂip ﬂop topology [1] has been chosen to ensure short C− to−Q delay when
76
VDD
RCKRCK
VSS
In
RCK
VDD
In
VDD
out- out+
out+ out-
Q Q_b
out+ out-
out+
VSS
out+ out-
out-
out+ out-
Figure 5.5: Schematic of the sense-ampliﬁer ﬂip-ﬂop.
clocked by a 5.5GHz and 7GHz resonant clock waveform with 10% to 90% transition
time equal to 54ps and 42ps, respectively. It also has the advantage of tracking the
eﬀect of variation in the comparator since their structures are very similar.
5.4 Clock Network Design
A single-phase resonant clock is generated by a single-ended clock generator, as
shown in Figure 5.6. An on-chip inductor L is used to set up an LC tank, where
C is the parasitic capacitance associated with the clock distribution network and
the clock pins of the clocked elements. In this chip, L is estimated at 0.77nH using
a commercial 3D electromagnetic ﬁeld solver, and C is estimated at 1.2pF using a
commercial RC extraction tool. Programmable switches (PMOS: 6µm to 60µm with
step size of 6µm; NMOS: 3µm to 30µm with step size of 3µm) powered by supply
VCK are placed near the clock terminal of the inductor to replenish the energy lost
due to the parasitic resistance of the clock network. A 12pF capacitive divider is
placed on the other terminal of the inductor to stabilize the oscillation. A reference
77
clock generated by a programmable ring oscillator drives the programmable switches,
enabling the operation of the LC tank at the frequency of that clock. The reference
clock drives the switches either directly or through an on-chip pulse generator that
varies its duty cycle D in the range 20% ≤ D ≤ 50%.
Pulse
Gen.
Ref. Clock
M
u
x
M
u
x
P_sel
Non-Overalapping 
VCK
RCK (        )
VSS
VCK
0.77nH
N_sel
Parasitic Cap.(~1.2pF)
39 m
6
9
m
31
 C
o
m
p
ar
at
o
rs
 A
rr
ay
G
re
y 
C
o
d
e 
E
n
co
d
e
r
T
o
 D
ec
im
at
o
r
T
o
 T
H
A
Figure 5.6: Resonant clock generator with variable pulse controls.
To achieve high energy eﬃciency, various inductor conﬁgurations have been eval-
uated using a commercial electromagnetic ﬁeld solver. Based on this evaluation, a
1.75-turn 130µm × 130µm 2-metal-layer inductor has been chosen for the ADC. A
ground shield consisting of patterned M1 has been added directly underneath the
inductor to improve its quality factor (Q) and thus increase the eﬃciency of the LC
tank. Detailed inductor analysis results are given in Section 5.5. Figure 5.6 also
shows the clock distribution network topology of the ADC. The single-phase resonant
clock is distributed over a 39µm × 69µm area. The clock terminal of the inductor is
ﬁrst connected to a 39-µm long, 2-metal-layer clock trunk with an eﬀective width of
2.2µm. The clock is then distributed along six 69µm 4-metal-layer clock spines with
an eﬀective width of 2.6µm each. 45 clocked elements are placed directly under the
six clock spines to achieve low distribution resistance and low coupling capacitance.
78
The maximum resistance between the clock terminal of the inductor and any clocked
element is 1.15Ω, and the load for the corresponding path is 380fF, including wire
and gate capacitance. The resulting maximal RC product for any clock element is
quite small, yielding a worst-case clock skew below 1ps.
5.5 Inductor Design and Analysis
Inductor design and analysis were performed after extracting the clock network
capacitance of the ADC design. Based on the parasitic capacitance of the entire
clock network, the inductor value was selected to yield a resonant frequency near the
target operating frequency. The relation between the inductance L, capacitance C,
and resonant frequency f is as follows:
f =
1
2pi
√
LC
. (5.3)
An inductor database supplied by the foundry gives inductor parameters (e.g.,
number of turns, diameter, metal width area, and quality factor) to achieve select
inductance value. Since this database gives only a limited number of metal options
and inductance values, to obtain an inductor that meets our exact speciﬁcations, we
have used these database parameters as a starting point for designing the inductor
used in our ADC.
The inductor layout was drawn in Cadence Virtuso. A pattern of M1 strips was
added directly below the inductor to reduce the eddy current loss in the substrate
and improve the inductor quality factor (Q) [59]. The layout was exported in stream
format and imported to HFSS, a 3D full-wave electromagnetic simulator, for electro-
magnetic ﬁeld analysis.
The ﬁnal inductor design was a 1.75-turn coil with outer dimensions 130µm ×
79
(a)
(b)
Figure 5.7: Current density in the substrate underneath inductor: (a) without M1
shield, and (b) with M1 shield.
80
With M1
Without M1
0 1 2 3 4 5 6 7 8 9 10 11 12
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
In
d
u
c
ta
n
c
e
 (
n
H
)
Frequency (GHz)
0.77nH 
@ 5.5 GHz
0.65nH 
@ 5.5 GHz
0.80nH 
@ 7.0 GHz
0.65nH 
@ 7.0 GHz
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12
0
2
4
6
8
10
12
14
16
18
20
Q
u
a
lit
y
 F
a
c
to
r
Frequency (GHz)
With M1
Without M1
Q=17.1
@ 5.5 GHz
Q=13.5 
@ 5.5 GHz
Q=17.2
@ 7.0 GHz
Q=14.4 
@ 5.5 GHz
(b)
Figure 5.8: (a) Inductance vs. resonant frequency with and without M1 shield. (b)
Quality factor vs. resonant frequency with and without M1 shield.
81
130µm and inner diameter of 85µm. It consists of 2 metal layers (M8 and M9) with
11.5µm-wide wires each and an eﬀective thickness of 1.8µm. The simulation results
show that with M1 shield, the inductor provides 0.77nH of inductance and a quality
factor of 17.1 at 5.5GHz, and 0.80nH of inductance and a quality factor of 17.2 at
7.0GHz.
Figure 5.7 shows the inductor layout with and without the M1 pattern shield and
current density in the substrate, as obtained from HFSS. For the setup in Figure 5.7(a)
without M1 shield, the image current induced by the magnetic ﬁeld in the conductive
substrate layer ﬂows in a direction opposite to that of the current in the spiral. The
negative self-inductance then leads to a signiﬁcant drop in the total inductance and
hence Q. In the simulation results shown in Figure 5.7(b), the induced eddy current
in the substrate layer is greatly reduced by the M1 strips shield.
Figure 5.8(a) shows inductance versus frequency. With M1 strips added, eddy
current loss in the substrate is reduced, and inductance increases by 15.4% at 5.5GHz
and 23.1% at 7GHz. Figure 5.8(b) shows the quality factor Q versus frequency, where
Q is deﬁned as:
Q = −Im(Y (1, 1))
Re(Y (1, 1))
, (5.4)
The graph shows that the inductor with M1 shield achieves a higher peak Q, with
the peak occurring at a lower frequency. For frequencies above the peak Q frequency,
the qualify factor for the inductor with M1 shield drops faster than the one without
M1. This drop is due to the increased coupling capacitance between the inductor and
the M1 shield. By adding the M1 shield Q is improved by 26% at 5.5GHz and 19%
at 7GHz, respectively.
82
CHAPTER 6
Evaluation and Testing of 7GS/s Resonant-Clock
Flash ADC
In this chapter, we present experimental results from the evaluation of resonant-
clock ADC test-chip, demonstrating its high performance and high energy eﬃciency.
In this ADC, resonant clocking is deployed to decrease the power consumption of the
network that distributes the clock signal to the analog and digital circuitry of the
ADC. A fully integrated inductor is used to resonate the parasitic capacitance of the
entire clock distribution network all the way to the clocked timing elements.
In Section 6.1 , we present measurement results and ADC performance character-
ization from the resonant ADC test-chip. Conclusions are given in Section 6.2.
6.1 Measurement Results
Multiple independent supplies are used in this ADC test-chip. VDDA is the supply
for the comparators, VDD is the supply for digital components, VDDH is the supply
for THA, and VCK is the supply for the clock generator. This setup allows for the in-
dependent monitoring of power consumption in various ADC components at diﬀerent
sampling frequencies. The calibration of the test-chip is performed by externally ap-
plying the DC voltages to the ADC inputs, at a sampling rate of 5.5GS/s. By setting
83
0 5 10 15 20 25 30
-1.2
-0.9
-0.6
-0.3
0.0
0.3
0.6
0.9
1.2
L
S
B
s
Codes
DNL (Not Calibrated)
INL (Not Calibrated)
0 5 10 15 20 25 30
-0.6
-0.3
0.0
0.3
0.6
L
S
B
s
Codes
DNL (Calibrated)
INL (Calibrated)
Figure 6.1: Measured DNL/INL.
threshold voltages at the ADC inputs, we adjust the oﬀset of the corresponding com-
parators to a desired output code. The calibration process is done in a commercial
software, Labview, and the optimal oﬀset compensation value is chosen automatically
for each comparator.
Figure 6.1 shows measured diﬀerential non-linearity (DNL) and integral non-
linearity (INL) at 5.5GS/s. Measurements taken before and after calibration show
that DNL improves from 0.65/-0.70 LSB to 0.21/-0.30 LSB, and that INL improves
84
from 0.74/-0.77 LSB to 0.33/-0.31 LSB.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
5
10
15
20
25
30
35
40
45
S
N
D
R
 /
 S
N
F
R
 (
d
B
)
Input Frequency (GHz)
SNFR(Calibrated)
SNDR(Calibrated)
3dB
Figure 6.2: Measured SNDR/SNFR vs. input frequency.
Figure 6.2 shows signal-to-noise and distortion ratio (SNDR) and signal-to-noise
ﬂoor ratio (SNFR) versus input frequency (fin). Operating the chip at 5.5GS/s and
applying a full-scale sine input yields a SNDR of 29.0dB and a SNFR of 39.1dB
with fin at 440MHz (4.56 ENOB) and a SNDR of 26.5dB with fin at 2.04GHz (4.11
ENOB). The 3dB diﬀerence between the two frequency points indicates that the
eﬀective resolution bandwidth (ERBW) is 2.04GHz.
Figure 6.3 shows the breakdown of measured energy consumption versus sampling
frequency. The ADC test-chip operates from 4.5GS/s to 7.0GS/s. The clock energy
of the resonant clock ADC is obtained from measurements using a separate power
supply VCK . As shown in the ﬁgure, in the frequency range from 4.5GHz to 7.0GHz,
VDDA ranges from 0.96V to 1.10V, VDD ranges from 0.99V to 1.10V, and VCK ranges
from 0.96V to 1.10V.
85
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
0
1
5
6
7
E
n
e
rg
y
 p
e
r 
C
y
c
le
 (
p
J
)
Sampling Frequency (GHz)
VCK=0.96V VCK=0.97V VCK=1.00V
VCK=1.03V
VCK=1.06V
VCK=1.10V
VDD=1.10V
VDDA=1.10V
VDDH=1.40V
VDD=1.06V
VDDA=1.06V
VDDH=1.36V
VDD=1.03V
VDDA=1.03V
VDDH=1.33V
VDD=1.00V
VDDA=1.00V
VDDH=1.30V
VDD=0.99V
VDDA=0.97V
VDDH=1.27V
VDD=0.99V
VDDA=0.96V
VDDH=1.26V
 Total Energy
 Clock-Related Energy
Figure 6.3: Measured energy per cycle vs. sampling frequency.
In the graph of Figure 6.3, the minimum clock energy consumption of 0.54pJ
is reached at 5.5GHz, indicating that 5.5GHz is the closest frequency point to the
natural frequency of the ADC. At this frequency, the resonant ADC consumes 5.1pJ
per cycle with only 10.7% of the total energy consumption being clock related. At
the same frequency, the ADC reaches the lowest FoM of 396fJ per conversion step.
The clock system in the resonant ADC consumes 54% less energy than CV 2.
Since the energy used to drive the clock generator switches is included in the total
clock-related energy consumption, the actual resonant LC recovery rate is higher
than the resonant system recovery rate of 54%. Spice-level simulations show that
clock distribution energy is about 69% of the total clock-related energy, with the
remainder 31% consumed on the pulse generator and drivers to the clock generator
86
switches. By applying this number to the measurement result and including only
energy consumption for clock distribution, the LC oscillation has an energy recovery
rate of 77%.
The minimum clock-energy frequency of 5.5GHz is close to the estimated resonant
frequency obtained from post-layout extraction. Inductance L has been extracted
with a commercial 3D electromagnetic ﬁeld solver from the layout of the inductor
in the test-chip. The capacitance of the clock network has been extracted from the
complete metal-ﬁlled design using a commercial RC extraction tool. These extracted
inductance and capacitance values yield a natural resonant frequency of 5.2GHz,
which is 6% away from the minimum clock-power frequency of 5.5GHz.
THA
(7.9mW, 28.1%)
Comparators
(8.1mW, 28.8%)
Clock-Related
(3mW, 10.7%)
Latches/Encoder
(7.7mW, 27.4%)
Others
(1.4mW,5.0%)
Figure 6.4: Measured power breakdown at 5.5GS/s.
Figure 6.4 shows the power breakdown at a sampling frequency of 5.5GHz, ob-
tained by measuring current from the independent supplies powering each ADC com-
ponent. The total power consumption at 5.5GS/s is 28mW, with 28.8% of this total
used in the comparators, 28.1% in the track and hold circuits, and 27.4% in the dig-
ital components. Total clock power, which includes power consumption in the clock
distribution network, programmable switches, and the pulse generator, is 3mW and
87
accounts for 10.7% of total power.
Figure 6.5(a) shows the measured SNDR versus sampling clock frequency (fs)
with a 400MHz full-power input. The measured ENOB is 4.18 and 4.62 at 4.5GS/s
and 5.5GS/s, respectively. At 7.0GS/s, the ADC achieves more than 4.40 eﬀective
bits after calibration. Figure 6.5(b) shows FoM versus sampling frequency, where
FoM is deﬁned as follows:
FoM =
power
2ENOB ×min(2× ERBW, fs) . (6.1)
The lowest FoM of 396fJ per conversion step is achieved at the sampling frequency
of 5.5GS/s. At sampling frequencies greater than 5.5GS/s, FoM increases due to the
increasing supply voltages and thus higher energy consumption. At lower sampling
frequencies, even though the ADC has lower energy consumption, ENOB decreases,
and thus FoM increases.
Table 6.1 summarizes the performance of the ADC. Operating in the vicinity of
its resonant frequency with sampling rate 5.5GS/s, THA supply at 1.3V, and other
supplies at 1.0V, the ADC dissipates 28mW with fin at 2.04GHz. FoM is 396fJ per
conversion step. Operating at it maximum sampling frequency of 7.0GS/s, the ADC
dissipates 45mW with a 0.1V increase to each supply. At that frequency, 11.1% of
the total energy consumption is clock-related and the ADC yields a FoM of 683fJ per
conversion step.
Figure 6.6 compares the ADC in this work with previously reported ADCs in [2].
Our resonant clock ADC achieves a lower FoM than all previously published ADCs
operating above 2.2GHz. Speciﬁcally, it achieves 58% lower FoM than [60], which
used interpolation and reduced-output-swing of front-end circuits. It also achieves
20% and 14% lower FoM than [61] and [51], respectively, which used clock duty cycle
88
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
5
10
15
20
25
30
35
S
N
D
R
 (
d
B
)
Sampling Frequency (GHz)
0
1
2
3
4
5
 E
N
O
B
 (
B
it
)
(a)
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
0
100
200
300
400
500
600
700
F
O
M
 (
fJ
 /
 C
o
n
v
. 
s
te
p
)
Sampling Frequency (GHz)
(b)
Figure 6.5: (a) Measured SNDR vs. sampling frequency with input frequency of
400MHz. (b) Measured FoM vs. sampling frequency.
89
Table 6.1: Performance summary.
control techniques and 8X time-interleaving to improve sampling rate. The techniques
used in [51, 60, 61] are compatible with resonant clocking and can be used to further
improve the FoM of resonant-clock ADCs, such as the one presented in this work.
Figure 6.7 shows the die microphotograph of the ADC chip. The device has been
fabricated using a 65nm CMOS technology and packaged in a 32-pin QFN package.
The ADC core occupies 277.6µm × 64µm = 0.018mm2. Including the 130µm ×
130µm inductor, the ADC occupies 0.035mm2.
90
*8X time-interleaved
0 2 4 6 8 10 12 14
0
200
400
600
800
1000
F
O
M
 (
fJ
/S
te
p
)
Sampling Frequency (GHz)
ISSCC
VLSI
ESSCIRC
This Work
VLSI' 10
65nm  *
VLSI' 09
65nm 
VLSI' 07
 90nm 
13%20%
This work
58%
Figure 6.6: FoM vs. sampling frequency: comparison to prior work summarized in
[2].
6.2 Summary
This chapter presented measurement results from the evaluation of a 5-bit non-
interleaved resonant-clock ﬂash ADC that achieves a sampling rate of 7GS/s. An
on-chip 0.77nH inductor resonates the entire clock distribution network to achieve
energy-eﬃcient operation. The ADC has been designed in a 65nm bulk CMOS process
and occupies 0.035mm2, including the integrated inductor. Operating at 5.5GHz, it
consumes 28mW, yielding 396fJ per conversion step, with its clock network accounting
for 10.7% of total power and consuming 54% less energy over CV 2. Operating at it
maximum sampling frequency of 7.0GS/s, the ADC dissipates 45mW with a 0.1V
91
Figure 6.7: ADC die microphotograph.
increase to each supply. At that frequency, 11.1% of the total energy consumption is
clock-related, and the ADC yields a FoM of 683fJ per conversion step. By comparison,
in a typical ﬂash ADC design, 30% of total power is clock-related.
92
CHAPTER 7
Conclusions and Future Directions
This chapter summarizes the contributions of this dissertation. Charge-recovery
techniques are explored as a solution targeting both high-performance and ultra-low
power platforms to achieve increased energy eﬃciency over conventional CMOS de-
signs without compromising performance or reliability. Two charge-recovery systems
operating at diﬀerent frequency points are tested and evaluated. Both designs pro-
vide high performance at their corresponding supply-levels and achieve higher energy
eﬃciency than their conventional CMOS designs counterparts.
7.1 Subthreshold Boost Logic
For ultra-low energy consumption with high performance, we present Subthreshold
Boost Logic (SBL), a circuit family that is capable of operating at multi-MHz clock
frequencies using subthreshold supplies. Unlike subthreshold circuitry, which uses
subthreshold currents for computation and typically operates in sub-MHz range, SBL
gates are overdriven to operate in the linear region, achieving order-of-magnitude
improvements in operating speed over subthreshold logic. Energy eﬃcient operation is
ensured through the use of aggressively-scaled DC supplies at subthreshold levels and
by deploying charge recovery design techniques to boost these subthreshold supply
levels by 3X to 4X.
93
To demonstrate the performance and energy eﬃciency of SBL, we have designed
a 14-tap 8-bit ﬁnite-impulse response (FIR) ﬁlter test-chip implemented using SBL.
Fabricated in a 0.13µm bulk silicon process with regular thresholds, the test-chip
functions correctly for clock frequencies ranging from 5MHz to 187MHz, relying on
two discrete oﬀ-chip inductors to boost the subthreshold supply in an energy-eﬃcient
manner. Clock drivers are fully integrated and distributed across the entire clock
network. With a single subthreshold supply set to 0.27V, it achieves its most en-
ergy eﬃcient operating point at 20MHz, yielding a ﬁgure of merit equal to 17.37
nW/Tap/MHz/InBit/CoeﬀBit. With the introduction of a second subthreshold sup-
ply set to 0.18V, energy consumption due to crowbar currents at clock frequencies
below 30MHz is signiﬁcantly reduced. Maximum energy eﬃciency is improved by
17.1% and is achieved at 20MHz, yielding 14.40 nW/Tap/MHz/InBit/CoeﬀBit. At
maximum energy eﬃciency, energy recovery rates range from 86% to 89%, depend-
ing on the number of supplies. Based on Spice simulations of the SBL FIR and a
fully-automatic static CMOS implementation of the same FIR architecture, the SBL
design consumes 40% to 50% less energy per cycle in the 17MHz-187MHz range while
incurring a 15% area overhead.
7.2 Resonant-Clock Flash ADC Design
For high performance and high energy eﬃciency, we present a 5-bit 7GS/s non-
interleaved ﬂash ADC with resonant clocking techniques. The chip has been fabri-
cated with a 65nm CMOS process. A 0.77nH on-chip inductor is used to generate
a single-phase resonant clock and reduce power consumption of the ADC clock dis-
tribution network. From measurement results, the test chip achieves a DNL below
0.30 LSB and an INL below 0.33 LSB. With a 400MHz full-power input, ENOB is
4.62 bits and 4.40 bits at 5.5GS/s and 7.0GS/s, respectively. When operating at
5.5GHz, the resonant-clock ADC consumes 28mW, achieving a FoM equal to 396fJ
94
per conversion step. This FoM is lower than all previously published ADCs operating
above 2.2GHz. The clock network dissipates only 10.7% of total power; in contrast,
clock-related power in typical ﬂash ADCs is roughly 30% of total power. Operating at
its maximum sampling frequency of 7.0GS/s, the ADC dissipates 45mW with 11.1%
of the total energy consumption attributed to the clock, yielding a FoM of 683fJ per
conversion step.
7.3 Future Directions
In this section, we discuss some future possibilities in the implementation of
charge-recovery logic and resonant-clocked designs.
Design Automation for Charge-Recovery Logic Designs
An important challenge for charge-recovery logic is the lack of automation tools
and automated design ﬂows. Unlike resonant-clocked designs, which are more amenable
to the conventional design ﬂow, charge recovery logic designs usually require a longer
design process. Logic synthesis is done manually but the result may not be fully
optimized for timing. For designs with regular structure, place-and-route may be
relatively straight forward; however, for general non-regular designs, it can be a par-
ticularly challenge.
Establishing a standard cell library is the ﬁrst step for automatic charge-recovery
design An algorithm is needed to decide the size of a standard cell library and the
complexity of functions implemented by logic gates. Since each charge-recovery logic
is a pipeline stage, the function complexity of each gate has a strong relation to the
number of pipeline stages and system latency.
When performing logic synthesis, another algorithmic technology is needed to
automatically trade-oﬀ the number of pipeline stages and the number of buﬀers for
clock phase alignment. More stages may reduce transistor sizes in each gate and
95
length of routing wires, reducing output loading for each cell. However, increasing
the number of stages may require more buﬀers for aligning signals with the correct
phase, resulting in increased power consumption and area overhead.
For place-and-route, a tool that balances the capacitive load for a dual-rail output
pair would be highly desirable. Speciﬁcally, such a place-and-route tool should be
capable of adjusting the wire length with the corresponding fan-out capacitance,
achieving near-constant loading during switching and reducing clock jitter in the
system.
A large inductor database with more conﬁgurable options is preferred to provide
the target inductance and quality factor, achieving high energy eﬃcient and reducing
design eﬀort.
AC-Powered Circuitry
Another promising area for research in charge-recovery is the design of AC-powered
circuitry. AC-powered circuitry is a circuit family that can only rely on an AC source
for its operation. No other DC voltages/ground is available in this AC-powered
circuitry. Since charge-recovery logic makes good use of potential diﬀerence in a
power-clock, it is potentially suitable for implementing AC-powered circuitry. AC-
powered circuitry can improve energy eﬃciency in digital integrated systems, espe-
cially for wireless powering designs. In addition, the inherent use of inductors in
wireless powering systems reduces the area overhead of the charge-recovery designs
over conventional ones.
Wireless powering techniques based on RF electromagnetic wave propagation have
captured the interest of many researchers in applications such as RFID and medical
electronics. One of the major goals of these harvesting systems is to convert RF
energy into usable DC source. AC-DC/DC-DC converters are usually used for this
conversion, but their energy eﬃciency is not high. In order to improve harvesting
96
eﬃciency, many methods have been proposed to improve energy eﬃciency of embed-
ded AC-DC/DC-DC converters. By designing charge-recovery AC-powered circuitry,
these converters can be omitted, and the circuits can be directly powered by the
received RF signal. As a result, overall design area decreases and energy eﬃciency
improves, compared to conventional designs
Time-Interleaved ADC with Resonant-Clocking Techniques
A time-interleaved ADC (TIADC) utilizes multiple ADCs in parallel to increase
the system sampling rate according to the number of used ADCs. One of the primary
issues with time-interleaved structures is timing skew in clock signals, which results
in non-uniform sampling of input. Clock buﬀers are one of the main sources of
the clock skew. However, resonant-clocking is a diﬀerent clocking technique from
conventional clocking methods and requires no buﬀers for clock distribution. With
resonant clocking, clock signals are distributed with metal wires, leading to minimum
clock skew.
Precise timing control in resonant clock systems is signiﬁcantly less challenging
than in conventional clock systems. In resonant clock systems, pulses with high
timing accuracy are needed to drive a resonant clock generator, and the area of a
resonant clock generator is substantially smaller than that of an ADC core. Therefore,
ensuring precise timing for clock generator pulses is a much more manageable task
than distributing a high-accuracy/low-skew clock signal across the entire core with a
conventional buﬀered clock network.
97
APPENDICES
98
APPENDIX A
Testing Setup for SBL FIR Test-Chip
This section presents the testing setup for SBL FIR test-chip. Figure A.1 shows
the bonding diagram of the SBL FIR test-chip for an LCC84 package. The SBL FIR
ﬁlter is on the same die with other cores. However, independent supplies are used for
each core, and only the ground nets are connected together.
Table A.1 describes the function of each pad. As shown in the table, two pads
are used in parallel to connect each power-clock phase and thus reduce the parasitic
resistance when SBL FIR is in resonance.
Figure A.2 shows the schematic of the printed circuit board (PCB) used for testing
the SBL FIR chip, and Table A.2 shows the parts list on this PCB. 0.1µF ceramic
capacitors are used for near-package decoupling, and 10µF ceramic capacitors are
used for decoupling near the connections between PCB and supplies. A 48-channel
DIO PCI card is used to provide control and conﬁguration signals to the test-chip.
A 915 Ohm ferrite bead is used to connect the DIO ground and supply ground, and
ﬁlters out high frequency noise generated from the DIO card. A Schmitt trigger is
used as a level converter for the scan signals from DIO card and provides them with
a sharp edge before entering the chip.
99
Figure A.1: Bonding diagram for SBL FIR test-chip.
Figure A.3 shows layout and photographs of the front and back sides of the PCB.
In this testing, a four-layer PCB is used, and the top and bottom layers are signal
layers. The second layer is used as a ground plane, and the third layer is used
as a power plane with multiple voltage domains. The power/ground planes reduce
coupling eﬀects between signals on top and bottom layers. Figure A.4 shows the
setup used to test the SBL FIR chip. Three voltage outputs are used: Dirty VDD (left
100
Table A.1: I/O information for SBL FIR test-chip.
output of the top supply) is used to supply the I/O pads, static 1.2V (right output of
the top supply) is the supply for the BIST circuit, and the subthreshold supply (left
101
output of the second supply) is shared by VCC and VDC for the SBL FIR ﬁlter.
Figure A.2: Schematic of the printed circuit board for SBL FIR test-chip.
102
Table A.2: Parts list for SBL FIR test-chip.
103
Figure A.3: Printed circuit board for SBL FIR test-chip.
104
Figure A.4: Test setup for SBL FIR test-chip.
105
APPENDIX B
Testing Setup for Resonant-Clock Flash ADC
Test-Chip
This section shows the test setup for the resonant-clock ﬂash ADC test-chip.
Figure B.1 shows the bonding diagram of the resonant-clock ADC chip. A QFN32
package with dimensions 5mm × 5mm is used to reduce parasitic capacitance and
inductance of packages, and thus reduce the signal distortions especially for analog
signals. Table B.1 describes functions of each pad. As shown in the table, analog
power and ground are specially isolated from other supplies to reduce coupling noise
from other power domains.
Figure B.2 shows the schematic of the PCB for the resonant-clock ADC test-chip,
and Table B.2 shows the parts list on this PCB. A QFN package is directly soldered
on the PCB to avoid parasitic resistance and inductance from a socket. A wide band
balun transformer is used to convert single-ended signal to a diﬀerential signal, and
two 50 Ohm resistors are used to provide needed resistance for impedance matching.
Figure B.3 shows the layout and photographs of the front and back sides of the
two-layer PCB. In this testing, a two-layer PCB is used. The top layer is used for
signals and diﬀerent power domains, and the bottom layer is shared with multiple
ground domains. These ground domains are connected with 915 Ohm ferrite beads
106
Figure B.1: Bonding diagram for resonant-clock ADC test-chip.
to reduce the coupling noise. The transformer and the matching resistors are placed
closely to the QFN package to reduce PCB trace length and other parasitics eﬀects.
Figure B.4 shows the setup used to test the resonant-clock ADC chip. Two signal
generators (top right) generate a sampling clock and an analog input signal. A logic
analyzer (middle) is used to capture the decimated outputs from the resonant-clock
ADC test-chip. The testing program is designed in Labview, and parameter sweeping
is conﬁgurable through the program GUI interface. Frequency and voltage scaling
are performed automatically through Labview .
107
Table B.1: I/O information for resonant-clock ADC test-chip.
Figure B.2: Schematic of the printed circuit board for resonant-clock ADC test-chip.
108
Table B.2: Parts list for resonant-clock ADC test-chip.
Figure B.3: Printed circuit board for resonant-clock ADC test-chip.
109
Figure B.4: Test setup for resonant-clock ADC test-chip.
110
BIBLIOGRAPHY
111
BIBLIOGRAPHY
[1] A. Ishii, J. Kao, V. Sathe, and M. Papaefthymiou, A Resonant-Clock 200MHz
ARM926EJ-STM Microcontroller, in European Solid-State Circuits Conference,
ESSCIRC, pp. 356359, September 2009.
[2] http://www.stanford.edu/∼murmann/adcsurvey.html.
[3] G. Moore, Progress in Digital Integrated Electronics, in International Electron
Devices Meeting, vol. 21, pp. 1113, 1975.
[4] B. Zhai, S. Pant, L. Nazhandali, S. Hanson, J. Olson, A. Reeves, M. Minuth,
R. Helfand, T. Austin, D. Sylvester, and D. Blaauw, Energy-Eﬃcient Sub-
threshold Processor Design, IEEE Transactions on Very Large Scale Integration
Systems, TVLSI, vol. 17, pp. 11271137, August 2009.
[5] S. Hanson, M. Seok, Y.-S. Lin, Z. Y. Foo, D. Kim, Y. Lee, N. Liu, D. Sylvester,
and D. Blaauw, A Low-Voltage Processor for Sensing Applications with Picow-
att Standby Mode, Journal of Solid-State Circuits, JSSC, vol. 44, pp. 1145
1155, April 2009.
[6] S. Hanson, Z. Y. Foo, D. Blaauw, and D. Sylvester, A 0.5V Sub-Microwatt
CMOS Image Sensor with Pulse-Width Modulation Read-Out, Journal of Solid-
State Circuits, JSSC, vol. 45, pp. 759767, April 2010.
[7] S. Kim and M. Papaefthymiou, Single-Phase Source-Coupled Adiabatic Logic,
in International Symposium on Low Power Electronics and Design, ISLPED,
pp. 9799, August 1999.
[8] K. S. and M. Papaefthymiou, True Single-Phase Adiabatic Circuitry, IEEE
Transactions on Very Large Scale Integration Systems, TVLSI, vol. 9, pp. 5263,
February 2001.
[9] C. Ziesler, J. Kim, V. Sathe, and M. Papaefthymiou, A 225 MHz Resonant
Clocked ASIC Chip, in International Symposium on Low Power Electronics
and Design, ISLPED, pp. 4853, August 2003.
[10] C. Ziesler, J. Kim, and M. Papaefthymiou, Energy Recovering ASIC Design,
in IEEE Computer Society Annual Symposium on VLSI, ISVLSI, pp. 133138,
February 2003.
112
[11] A. Bargagli-Stoﬃ, G. Iannaccone, S. Di Pascoli, E. Amirante, and D. Schmitt-
Landsiedel, Four-Phase Power Clock Generator for Adiabatic Logic Circuits,
Electronics Letters, vol. 38, pp. 689690, July 2002.
[12] A. Kramer, J. S. Denker, B. Flower, and J. Moroney, 2nd Order Adiabatic Com-
putation with 2N-2P and 2N-2N2P Logic Circuits, in International Symposium
on Low Power Electronics and Design, ISLPED, pp. 191196, August 1995.
[13] R. Staszewski, K. Muhammad, and P. Balsara, A 550-MSample/s 8-Tap FIR
Digital Filter for Magnetic Recording Read Channels, Journal of Solid-State
Circuits, JSSC, vol. 35, pp. 12051210, August 2000.
[14] V. Sathe, J. Kao, and M. Papaefthymiou, Resonant-Clock Latch-Based Design,
Journal of Solid-State Circuits, JSSC, vol. 43, pp. 864873, April 2008.
[15] S. Chan, P. Restle, T. Bucelot, J. Liberty, S. Weitzel, J. Keaty, B. Flachs,
R. Volant, P. Kapusta, and J. Zimmerman, A Resonant Global Clock Distribu-
tion for the Cell Broadband Engine Processor, Journal of Solid-State Circuits,
JSSC, vol. 44, pp. 6472, January 2009.
[16] J. Kao, W.-H. Ma, S. Kim, and M. Papaefthymiou, 2.07 GHz Floating-Point
Unit with Resonant-Clock Precharge Logic, in Asian Solid State Circuits Con-
ference, A-SSCC, pp. 213216, November 2010.
[17] W.-H. Ma, J. C. Kao, V. S. Sathe, and M. Papaefthymiou, A 187MHz
Subthreshold-Supply Robust FIR Filter with Charge-Recovery Logic, in Sym-
posium on VLSI Circuits, pp. 202203, June 2009.
[18] W.-H. Ma, J. Kao, V. Sathe, and M. Papaefthymiou, 187 MHz Subthreshold-
Supply Charge-Recovery FIR, Journal of Solid-State Circuits, JSSC, vol. 45,
pp. 793803, April 2010.
[19] W.-H. Ma, J. Kao, and M. Papaefthymiou, A 5.5GS/s 28mW 5-bit Flash ADC
with Resonant Clock Distribution, in European Solid-State Circuits Conference,
ESSCIRC, September 2011.
[20] R. P. Feynman, Feynman Lectures on Computation. Boston, MA, USA: Addison-
Wesley Longman Publishing Co., Inc., 1998.
[21] R. Landauer, Irreversibility and Heat Generation in the Computing Process,
IBM Journal of Research and Development, vol. 5, pp. 183191, July 1961.
[22] C. H. Bennett, Logical Reversibility of Computation, IBM Journal of Research
and Development, vol. 17, pp. 525532, November 1973.
[23] W. Pan and M. Nalasani, Reversible Logic, IEEE Potentials, vol. 24, pp. 3841,
February 2005.
113
[24] R. P. Feynman, Quantum Mechanical Computers, in Foundations of Physics,
vol. 16, pp. 507531, June 1986.
[25] W. Athas and L. Svensson, Reversible Logic Issues in Adiabatic CMOS, in
Workshop on Physics and Computation, pp. 111118, November 1994.
[26] S. G. Younis, Asymptotically Zero Energy Computing using Split-Level Charge
Recovery Logic, in Ph.D thesis, Massachusetts Institute of Technology, (Cam-
bridge, MA, USA), 1994.
[27] J. Lim, D.-G. Kim, and S.-I. Chae, A 16-bit Carry-Lookahead Adder using
Reversible Energy Recovery Logic for Ultra-Low-Energy Systems, Journal of
Solid-State Circuits, JSSC, vol. 34, pp. 898903, June 1999.
[28] A. Dickinson and J. Denker, Adiabatic Dynamic Logic, in IEEE Custom Inte-
grated Circuits Conference, CICC, pp. 282285, May 1994.
[29] A. Dickinson and J. Denker, Adiabatic Dynamic Logic, Journal of Solid-State
Circuits, JSSC, vol. 30, pp. 311315, March 1995.
[30] Y. Ye and K. Roy, QSERL: Quasi-Static Energy Recovery Logic, Journal of
Solid-State Circuits, JSSC, vol. 36, pp. 239248, February 2001.
[31] V. Oklobdzija, D. Maksimovic, and F. Lin, Pass-Transistor Adiabatic Logic
using Single Power-Clock Supply, IEEE Transactions on Circuits and Systems
II: Analog and Digital Signal Processing, vol. 44, pp. 842846, October 1997.
[32] A. Vetuli, S. Pascoli, and L. Reyneri, Positive Feedback in Adiabatic Logic,
Electronics Letters, vol. 32, pp. 18671869, September 1996.
[33] S. Kim and M. Papaefthymiou, True Single-Phase Energy-Recovering Logic
for Low-Power, High-Speed VLSI, in International Symposium on Low Power
Electronics and Design, ISLPED, pp. 167172, August 1998.
[34] V. S. Sathe, J.-Y. Chueh, and M. C. Papaefthymiou, Energy-Eﬃcient Ghz-Class
Charge-Recovery Logic, Journal of Solid-State Circuits, JSSC, vol. 42, pp. 38
47, January 2007.
[35] W. Athas, N. Tzartzanis, L. Svensson, and L. Peterson, A Low-Power Micro-
processor Based on Resonant Energy, Journal of Solid-State Circuits, JSSC,
vol. 32, pp. 16931701, November 1997.
[36] M. Hansson, B. Mesgarzadeh, and A. Alvandpour, 1.56 GHz On-Chip Resonant
Clocking in 130nm CMOS, in IEEE Custom Integrated Circuits Conference,
CICC, pp. 241244, September 2006.
[37] C. J-Y, V. Sathe, and M. Papaefthymiou, 900MHz to 1.2GHz Two-Phase Reso-
nant Clock Network with Programmable Driver and Loading, in IEEE Custom
Integrated Circuits Conference, CICC, pp. 777780, September 2006.
114
[38] V. Sathe, J. Kao, and M. Papaefthymiou, RF2: A 1GHz FIR Filter with Dis-
tributed Resonant Clock Generator, in Symposium on VLSI Circuits, pp. 4445,
June 2007.
[39] V. Sathe, J. Kao, and M. Papaefthymiou, A 0.8-1.2GHz Single-Phase Resonant-
Clocked FIR Filter with Level-Sensitive Latches, in IEEE Custom Integrated
Circuits Conference, CICC, pp. 583586, September 2007.
[40] A. Wang and A. Chandrakasan, A 180mV Subthreshold FFT Processor using a
Minimum Energy Design Methodology, Journal of Solid-State Circuits, JSSC,
vol. 40, pp. 310319, January 2005.
[41] B. Zhai, L. Nazhandali, J. Olson, A. Reeves, M. Minuth, R. Helfand, S. Pant,
D. Blaauw, and T. Austin, A 2.60pJ/Inst Subthreshold Sensor Processor for
Optimal Energy Eﬃciency, in Symposium on VLSI Circuits, pp. 154155, June
2006.
[42] M. Seok, S. Hanson, Y.-S. Lin, Z. Foo, D. Kim, Y. Lee, N. Liu, D. Sylvester, and
D. Blaauw, The Phoenix Processor: A 30pW Platform for Sensor Applications,
in Symposium on VLSI Circuits, pp. 188189, June 2008.
[43] J.-S. Wang, J.-S. Cehn, Y.-M. Wang, and C. Yeh, A 230mV-to-500mV 375KHz-
to-16MHz 32b RISC Core in 0.18µm CMOS, in International Solid-State Cir-
cuits Conference, ISSCC, pp. 294604, February 2007.
[44] M.-E. Hwang, A. Raychowdhury, K. Kim, and K. Roy, A 85mV 40nW Process-
Tolerant Subthreshold 8×8 FIR Filter in 130nm Technology, in Symposium on
VLSI Circuits, pp. 154155, June 2007.
[45] J. Kil, J. Gu, and C. Kim, A High-speed Variation-Tolerant Interconnect Tech-
nique for Sub-Threshold Circuits Using Capacitive Boosting, in International
Symposium on Low Power Electronics and Design, ISLPED, pp. 6772, August
2006.
[46] M. Seok, D. Jeon, C. Chakrabarti, D. Blaauw, and D. Sylvester, A 0.27V 30MHz
17.7nJ/Transform 1024-pt Complex FFT Core with Super-Pipelining, in Inter-
national Solid-State Circuits Conference, ISSCC, pp. 342344, February 2011.
[47] W. Athas, L. Svensson, and N. Tzartzanis, A Resonant Signal Driver for Two-
Phase, Almost-non-Overlapping Clocks, in International Symposium on Circuits
and Systems, ISCAS, vol. 4, pp. 129132 vol.4, May 1996.
[48] B. Verbruggen, J. Craninckx, M. Kuijk, P. Wambacq, and G. Van der Plas, A
2.2 mW 1.75 GS/s 5 Bit Folding Flash ADC in 90 nm Digital CMOS, Journal
of Solid-State Circuits, JSSC, vol. 44, pp. 874882, March 2009.
[49] G. Geelen and E. Paulus, An 8b 600MS/s 200mW CMOS Folding A/D Con-
verter using an Ampliﬁer Preset Technique, in International Solid-State Circuits
Conference, ISSCC, pp. 254526, February 2004.
115
[50] R. Taft, C. Menkus, M. Tursi, O. Hidri, and V. Pons, A 1.8-V 1.6-GSample/s 8-b
Self-Calibrating Folding ADC with 7.26 ENOB at Nyquist Frequency, Journal
of Solid-State Circuits, JSSC, vol. 39, pp. 21072115, December 2004.
[51] M. El-Chammas and B. Murmann, A 12-GS/s 81-mW 5-bit Time-Interleaved
Flash ADC with Background Timing Skew Calibration, Journal of Solid-State
Circuits, JSSC, vol. 46, pp. 838847, April 2011.
[52] S. Park, Y. Palaskas, and M. Flynn, A 4GS/s 4-bit Flash ADC in 0.18µm
CMOS, Journal of Solid-State Circuits, JSSC, vol. 42, pp. 1865 1872, Septem-
ber 2007.
[53] M. Choi, J. Lee, J. Lee, and H. Son, A 6-bit 5-GSample/s Nyquist A/D Con-
verter in 65nm CMOS, in Symposium on VLSI Circuits, pp. 1617, June 2008.
[54] C. Paulus, H.-M. Bluthgen, M. Low, E. Sicheneder, N. Bruls, A. Courtois,
M. Tiebout, and R. Thewes, A 4GS/s 6b Flash ADC in 0.13µm CMOS, in
Symposium on VLSI Circuits, pp. 420423, June 2004.
[55] X. Jiang, Z. Wang, and M. Chang, A 2 GS/s 6 b ADC in 0.18µm CMOS,
in International Solid-State Circuits Conference, ISSCC, pp. 322497, February
2003.
[56] D. Wei, D. Sun, and A. Abidi, A 300MHz Mixed-Signal FDTS/DFE Disk Read
Channel in 0.6µm CMOS, in International Solid-State Circuits Conference,
ISSCC, pp. 186187, February 2001.
[57] M. Pelgrom, A. Duinmaijer, and A. Welbers, Matching Properties of MOS Tran-
sistors, Journal of Solid-State Circuits, JSSC, vol. 24, pp. 14331439, October
1989.
[58] K. Uyttenhove and M. Steyaert, A 1.8-V 6-bit 1.3-GHz Flash ADC in 0.25-
µm CMOS, Journal of Solid-State Circuits, JSSC, vol. 38, pp. 11151122, July
2003.
[59] S.-M. Yim, T. Chen, and K. O, The Eﬀects of a Ground Shield on the Charac-
teristics and Performance of Spiral Inductors, Journal of Solid-State Circuits,
JSSC, vol. 37, pp. 237244, Feburary 2002.
[60] K. Deguchi, N. Suwa, M. Ito, T. Kumamoto, and T. Miki, A 6-bit 3.5-GS/s 0.9-
V 98-mW Flash ADC in 90-nm CMOS, Journal of Solid-State Circuits, JSSC,
vol. 43, pp. 23032310, October 2008.
[61] H. Chung, A. Rylyakov, Z. T. Deniz, J. Bulzacchelli, G.-Y. Wei, and D. Friedman,
A 7.5-GS/s 3.8-ENOB 52-mW Flash ADC with Clock Duty Cycle Control in
65nm CMOS, in Symposium on VLSI Circuits, pp. 268269, June 2009.
116
