Ultra-Low Leakage, Energy-Efficient Digital Integrated Circuit and System Design by da Silva Cerqueira, Joao Pedro
Ultra-Low Leakage, Energy-Efficient
Digital Integrated Circuit and System
Design
Joa˜o Pedro da Silva Cerqueira
Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy








Digital Integrated Circuit and System
Design
Joa˜o Pedro da Silva Cerqueira
The advances of the complementary metal-oxide-semiconductor (CMOS) technology
manufacturing and design over the years have enabled a diverse range of applications
across the power consumption, performance, and area (PPA) spectra. Many of the
recent and prospective applications rely on the availability of energy-autonomous,
miniaturized systems, i.e., ultra-low power (ULP) VLSI systems, which are gener-
ally characterized by extreme resource limitations. Some examples of applications
are wireless sensing platforms, body-area sensor networks (BASN), biomedical and
implantable devices, wearables, hearables, and monitors. Within the context of
such applications, the key requirements are long lifetime and miniaturized size (sub-
/millimeter-scale). In order to enable both requirements, energy-efficiency is the key
metric. It allows for extended battery lifetime and operation with the energy that
can be harvested from the environment, and it limits the size (volume) of the energy
sources utilized to power these systems.
Ultra-low voltage (ULV) operation is a key technique in which the VDD of circuits
is reduced from nominal to near or below the threshold voltage of the transistor. It is a
powerful knob that has been largely exploited by designers in order to achieve ultra-
low power consumption and high energy-efficiency in CMOS. Existing ULP VLSI
systems typically operate at a lower supply voltage thereby reducing their energy
consumption by one to two orders of magnitude in order to enable the aforementioned
applications.
While supply voltage scaling is a promising measure for achieving low power and
reducing energy consumption, it brings up several challenges. One critical issue is
the leakage energy dissipated by the devices, which is magnified in portion to the
total energy consumption at ULV. The reason is that, as VDD scales from nominal
to near-threshold and sub-threshold, transistors become increasingly slower and they
accumulate more leakage (i.e., static) power over longer cycle times. This energy waste
accounts for a significant portion of the system’s total energy consumption, offsets
the gains provided by voltage scaling, defines the minimum energy per operation, and
poses a practical limit for the system’s energy-efficiency.
This thesis presents selected research works on ultra-low leakage, energy-efficient
digital integrated circuit design. More specifically, it describes novel and key tech-
niques for minimizing the energy waste of idle/underutilized and always-on hard-
ware. The main goal of such techniques is to push the envelope of energy-efficiency
in energy-autonomous, miniaturized VLSI systems. Such techniques are applied to
key building blocks of emerging mobile and embedded computing devices resulting in
state-of-the-art energy-efficiencies.
Table of Contents
List of Figures iv
Acknowledgments xi
Chapter 1: Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Energy-Autonomous, Miniaturized VLSI Systems for the ULP Embed-
ded and Mobile Domains . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 The Key Enabler: Energy-Efficiency . . . . . . . . . . . . . . 4
1.3 Ultra-Low Voltage, Energy-Efficient Operation . . . . . . . . . . . . . 5
1.4 Leakage Mechanisms in CMOS Circuits . . . . . . . . . . . . . . . . . 8
1.4.1 Sub-Threshold Leakage Current . . . . . . . . . . . . . . . . . 9
1.5 The Impact of Leakage in ULP VLSI Systems . . . . . . . . . . . . . 10
1.5.1 The Ineffectiveness of Traditional Leakage Reduction Techniques
at ULV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Challenges and Contributions of this Thesis . . . . . . . . . . . . . . 13
1.6.1 Spatiotemporally Fine-Grained Power-Gating . . . . . . . . . 15
1.6.2 An Ultra-Low Leakage Logic Family for Always-On Circuits . 17
1.6.3 A Near-Vt Programmable Spatial Array Accelerator for the
Ultra-Low Power Embedded and Mobile Domains . . . . . . . 17
1.7 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 19
i
Chapter 2: Temporally Fine-Grained Sleep Technique for Near- and
Sub-Threshold Parallel Architectures 20
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Parallelism and Pipelining in Near- and Sub-Threshold Circuits . . . 25
2.2.1 The Effectiveness of the 2-Way Parallel Architecture . . . . . 25
2.2.2 The Impact of Circuit Utilization . . . . . . . . . . . . . . . . 29
2.2.3 Scalability of Parallel Architectures . . . . . . . . . . . . . . . 31
2.3 Power-Gating Overhead Analysis . . . . . . . . . . . . . . . . . . . . 32
2.3.1 PGS mode-transition Energy and Delay Overheads . . . . . . 34
2.3.2 The Impact of Voltage Scaling on PGS Overheads . . . . . . . 38
2.4 Temporally Fine-Grained Sleep Technique . . . . . . . . . . . . . . . 39
2.5 Parallel Architecture with the temporally Fine-Grain Sleep Technique 45
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 3: A 0.17 mm2 3.19 nJ/Transform 256-Point Fast Fourier
Transform Core based on Spatiotemporally Fine-Grained
Active Leakage Suppression 50
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Memory-Based, Compact FFT Architecture . . . . . . . . . . . . . . 52
3.3 Active Leakage Suppression Techniques for Near- and Sub-Vt Computing 54
3.3.1 Ultra-Low Leakage SRAM Design . . . . . . . . . . . . . . . . 54
3.3.2 Ultra-Low Leakage Combinational Logic Design . . . . . . . . 59
3.4 Chip Prototype and Measured Results . . . . . . . . . . . . . . . . . 65
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 4: A fW- and kHz-Class Feedforward Leakage Self-Suppression
Logic Requiring No External Sleep Signal to Enter the
Leakage Suppression Mode 72
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
ii
4.2 Feedforward Leakage Self-suppression Logic . . . . . . . . . . . . . . 76
4.2.1 The FLSL Inverter . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 The FLSL Performance . . . . . . . . . . . . . . . . . . . . . . 77
4.2.3 No Sleep/Mode Signal Requirement . . . . . . . . . . . . . . . 81
4.2.4 The Impact of Technology Scaling on FLSL . . . . . . . . . . 82
4.3 Chip Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5: Catena: A 0.5 V Sub-0.4 mW Programmable Spatial
Array Accelerator for Mobile and Embedded Computing 92
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Catena’s Circuit and Architecture Design . . . . . . . . . . . . . . . . 94
5.2.1 Spatiotemporally Fine-Grained Clock-Gating . . . . . . . . . . 97
5.2.2 Spatiotemporally Fine-Grained Power-Gating . . . . . . . . . 98
5.2.3 Ultra-Low Power L1 Cache and Scratchpad Design . . . . . . 100
5.2.4 Architecture-Level Techniques for Efficient Computation and
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Chip Prototype and Testing Setup . . . . . . . . . . . . . . . . . . . 103
5.4 Measurements and Comparison with Prior Art . . . . . . . . . . . . . 104




Figure 1.1: A commercial long-term continuous blood glucose monitoring
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Figure 1.2: Energy/cycle vs. supply voltage of a 50-stage FO4-inverter
chain benchmark circuit. The total energy/cycle is split be-
tween two major components: the switching and leakage en-
ergies. At ULV, the leakage energy sets the minimum energy
point, the practical limit for energy-efficiency. . . . . . . . . . 6
Figure 1.3: FO4-inverter delay vs. supply voltage of a 50-stage FO4-
inverter chain benchmark circuit. At ULV, delay increases
almost exponentially as predicted by Equation 1.3. . . . . . . 7
Figure 1.4: Sub-threshold curve of an NMOS device in a 65 nm general
purpose CMOS process technology. The sub-threshold region
is within the linear part of the curve in this semilog-Y plot. . 9
Figure 1.5: Energy/cycle vs. supply voltage of a 50-stage FO4-inverter
chain benchmark circuit after applying leakage reduction tech-
niques. The leakage energy curve (1) moves down to a lower
leakage point (2). Consequently, the total energy is also re-
duced enabling extended energy-efficiency. . . . . . . . . . . . 14
iv
Figure 2.1: Three test architectures based on a 16-bit multiplier. (a) Base-
line, (b) 2-stage pipelined, and (c) 2-way parallel designs. The
dash lines represent boundaries of the equivalent sequencing
stage across three designs. . . . . . . . . . . . . . . . . . . . . 27
Figure 2.2: Comparisons of three architectures (baseline, 2-stage pipelined,
and 2-way parallel), in energy consumption per cycle and com-
putational throughput. VDD is swept from 0.8 V to 0.2 V, with
a step of 50 mV. (a) The comparisons over the full range of VDD
sweep; (b) A zoomed in view in the super-threshold regime; (c)
A zoomed in version in near-threshold regime; (d) A zoomed
in version in sub-threshold regime. . . . . . . . . . . . . . . . 28
Figure 2.3: Performance and energy-efficiency gains of parallel and pipeline
architectures at super-, near-, and sub-threshold regimes. . . 29
Figure 2.4: Impact of hardware utilization. (a) The comparisons of the
three architectures at TCY CLE = 20 ns. (b) The crossover
utilization point between the baseline and the 2-way parallel
architectures over TCY CLEs. . . . . . . . . . . . . . . . . . . . 30
Figure 2.5: Scalability of parallel architectures in near-threshold regime.
(a) Computation throughput improvement at the same energy
per operation. (b) Energy per operation improvement at the
same throughput. . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 2.6: Scalability of parallel architectures in sub-threshold regime.
(a) Computation throughput improvement at the same energy
per operation. (b) Energy per operation improvement at the
same throughput. . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 2.7: (a) Main circuits (two inverters) with an NMOS PGS showing
the critical discharging path during the wake-up process. (b)
Timing and energy overheads during the mode-transitions. . . 33
v
Figure 2.8: VV G and VINT waveforms during a wake-up process. T2WKU
is divided into two phases, (b) the effective circuits during
T2WKU1, (c) the effective circuits during T2WKU2. . . . . . . . 36
Figure 2.9: Relative energy and timing overheads over VDDs. . . . . . . . 39
Figure 2.10: (a) 50-stage inverter chain circuits with the ZSCCMOS scheme.
(b) The breakdown of leakage power dissipation during sleep
modes across VDDs. . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 2.11: Four PGS designs: (a) a footer PGS, (b) a footer PGS with
gate overdrive voltage, (c) a ZSCCMOS, and (d) a ZSCCMOS
with gate overdrive voltage. . . . . . . . . . . . . . . . . . . . 43
Figure 2.12: Comparisons among the four different PGS schemes. . . . . . 44
Figure 2.13: (a) Proposed design: A 16-bit 2-way parallel multiplier with
our temporally fine-grained PGS technique. (b) Functional
waveforms. (c) Comparison of the energy dissipation per cycle
of the four architectures for a target cycle time of 20 ns. (d)
Comparison of the energy dissipation per cycle of the four
architecture for a target cycle time of 3.3 ns. . . . . . . . . . 46
Figure 3.1: FFT processor architecture. Fixed-point, memory-based, one
16-bit input lane, one 16-bit output lane, radix-2-based but-
terfly, 256-point resolution. . . . . . . . . . . . . . . . . . . . 53
Figure 3.2: Energy breakdown of the baseline and proposed FFT processor. 55
Figure 3.3: Energy breakdown (butterfly and memory). (a) The energy
breakdown across different number of points for a radix-2, MB
FFT. (b) The energy breakdown across temperatures. . . . . 56
Figure 3.4: (a) Schematic and (b) layout of the proposed 10T ultra-low
leakage bitcell. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 3.5: The effectiveness of the circuit-level techniques employed in
the SRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vi
Figure 3.6: The memory bank architecture (only 2 out of 4 blocks are
depicted) and a column of 16 bitcells featuring the spatiotem-
poral voltage-boosting circuitry. . . . . . . . . . . . . . . . . 60
Figure 3.7: The ZSCCMOS drivers are implemented in combinational logic
circuits such as the memory peripherals to minimize the ac-
tive leakage energy waste. (a) The CTRL shuts down a WWL
driver of which the output becomes LOW. (b) The CTRL
wakes up the driver on the fly to access a particular word. . . 63
Figure 3.8: Power consumption of the FFT processor when the PE is
always-on or shutdown for 192 and 255 cycles. 13.4 % power
reduction is observed when the PE is shutdown for 255 cycles.
In this case, the very last one cycle is reserved for wake circuits
up from sleep mode. . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 3.9: Detailed power domain and Vt assignment of the FFT proces-
sor. The PE includes a PGS that is sized at 3 % of the total
NMOS width of this block. The SLPB signal that drives the
power gate is overdriven to VDDH by a level converter. . . . . 66
Figure 3.10: The FFT processor chip microphotograph fabricated in a 65
nm general-purpose CMOS. . . . . . . . . . . . . . . . . . . . 66
Figure 3.11: The minimum energy operating point of the baseline design
is scaled by 50 to 75 mV as compared to the proposed FFT
processor. The scaling of the MEP due to the circuit level
active leakage suppression techniques employed in memory and
PE results in greater energy-efficiency. . . . . . . . . . . . . . 67
Figure 3.12: The measured energy dissipation per transform and clock fre-
quency of the FFT core. . . . . . . . . . . . . . . . . . . . . . 68
Figure 3.13: The energy and clock frequency as a function of ∆VDD. . . . 68
vii
Figure 3.14: The energy/FFT and clock frequency measured across tem-
peratures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 3.15: The energy and performance distributions of the FFT across
18 dies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 3.16: (a)The comparison table with prior arts. (b) Normalized area
vs. normalized energy per transform as a figure-of-merit (FoM). 71
Figure 4.1: (a) Stacked inverter schematics. (b) Simulation results of leak-
age power, (c) delay and area overheads across the number of
stacks implemented. . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 4.2: (a) Stacked inverter schematics. (b) Simulation results of leak-
age power, (c) delay and area overheads across the number of
stacks implemented. . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 4.3: Parasitic-annotated simulations of inverters designed in FLSL,
Lim et al., and static CMOS. (a) Leakage power vs. FO4-
inverter delay. (b) FO4-inverter delay vs. supply voltage. (c)
Leakage-Delay Product vs. supply voltage. . . . . . . . . . . 79
Figure 4.4: (a) Monte Carlo results of the (a) FLSL and (b) static CMOS
inverters with mismatch across process corners. . . . . . . . . 80
Figure 4.5: Inverter layout in three different logic families. (a) Static
CMOS. (b) Lim et al. (c) FLSL. The overhead of the FLSL
inverter as compared to the static CMOS counterpart is rela-
tive to a minimum-size inverter of a commercial standard-cell
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 4.6: FO4-inverter delay vs. per-inverter leakage power in different
supply voltages and across technology nodes. . . . . . . . . . 84
Figure 4.7: (a) LDP comparison of an FLSL, a static CMOS and the prior
art low leakage NAND-2 (Lim et al.) (b) FLSL NAND-2. (c)
FLSL transparent-high latch. . . . . . . . . . . . . . . . . . . 86
viii
Figure 4.8: (a) Chip microphotograph showing an active area of 1.32 mm2
in a 0.18 µm. (b) The FIR architecture and the steady-state
detector circuits. . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 4.9: (a) The measured leakage power vs. supply voltage of the
core circuits. (b) The clock frequency vs. supply voltage mea-
surement for three different VBIAS voltages, showing the filter
operates at 1.03 kHz for VDD = VLEAK = 0.85 V and VBIAS =
0.5 · VLEAK . (c) Measurement results of the total power of the
filter running at the leakage-optimal operating point (power is
proportional to input activity). (d) The temperature effect on
the clock and the simulated effect on leakage power . . . . . . 88
Figure 4.10: Table of comparison of the prototyped filter with the prior art. 89
Figure 5.1: Catena’s high-level architecture featuring the programmable
spatial array of 16 cores, an 8 kB ULP L1 cache, and three on-
chip networks utilized for communication between the cores,
between the cores and the cache, and for configuration and
debug. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 5.2: The core microarchitecture including the processing element
and the router. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Figure 5.3: Proposed statically-configured and dynamically-configured clock-
gating scheme implemented in Catena. . . . . . . . . . . . . . 97
Figure 5.4: Proposed spatiotemporally fine-grained in-cell zigzag PG. Sleep
mode can be exercised for short periods of idleness so as to
maximize the time circuits remain in the low leakage mode to
reduce the leakage energy dissipation. . . . . . . . . . . . . . 99
Figure 5.5: Ultra-low leakage SRAM macro design. Fine-grained VB is
proposed to speed up read/write accesses and reduce energy
waste. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
ix
Figure 5.6: Energy savings of our proposed cell-embedded zigzag PG ap-
plied to the SRAM macro over the number of cycles in the
sleep mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 5.7: Normalized runtime (cycle/cycle) improvement due to architecture-
level techniques deployed in Catena across multiple workloads. 103
Figure 5.8: Microphotograph of the silicon prototype in a 65 nm LP CMOS.104
Figure 5.9: Energy/cycle vs. clock frequency when performing GEMM.
Two different conditions are shown – (a) 16 cores running at
25 ◦C and 50 ◦C; and (b) 8 cores running at 25 ◦C and 50 ◦C. 105
Figure 5.10: Energy/cycle vs. utilization rate. The hardware usage is mod-
ulated by mapping multiple workloads onto Catena’s spatial
array. The circuit-level techniques reduce the energy waste of
underutilized hardware, which improves energy-efficiency by
2.42X at 20 %. At higher utilization rates of 70 % to 90 %
the proposed circuit techniques still enable significant savings
of 35 % and 25 %, respectively. . . . . . . . . . . . . . . . . . 106
Figure 5.11: Two workloads mapped onto Catena’s spatial array. (a) MNIST
and (b) FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 5.12: Table of comparison of Catena with the prior art. . . . . . . . 109
x
Acknowledgments
I am very grateful to my academic advisor, Prof. Mingoo Seok, for his assistance,
support, and strong encouragement throughout my Ph.D. career at Columbia Uni-
versity. Mingoo has been a reliable source of technical guidance and a professional
inspiration to me. I would like to thank Prof. Martha Kim, who I have collaborated
with and has also mentored me over the years. I am also thankful to Dr. Yu Pu,
who has mentored and provided me with insights into the semiconductor industry.
Many thanks to the other committee members of my thesis proposal & defense, Prof.
Luca Carloni, Prof. Ioannis (John) Kymissis, Prof. Xiaofan (Fred) Jiang. They have
graciously shared their valuable time with me and I am much obliged.
I have been fortunate to have worked and collaborated with a number of outstand-
ing people throughout my graduate career. Many thanks to Seongjong (Josh) Kim,
Jiangyi Li, Zhewei Jiang, Thomas Repetti, Andrea Lottarini, and Minhao Yang. I
am also grateful to my colleagues in the VLSI research group who I have learned from
and enjoyed the company of Teng Yang, Doyun Kim, Wei Jin, Tianchan Guan, Pavan
Chundi, Sung Justin Kim, Dongkwun Kim, Weiwei Shan, Bo Zhang, Ashish Shukla,
and Peiye Liu.
Beyond the VLSI lab., I am grateful for the friendship and support of Ketson
Roberto Maximiano dos Santos, Daniel de Godoy Peixoto, Negar Reiskarimian, and
Vin´ıcius Oliveira. I would also like to thank the administrative staff of the depart-
ment of electrical engineering, in particular, Mrs. Elsa Sanchez, who has assisted me
multiple times with different requests. Many thanks to Dr. Mohamed H. Abu-Rahma
and Dr. Mayur Joshi from Apple Incorporated who I have learned so much from and
xi
mentored me during my internship.
I am very grateful to Katyeny Manuela da Silva, love of my life, soulmate, and
best adventure partner for all the love and support.
All of this was only made possible by the encouragement of my parents, Marcelo
Cerqueira and Lene Cerqueira. I am unspeakably thankful for their never-ending
streams of love and support.
Finally, I express my gratitude to God for all the wisdom, perseverance, and bless-








This chapter provides an introduction on energy-autonomous, miniaturized very-
large-scale integration (VLSI) systems1 for the ultra-low power (ULP) embedded
and mobile domains. It shows examples of recent and prospective applications and
use-cases. It discusses the main requirements for the systems that support such appli-
cations that are size and lifetime. It also presents energy-efficiency as the key enabler
for achieving smalls size and long lifetime in such systems.
This chapter discusses the power consumption of digital CMOS circuits, ultra-low
voltage, and energy-efficient operations. It also identifies leakage as a major roadblock
for improved energy-efficiency in ULP VLSI systems.
This chapter presents leakage mechanisms in CMOS circuits. Focus is given to
sub-threshold leakage, which is the dominant leakage component in today’s CMOS
1In this thesis, the term “energy-autonomous, miniaturized VLSI system(s)” is used interchange-
ably with the term “ultra-low power VLSI system(s)” for simplicity.
CHAPTER 1. INTRODUCTION 2
process technology. Additionally, this chapter provides a discussion on the impact
of the leakage in ULP VLSI systems and the importance of reducing it in order to
extend energy-efficiency.
Lastly, this chapter discuss the main contributions of this thesis that converge to
the main goal of extending the energy-efficiency of ULP VLSI systems by reducing the
energy waste of idle/underutilized and always-on hardware: novel power-gating switch
techniques that are applied in spatially and temporally fine-grained manners, an ultra-
low leakage logic family for always-on circuits, and a near-threshold programmable
spatial array accelerator for the ULP embedded and mobile domains.
1.2 Energy-Autonomous, Miniaturized VLSI Sys-
tems for the ULP Embedded and Mobile Do-
mains
Several decades of improvements in integrated circuit (IC) manufacturing and de-
sign have enabled a diverse range of applications across the power consumption,
performance, and area spectra. Many of the recent and prospective applications
rely on the availability of emerging VLSI systems that are energy-autonomous and
miniaturized, generally characterized by extreme resource limitations (i.e., ULP VLSI
systems). A few examples of such applications are wireless sensing platforms [25;
18], body-area sensor nodes/networks (BASN) [12; 64], biomedical and implantable
devices [15], wearables [13], hearables [13; 14], and monitors [15; 16; 17]. With the in-
creased attention of academia and industry communities, ever-growing opportunities
emerge for such electronics to impact our everyday life.
CHAPTER 1. INTRODUCTION 3
Figure 1.1 A commercial long-term continuous blood glucose monitoring system.
In the context of such emerging embedded and mobile technologies, the main
drivers (i.e., requirements, features, or intrinsic characteristics) are defined by the
physical constraints of the system such as the size (i.e., volume, form factor) and the
power (aspects such as power delivery, budget, and lifetime). Regarding the size, it is
typically set by the energy sources of the system such as the battery and the energy
harvester/scavenger. Sub- and millimeter-scale is a common target for many of the
example applications [19; 18; 15]. For instance, implantable devices require small
size for less invasive deployments. In terms of power, as such systems are commonly
untethered, they also require to be self-powered that strongly limits their power and
energy budgets. For instance, a purely energy-harvested system is typically limited
to few µW for perpetual operation whereas a battery-powered system is constrained
by 10s of nW to 10s of µW for a lifetime of 10 years to 1 week, respectively [20].
Generally, long lifetime is highly desirable so as to avoid maintenance (e.g., battery
replacement in an implantable device). Figure 1.1 shows a related commercial appli-
cation – an under-the-skin system that provides long-term continuous blood glucose
CHAPTER 1. INTRODUCTION 4
monitoring [16]. The implantable system requires a small form factor and long life-
time for continuous operation and avoiding medical procedures to replace the energy
source.
1.2.1 The Key Enabler: Energy-Efficiency
While the continuous scaling of the CMOS device feature size, dictated by Moore’s
law, has enabled substantially more compact, complex, and energy-efficient hardware;
advances in power source technology (i.e., batteries and energy harvesters/scavengers)
have evolved at a significantly slower pace. Existing compact batteries [21] allow for
millimeter-scale power sources, however, the amount of energy that can be stored is
limited, roughly proportional to their volume. Other sources such as solar cells can
provide extremely limited energy from the environment to the system [68].
High energy-efficiency is simultaneously the key challenge and enabler of ULP
VLSI systems for emerging embedded and mobile applications. It addresses both
lifetime and size, the two major physical constraints of such systems. Low energy
dissipation allows for extended battery lifetime and operation with the energy that
can be harvested from the environment. Low energy consumption also allows for
reducing the size of the energy sources (e.g., battery and energy harvester/scavenger)
utilized to power these systems.
In the context of such power, energy, and area resource-constrained design space,
this thesis has an emphasis on novel circuit- and architecture-level techniques for im-
proving energy-efficiency of integrated circuits and systems by minimizing the energy
waste of idle/underutilized and always-on hardware. Though applicable for most of
the integrated circuit design spectrum, the techniques presented here target energy-
CHAPTER 1. INTRODUCTION 5
autonomous, miniaturized VLSI systems for the ULP embedded and mobile domains.
1.3 Ultra-Low Voltage, Energy-Efficient Operation
Ultra-low voltage (ULV) operation is a technique in which the supply voltage of
circuits is reduced from nominal to near or below the threshold voltage (Vt) of the
CMOS transistor. This technique is the most powerful knob in today’s integrated
circuit design methodology for achieving lower power consumption and higher energy-
efficiency. For this reason, ULP design typically translates to ULV design [3].
Supply voltage scaling is particularly beneficial for ULP VLSI systems, mobile
and embedded applications such as the examples presented in Section 1.2. It enables
extended battery lifetime and operation with the energy that can be harvested from
the environment; it also allows for limiting the size (volume) of the energy sources
utilized to power such systems.
Figure 1.2 shows the simulated total energy consumption per cycle against the
supply voltage of operation of a 50-stage fanout-of-4 (FO4) inverter chain in a 65 nm
general purpose CMOS. It also shows two main components that contribute to the
total energy consumption: the switching and the leakage energies. Figure 1.3, on the
other hand, shows the FO4-inverter delay vs. supply voltage obtained from the same
benchmark circuit.






where Ceff is the effective capacitance, CTOT is the total physical capacitance, and
αsw is the activity factor. The leakage energy component is expressed as Equation
CHAPTER 1. INTRODUCTION 6
1.2:
ELKG = PLKG × TCY CLE = (VDD × IDS,OFF )× TCY CLE (1.2)
where IDS,OFF is the sub-threshold leakage current when VGS = 0 and VDS ≥ 3 · VT
for an NMOS device as described by Equation 1.3; and TCY CLE is the cycle time,
typically dominated by the total logic gate delay of the critical path of the target
system.
Figure 1.2 Energy/cycle vs. supply voltage of a 50-stage FO4-inverter chain bench-
mark circuit. The total energy/cycle is split between two major components: the
switching and leakage energies. At ULV, the leakage energy sets the minimum energy
point, the practical limit for energy-efficiency.
Equation 1.1 dictates that the supply voltage has a strong impact on the switching
energy – i.e., ESW scales quadratically with VDD. Additionally, Equation 1.2 shows
that supply voltage scaling leads to a linear reduction in PLKG whereas it leads to
an almost exponential increase in the gate delay from Equation 1.3 and Figure 1.3.
Therefore, the leakage energy per cycle (Equation 1.2) increases almost exponentially
CHAPTER 1. INTRODUCTION 7
Figure 1.3 FO4-inverter delay vs. supply voltage of a 50-stage FO4-inverter chain
benchmark circuit. At ULV, delay increases almost exponentially as predicted by
Equation 1.3.
when reducing VDD – in Figure 1.2, the leakage energy per cycle has an opposite trend
as compared to the switching energy. The leakage energy components overcomes the
switching energy component and defines the minimum energy point (MEP) – the
practical limit for energy-efficiency. The supply voltage of operation at which the
MEP typically occurs is in the ULV region.
From the above considerations, it becomes clear that operating circuits at lower
VDD is highly desirable for ULP VLSI systems that can tolerate lower performance
such as the examples cited in Section 1.2. However, in such a case, the leakage energy
consumption plays an important role: it is responsible for the total energy increase
and setting the minimum energy per operation at ULV. Therefore, this thesis focus
on techniques for reducing the impact of the leakage energy in order to improve
energy-efficiency of ULP VLSI systems. The subsequent section discusses leakage
mechanisms in deep-submicrometer and nanometer CMOS with emphasis on sub-
threshold leakage.
CHAPTER 1. INTRODUCTION 8
1.4 Leakage Mechanisms in CMOS Circuits
Leakage current in deep-submicrometer and nanometer process technology nodes is
one of the two main contributors to the power dissipation of today’s CMOS circuits.
This is a direct consequence of the technology scaling that has been continuing for
more than 40 years now, driven by the goals of achieving higher density and perfor-
mance, and lower power consumption. Authors of [1; 2] identify, describe, and model
different leakage mechanisms there exist in contemporary CMOS circuits that are
sub-threshold leakage current, reverse-biased junction leakage current, gate leakage
(tunneling into and through gate oxide), and gate induced drain leakage. In current
CMOS technologies, however, the sub-threshold leakage current is much larger than
the other concurrent leakage current components and dominates the total leakage [2].
Hence, understanding, modeling, and managing the sub-threshold leakage has been a
key goal from the point of view of technology/device and circuits.
From the point of view of circuit design, authors of [2] further define active and
standby leakage in order to differentiate the leakage power of circuits in two scenarios:
active and standby. The former is defined as the intrinsic leakage power consump-
tion of circuits that are powered on (i.e., active, not-shutdown, or in a low leakage
mode). The latter is defined as the remaining leakage current of circuits after exercis-
ing non-passive leakage suppression mechanisms such as power-gating. For instance,
when referring to the remaining leakage current of a power-gated block, then standby
leakage is usually the term employed.
Due to its great importance in today’s CMOS circuits, the works and techniques
presented in this thesis are focused on the sub-threshold leakage current. Throughout
CHAPTER 1. INTRODUCTION 9
this thesis, the terms active leakage and leakage refer to sub-threshold leakage current,
otherwise specified.
1.4.1 Sub-Threshold Leakage Current
Sub-threshold leakage (also known as weak inversion conduction) is the current be-
tween the source and drain in a MOSFET device when it is in the cutoff region. In
an NMOS, the cutoff occurs when its gate-to-source voltage is less than or equal to
its threshold voltage, i.e., VGS ≤ Vt. In a PMOS, the cutoff occurs when its source-
to-gate voltage is less than or equal to its absolute threshold voltage, i.e., VSG ≤ |Vt|.
The sub-threshold curve of an NMOS device in a 65 nm general-purpose CMOS is
shown in Figure 1.4. The weak inversion region is identified as the linear part of the
curve in this semilog-Y plot.
Figure 1.4 Sub-threshold curve of an NMOS device in a 65 nm general purpose CMOS
process technology. The sub-threshold region is within the linear part of the curve in
this semilog-Y plot.
The sub-threshold current is then expressed as in Equation 1.3 [1]:




(m− 1)(VT )2 × e(VGS−Vt)/mVT × (1− e−VDS/VT ) (1.3)
where VT = kT/q is the thermal voltage, Cox is the gate oxide capacitance, µ0 is the
zero bias mobility. The sub-threshold swing coefficient (also known as body effect
coefficient), m, is expressed by:










where SS is the sub-threshold slope (sub-threshold swing), which indicates how effec-
tively the transistor can be turned off as VGS is down-scaled below Vt. SS is expressed



















Ideally, when tox → 0 in Equation 1.4 and at room temperature, the SS ≈ 60
mV/decade. Nevertheless, typical values for a bulk CMOS range from 70 to 120
mV/decade. For instance, in Figure 1.4, the NMOS device has a slope of approxi-
mately 95 mV/decade.
1.5 The Impact of Leakage in ULP VLSI Systems
Leakage plays a fundamental role in ULP VLSI systems. It is responsible for the
increase of total energy per cycle at ULV and it determines the MEP of ULP VLSI
systems.
CHAPTER 1. INTRODUCTION 11
As shown in Figure 1.2, the leakage energy component is responsible for the total
energy per cycle increase at ULV operation. The almost exponential increase of
the leakage energy limits the energy reductions provided by supply voltage scaling,
as opposed to the quadratic reduction that would have been achievable if the total
energy was dominated by the switching energy component. In fact, if the leakage
contribution was negligible at ULV, then ESW  ELKG, and from Equation 1.1, the
switching energy would monotonically decrease with supply voltage. In such a case,
the MEP would be determined by the minimum supply voltage (VMIN) of operation
of the circuits. However, the leakage contribution is non-negligible and becomes
increasingly significant with operation at scaled-VDD.
The leakage energy contribution at ULV also determines the MEP of an ULP VLSI
circuit. In Figure 1.2, when the leakage energy curve moves up (i.e., the ELKG portion
increases), the supply voltage where the MEP occurs moves towards a higher VDD.
Consequently, it determines a higher MEP, undesirable for the target applications
mentioned in Section 1.2.
From the above considerations, it becomes clear that leakage energy management
is a crucial measure in order to achieve high energy-efficiency in ULP VLSI systems.
In this context, this thesis focus on extending the energy-efficiency of ULP VLSI
systems by suppressing the leakage energy thus reducing the energy waste of idle,
underutilized, and always-on hardware. As will be shown in Subsection 1.5.1, tra-
ditional leakage suppression techniques are rather ineffective at ULV. Hence, novel
techniques for reducing leakage of circuits are required to extend the energy-efficiency
of circuits operating at ultra-low VDD.
CHAPTER 1. INTRODUCTION 12
1.5.1 The Ineffectiveness of Traditional Leakage Reduction
Techniques at ULV
While there is a large body of works that have been published over the years about
leakage energy reduction in CMOS circuits, most of the proposed techniques target
super-threshold circuits and are marginally effective for mitigating the leakage energy
in near- and sub-threshold regimes. A few examples include transistor stacking [22;
2] and multiple threshold design [22; 2; 4].
1.5.1.1 Transistor Stacking
Transistor stacking [22; 2] is a technique that has been historically utilized by design-
ers to reduce the leakage of circuits at the super-threshold regime. It is a compelling
measure at nominal VDD because the factor by which the off-current (i.e., IOFF ) is
reduced due to the connection of multiple devices in series is much larger than the
performance degradation incurred by the stack. The leakage current reduction is
larger at nominal supply voltage because of the stronger the super-cutoff effect (i.e.,
negative-|VGS|). However, at ULV operation, as it will be shown in Section 1 (Mo-
tivation) of Chapter 4, transistor stacking results in much smaller leakage reduction
and much larger performance penalty. At ULV, the super-cutoff effect is weaker be-
cause the amount of negative-|VGS| scales down with supply voltage, which makes
such technique marginally effective.
1.5.1.2 Multiple Threshold (Multi-Vt) Design
The adoption of different threshold voltage devices on designing CMOS circuits is
another traditional technique to reduce leakage at nominal supply voltage. In multi-
CHAPTER 1. INTRODUCTION 13
Vt designs, standard- and low-threshold cells are assigned to critical paths for meeting
the cycle time whereas high-threshold cells are assigned to paths that are not critical
for reducing leakage. At ULV operation, however, devices have exponential sensitivity
to the threshold voltage in accordance with Equation 1.3. Consequently, the non-
critical paths with high-threshold voltage devices become much slower and start to
turn into critical paths, which impacts delay and timing closure. Furthermore, in
multi-Vt design, the leakage savings are generally small and limited by the standard-
and low-threshold voltage cells.
1.6 Challenges and Contributions of this Thesis
This thesis focus on leakage suppression mechanisms with the goal of minimizing the
leakage energy waste of idle, underutilized, and always-on hardware in ULP VLSI
systems for embedded and mobile applications. As shown in Figure 1.2, the leakage
energy increases the total energy per cycle at ULV and increases the total energy
consumption. It also defines the practical limit for energy-efficiency, which is not nec-
essarily enough for the applications in Section 1.2 and potentially enabling unforeseen
use-cases. Therefore, this thesis proposes novel leakage reduction techniques to effec-
tively address such energy waste and extend the boundaries of energy-efficiency. As
illustrated in Figure 1.5, the main goal of the proposed techniques presented here is
to move down (reduce) the leakage energy curve from the original point (1) to the
lower leakage energy point (2) so as to enable higher energy-efficiency in ULP VLSI
systems.
The circuit- and architecture-level techniques for leakage mitigation proposed in
CHAPTER 1. INTRODUCTION 14
Figure 1.5 Energy/cycle vs. supply voltage of a 50-stage FO4-inverter chain bench-
mark circuit after applying leakage reduction techniques. The leakage energy curve
(1) moves down to a lower leakage point (2). Consequently, the total energy is also
reduced enabling extended energy-efficiency.
this thesis are demonstrated in real, fairly large ULP VLSI systems from application-
specific integrated circuit (ASIC) accelerators to general purpose processors and
to spatial architectures (i.e., coarse-grain reconfigurable accelerators/architectures).
Such ULP VLSI systems were taped-out using commercial process technologies and
verified functionality under different process, voltage, and temperature conditions.
Measured results of the prototypes attest the effectiveness of the proposed techniques
and mark state-of-the-art energy-efficiencies.
CHAPTER 1. INTRODUCTION 15
1.6.1 Spatiotemporally Fine-Grained Power-Gating
1.6.1.1 Temporally Fine-Grained Sleep Technique for Parallel Architec-
tures at ULV
Chapter 2 presents a novel sleep technique with fine temporal granularity for paral-
lel architectures operating at ULV. This technique allows for extending the energy-
efficiency of parallel circuits operating in the near- and sub-threshold regimes by op-
portunistically shutting down idle and underutilized replicas in a parallel architecture
to reduce leakage.
An analysis of the energy/cycle vs. cycle time for the baseline (i.e., no pipelining
or parallelism implemented), the 2-stage pipelined, and the 2-way parallel multiplier
shows that trading-off performance and energy-efficiency by increasing parallelism
and reducing supply voltage becomes unattainable at ULV operation – differently
from what classic studies have shown for the super-threshold regime [28]. The reason
is that the leakage energy waste offsets the total energy reduction provided by VDD
down-scaling.
We then investigate power-gating switches to reduce the impact of leakage in par-
allel architectures operating at low supply voltages. We show a meticulous study
on power-gating switch overheads and identify Zigzag Super Cut-Off CMOS (ZSC-
CMOS) [36; 37] as a key power-gating technique for obtaining full benefit of the
performance/energy-efficiency trade-off at ULV.
Therefore, we propose a novel sleep technique that utilizes ZSCCMOS circuits
coupled with the parallel architecture in order to shutdown idle and underutilized
replicas/branches in a temporally fine-grained manner. The proposed technique is
CHAPTER 1. INTRODUCTION 16
applied to the 2-way parallel multiplier. Simulation results show that at low utiliza-
tions rates the proposed parallel design with the proposed technique achieves 2.4X
better energy-efficiency than the 2-way parallel design with no the sleep mode.
1.6.1.2 A Compact, Energy-Efficient FFT Processor for ULP Embedded
Applications
Chapter 3 describes the design and implementation of a compact, energy-efficient fast
Fourier transform (FFT) core that improves the MEP (i.e., energy-efficiency) by 50
mV to 75 mV as compared to its baseline counterpart implementation by suppressing
the leakage energy consumption of its functional units such as the SRAM memories
and the processing element.
In the proposed FFT core, we adopt a memory-based FFT architecture for area-
efficiency and exercise sub-threshold circuits to minimize the switching energy con-
sumption. The main roadblock in this direction is the leakage energy consumption
that limits the scaling of the minimum energy per FFT. This, consequently, results
in a marginal energy-efficiency, similarly to previously published work [24].
To address the leakage energy and extend the energy-efficiency of the FFT core, we
propose spatiotemporally fine-grained leakage reduction techniques that are applied to
memory and combinational logic circuits. Engineered for ultra-low leakage consump-
tion, the fabricated FFT silicon prototype consumes 3.19 nJ per 256-point FFT trans-
form and achieves approximately 5X better energy-area-product than the prior art [50;
24].
CHAPTER 1. INTRODUCTION 17
1.6.2 An Ultra-Low Leakage Logic Family for Always-On
Circuits
Chapter 4 presents a novel ultra-low leakage logic family for always-on circuits referred
to as Feedforward Leakage Self Suppression Logic (FLSL). It addresses the switch-
ing speed problem of existing ultra-low leakage logic families whereas maintaining
femtowatt leakage power consumption per gate.
ULP VLSI systems typically execute a short task imposed by an application and
put most their building in the sleep mode to reduce power consumption. Nevertheless,
the energy-efficiency of such systems can be still dominated by always-on circuits –
low switching activity, leakage-dominated – such as data-retentive SRAMs, power
management, signal acquisition front-ends, control. Minimizing the leakage energy of
such always-on modules becomes crucial for improving energy-efficiency.
We then propose a novel ultra-low leakage logic family based on the prior art [68].
The proposed work removes the slow feedback path of [68] and replaces it with fast
feedforward signals that are available in a complementary (i.e., dual-rail) logic family.
We develop a complete standard-cell library and fabricate a finite impulse response
(FIR) core to demonstrated the proposed logic family. Results show an improvement
of 150X in speed and 200X in leakage-delay product over the prior art [68].
1.6.3 A Near-Vt Programmable Spatial Array Accelerator for
the Ultra-Low Power Embedded and Mobile Domains
Chapter 5 presents Catena, a near-threshold sub-0.4-mW 16-core programmable spa-
tial array accelerator for the ULP embedded and mobile domains. Catena’s circuit
CHAPTER 1. INTRODUCTION 18
and architecture is designed using the proposed techniques presented in Chapters 2,
3, and 4, in order to minimize the energy waste of idle and underutilized hardware
and always-on circuits. Thanks to the proposed techniques, the prototype achieves
state-of-the-art energy-efficiencies across multiple workloads and utilization rates.
Programmable spatial array accelerators are massively parallel devices that have
been historically utilized in high-performance computing systems. As a fully pro-
grammable device, such an architecture also becomes a compelling solution for emerg-
ing embedded and mobile applications that can tolerate lower frequency of operation
as it can efficiently map workloads commonly used in this domain such as cryp-
tography, deep neural networks, and digital signal processing. The large degree of
parallelism of a spatially-programmable architecture also offers ample opportunity for
trading-off performance and energy-efficiency via voltage scaling.
However, as it will be shown in Chapter 2, this trade-off becomes unattainable
at ULV due to the energy waste of idle and underutilized hardware and always-on
circuits. Therefore, we engineer Catena’s circuit and architecture to address such
issues. We employ spatially and temporally fine-grained clock- and power-gating as
presented in Chapters 2 and 3 to minimize the energy waste of idle and underutilized
hardware. For the always-on circuits, we employ techniques such as shown in Chapter
4 as well as increase computation-efficiency by utilizing architectural knobs such as
sub-word vectorization.
Catena’s prototype is the first known programmable spatial array accelerator de-
signed to operate at near- and sub-threshold and consume ULP. Thanks to the pro-
posed techniques, it achieves state-of-the-art energy-efficiencies across multiple work-
loads and hardware utilization rates. It achieves 2.7X better energy-efficiency than
CHAPTER 1. INTRODUCTION 19
prior art [71] when performing an FFT workload.
1.7 Organization of this Thesis
This thesis is organized as follows. Chapter 1 provides a brief introduction on the
ULP VLSI systems that the research works presented in this thesis are targeted for.
It also discusses ULV operation, sub-threshold leakage current in deep-submicrometer
and nanometer CMOS circuits, and the importance of reducing the leakage energy
in the target ULP VLSI systems. Chapter 2 presents a novel technique on tempo-
rally fine-grain sleep mode for achieving maximum benefit of the performance and
energy-efficiency trade-off in parallel architectures operating at ULV by suppressing
leakage energy. Chapter 3 presents circuit-level techniques on spatiotemporally fine-
grain leakage reduction applied to combinational logic and memory circuits of a com-
pact fast Fourier transform (FFT) processor in order to improve its energy-efficiency.
Chapter 4 presents a novel logic family for ultra-low leakage always-on circuits in ULP
VLSI systems, which achieves fW/gate leakage while addressing the switching speed
problem of the prior art. Chapter 5 presents a sub-0.4-mW, near-threshold 16-core
programmable spatial array accelerator for the ultra-low power emerging embedded
and mobile Internet of Things, which employs multiple circuit- and architecture-level
techniques for improving energy-efficiency by minimizing the energy waste of idle and
underutilized hardware. Lastly, Chapter 6 concludes this thesis.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 20
Chapter 2
Temporally Fine-Grained Sleep
Technique for Near- and
Sub-Threshold Parallel
Architectures
This chapter presents a design approach for improving energy-efficiency and through-
put of parallel architectures in near- and sub-threshold voltage circuits. The focus
is to suppress leakage energy dissipation of idle portions of circuits during active
modes, which can allow us to wholly transform the throughput improvement from
parallel architectures into energy savings via deep voltage scaling. We begin by inves-
tigating the efficacy of parallel and pipeline architectures in near- and sub-threshold
circuits. The investigation reveals that active leakage energy dissipation largely un-
dermines the ability of deep voltage scaling to transform excessive throughput into
energy savings. Techniques such as power-gating switches (PGS) can mitigate the
active leakage power dissipation; however, the overhead for entering and exiting sleep
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 21
modes can offset the energy savings provided by sleep mode, particularly if sleep time
is fine-grained for suppressing active leakage. Therefore, in this chapter, we propose
a PGS design technique, inspired by Zigzag Super Cut-Off CMOS (ZSCCMOS) [36;
37], in order to optimize the overheads of mode-transitions of PGS in near- and sub-
threshold circuits. The proposed technique allows for having circuits in the sleep
mode for as short as a single clock cycle with a negligible amount of energy and
delay overheads. We apply our proposed design to parallel multiplier-based test cir-
cuits operating at near- and sub-threshold voltages. Simulations show a significant
improvement in energy-efficiency over baselines at the same throughput.
2.1 Motivation
The advance of complementary metal-oxide-semiconductor (CMOS) technology has
enabled incredible progress in integrated circuits (IC) over the years. As transistors
become smaller, faster, and consume less power, new and unforeseen applications
emerge across the size, performance, and power spectra.
In recent years, however, with emerging systems of increasingly higher complexity
and embedded applications, power and energy consumption have become of greater
importance and designers have had to conscientiously devise techniques to improve
power consumption whereas avoiding or mitigating performance degradation.
Supply voltage (VDD) scaling from nominal down to near or below the level of
transistor threshold voltage (Vt) [24; 25; 26; 27] is one of the most powerful knobs we
have in order to reduce the power consumption of circuits. Such a technique often
referred to as near- and sub-threshold operation can provide one to two orders of
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 22
magnitude savings in energy dissipation.
Coupled with supply voltage scaling, parallelism and pipelining have been histor-
ically utilized by designers in order to improve performance and energy-efficiency of
digital circuits – as shown in classic studies [28; 29; 30; 31], by taking advantage of
the performance enhancement of parallelism and pipelining in throughput and clock
frequency, respectively, one can trade-off performance and energy-efficiency via VDD
scaling.
The existing works on parallel and pipelined architectures, however, have an em-
phasis on nominal VDD (super-threshold regime) designs, having minimal attention to
a crucial issue that has a great impact in near- and sub-threshold circuits: the active
leakage dissipation. As VDD is scaled down from nominal to near- and sub-threshold
levels, increasingly slowed-down circuits accumulate exponentially more leakage power
per clock cycle. Such accumulation, in the form of active leakage energy, starts to
offset the energy reduction gains and sets a practical limit for energy-efficiency. Ac-
tive leakage energy dissipation, consequently, becomes critical for runtime comput-
ing energy-efficiency. The point at which the total energy consumption starts to
increase is defined as the minimum energy point (MEP). The energy consumption
at MEP is denoted by optimal energy (EOPT ) or minimum energy(EMIN) [32; 33;
48].
In order to improve computing energy-efficiency beyond the conventional limit,
i.e., MEP, it is key to reduce leakage energy dissipation during active modes. One of
the potential solutions for it is to place idle parts of circuits in a low leakage (sleep)
mode. For example, although not targeting ultra-low voltage (ULV) circuits, Refs.
[34; 35] have proposed PGS for each function block of an execution stage of a pipelined
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 23
microprocessor. By opportunistically having the blocks that perform no useful work
in the sleep mode, we can reduce active leakage waste.
While shutting idle circuits down is a valid approach, it can cause non-negligible
energy and delay overheads to frequently enter and exit sleep mode (i.e., mode-
transitions) for suppressing the active leakage. PGS can consume a significant amount
of dynamic energy to charge and discharge internal nodes’ capacitances during mode-
transitions. Furthermore, the delay required for transitioning from sleep to active
mode can degrade throughput and complicate sleep control.
In this chapter, we investigate a novel technique on temporally fine-grained PGS
that allows for minimal mode-transition energy and delay overheads. We analyze var-
ious PGS strategies at ULV, revive, and optimize a previously published technique
known as Zigzag Super Cut-Off CMOS (ZSCCMOS) [36; 37]. ZSCCMOS has been
proposed as an alternative to conventional PGS, but it has not been widely adopted
in super-threshold circuits due to its poor ability to suppress gate leakage. Never-
theless, gate leakage dissipation at near- and sub-threshold regimes is rather small
as compared to sub-threshold leakage. Therefore, ZSCCMOS becomes a compelling
PGS technique to largely reduce the overhead associated with mode-transitions in
ULV circuits.
We apply the proposed PGS technique in a parallel multiplier benchmark circuit
that operates in the near- and sub-threshold regime. The technique allows for having
idle branches of the parallel multiplier in the sleep mode for as short as a single
clock cycle with minimal delay and energy overheads. The simulation results show
that the parallel architecture employing the proposed temporally fine-grained PGS
design technique can improve energy-efficiency by 1.9X to 2.6X, over the baseline,
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 24
2-stage pipelined, and 2-way parallel designs running at the same throughput. The
contribution of this work is summarized below:
• We analyze the challenge of active leakage in parallel architectures in near- and
sub-threshold voltage circuits.
• We motivate the active leakage can be significantly mitigated by employing
temporally fine-grained PGS.
• Among the multiple PGS techniques studied, we find ZSCCMOS to be a good
fit for temporal fine granularity.
• We experimentally confirm that the combination of parallel architectures and
ZSCCMOS can successfully trade-off architectural throughput improvement for
energy savings in near- and sub-threshold voltage operation.
This chapter is organized as follows: In Section 2.2, we analyze parallel and
pipeline architectures in near- and sub-threshold operation. Additionally, we study
the impact of circuit utilization and the scalability of parallel architectures. In Section
2.3, we analyze the overhead of exercising sleep mode utilizing PGS, and in Section
2.4 we revive and optimize the ZSCCMOS technique for achieving minimal mode-
transition overheads in near- and sub-threshold circuits. In Section 2.5, we evaluate
the proposed PGS technique in parallel architectures on its efficacy to suppress ac-
tive leakage overhead in near- and sub-threshold circuits. Section 2.6 concludes this
chapter.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 25
2.2 Parallelism and Pipelining in Near- and Sub-
Threshold Circuits
Increasing hardware parallelism and pipelining are notable architectural strategies
for enhancing effective computing throughput. These approaches can also provide
significant energy-efficiency gains since the improved throughput can be traded off
for energy savings via voltage scaling.
In this section, we will investigate the efficacy of parallelism and pipelining in near-
and sub-threshold circuits. In nominal supply voltage (super-threshold regime), the
study [28] has shown that parallelism and pipelining enable lower supply voltage of
operation for the same throughput thus improving energy-efficiency. The results show
that 2-way parallel and 2-stage pipelined architectures can reduce VDD from 5 V to
2.9 V and achieve 2.5X and 2.8X, respectively, energy savings over the baseline design.
As we show shortly, however, those architectures become significantly less effective
on improving energy-efficiency in near and sub-threshold circuits due to the impact
of the active leakage.
2.2.1 The Effectiveness of the 2-Way Parallel Architecture
In order to investigate parallelism and pipelining in near- and sub-threshold circuits,
we use three benchmark circuits based on 16-bit Baugh-Wooley (BW) multipliers
[23] in a 65 nm general-purpose CMOS. The baseline version, shown in Figure 2.1(a),
consists of 32 input D flip-flops and a BW multiplier, which operates at the maximum
clock frequency (FCLK,BASE) at each VDD. Figure 2.1(b) shows a 2-stage pipelined
architecture, which consists of 32 input and pipeline flip-flops. The 2-stage pipeline
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 26
can halve the critical path delay, which allows us to use the same clock frequency
(FCLK,PIPE = FCLK,BASE) at a lower VDD for improved energy-efficiency. Lastly,
Figure 2.1(c) shows the 2-way parallel architecture. This design includes a 32-bit
2-to-1 multiplexer to recombine the outputs of the two multipliers (i.e., Multipliers 1
and 2). In the parallel architecture, while a new input comes at FCLK,BASE, compu-
tation is interleaved by clocking the input flip-flops at FCLK,PARA which is the half of
FCLK,BASE. Although clock frequency is reduced, throughput is still maintained. This
slack, provided by parallelism, enables us to reduce VDD to increase energy savings.
The energy dissipation of output flip-flops is not included.
As shown in Figure 2.2(a), we perform SPICE simulations for the three test archi-
tectures to find FCLK and the corresponding energy consumption per cycle across a
range of VDDs that cover super-, near-, and sub-threshold regimes. Results in Figure
2.2(b) show that the 2-way parallel design, operating at nominal VDD, can achieve
2X improvement in throughput for the same energy/cycle and 2X improvement in
energy-efficiency for the same throughput, as compared to the baseline design. How-
ever, at near-threshold VDDs, as shown in Figure 2.2(c), the active leakage starts to
be more significant, reducing the energy-efficiency gains to 1.4X.
The active leakage contribution becomes even greater when VDD is deeply scaled
down to the sub-threshold level. As a result, shown in Figure 2.2(d), the 2-way
parallel architecture becomes even less energy-efficient than the baseline one.
Similarly, one can observe that the energy-efficiency gains of the pipeline archi-
tecture over the baseline reduce as VDD scales down from super- to near- and sub-
threshold levels (Figures 2.2(b)–2.2(d)) although the trend is less aggressive than
the parallel architecture case. This is because the 2-stage pipeline architecture in-
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 27
Figure 2.1 Three test architectures based on a 16-bit multiplier. (a) Baseline, (b) 2-
stage pipelined, and (c) 2-way parallel designs. The dash lines represent boundaries
of the equivalent sequencing stage across three designs.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 28
Figure 2.2 Comparisons of three architectures (baseline, 2-stage pipelined, and 2-way
parallel), in energy consumption per cycle and computational throughput. VDD is
swept from 0.8 V to 0.2 V, with a step of 50 mV. (a) The comparisons over the
full range of VDD sweep; (b) A zoomed in view in the super-threshold regime; (c) A
zoomed in version in near-threshold regime; (d) A zoomed in version in sub-threshold
regime.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 29
curs less active leakage overhead than the 2-way parallel one. Figure 2.3 summarizes
throughput and efficiency gains of the two architectures across super-, near-, and
sub-threshold voltages.
Figure 2.3 Performance and energy-efficiency gains of parallel and pipeline architec-
tures at super-, near-, and sub-threshold regimes.
2.2.2 The Impact of Circuit Utilization
The results shown in the previous section are based on 100 % hardware utilization,
i.e., a new input comes in every clock cycle. The 100 % hardware utilization scenario,
however, may not reflect the practical case as utilization is usually lower considering
the multiplier is a functional unit of a larger system. In order to understand the
impact of hardware utilization, we perform energy/cycle simulations for the three
test circuits at the supply voltage levels that make the architectures to meet the
same throughput of 50 MHz, i.e., clock cycle time (TCY CLE) of 20 ns. We then sweep
hardware utilization via input control.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 30
As shown in Figure 2.4(a), at low utilization rates, the active leakage dissipation
starts to dominate the total energy consumption and offsets the quadratic reduction of
dynamic energy provided by voltage scaling. The energy/cycle curves of the baseline
and the 2-way parallel architecture intersect at around 30 % utilization rate, making
the parallel architecture worse than the baseline in energy-efficiency. The 2-stage
pipelined design still achieves better or equal energy-efficiency as compared to the
baseline over utilization rates. We further analyze the crossing point of the baseline
and the parallel architecture over several cycle times, ranging from 10 to 40 ns. The
results show that longer TCY CLE increases (shifts up) the crossing point since both
designs operate at low VDDs, shown in Figure 2.4(b).
Figure 2.4 Impact of hardware utilization. (a) The comparisons of the three archi-
tectures at TCY CLE = 20 ns. (b) The crossover utilization point between the baseline
and the 2-way parallel architectures over TCY CLEs.
Hardware utilization has a strong impact on the effectiveness of parallel archi-
tectures to trade-off throughput and energy savings via supply voltage scaling. At
sub-threshold VDD (which for this technology occurs around 0.3 V), utilization rates
inferior to 60 % can make the 2-way parallel architecture less efficient than the base-
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 31
line. The root cause is the active leakage overhead, magnified at low hardware usage.
This is a significant concern as parallelism by using parallel replicas and interleaved
input sampling (as in the 2-way parallel multiplier) is often considered as an important
architectural knob to improve energy-efficiency and throughput.
2.2.3 Scalability of Parallel Architectures
In this section, we further explore the scalability of parallel architectures (the scal-
ability of pipeline architectures is discussed in works [38; 39]). We sweep the num-
ber of parallel replicas (amount of parallelism) from one to eight ways and simulate
the throughput and energy dissipation per cycle. Figure 2.5 presents the results
at the near-threshold regime. Figure 2.5(a) shows that more parallelism can in-
crease throughput at the same energy per operation roughly in a linear manner. The
marginal improvement of throughput decreases as parallelism is extended to more
ways due to the circuit overhead associated with it, for example, input flip-flops and
multiplexers. Figure 2.5(b) shows the improvement in energy per operation at the
same throughput. Similarly, we find that as the number of ways increases, paral-
lel architectures incur more active leakage overhead, reducing the energy-efficiency
improvements.
We perform the same scalability analysis for parallel architecture in the sub-
threshold regime. The results are essentially an exacerbated version of the trends
observed in Figure 2.5. As seen in Figure 2.6(a), the throughput increases more
slowly with an increasing amount of parallelism. Furthermore, as shown in Figure
2.6(b), the energy per operation can improve until the number of ways increases to
4. The 8-way parallel architecture, in fact, consumes more energy per operation than
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 32
Figure 2.5 Scalability of parallel architectures in near-threshold regime. (a) Compu-
tation throughput improvement at the same energy per operation. (b) Energy per
operation improvement at the same throughput.
the 2-way parallel architecture at the same throughput.
2.3 Power-Gating Overhead Analysis
In Section 2.2, we find that is critical to mitigating the active leakage for large parallel
devices at ULV. PGS is a valid technique to reduce leakage power by having idle
circuits in sleep mode. However, in order to use PGS for mitigating active leakage
dissipation, we need to support sleep time as short as one clock cycle. With such short
sleep time, delay and energy overheads of mode-transitions must be minimized. In
this section, we systematically analyze the overheads of PGS across VDDs. In Section
2.4, we will explore a PGS technique with low mode-transition overheads.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 33
Figure 2.6 Scalability of parallel architectures in sub-threshold regime. (a) Compu-
tation throughput improvement at the same energy per operation. (b) Energy per
operation improvement at the same throughput.
Figure 2.7 (a) Main circuits (two inverters) with an NMOS PGS showing the critical
discharging path during the wake-up process. (b) Timing and energy overheads during
the mode-transitions.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 34
2.3.1 PGS mode-transition Energy and Delay Overheads
Figure 2.7(a) depicts the conventional NMOS based PGS design. Figure 2.7(b) il-
lustrates the transient behaviors of the virtual ground potential (VV G) and power
dissipation of the main circuits when entering, exercising, and exiting sleep modes.
Here, when the SLPB signal transitions to logic level LOW (i.e., entering sleep mode),
the potential of VG starts to rise to a level close to VDD. The elapsed time associated
with this transition is defined as time to sleep (T2SLP ). After that, the circuits reach
deep sleep, when they consume minimal leakage power, referred to as PSLEEP . In
order to exit a sleep mode, the SLPB signal is set HIGH and each node of the main
circuits, including VG, returns to its stable state. The transition time from sleep to
active mode is referred to as wake-up time (T2WKU). The total sleep time, TSLP ,
is defined as the sum of T2SLP , TSLEEP , and T2WKU . The energy dissipated during
a mode-transition is defined as ETRAN . Among the metrics discussed, the critical
bottlenecks for enabling temporally fine-grained sleep modes are T2WKU and ETRAN .
T2WKU has a direct impact on computing throughput since a longer T2WKU prevents
the hardware from computing immediately after a new input arrives. Additionally,
ETRAN can dictate the efficacy of temporally fine-grained sleep modes since a larger
ETRAN can offset the leakage reduction through exercising temporally fine-grained
sleep modes. T2WKU is related to the charging and discharging action of various ca-
pacitors through the power-switch that transitions from OFF to ON state. In order to
understand the dynamics of it, we consider the simple circuits that consist of two in-
verters and a footer PGS (Figure 2.7(a)) and derive an analytical equation for T2WKU .
When we wake up the circuits, i.e., SLPB changes from LOW to HIGH, T2WKU is
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 35
determined by a critical discharging path, from INT to V G and GND (highlighted in
red) in Figure 2.7(a). This process involves discharging mainly two large capacitors:
the virtual ground rail parasitic capacitor (CV G) and an interconnect capacitor of the
main circuits (CINT ). Thus, as shown in Figure 2.8(a), we can define T2WKU as the
sum of two discharging time components as shown below:
T2WKU = T2WKU1 + T2WKU2 (2.1)
where T2WKU1 is the time to discharge VV G to 4 · VT (VT is thermal voltage, 25.8 mV
at room temperature) and T2WKU2 is the time to discharge VINT to 10 % of VDD.
In the course of T2WKU1, the PGS has low resistance (RPGS,on) since the gate-
source (VGS,PGS) and drain-source (VDS,PGS) voltages are large. RPGS,on is also
roughly constant as VGS,PG is the same and the impact of VDS,PGS on RPGS,on is
small for the well-known sub-threshold leakage current equation. On the other hand,
the gate-source voltage of the pull down transistor N1 (VGS,N1) is approximately zero
since VV G is close to VDD during this phase. This leads N1 to have a resistance (RN1)
that is very large relatively to RPGS,on. Therefore, T2WKU1 can be modeled as the
discharging time of CV G through RPGS,on (Figure 2.8(b)). It can be derived as:






As VV G discharges to 4 ·VT , the condition that we described above no longer holds
since VGS,N1 becomes large enough and N1 starts to discharge CINT . In order to
model T2WKU2, accordingly, we consider the circuits shown in 2.8(c), where VGS,N1 =
VDD−VV G and VDS,N1 = VDD−VV G. Then, based on the capacitor current equation,
i.e., iN1 = CINT · dVINT/dt, where iN1 is the drain current of N1, we can derive the
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 36
Figure 2.8 VV G and VINT waveforms during a wake-up process. T2WKU is divided into
two phases, (b) the effective circuits during T2WKU1, (c) the effective circuits during
T2WKU2.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 37

















where VV G is assumed to linearly scale down over time with a slope k (i.e., VV G =
−kt+ 4VT ), and n is a sub-threshold slope factor, determined by the intrinsic device
topology and structure. Since VDS,N1 is much larger than VT , the VDS exponential













Equation 2.4 states that k dictates T2WKU2. k is small relative to the slope in
T2WKU1. Consequently, T2WKU2 becomes large since it is the result of a negative
feedback. While N1 dumps charge from INT to VG, it slows down the increase of
VGS,N1. This, however, implies that if discharging CINT through N1 does not dump
charge from INT to VG, T2WKU2 can be shortened.
Finally, we verify the T2WKU equations via simulation. As shown in Figure 2.8(a),
the model matches well with SPICE results. We also verify the improvement of
T2WKU2 at the absence of charge dumping from INT to VG. The green dashed line
clearly shows that it can achieve much shorter T2WKU2.
In addition to T2WKU , ETRAN is another overhead to be minimized for enabling
temporally fine-grained sleep mode. It is associated with the switching energy dissi-
pation in various parasitic capacitors. Since ETRAN is dominated by switching energy
dissipation, it can be small in near- and sub-threshold VDDs, which will be confirmed
in Subsection 2.3.2 ETRAN can be modeled as:
ETRAN = (CV G + CINT )∆V
2 + CPGS · V 2DD (2.5)
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 38
where ∆V is the change of VV G during mode-transitions (typically close to VDD) and
CPGS is the PGS gate capacitance.
Based on the derived equations in this section, the main knobs to minimize
the mode-transition overheads associated with T2WKU and ETRAN are CV G, CINT ,
RPGS,on, and ∆V . Among those, CINT , in particular, plays a critical role in both
T2WKU and ETRAN []. Conceiving a technique that avoids charging and discharging
CINT during a wake-up process would potentially enable shorter T2WKU and lower
ETRAN .
2.3.2 The Impact of Voltage Scaling on PGS Overheads
We investigate the impact of VDD on T2WKU and ETRAN using 16-bit BW multi-
plier test circuits. We use post-layout parasitic RC extracted netlists in a 65 nm
general-purpose CMOS. The PGS is designed with standard-threshold (SVT) NMOS
transistors and sized approximately 1 % of the total NMOS width of the multiplier
circuits, as suggested in [40; 41]. In addition, based on the suggestion in Ref. [42],
we include a 10 pF decoupling capacitor that is approximately 10 % of the footprint
of the multiplier. The substrate of the main circuit is isolated and tied to virtual
ground to avoid implicit body biasing.
We investigate relative costs such as T2WKU in fanout-of-4 (FO4)-inverter delay
units and ETRAN normalized to leakage energy consumption per cycle (ELEAK), both
across VDDs. We set TCY CLE to the minimum value at each VDD. As shown in
Figure 2.9, the relative ETRAN quadratically decreases with VDD. This is because
ETRAN consists mainly of switching energy dissipation. On the other hand, the rel-
ative T2WKU remains roughly insensitive to VDD down scaling since the parameters
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 39
determining T2WKU (e.g., RPGS,on and k) scale similarly with the FO4-inverter delay
across VDDs.
The smaller relative ETRAN at lower VDDs can be an important opportunity to use
temporally fine-grained sleep modes for mitigating active leakage overhead. Still, it
remains critical devising a technique to reduce T2WKU . By doing so, we can completely
remove the bottlenecks to enable temporally fine-grained sleep mode.
Figure 2.9 Relative energy and timing overheads over VDDs.
2.4 Temporally Fine-Grained Sleep Technique
In order to reduce T2WKU , we first investigate the use of an overdrive VGS to drive the
PGS – e.g., using VSLPB = 1 V and VDD = 0.35 V in Figure 2.8(a). This, however,
can shorten only T2WKU1 and have little impact on reducing T2WKU2 – the latter is
dominated by the feedback process of non-PGS transistors (e.g., N1 in Figure 2.8(a)).
As shown in Figure 2.12, T2WKU is still 35 FO4-inverter delays (only 3 FO4 reduction
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 40
from T2WKU of the non-overdrive PGS case) largely compromising throughput if a
temporally fine-grain sleep mode is exercised.
Equations 2.3 and 2.4 show that it is critical to avoid the discharging process of
CINT . For this, we revive and optimize a technique known as Zigzag Super Cut-Off
CMOS (ZSCCMOS) [36; 37]. This technique assigns either footer or header PGS to
each gate depending on the output voltage level at a predefined set of inputs. This
arrangement prevents internal nets’ capacitors from charging and discharging during
mode-transitions.
Figure 2.10(a) shows such arrangement for a 50-stage inverter chain. In this
scheme, when the circuits enter into the sleep mode (the signal SLPB becomes LOW),
VIN is also set to a predefined value (LOW in this example). This makes all the
inverters have their outputs (i.e., internal nodes) forced to predefined values, either
LOW or HIGH. Those gates with HIGH outputs are tied to a footer PGS. On the
other hand, those gates with LOW outputs are connected to a header PGS. This
way, we can limit the switching of CINT during the sleep mode and mode-transitions,
remarkably reducing T2WKU and ETRAN .
The wake-up sequence of the circuits starts when the flip-flops capture INPUT
and VALID signals at the rising clock edge. However, we use the different flip-flops
for INPUT and VALID signals: the flops for INPUT has reset input and therefore
the D-Q delay of these flip-flops is longer than those of the flip-flops used for the
VALID signals. The small difference in D-Q delay (1–2 FO4-inverter delays) is found
to be sufficient to wake-up circuits thanks to the fast wake-up time of the ZSCCMOS.
Even if the time difference is not fully sufficient and thus the INPUT signals propagate
through gates, the impact on the wake-up time is found to be small since only the
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 41
Figure 2.10 (a) 50-stage inverter chain circuits with the ZSCCMOS scheme. (b) The
breakdown of leakage power dissipation during sleep modes across VDDs.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 42
first few gates switch while waking-up the whole circuits.
Nevertheless, ZSCCMOS has a notable issue, which has limited its use in super-
threshold voltage circuits. This is the gate leakage path created by the use of both
footer and header PGS (Figure 2.10 illustrates such paths as red and blue arrows).
Furthermore, as shown in Figure 2.10(b), gate leakage through the paths can con-
tribute to 83 % of total leakage dissipation during a sleep mode at 1 V. Other leakage
components (e.g., sub-threshold leakage, junction leakage) account for 17 %. The
sub-threshold leakage, in this case, is small due to the stacking effect [1]. In con-
trast, header-only or footer-only PGS designs have no such gate leakage paths [43].
Note that gate leakage would remain one of the significant sources of leakage power
dissipation although its significance varies across processes and device-flavors. This
reason is that the gate oxide thickness poses a fundamental trade-off between channel
control and gate leakage, which continues to motivate studies on device optimization
[46].
The gate leakage problem, however, is negligible at near- and sub-threshold, be-
cause gate leakage becomes small at lower VDDs making the ZSCCMOS technique
more compelling. As shown in Figure 2.10(b), the contribution of gate leakage to
total leakage dissipation during a sleep mode becomes 23.8 % at 0.5 V and only 6.5
% at 0.3 V.
We compare four PGS design techniques: A footer PGS, a footer PGS with gate
overdrive voltage, a ZSCCMOS, and a ZSCCMOS with gate overdrive voltages (Figure
2.11). In the two footer-only PGS designs, the power-switches are sized to 10 % of
the total NMOS width of the main circuits (a 16-bit BW multiplier). In the two
ZSCCMOS designs, the footer is sized to 10 % of the total NMOS width and the
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 43
header PGS is sized to 10 % of the total PMOS width. We include the overhead of
level converters [47] in our simulations for the designs using overdrive voltage (i.e.,
Figures 2.11(b, d))
Figure 2.11 Four PGS designs: (a) a footer PGS, (b) a footer PGS with gate overdrive
voltage, (c) a ZSCCMOS, and (d) a ZSCCMOS with gate overdrive voltage.
As shown in Figure 2.12, the ZSCCMOS can reduce T2WKU over the footer PGS
design by more than one order of magnitude. In contrast, the improvements provided
by gate overdrive voltage are marginal for both footer PGS and ZSCCMOS designs.
These results make us again to confirm the importance of limiting the switching of
CINT to scale T2WKU down. We also find similar results for T2SLP , although it is less
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 44
crucial for implementing temporally fine-grained sleep modes.
Figure 2.12 also shows PSLP comparisons across the four structures. PSLEEP is de-
fined as the average power dissipation while we sleep circuits for ≥ 1000 FO4-inverter
delays, and then, wake them up. The ZSCCMOS design consumes less PSLEEP than
the footer PGS design by 1.74X at 0.55 V and by 1.37X at 0.35 V thanks to smaller
ETRAN . The ZSCCMOS design with gate overdrive voltage consumes slightly more
than the ZSCCMOS due to slightly larger ETRAN and the power dissipation of level
converter circuits.
Figure 2.12 Comparisons among the four different PGS schemes.
Finally, leakage power is a strong function of process and temperature. Therefore,
it is important for sleep techniques to adapt to process and temperature variations
and determine when to enter and exit sleep modes. While the scope of this work is
mostly on the nominal process and temperature conditions, several recently-developed
techniques [44; 45] can be used together with the technique proposed in this work.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 45
2.5 Parallel Architecture with the temporally Fine-
Grain Sleep Technique
In this section, based on the investigations presented in Sections 2.2, 2.3, 2.4, we
experiment the design of parallel architectures in near- and sub-threshold circuits,
in which we employ the proposed temporally fine-grain sleep technique. The goal is
to fully the transform throughput improvements of parallel architectures into energy
savings by avoiding the penalty with respect to the active leakage overhead.
Figure 2.13(a) shows the test circuits that we use in the experiment. It consists of
two 16-bit BW multipliers (M1 and M2), a 32-bit 2-to-1 multiplexer, input flip-flops
(IFF1 and IFF2), and sleep flip-flops (SFF1 and SFF2). M1 and M2 employ the
ZSCCMOS technique with no gate overdrive voltage. The multiplexer employs no
PGS. CLK is a synchronous clock signal for non-parallel portions of the circuit while
CLK1 is a 2X slower clock than CLK. CLK2 is the complement of CLK1. The data
input (INPUT) to the test circuits is accompanied with a VALID signal that indicates
whether INPUT is valid or not. Both INPUT and VALID signals are synchronized
with CLK. In addition, by AND-ing VALID and CLK1 (or CLK2) we can generate
LCLK1 (or LCLK2), which is used in the input flip-flops of M1 (or M2) for clock
gating.
We present the detailed operating waveforms of the test circuits in Figure 2.13(b).
In the waveforms, at the clock cycle 1, the first input is not valid (VALID transitioning
to LOW before the rising clock edge). SFF1, clocked by CLK1, captures VALID and
generates SLPB1 = LOW and SLP1 = HIGH signals. This puts M1 in the sleep
mode and minimizes its active leakage dissipation. The IFF1 of M1 are reset, forcing
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 46
Figure 2.13 (a) Proposed design: A 16-bit 2-way parallel multiplier with our tem-
porally fine-grained PGS technique. (b) Functional waveforms. (c) Comparison of
the energy dissipation per cycle of the four architectures for a target cycle time of 20
ns. (d) Comparison of the energy dissipation per cycle of the four architecture for a
target cycle time of 3.3 ns.
the inputs of M1 to LOW so that every internal node in M1 settles to a pre-defined
state once logic evaluates. The IFF1 are also clock-gated such that their dynamic
energy consumption is minimized. Similarly, just before the next rising edge of CLK,
another invalid input comes in (VALID is still LOW at the rising clock edge). The
VALID signal, this time, is captured by SFF2 at the rising edge of CLK2. This makes
SLPB2 = LOW and SLP2 = HIGH to turn off M2 during that cycle. The IFF of
M2 is reset, forcing internal nodes in M2 to pre-defined states. In addition, IFF2 are
clock-gated by LCLK2 to reduce dynamic energy consumption.
At cycle 3, VALID becomes HIGH. The SFF1 captures the VALID signal at the
rising edge of CLK1, which modulates SLPB1 and SLB1 to wake up M1. The wake-up
process is finished before the IFFs of M1 sample INPUT at the rising edge of LCLK
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 47
thanks to (a) a very short T2WKU of less than 1 FO4-inverter delays, and (b) the time
difference between the rising edges of CLK1 and LCLK1.
The time difference is provided by the AND gate delay for generating LCLK1
from CLK1 and VALID. In addition, the difference between clock-to-q delay of the
resettable IFFs and the regular SFFs (3.3 FO4- vs. 2.3 FO4-inverter delays) can also
contribute to the time difference. Although we use those intrinsic time differences
to accommodate the short T2WKU , it is possible to intentionally add delay to ensure
that the wake-up process finishes before IFF sample new input. Alternatively, it is
also possible to latch the sleep signals with a transparent low latch so that the sleep
signals drive the PGSs before the rising edge of the clock (when a new input comes
in). After the wake-up process, M1 can operate over the sampled inputs under usual
conditions while M2 is kept in the sleep mode to maximize energy-efficiency.
At the beginning of cycle 4, the SFF of M2 captures the VALID signal, which is
still HIGH at the rising edge of CLK2. This makes SLPB2 and SLP2 to HIGH and
LOW, respectively, waking up M2, which then performs computation over the new
inputs, sampled at the rising edge of LCLK2.
Figure 2.13(c) shows energy consumption per cycle for the same target TCY CLE of
20 ns. The VDDs to meet the target TCY CLE are 345 mV (proposed), 400 mV (base-
line), 345 mV (2-stage pipelined), and 335 mV (2-way parallel). The proposed design
exhibits less energy per cycle than the baseline and the 2-way parallel (conventional
implementation) designs across utilization rates. At hardware utilization rates less
than 50 %, the proposed design outperforms the 2-stage pipelined design as well. At
utilization of 10 %, the proposed design consumes around 2.4X less energy than the
2-way parallel design.
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 48
We also investigate the gains in near-threshold voltage regime. The target TCY CLE
is set to 3.3 ns, which raises VDDs to 590 mV for the baseline, 505 mV for the 2-
stage pipelined, 470 mV for the 2-way parallel, and 480 mV for the proposed design.
As shown in Figure 2.13(d), the proposed design can achieve comparable or better
energy-efficiency across hardware utilization rates than all the other architectures.
Finally, while we focus on studying small-scale test circuits, i.e., multipliers, based
on the results of the study we can speculate the applicability of the proposed technique
on larger-scale systems. We believe that the proposed technique can benefit the
hardware whose power dissipation is dominated by active leakage, particularly due to
the use of a deeply-scaled supply voltage and the use of massive parallelism.
2.6 Conclusions
In this chapter, we propose coupling parallel architectures with temporally fined-
grained sleep modes for improving the energy-efficiency of near- and sub-threshold
digital circuits. We re-visit two classic techniques for improving throughput and
energy-efficiency, i.e., pipelining and parallelism, in the context of near- and sub-
threshold circuits. By investigating them, we find that active leakage overhead makes
it significantly inefficient to trade-off throughput and energy savings in such context
due to the active leakage. In order to mitigate active leakage overhead, we propose a
temporally fine-grained PGS technique by analyzing and optimizing mode-transition
overhead. Simulations with multiplier-based test circuits show that the proposed
technique can significantly reduce the energy consumption per cycle, especially for
the cases in which hardware utilization rates are low. The result also confirms the
CHAPTER 2. TEMPORALLY FINE-GRAINED SLEEP TECHNIQUE FOR
NEAR- AND SUB-THRESHOLD PARALLEL ARCHITECTURES 49
importance of suppressing active leakage overhead for scaling the practical limit of
energy-efficiency in near- and sub-threshold digital circuits.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 50
Chapter 3
A 0.17 mm2 3.19 nJ/Transform
256-Point Fast Fourier Transform
Core based on Spatiotemporally
Fine-Grained Active Leakage
Suppression
This chapter presents an ultra-low energy and compact Fast Fourier Transform (FFT)
processor suitable for pervasive sensing systems. To achieve high energy-efficiency and
small area, we design area-efficient memory-based architecture and equip it with two
devised techniques for ultra-low voltage circuits: (1) spatiotemporally fine-grained
voltage boosting for sub-threshold, ultra-low leakage memory; and (2) spatiotem-
porally fine-grained active leakage suppression for combinational logic. The proto-
typed chip, implemented in a 65 nm general-purpose CMOS, consumes 3.19 nJ per
FFT transform and requires 0.17 mm2 silicon area while achieving 1.2 MSample/s.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 51
As compared to the prior state-of-the-art, our FFT processor achieves 5.04X better
energy-area-product (EAP) normalized to FFT type, resolution, word-length, and
process technology.
3.1 Motivation
Energy-efficient Fast Fourier Transform (FFT) is one of the key computation kernels
for emerging embedded systems such as edge devices for the Internet of Things (IoT).
To minimize energy consumed per transform, conventional designs explored deep
voltage scaling down to near- and sub-threshold voltage regimes, often combined with
highly pipelined and parallel architectures to maximize nodal switching activities,
which can reduce the impact of the active leakage dissipation [50; 24; 51; 52; 53;
54]. While such techniques can successfully improve energy-efficiency, the large degree
of pipelining and parallelism significantly increases silicon area, which cannot be
simply compromised in a target resource-constrained, miniaturized system.
In this chapter, we explore techniques for designing ultra-low energy and compact
FFT processor. Our direction is to adopt an in-place, memory-based FFT architecture
for area efficiency and then exercise near- and sub-threshold circuits to minimize the
dynamic energy dissipation. The key roadblock of this direction is the active leakage
energy consumption, i.e., static power consumed over an exponentially longer cycle
time while circuits are in the active mode, which limits the scaling of the minimum-
energy point (MEP) and dictates the practical limit of energy dissipation in ultra-low
voltage digital VLSI circuits.
In order to address the active leakage problem, we propose active leakage suppres-
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 52
sion techniques that can effectively improve energy-efficiency without compromising
implementation area. By suppressing the active leakage, we scale the minimum en-
ergy point (MEP) of the FFT core and further improve energy-efficiency. We im-
plement two techniques for near- and sub-threshold computing: (1) spatiotemporally
fine-grained voltage boosting for sub-threshold, ultra-low leakage memory; and (2)
spatiotemporally fine-grained active leakage suppression for combinational logic.
To demonstrate the effectiveness of these two techniques, we design and fabricate
an FFT processor in a 65-nm general-purpose CMOS. The prototype requires 0.17
mm2 silicon area and consumes 3.19 nJ per FFT transform with a throughput of
1.21 MSample/s. As compared to the prior art [24; 51; 50], our FFT processor
achieves 5.04X and 4.57X better energy-area-product (EAP) normalized to FFT type,
resolution, word length, and process technology. The area overhead of the proposed
techniques accounts for approximately 15 %.
The remainder of this chapter is structured as follows. In Section 3.2, we discuss
our compact MB FFT architecture. In Section 3.3, we present the FFT processor
implementation details, focusing on the circuit-level techniques for improving energy-
efficiency by suppressing the active leakage. In Section 3.4, we show the measurement
results of the prototype and provide comparisons with the prior art. Lastly, in Section
3.5, we conclude this chapter.
3.2 Memory-Based, Compact FFT Architecture
Out FFT architecture is shown in Figure 3.1. It consists of an in-place, memory-
based (MB) implementation of the FFT algorithm based on [57; 58]. The proposed
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 53
architecture has five functional units (FUs): data memory (DM), processing element
(PE), twiddle factors read-only memory (ROM), shuffling multiplexers (M3 – M1),
and controller (CTRL). The dataflow splits real and imaginary parts of a sample.
The data path is completely fixed-point – input samples are represented with a Q1.15
format. To avoid the introduction of bias and improving accuracy after arithmetic
operations, we perform rounding up and truncation to 16-bit format.
Figure 3.1 FFT processor architecture. Fixed-point, memory-based, one 16-bit input
lane, one 16-bit output lane, radix-2-based butterfly, 256-point resolution.
The DM consists of a 4 kb decoupled 1-Read 1-Write (1R1W) fully custom sub-
threshold ultra-low leakage SRAM, composed of 4 banks (B3 – B0). Each one of
the memory banks stores 64 16-bit words, totaling 1 kb storage per bank. The
PE uses the radix-2-based butterfly architecture and is composed of four 16-bit
adders/subtractors, two 16-bit multiplexers, and one complex multiplier (CM). The
CM is efficiently implemented with five 16-bit adders and one real multiplier. The
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 54
ROM of twiddle factors leverages the redundancies of the real and imaginary parts
of the twiddles to store only 33 coefficients and reconstruct the remaining 95 fac-
tors/coefficients by performing simple logic and shift operations.
The MB FFT operation is composed of two phases: loading and computation.
In the loading phase, 256 input samples are loaded into the SRAM banks. Once
memory loading is complete, the multiplexer M1 switches from the IO interface to
the data path and the computation phase starts. In the computation phase, the PE
fetches data from the memory and the ROM of twiddle factors in order to perform
the butterfly operations. The butterfly results are written back into the memory so
as they can be used in the next stages of the computation. It takes 449 iterations to
perform one FFT transform. The PE input multiplexers (M2) select which memory
bank to read from. Multiplexers M3 shuffle partial results to be written-back into
memory banks. The CTRL generates the read and write signals and addressing
patterns for memory banks. It also generates select signals for multiplexes M0 – M3,
addresses for the ROM of twiddle factors, clock-gate signals for multiple clock sinks,
and sleep signals for the PGS.
3.3 Active Leakage Suppression Techniques for Near-
and Sub-Vt Computing
3.3.1 Ultra-Low Leakage SRAM Design
The MB architecture has the advantage of small area as compared to the pipeline
architectures, however, it suffers from low nodal switching activities. The main rea-
son is the memory, which accounts for the greatest portion of the total circuits in the
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 55
MB architecture. It implements a 256 × 16-bit word memory, but only one word is
accessed per bank per cycle. All the other words (i.e., 252 × 16-bit words, equivalent
to 98.5 %) remain idle. These idling words significantly contribute to energy waste in
the form of leakage. Figure 3.2 illustrates it by showing the energy breakdown of the
baseline and the proposed FFT cores. The baseline processor is implemented with
high-threshold (HVT) standard-cells and runs at the same supply voltage, tempera-
ture, and frequency as the proposed FFT processor counterpart.
Figure 3.2 Energy breakdown of the baseline and proposed FFT processor.
We further investigate the memory energy contribution in the MB FFT architec-
ture for a different number of points and temperatures by using an analytical model.
Figure 3.3 shows energy breakdowns between the two main modules: memory and
butterfly. In all cases, the memory energy contribution is significantly greater than
that of the butterfly. The memory energy is mainly due to the active leakage con-
sumption because of the low nodal switching activity of this module.
As shown in Figure 3.3(a), the memory energy dramatically increases with the
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 56
FFT resolution as it takes more cycles to perform one FFT transform, which, in
turn, leads to a greater leakage energy waste in portion. Figure 3.3(b) shows the
same estimated energy breakdown at different temperatures. Dominated by the active
leakage, the memory contribution strongly grows larger with temperature. Therefore,
it becomes clear that devising techniques to minimize the leakage energy consumption
in memory is a crucial step to improve the FFT processor energy-efficiency.
Figure 3.3 Energy breakdown (butterfly and memory). (a) The energy breakdown
across different number of points for a radix-2, MB FFT. (b) The energy breakdown
across temperatures.
In that context, our direction to reduce the leakage energy consumption in memory
is to design ultra-low leakage bitcell and peripherals and then utilize supply voltage
boosting in specific parts of the memory to increase performance and meet the target
read and write access times. This way, the leakage power is reduced whereas access
delay is shortened allowing for improved energy-efficiency.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 57
3.3.1.1 Bitcell and Peripherals Design
The proposed 10T bitcell schematic, inspired by [59], is shown in Figure 3.4(a). As
our memories require state-retention and cannot be power-gated, we design the bitcell
using HVT and long channel devices for lower leakage. We length-bias the transistors
at 3 times the minimum length of technology, i.e., 3 · LMin = 180nm. Simulation
results in Figure 3.5 show that utilizing high-threshold and long channel length pro-
vides approximately 9.5X active leakage current reduction over the baseline bitcell,
with the same structure, regular-threshold, and minimum length devices.
Our 10T bitcell layout is shown in Figure 3.4(b). It requires 4.56 µm2 area in a
logic (not SRAM ) design rule and exhibits about 33.3 % area overhead as compared to
its baseline counterpart. The memory cell was designed to be mirrored and overlapped
to share VDD and VSS lines between adjacent cells along the boundary. Different
from the commonly used thin 6T, we layout our 10T bitcell in a way that VDD rails
run horizontally along with the read- and write-wordline signals. Our 10T bitcell
decouples read and write ports thus both operations are independent and can be
performed simultaneously, within one cycle. It also provides enough noise margins,
read, and write stability at deeply scaled supply voltage operation, as required in our
FFT design.
3.3.1.2 Spatiotemporally Fine-Grained Voltage-Boosting
Figure 3.6 shows our scheme to enhance read and write performance and avoid com-
promising access times in the proposed ultra-low leakage SRAM. For a read operation,
the VDD rail of the row of bitcells that is being accessed is boosted from VDDL to VDDH
(where VDDH = VDDL + ∆VDD, where ∆VDD is nominally approximately 0.15 V). It
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 58
Figure 3.4 (a) Schematic and (b) layout of the proposed 10T ultra-low leakage bitcell.
Figure 3.5 The effectiveness of the circuit-level techniques employed in the SRAM.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 59
ensures fast charge of RBL as the VSG of the pull-up network of the tri-state inverter
becomes VDDL + ∆VDD. Moreover, we boost RWL by ∆VDD via a level converter [47]
to rapidly discharge RBL via the pull-down network (both PTN and MN transistors
also have VGS = VDDL + ∆VDD). We can initiate and finish the boosting operations
in a single cycle. This can be done with no considerable power-grid integrity penalty
due to the low power dissipation of near- and sub-Vt digital circuits, and relatively
small metal resistance.
Similarly, for a write operation, more conventionally, the write-wordline (WWL)
of the accessed row is boosted by the same ∆VDD, from VDDL to VDDH , for one single
cycle. The write operation ensures fast and reliable write into the cell.
3.3.2 Ultra-Low Leakage Combinational Logic Design
In our design methodology targeting energy-efficiency, we want to aggressively sup-
press the active leakage energy in order to scale the MEP. To that end, we extensively
employ power-gating switch (PGS) techniques and exercise sleep mode to shut down
idle blocks at runtime that largely contribute to the active leakage energy waste.
Memory drivers were given special attention as they employ bigger transistors in or-
der to efficiently drive the large gate and wire interconnect capacitance associated
with a particular row of bitcells. Furthermore, we optimize the PE when it is idle,
during the memory loading phase.
3.3.2.1 Zigzag-based Wordline Drivers
Read- and write-wordline drivers are upsized buffers used to drive the large gate and
wire interconnect capacitances of a particular word or row of bitcells. Our memory,
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 60
Figure 3.6 The memory bank architecture (only 2 out of 4 blocks are depicted) and
a column of 16 bitcells featuring the spatiotemporal voltage-boosting circuitry.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 61
specifically, implements 256 words, each of which requires 3 drivers for the read- and
write-wordline signals (i.e., RWL, RWLB, and WWL) accounting for 768 drivers.
Nevertheless, only one word is accessed per bank per cycle while all the other words,
i.e., 252 × 3 = 756 drivers, remain idling and contributing for the active leakage
energy waste.
Inspired by works [60; 61; 62], we utilize the PGS technique Zigzag Super Cut-Off
CMOS (ZSCCMOS) to automatically shut down all the memory drivers of wordlines
that are not accessed in a particular cycle. Additionally, ZSCCMOS allows for waking
up the driver of a wordline that is being accessed on the fly (in the same cycle of a
read or write request) to assert the wordline of the relevant word.
Figure 3.7 illustrates a write-wordline driver with four inverter stages. As shown
in Figure 3.7(a), when a particular address (i.e., word/row) is not being accessed
in the memory, the one-hot row decoder forces such a word WWL (input to the
driver) to logic level low. With this condition, the outputs of the first and third
inverters evaluate to high whereas the outputs of the second and fourth inverters
evaluate to low. We then connect the first and third inverters to the NMOS footer
switch (MNPGS) and the second and fourth inverters to the PMOS header switch
(MPPGS). This way, when a driver enters the sleep mode (WSLP = high, WSLPB
= low), all its inverters become power-gated through the PMOS header or NMOS
footer, however, their internal nodes (outputs) remain unchanged (they do not float
or switch), because they are driven by the real power or ground rails. For instance,
the first inverter in Figure 3.7(a) has its output at logic level high, directly driven by
the VDD rail through the pull-up PMOS device (ON, with VSG = 0), whereas it is
shut down by the NMOS footer switch that is OFF (with WSLPB = 0). The second,
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 62
third, and fourth inverters follow a similar scheme. This sets the outputs of the WWL
driver to the desired logic level low, automatically disabling the write-wordline of that
particular word (i.e., in the Figure 3.7(a), WWL is the output of the fourth inverter
and it becomes low).
On the other hand, Figure 3.7(b) shows the case when a particular word is to be
accessed. The driver of the relevant word is initially shut down. Then, it receives the
sleep-to-active signal from the CTRL block at the beginning of the accessing cycle.
As the ZSCCMOS technique prevents the internal nodes of the target block from
floating or switching during the sleep mode, only the capacitances of its virtual rails
need to charge or discharge upon receiving the sleep-to-active transition. This allows
for a dramatic improvement in the wake-up time – it takes only 2.4 fanout-of-4 (FO4)-
inverter delays for a driver to wake up from the low leakage (sleep) to active mode.
After the mode-transition, upon receiving the output of the one-hot row decoder,
the write-wordline driver asserts the relevant WWL to access the word of interest.
Simulation results in Figure 3.5 show that we can reduce the active leakage of each
driver by about 9X. Along with the circuit techniques used in the bitcell, we can scale
the active leakage of the memory block by 9.45X (Figure 3.5).
3.3.2.2 Leakage Suppression in the Processing Element
We further suppress the active leakage energy contribution of the PE. This functional
unit remains idle for 256 cycles during the memory loading phase. Therefore, we
implement an NMOS PGS to place the PE into low leakage (sleep) mode for 255
cycles, when it is not performing any computation. The very last one cycle is reserved
for wake up from sleep to active mode. We utilize an NMOS PGS sized approximately
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 63
Figure 3.7 The ZSCCMOS drivers are implemented in combinational logic circuits
such as the memory peripherals to minimize the active leakage energy waste. (a) The
CTRL shuts down a WWL driver of which the output becomes LOW. (b) The CTRL
wakes up the driver on the fly to access a particular word.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 64
3 % of the total NMOS width of the PE. Such an undersized PGS enables an increased
OFF resistance and a larger voltage drop across the power-gate, causing the virtual
ground rail to collapse with the VDD rail after a number of cycles in the sleep mode.
This results in a dramatic leakage power reduction. Measurement results of the
prototype in Figure 3.8 show that by shutting down the PE for 255 cycles, we can
reduce the total power dissipation of the FFT processor by approximately 13.4 %.
To reduce the ON resistance of the NMOS PGS in the active mode and enable a
single cycle sleep-to-active transition of the PE, we overdrive the SLPB signal (i.e., the
VGS of the PGS) by around 150 mV using a level converter. This increases the driving
strength of the power switch in the active mode allowing for no speed degradation
during the computation phase. It also enables faster charge and discharge of internal
nodes and virtual rails of the PE during a sleep-to-active mode transition. Figure
3.9 shows the NMOS PGS along with the PE and the other functional units that
compose the FFT processor. In this functional unit, in particular, we decide not to
employ ZSCCMOS as it would only allow the PE to remain one more cycle in the
sleep mode whereas it significantly increases physical implementation complexity.
In contrast to the memory loading phase, the PE is highly active in the computa-
tion phase. The reason is that every cycle, the PE fetches new operands from memory
and ROM and operates over them performing radix-2 butterflies. With such high uti-
lization, the PE energy dissipation during the computation phase becomes dominated
by dynamic rather than leakage energy. Therefore, we implement the PE with regular-
threshold instead of high-threshold devices, which can provide approximately 2.7X
speed up in computation delay at 0.4 V. Furthermore, speeding up the computation
delay in the PE improves the timing balance between the memory and PE pipeline
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 65
Figure 3.8 Power consumption of the FFT processor when the PE is always-on or
shutdown for 192 and 255 cycles. 13.4 % power reduction is observed when the PE
is shutdown for 255 cycles. In this case, the very last one cycle is reserved for wake
circuits up from sleep mode.
stages. This results in lesser energy waste in both modules. Finally, to minimize the
dynamic energy, we operate the PE at sub-threshold with VDD,PE = VDDL.
3.4 Chip Prototype and Measured Results
We implement the FFT processor using the MB architecture and the circuit-level tech-
niques detailed in the previous sections. The SRAM banks and the PGS for the PE
are fully custom designed and layout. The remaining functional units are synthesized
and automatically placed and routed using commercial CAD tools. The prototype
is designed and fabricated in a 65 nm general-purpose CMOS process. The FFT
processor consumes 0.17 mm2 of silicon area. Figure 3.10 shows a microphotograph
of the chip.
The proposed techniques applied to our FFT processor successfully reduce the
overall active leakage by approximately 10X, as compared with the baseline design.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 66
Figure 3.9 Detailed power domain and Vt assignment of the FFT processor. The PE
includes a PGS that is sized at 3 % of the total NMOS width of this block. The
SLPB signal that drives the power gate is overdriven to VDDH by a level converter.
Figure 3.10 The FFT processor chip microphotograph fabricated in a 65 nm general-
purpose CMOS.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 67
This allows us to reduce the MEP by 50 to 75 mV, as shown in Figure 3.11.
Figure 3.11 The minimum energy operating point of the baseline design is scaled by
50 to 75 mV as compared to the proposed FFT processor. The scaling of the MEP
due to the circuit level active leakage suppression techniques employed in memory
and PE results in greater energy-efficiency.
As shown in Figure 3.12, our FFT core consumes 3.19 nJ per transform at
VDD,OPT = 0.4V and ∆VDD = 0.15V . It also achieves 3.33 MHz clock frequency.
This translates to 4.73 K FFT/sec (= 1.21 MS/s). Such performance is sufficient to
handle various types of physical signals (e.g., audio).
We also experiment on the energy-optimal ∆VDD. As shown in Figure 3.13, it is
found be around 0.15 V. For larger ∆VDD voltages, the energy overhead due to VDDH
prevails over the energy gains provided by the speed improvement from the spatiotem-
poral voltage boosting techniques. On the other hand, when ∆VDD is smaller than
0.15 V, performance enhancement becomes limited and it constrains the energy gains.
We also investigate the variability of clock frequency and energy-efficiency over
temperature and process variations. As shown in Figure 3.14, the clock frequency
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 68
Figure 3.12 The measured energy dissipation per transform and clock frequency of
the FFT core.
Figure 3.13 The energy and clock frequency as a function of ∆VDD.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 69
exhibits 5X variability across -20 to 60 ◦C. The measurement of 18 dies shows a con-
siderable spread with σ/µ of 13.4 % (Figure 3.15). Both reconfirm the importance
of variation-adaptive techniques in near- and sub-threshold circuits. The energy con-
sumption exhibits smaller variability. Over the same temperature range, it varies by
1.28X. The energy measurement of 18 dies shows a distribution with σ/µ of 6.47 %.
Figure 3.14 The energy/FFT and clock frequency measured across temperatures.
Furthermore, we compare our design with other state-of-the-art and present the
results in Figure 3.16. As compared with prior works [24; 50], the proposed FFT
achieves 5.04X and 4.57X better energy-area-product (EAP) normalized to technology
nodes, FFT type, number of points, and word-length.
3.5 Conclusions
In this chapter, we propose an approach to increased energy-efficiency of ULV circuits
that is demonstrated in a memory-based FFT processor. In our proposed approach,
we dramatically reduce the leakage energy of low nodal switching activity blocks such
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 70
Figure 3.15 The energy and performance distributions of the FFT across 18 dies.
as SRAM by utilizing aggressive leakage suppression techniques. We also implement
spatiotemporally voltage-boosting techniques to compensate for the delay introduced
by the ultra-low leakage design in such blocks. Furthermore, we optimize the static
energy consumption of the processing element by exercising sleep mode during the idle
periods in which the memory is being loaded with sampled data. The proposed FFT
core is fabricated in a 65-nm general-purpose CMOS. As compared to prior art [24;
50], the proposed design achieves 5.04X and 4.57X better normalized EAP.
CHAPTER 3. A 0.17 MM2 3.19 NJ/TRANSFORM 256-POINT FAST
FOURIER TRANSFORM CORE BASED ON SPATIOTEMPORALLY
FINE-GRAINED ACTIVE LEAKAGE SUPPRESSION 71
Figure 3.16 (a)The comparison table with prior arts. (b) Normalized area vs. nor-
malized energy per transform as a figure-of-merit (FoM).
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 72
Chapter 4
A fW- and kHz-Class Feedforward
Leakage Self-Suppression Logic
Requiring No External Sleep
Signal to Enter the Leakage
Suppression Mode
In this chapter, we present a novel logic family for nanowatt and sub-nanowatt always-
on circuits. This logic family achieves the ultra-low leakage of 5 fW per gate without
requiring external sleep control signals to put the circuits in the leakage-suppression
mode while addressing the switching speed problem of such ultra-low leakage logic
families. It achieves the fanout-of-4 (FO4)-inverter delay of 10.2 µs, marking an
improvement of 148X over the state of the art. With the proposed logic family, we
prototyped a finite impulse response filter (FIR) for physical signal sensing systems, in
which sampled signals are often sparse (predominantly constant or slowly changing)
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 73
and their frequency varies with time in a burst fashion. During the sparse period,
the filter consumes only 109 pW by autonomously suppressing leakage. It can still
process signals whose bandwidth can be as high as half of the filter’s maximum
operating frequency, 1.03 kHz. The proposed logic improves leakage-delay-product
by 200X over the prior art.
4.1 Motivation
A nanowatt power system is one of the key building blocks for emerging technologies
such as the Internet of Things. The power budget required by such systems is ex-
tremely limited by miniaturized energy sources to nanowatt or even less. To achieve
these power levels, existing systems employ duty-cycling. They typically perform
a short task (triggered by an event or imposed by an application), and then place
part of their building blocks in the sleep mode [25]. After this, the total power con-
sumption becomes dominated by the leakage of blocks that cannot be shut down, i.e.
always-on circuits such as signal acquisition front-ends, data retentive memories, and
power management [63]. For the always-on circuits, power consumption (especially
leakage) rather than energy per operation needs to be minimized as they cannot be
shut down and ideally need to consume as low as the duty-cycled blocks in the sleep
mode. Therefore, scaling the leakage of always-on circuits to a fraction of a nanowatt
is a crucial step towards sub-nanowatt and nanowatt systems.
Leakage consumption in CMOS digital logic circuits is significantly larger than
the nanowatt level. In this 0.18 µm process, despite being known for low leakage, a
single minimum-sized inverter consumes around 10 pW of leakage at 0.5 V. Thus, a
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 74
module with 10,000 gates would consume approximately 100 nW. To achieve 100 pW
leakage for such module (i.e., fW/gate), we need a 1,000X reduction in leakage.
Existing techniques such as transistor stacking and power-gating switches (PGS)
[64; 69] can reduce leakage power, but neither of them is suitable for always-on cir-
cuits. For example, PGS needs to make frequent sleep-in and -out transitions for
suppressing the leakage during the quiescent periods between active operations. This
incurs substantial power and delay overheads. Additionally, it requires sleep signals
(i.e., control signals necessary to put the target circuits in the leakage-suppression
mode) that should be generated by additional always-on circuits. It also requires
state-retention elements that add design complexity and pose a trade-off between
data retrieval delay and leakage savings. On the other hand, the stacking technique
becomes less effective at near- and sub-threshold as the amount of negative VGS scales
with supply voltage. For this reason, to reduce inverter’s leakage by 1,000X at 0.5 V
for example, it requires to stack approximately 500 devices, which results in unrealistic
delay and area increases (Figure 4.1).
Several works recently proposed non-conventional logic circuits to further reduce
per-gate leakage. Some techniques rely on deep VDD scaling to the sub-100 mV
regime while improving robustness by adopting Schmitt-Trigger style circuits [65] and
adaptive β-ratio modulation [66]. However, the resulting per-gate leakage still remains
in the pW level. On the other hand, Lim et al. demonstrated a logic family based on
source biasing that can achieve fW/gate leakage [68]. However, it exhibits extreme-low
switching speed, as reported 6.6 Hz maximum clock frequency for a 32-bit embedded
processor at its leakage-optimal supply voltage, 0.55 V. This marks one FO4-inverter
delay of 1.51 ms (estimated based on the assumption that the microprocessor has a
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 75
Figure 4.1 (a) Stacked inverter schematics. (b) Simulation results of leakage power,
(c) delay and area overheads across the number of stacks implemented.
critical path of 100 FO4-inverter delays). To improve speed, Lin et al. proposed a
logic that can transform between the aforementioned logic [68] and the conventional
CMOS but requires external sleep signals to enter the leakage-suppression mode [67].
In this chapter, we present a logic family referred to as Feedforward Leakage Self-
suppression Logic (FLSL). It achieves one FO4-inverter delay of 10.2 µs, 148X faster
than [68], at the same per-gate leakage level of 5.39 fW, without requiring explicit
sleep signals. The key idea of the FLSL is to employ dual-rail style thus replace the
slow feedback paths that activate the double source biasing with fast feedforward ones
by using the signals available in the dual-rail logic.
With FLSL, we prototyped an always-on 16-bit 8-tap finite impulse response (FIR)
in a 0.18 µm CMOS. The filter is intended for physical signal sensing systems. In
such systems, the sampled signals are typically sparse (i.e., predominantly constant
or slowly changing) and their frequency can vary with time in a burst fashion [70].
The filter leverages the dramatic leakage reduction capability of the FLSL for most of
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 76
the time, during long constant or slow changing periods, and thus consumes only 109
pW. It can still process a wide range of input signals whose bandwidth are as high as
the half of the filter’s maximum clock frequency of 1.03 kHz. In contrast, the prior
ultra-low leakage logic [68] cannot be used for such purposes due to its prohibitively
low switching speed. The proposed FLSL improves Leakage-Delay Product (LDP =
Leakage · FO4-inverter delay), an important metric for always-on circuits, by about
200X, and Leakage-Delay-Area Product (LDAP = LDP · Area) by approximately
150X over [68].
4.2 Feedforward Leakage Self-suppression Logic
4.2.1 The FLSL Inverter
The existing leakage-suppression gates (Lim et al., Figure 4.2(a)) have two leakage-
suppression transistors MNT and MPB, controlled via the feedback path from input
A to output Y, and to the gates of MNT and MPB. When input A changes from 0 to
VDD, the gate tries to discharge output node Y through MNB and MPB. However,
this discharging is rather slow because the discharging current is limited by MPB and
MNB, both of which operate mostly in super-cutoff with negative-VSG or in the cutoff
with zero VSG. Specifically, MPB is initially in super-cutoff (N2 voltage < Y voltage)
and slowly reaches cutoff as N2 voltage becomes close to Y. MNB is also initially in
super-cutoff as A voltage is less than N2, then it reaches cutoff when both voltages
become the same and it turns on only after A voltage is larger than N2.
Figure 4.2(b) shows the proposed FLSL inverter, which avoids the aforementioned
problem. Our FLSL gate uses its own inputs to control all transistors, including
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 77
the leakage-suppression ones, i.e. NTL, NTR, PBL, and PBR. As input A (AN)
transitions from 0 (VDD) to VDD (0), PBL immediately moves from super-cutoff to
cutoff with VSG = 0 (AN voltage ≈ NBOTL voltage). The internal node NBOTL is
discharged through PBL and allows NBL to fully turn on. This makes the output Y
sharply switch. Furthermore, the feedforward path helps in suppressing leakage faster.
As A and AN directly turn off PTL and NTL, the node NTOPL quickly settles on
at around VDD/2. This places both PTL and NTL in the super-cutoff region with
large negative-VGS, strongly suppressing leakage. Similarly, node NBOTR settles on
at around VDD/2 and suppresses the leakage of the circuits on the right-hand side.
4.2.2 The FLSL Performance
Figure 4.3 summarizes the parasitic-annotated simulations of inverters in three digital
logic families. The FLSL inverter achieves the leakage power of about 1 fW and one
FO4-inverter delay of 3.10 µs at VLEAK = 0.85 V. It consumes more leakage power at
VDD higher or lower than VLEAK , the former because the PN junction diodes of the
bottom PMOSs start to turn on; and the latter because the magnitude of negative-
VGS for the double source biasing becomes small. The FLSL inverter achieves a record
LDP of 3.53 fW·µs, which is 1,982X smaller than the inverter in [68]. As compared
to the static CMOS inverter at 0.9 V, an FLSL inverter exhibits 1.72X better LDP.
Since the FLSL operates with a minimal amount of current, we design each gate
to ensure robustness. To achieve full-swing, we use low-Vt transistors and carefully
size the leakage-suppression devices (NTR, NTL, PBR, and PBL in Figure 4.2(b)) at
2–3 · WMIN and minimum length. Transistors that compose the logic function are
minimum-sized for minimal capacitances. On account of PMOS to be much weaker
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 78
Figure 4.2 (a) Stacked inverter schematics. (b) Simulation results of leakage power,
(c) delay and area overheads across the number of stacks implemented.
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 79
Figure 4.3 Parasitic-annotated simulations of inverters designed in FLSL, Lim et al.,
and static CMOS. (a) Leakage power vs. FO4-inverter delay. (b) FO4-inverter delay
vs. supply voltage. (c) Leakage-Delay Product vs. supply voltage.
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 80
than NMOS in this technology at near- and sub-threshold voltages, we balance PMOS
and NMOS strengths by biasing the bodies of PBL and PBR to VBIAS = VDD/2. The
ability to body-bias helps to achieve full swing output and approximately 2X switching
speed improvement. It can also be used as a powerful knob to mitigate the leakage
increase over temperature variation, e.g., in a closed-loop system with a temperature
sensor. Additionally, the FLSL operates at VLEAK = 0.85 V, which is much higher
than existing works relying on deep voltage scaling [65; 66]. This helps to improve
robustness. For each gate, we performed 100,000 Monte Carlo simulations accounting
for mismatch at each of the five process corners. Results that are shown in Figure
4.4 confirm that FLSL is fully functional, covering the 4.1-σ worst case and showing
similar or better variations than its static CMOS counterpart at 0.3 V (the operating
point at which the static CMOS is the lowest leakage and yet functional).
Figure 4.4 (a) Monte Carlo results of the (a) FLSL and (b) static CMOS inverters
with mismatch across process corners.
Although each gate in the FLSL requires an average of 2.67X more transistors as
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 81
compared to [68], it incurs a moderate area penalty of 33 %, as illustrated in Figure
4.5. This is because the topological advantage in speed allows FLSL to avoid excessive
transistor up-sizing that otherwise would have been needed for compensating speed.
For instance, [68] has to largely upsize MNT, MPB, and also needs to create a gate-
wise isolated N-well for MPT for forward body biasing it (Figure 4.5(b)). On the
other hand, the area overhead of the FLSL as compared to the static CMOS logic is
4.63X, mainly caused by the dual rail design (Figure 4.5(c)).
Figure 4.5 Inverter layout in three different logic families. (a) Static CMOS. (b)
Lim et al. (c) FLSL. The overhead of the FLSL inverter as compared to the static
CMOS counterpart is relative to a minimum-size inverter of a commercial standard-
cell library.
4.2.3 No Sleep/Mode Signal Requirement
One of the key challenges to suppress the leakage of always-on circuits is to do so
without external signals necessary to put the target block in the leakage-suppression
mode (e.g., sleep signals). The reason is that it requires additional always-on circuits
to generate and distribute these sleep signals.
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 82
Recently published work [67] aims to improve the speed of Lim et al. and proposes
a dual-mode logic in which gates can switch between the conventional CMOS in the
normal mode and the same structure as [68] in the low leakage mode. While such a
scheme can enable faster switching speeds, the main challenge in applying this logic
to sub-/nanowatt always-on circuits is that each gate requires external sleep signals
to exercise the low leakage mode. The overhead of generating and distributing these
sleep signals to all the dual-mode gates can be quite significant. It requires additional
always-on circuits to generate the sleep signals, which cannot be implemented in
the existing ultra-low leakage logic such as [68] due to its slow switching speed. It
also demands large always-on buffers to drive the high fan-out sleep signals and
interconnect, which are inputs for each gate in the target block. Both the always-on
extra circuits and buffers have significant leakage and contribute to offsetting the
leakage reduction in the low leakage mode.
On the hand, the proposed FLSL gates do not require any external sleep signals
to enter the leakage-suppression mode. As a result, it is not needed any sleep circuits
or large buffers as in [67]. FLSL offers a compelling speed advantage over [68] at
the cost of a moderate area overhead allowing for a wider range of applications and
emerging use cases.
4.2.4 The Impact of Technology Scaling on FLSL
We also investigated the impact of technology scaling on the FLSL logic family. We
designed 50-stage FO4-inverter chains as benchmark circuits and performed SPICE
simulations in three processes, namely 0.18 µm, 65 nm low power (LP), and 28 nm
high performance (HP). In all benchmarks, we used the minimum strength inverters
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 83
(INVD0). For simplicity, we did not adopt techniques such as reverse/forward body
biasing for both the FLSL inverters and the static CMOS counterparts.
Figure 4.6 shows the leakage power per gate against the FO4-inverter delay of the
static CMOS and FLSL inverters. We swept the supply voltage from 0.9 V to 0.2
V and simulated the delay and leakage. The lowest leakage power per gate occurs
for the FLSL inverter designed in a 0.18 µm CMOS. Here we used the low-threshold
voltage devices. As compared to its static CMOS counterpart in the same technology,
the leakage-suppression is of 3 to 4 orders of magnitude.
In the 65 nm LP with low-threshold voltage devices, we found that the leakage
scaling provided by the proposed FLSL becomes limited due to the gate-tunneling
leakage. The gate leakage does not scale as the sub-threshold leakage with the double
source biasing utilized by the FLSL inverter and ends up limiting the overall leakage
savings of the FLSL. Despite the tunneling leakage, the FLSL in a 65 nm LP can
still achieve the per-gate leakage power of 62.7 fW, which is about 1 to 2 orders of
magnitude lower than the static CMOS counterpart in the same technology and the
same threshold voltage devices.
In a 28 nm HP, the FLSL inverter again presents significant leakage reduction
as compared to the static CMOS inverter design in the same technology node. This
is mainly because the high-K gate stack, along with lower supply voltage, can sub-
stantially reduce the gate leakage. However, the absolute leakage per gate is the pW
range (9.15 pW at the VLEAK), which is two to three orders of magnitude larger than
that of the 0.18 µm FLSL inverter. But, it offers more than two orders of magnitude
faster switching speed. Specifically, it achieves the FO4-inverter delay of 70 ns at its
VLEAK . With this switching speed, a 16-bit FIR filter in the 28 nm process could
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 84
achieve 500 kHz clock frequency, which is sufficient for processing a wide range of
signals.
Figure 4.6 FO4-inverter delay vs. per-inverter leakage power in different supply volt-
ages and across technology nodes.
4.3 Chip Prototype
We developed an FLSL standard-cell library comprised of 13 logic gates. Three of
them are shown in Figure 4.2 and Figure 4.7. The NAND-2 gate (Figure 4.7(b))
consists of the kernel logic and the leakage suppression devices. The complementary
circuit is devised by using a NOR-2 and the complement of the NAND-2 inputs.
Transistors are low-Vt. PBBL, PBTL, PBBR, and PBTR are sized 3 · WMIN and
LMIN , and forward body-biased to VBIAS < VDD to compensate the PMOS strength
at low voltage. Remaining devices employ minimum-size transistors for reducing
internal capacitances. As compared to [68], the FLSL NAND-2 achieves 500X faster
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 85
delay and the same fW-level leakage at the expense of 2.67X more transistors. The
improvement in LDP is about 1,180X (Figure 4.7(a)).
Figure 4.7(c) shows the schematic of a transparent-high FLSL latch. It consists
of five FLSL inverters (the same design discussed in Section 4.2.1) and four low-Vt
transmission gates. The internal write buffer requires only one FLSL inverter because
its outputs provide the write enable signal and its complement (WE, WE N). The
FLSL latch consumes 7.5 fW leakage at VDD = 0.8 V and VBIAS = 0.4 V. All the
other FLSL gates (e.g., OR, NOR, XOR, XNOR, etc) are implemented in the similar
fashion and follow similar considerations as the FLSL inverter, NAND-2, and latch
gates. The D flip-flop cascades two FLSL latches.
Using the proposed library, we prototyped an always-on 16-bit 8-tap FIR in a
0.18 µm. The filter consists of 20,150 logic gates and takes 1.32 mm2 – a chip
microphotograph is shown in Fig. 8(a). We used conventional logic synthesis and
automatic placement and routing CAD tools with custom modifications to support
dual-rail logic. The architecture of the filter, as shown in Fig. 8(b), follows the direct-
form (DF-I) with a two-stage pipeline. To save clock power, it includes a steady-state
detector that gates the clock to the filter if eight consecutive inputs to the filter
are identical. This signal can come from a level-crossing analog-to-digital converter
(ADC) [70].
In functional tests and measurements, we use a temperature chamber for environ-
mental temperature control and a pico-ammeter for accurate measurements of very
low and sensitive currents. As shown in Fig. 9(a), the filter achieves the leakage
dissipation of 109 pW (5.39 fW/gate) at VLEAK = 0.85 V and VBIAS = 0.5 · VLEAK .
With the same settings, it can operate at the maximum clock frequency of 1.03 kHz
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 86
Figure 4.7 (a) LDP comparison of an FLSL, a static CMOS and the prior art low
leakage NAND-2 (Lim et al.) (b) FLSL NAND-2. (c) FLSL transparent-high latch.
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 87
Figure 4.8 (a) Chip microphotograph showing an active area of 1.32 mm2 in a 0.18
µm. (b) The FIR architecture and the steady-state detector circuits.
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 88
Figure 4.9 (a) The measured leakage power vs. supply voltage of the core circuits.
(b) The clock frequency vs. supply voltage measurement for three different VBIAS
voltages, showing the filter operates at 1.03 kHz for VDD = VLEAK = 0.85 V and
VBIAS = 0.5 · VLEAK . (c) Measurement results of the total power of the filter running
at the leakage-optimal operating point (power is proportional to input activity). (d)
The temperature effect on the clock and the simulated effect on leakage power
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL







































CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 90
(Fig. 9(b)). The power consumed by the filter is proportional to its input activity,
being dominated by the leakage at low activity rates, as shown in Fig. 9(c). Fig. 9(d)
summarizes the leakage power and the maximum clock frequency of the filter across
temperatures.
Lastly, Fig. 10 shows a table of comparison of our prototype with the prior art.
The proposed FLSL improves LDP by 200X and LDAP by 150X over the state-of-
the-art logic family for always-on circuits [68] at 33 % area overhead. For the sake of
comparison, the projected area of a 32-bit Cortex-M0 fully implemented with FLSL
is 2.56 mm2.
4.4 Conclusion
In this chapter, we proposed a novel family of logic gates intended for ultra-low leakage
always-on circuits. Requiring no sleep/mode signals to enter the low-leakage mode,
the proposed cells can achieve 5 fW per-gate leakage power and yet present an inverter
FO4-inverter delay of 10.2 µs. These numbers represent a speed improvement of 148X
(while maintaining the same leakage power level) over the state-of-the-art ultra-low
leakage family in [68].
To demonstrate the effectiveness of the proposed logic family, we prototyped an
FIR filter in a 0.18 µm CMOS intended for physical signals sensing systems, in which
signals are sparse (mostly constant or slowly changing over time). Measurement
results of the prototype showed the entire filter consumes only 109 pW leakage power
and can operate at a clock frequency of 1.03 kHz at its leakage-optimal supply voltage.
These results indicate FLSL improves the leakage-delay product and leakage-delay-
CHAPTER 4. A FW- AND KHZ-CLASS FEEDFORWARD LEAKAGE
SELF-SUPPRESSION LOGIC REQUIRING NO EXTERNAL SLEEP SIGNAL
TO ENTER THE LEAKAGE SUPPRESSION MODE 91
area product metrics by 200X and 150X over the prior art [68].
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 92
Chapter 5
Catena: A 0.5 V Sub-0.4 mW
Programmable Spatial Array
Accelerator for Mobile and
Embedded Computing
In this chapter, we present Catena, a near-threshold, sub-0.4-mW, programmable 16-
core spatial array accelerator supporting workloads for ULP mobile and embedded
domains.
Deeply scaling supply voltage of such massively parallel device could save energy,
however alone results in limited savings, as it magnifies the energy waste of idle
and underutilized hardware in portion to the total energy consumption. Therefore,
we design Catena with novel circuit and architecture techniques to minimize such
energy waste. Thanks to the proposed techniques, the 65 nm low-power (LP) CMOS
prototype achieves state-of-the-art energy-efficiencies across multiple workloads and
utilization rates.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 93
5.1 Motivation
A spatial array accelerator is a massively parallel architecture that has been histor-
ically utilized in high-performance computing (HPC) systems [71; 72]. As a fully
programmable device, such an architecture also becomes very compelling for emerg-
ing mobile and embedded applications that can tolerate lower frequency of operation
as it can efficiently perform various workloads – for example, deep neural networks
(DNN), cryptography, and digital signal processing (DSP) workloads. Additionally,
one can trade off the excessive throughput from massive parallelism and energy-
efficiency via voltage scaling. This has been explored by prior works that attempt to
utilize near-threshold voltage (NTV) computing [71; 75].
NTV operation alone, however, is marginally effective for energy-efficient opera-
tion in massively parallel systems because it magnifies the energy waste of idle and
underutilized hardware in portion to the total energy consumption. Essentially, any
idle hardware contributes to this overhead: an unused core, a temporarily idle core
that is waiting on incoming data from the network, an idle unit within the core, and
un-accessed bitcells and idle peripherals in the SRAMs. An additional challenge is
the energy consumption and waste of always-on modules such as on-chip networks
(NoCs), control circuits, and shared cache, which increases with runtime (i.e., cycle
count). Therefore, it becomes crucial for improved energy-efficiency to reduce the
energy waste of idle and underutilized hardware, increase instruction per cycle (IPC),
and minimize the cycle counts per workload.
We design Catena’s circuit and architecture to address both of these objectives.
To minimize the energy waste of idle and underutilized hardware, the circuits employ
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 94
both spatially and temporally fine-grained clock- and power-gating (CG, PG) – this
technique enables cycle-by-cycle transitions between regular and energy-saving modes
for improved energy-efficiency. Catena’s SRAMs employ cycle-by-cycle and word-by-
word voltage-boosting (VB) to minimize the leakage power without compromising
access times. Architecturally, we employ sub-word vectorization using a byte vector
unit (BVU) and further reduce dynamic instruction counts by embedding branches
within datapath instructions in our fully custom instruction set architecture (ISA),
which markedly improves IPC and reduces cycle count across various workloads and
hardware utilization rates to minimize the energy waste of always-on circuits.
5.2 Catena’s Circuit and Architecture Design
Catena’s high-level architecture is shown in Figure 5.1. It includes a spatial array
architecture with 16 autonomous, homogeneous, programmable, single-issue compute
cores; an 8 kB storage, 8-port, 8-way set associative, word-interleaved, pseudo-LRU
ULP L1 cache; and three on-chip networks – the spatial network (shown in red in
Figure 5.1) for data transport between the sixteen compute cores, the cache network
(shown in blue in Figure 5.1) for communication between the cores and the cache
through the crossbar switch, and the MMIO network for configuration and debug.
Catena’s cores are distributed in 4 by 4 matrix. The cache is implemented as a single
module and physically located next to the core ensemble. We design the whole system
to operate at ultra-low voltage (ULV) domain including the SRAMs and the NoCs,
which operate at the same core voltage as the spatial array.
Catena was designed to be embedded in a larger system-on-chip (SoC), which
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 95
in our system prototype is emulated by a 128 kB on-chip SRAM and an off-chip
Zynq-SoC FPGA, also shown in Figure 5.1.
Figure 5.1 Catena’s high-level architecture featuring the programmable spatial array
of 16 cores, an 8 kB ULP L1 cache, and three on-chip networks utilized for communi-
cation between the cores, between the cores and the cache, and for configuration and
debug.
Figure 5.2 shows Catena’s cores microarchitecture including two major modules:
the processing element (PE) and the router (R). Each core implements a three-stage
pipeline with a datapath that includes five functional units, i.e., the one-stage arith-
metic logic unit (OSALU), the two-stage arithmetic logic unit (TSALU), the integer
multiplication unit (IMU), the byte vector unit (BVU), and the 0.5 kB private scratch-
pad (SM). A core also employs a 32-instruction memory, a thread control module,
multiple program counters (PC), a scheduler, an instruction decoder, a register file
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 96
(RF), and a pair of four first-input first-output (FIFO) interconnect networks for data
input and output.
The cores implement a 32-bit integer ISA of our own, which is designed to be
inherently general-purpose. Catena’s cores also employ a novel control mechanism in
which individual PEs can host many hardware contexts with support to zero-overhead
context switches enabled by the thread control, PCs, and shared RF modules. The
eligibility of individual threads within a core is determined by the scheduler based on
the architectural state in a manner reminiscent of the triggered instructions control
paradigm [76; 77; 78]. The proposed multithreaded scheme allows for avoiding data
hazards and keeping the pipeline full by interleaving multiple threads.
Figure 5.2 The core microarchitecture including the processing element and the router.
Catena consumes 356.8 µW at 1.56 MHz performing General Matrix Multiply
(GEMM) with all its 16 cores. It requires 7.7k cycles (2.31 µJ) per 256-point Fast
Fourier Transform (FFT) and 14.4k cycles (6.95 µJ) per 8-bit deep neural network
inference. For FFT, Catena achieves at least 2.7X better energy efficiency than the
prior state-of-the-art [71].
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 97
5.2.1 Spatiotemporally Fine-Grained Clock-Gating
In Catena, clock-gating is an essential technique to improve energy-efficiency. The dif-
ferent applications mapped onto the spatial array can require cores (PE and routers),
functional units within cores, cache, and NoCs to operate at low switching activity
rates or occasionally remain idle. Such hardware contributes to the energy overhead
and impacts energy-efficiency. Therefore, it becomes crucial to adaptively manage
the switching energy waste associated with the clock and its sinks across multiple
workloads.
Figure 5.3 Proposed statically-configured and dynamically-configured clock-gating
scheme implemented in Catena.
In our prototype, each core employs statically(program)-configured CG for the
PE the router, as shown in Figure 5.3. Within a PE, dynamically-configured fine-
grained CG is designed for individual modules such as the register file (RF), each of
the functional units in the datapath, and output NoC-FIFOs. All the control signals
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 98
for CG and PG are determined and generated by the scheduler in the first stage of
the core microarchitecture, the schedule & decode stage (Figure 5.2). The cache uses
a similar dynamically-configure fine-grained CG scheme, which can gate the clock of
any of the eight banks and all metadata storage (Figure 5.3).
5.2.2 Spatiotemporally Fine-Grained Power-Gating
In Catena, we also implement spatiotemporally fine-grained power-gating (PG) in
order to reduce the leakage energy of underutilized hardware and improve energy-
efficiency. As the dynamic energy is minimized with our proposed clock-gating scheme
(explained in the previous subsection) and by exercising NTV operation; the leakage
energy accounts for a significant portion of the system’s total energy/cycle due to
increased cycle times. The leakage energy contribution rapidly aggravates with higher
temperature (i.e., an increase in temperature leads to a down-shift in transistors Vt
that exponentially impacts leakage dissipation) and increased parallelism.
Unlike CG, PG is difficult to use in a temporally fine-grained manner since gat-
ing and un-gating supply voltage requires change and discharge of the target block’s
internal signal nets and virtual power rails, resulting in large delay and energy penal-
ties. Thus, PG use has been limited to long periods of inactivity, 1-100 million cycles
[35], so as the energy balance breaks even. In Catena, however, we seek to suppress
leakage even for short intervals of idleness.
To this end, we develop an in-cell zig-zag PG technique based on prior art [36;
37]. Our technique can power-off circuits even during brief idleness and still achieve
net energy savings. Figure 5.4 shows a simple combinational logic module with the
proposed PG. We create two versions of each gate in a standard-cell library, one with
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 99
Figure 5.4 Proposed spatiotemporally fine-grained in-cell zigzag PG. Sleep mode can
be exercised for short periods of idleness so as to maximize the time circuits remain
in the low leakage mode to reduce the leakage energy dissipation.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 100
an NMOS footer and the other with a PMOS header (multi-output gates such as
full adders and flip-flops have more than two versions). Then, we define the input
forcing vector (here as Low). It will be applied prior to entering sleep mode at
runtime. With the vector, at design time, the gates whose outputs evaluate to HIGH
are mapped to the NMOS footer versions, and the gates to LOW are to the PMOS
header versions. With such mapping, all the signal nets keep their states during the
mode-transition, reducing its delay and energy penalty. After this, dis-/charging the
virtual rails remains the main source of the mode transition penalty. To minimize it,
we integrate a header/footer in each cell to eliminate the large capacitance of shared
virtual rails. We apply this dynamic PG technique to all five independent functional
units within the PE core.
5.2.3 Ultra-Low Power L1 Cache and Scratchpad Design
A 16 kB SRAM split between cache and core-distributed scratchpads account for a
significant portion of the remaining energy waste after our clock- and power-gating
schemes, which must be minimized. To that end, we design Catena’s cache and
scratchpad memories utilizing a fully custom SRAM macro engineered for ultra-low
leakage consumption.
In our SRAM macro, shown in Figure 5.5, the proposed PG is applied to peripher-
als. To reduce the energy waste of bitcells, we propose fine-grained voltage boosting
(VB). In our 2 kb macro, a retention supply voltage of 0.25 V (VDDL) powers the
bitcells by default. On a read request, it boosts the supply only of the bitcells in
the accessed row from 0.25 V to 0.7 V (VDDH) for a single cycle only, to meet the
target access time. On a write operation, it boosts the write-wordline (WWL) of the
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 101
accessed row to VDDH .
Figure 5.5 Ultra-low leakage SRAM macro design. Fine-grained VB is proposed to
speed up read/write accesses and reduce energy waste.
Thus, nearly all bitcells remain at VDDL and most peripherals remain power-gated,
making the leakage of our 2 kb macro 22 nW (max. power per read/write access per
cycle is 171 nW/MHz). One SRAM macro can achieve net energy savings over clock
gating if it is shut down for only 3 cycles (NSLP,min = 3), as shown in Figure 5.6.
5.2.4 Architecture-Level Techniques for Efficient Computa-
tion and Communication
In terms of architecture, we aim to reduce cycle count per workload to minimize the
energy waste of the always-on hardware. To accomplish it, we equipped each core with
a BVU, which can reduce cycle counts for low-precision workloads. For instance, in a
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 102
Figure 5.6 Energy savings of our proposed cell-embedded zigzag PG applied to the
SRAM macro over the number of cycles in the sleep mode.
neural network inference task, the BVU efficiently reduces the cycle count from 103 k
to 14.4 k, as shown in Figure 5.7. In addition, we embed branches within arithmetic
and logic instructions so that the PE executes them in the same cycle. As shown in
Figure 5.7, this reduces the cycle counts of workloads with many branches such as
AES-128 and MNIST by 25 % and 28 %, respectively. Furthermore, to make data
exchange between cores more efficient, we designed the core to perform decoupled
loads to conventional register or directly out onto the spatial NoC and to manage
inter-core load-store ordering via explicit memory-barriers. These techniques coupled
with our multithreaded scheme contribute for achieving per-core IPCs of 0.49-0.88,
which is 2-4X greater than prior art [71; 74].
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 103
Figure 5.7 Normalized runtime (cycle/cycle) improvement due to architecture-level
techniques deployed in Catena across multiple workloads.
5.3 Chip Prototype and Testing Setup
We design and fabricate Catena in a 65 nm LP CMOS, a process technology that we
choose for its compelling balance between dynamic and static power dissipation.
Figure 5.8 shows a microphotograph of the silicon prototype. Catena consumes
6.50 mm2 of active area. The dimensions of a single core are 500 × 500 µm2, which
is similar to prior art’s compute-core [71] after technology node normalization. There
are 105 on-chip pads distributed on the left, right, and bottom edges of the prototype.
For packaging, we utilize a low-profile quad flat package (LQFP) with 144 leads. The
supply voltage range goes from 0.375 V to 1.15 V. The clock frequency ranges from
100 kHz to 60 MHz. The chip includes 16 kB ULP SRAM split between cache and
the core-distributed scratchpads.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 104
Figure 5.8 Microphotograph of the silicon prototype in a 65 nm LP CMOS.
5.4 Measurements and Comparison with Prior Art
Figure 5.9 shows Catena performing General Matrix Multiply (GEMM) with 8 and
16 cores at 25 ◦C and 50 ◦C. It consumes 227 pJ/cycle (356.8 µW power consumption,
sub-0.4-mW) at around 1.6 MHz and 0.54 V (target design point) when utilizing all
of its 16 cores at 25 ◦C (Figure 5.9(a)). Catena consumes only 129.0 pJ at 0.42 V and
200 kHz when utilizing half of its cores at a temperature of 25 ◦C (Figure 5.9(b)). In
both cases, we see that the increase in temperature leads to an increase in the MEP
as a consequence of the greater leakage power consumption. Nevertheless, in Catena,
such energy overhead is mitigated by the circuit-level techniques that enable from
31.3 % to 47.5 % energy savings at the MEP as compared to its baseline counterpart
implementation with the circuit-level techniques disabled.
In Figure 5.10, we show results of measurements that corroborate the effectiveness
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 105
Figure 5.9 Energy/cycle vs. clock frequency when performing GEMM. Two different
conditions are shown – (a) 16 cores running at 25 ◦C and 50 ◦C; and (b) 8 cores
running at 25 ◦C and 50 ◦C.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 106
of the circuit-level techniques employed in Catena. The proposed CG, PG, and VB
reduce energy/cycle by 2.42X for 20 % hardware utilization rate and by 25 % for 90
%.
Figure 5.10 Energy/cycle vs. utilization rate. The hardware usage is modulated by
mapping multiple workloads onto Catena’s spatial array. The circuit-level techniques
reduce the energy waste of underutilized hardware, which improves energy-efficiency
by 2.42X at 20 %. At higher utilization rates of 70 % to 90 % the proposed circuit
techniques still enable significant savings of 35 % and 25 %, respectively.
We map multiple workloads on Catena’s spatial array such as FFT, AES-128, FIR,
and MNIST – Figure 5.11 shows the core and network mappings for (a) MNIST and
(b) FFT. For MNIST, we use a multi-layer perceptron with three 120-neuron hidden
layers and 8-bit weights, which results in 6.95 µJ/pred. Such energy-efficiency is
in the same ballpark of a recently DNN ASIC-accelerator [73] that consumes 1.32
µJ/pred (post-normalization to a 65 nm process).
Figure 5.12 shows a table of comparison of our work with the prior art [71; 74;
75]. Authors of [71] report the energy consumption of a reproducible workload (FFT),
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 107
Figure 5.11 Two workloads mapped onto Catena’s spatial array. (a) MNIST and (b)
FFT.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
ARRAY ACCELERATOR FOR MOBILE AND EMBEDDED COMPUTING 108
but consider a high-performance algorithm to perform 12 parallel 4096-point FFTs.
After normalization accounting for the FFT type, resolution, technology node, and
wordlength, and assuming the un-utilized/idle cores in [71] have zero dynamic and
static power dissipation, Catena would still consume 2.7X less energy per FFT trans-
form.
CHAPTER 5. CATENA: A 0.5 V SUB-0.4 MW PROGRAMMABLE SPATIAL
































CHAPTER 6. CONCLUSIONS 110
Chapter 6
Conclusions
This work has focused on ultra-low leakage, energy-efficient integrated circuit and
system design for the ULP embedded and mobile domains. It presents novel leakage
suppression techniques that are demonstrated in real, fairly large silicon prototypes
fabricated in modern sub- and nanometer CMOS process technology nodes. Such
prototypes show state-of-the-art energy-efficiencies.
Energy-efficiency is identified as a key enabler for area and lifetime, two crucial
requirements for ULP VLSI systems for embedded and mobile applications. In order
to achieve low power consumption and high energy-efficiency, designers have been
exploiting ULV operation down to near- and sub-threshold. Indeed, supply voltage
scaling is the most powerful knob to reduce power and energy in today’s CMOS
circuits. Nevertheless, operating circuits at ULV incurs several challenges. One key
challenge is the leakage energy consumption, especially sub-threshold leakage, that is
responsible for the total energy per cycle increase and for defining the practical limit
of energy-efficiency (i.e., MEP) at low voltage.
This thesis proposes a range of circuit- and architecture-level techniques for ad-
CHAPTER 6. CONCLUSIONS 111
dressing the leakage energy waste in ULP VLSI systems. To demonstrate the proposed
techniques presented here, numerous silicon prototypes are designed, fabricated, ver-
ified functionality, and evaluated with rigorous test and measurement methodologies.
Measured results of these prototypes are compared against recently published state-
of-the-art systems showing significant improvements in power performance, area, and
other relevant metrics for ULP VLSI systems.
In this thesis, Chapter 2 presents a novel sleep technique with fine temporal gran-
ularity for parallel architectures operating at ULV. This technique allows for extend-
ing the energy-efficiency of parallel circuits operating in the near- and sub-threshold
regimes by opportunistically shutting down idle and underutilized replicas in a par-
allel architecture to reduce leakage. The project discussed in Chapter 3 advances the
technique presented in Chapter 2 in order to extend the energy-efficiency of a compact
FFT ASIC-accelerator for ULP applications. Chapter 4 focus on reducing the leakage
energy of always-on circuits of ULP VLSI systems with a novel ultra-low leakage logic
family that addresses the switching speed issue of the prior art while still maintaining
fW leakage-level per gate. Chapter 5 combines all techniques presented in Chapters
2, 3, and 4 in a near-Vt sub-0.4-mW 16-core programmable spatial array accelerator
for ULP embedded and mobile applications, which remarkably shows state-of-the-art
energy-efficiencies across multiple workloads and core-utilization rates.
This thesis presented circuit- and architecture-level techniques for improving the
energy-efficiency of digital integrated circuits at ULV and systems by reducing the
energy waste of idle and underutilized hardware and always-on circuits. Thereby, it
brings it closer to our everyday life a wide range of embedded and mobile applications.
However, there is yet more to studied as significant challenges still remain in the
CHAPTER 6. CONCLUSIONS 112
massive adoption and manufacturability of ULP VLSI circuits. A few examples along
with leakage energy are performance scalability, adaptability to reduce the impact
of the process, voltage, and temperature variations, flexibility vs. performance and
energy-efficiency, and power delivery for these systems.
BIBLIOGRAPHY 113
Bibliography
[1] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage Current Mech-
anisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Cir-
cuits,” Proceedings of IEEE, vol. 91, no. 2, pp. 305-327, 2003.
[2] F. Fallah and M. Pedram, “Standby and Active Leakage Current Control and
Minimization in CMOS VLSI Circuits,” IEICE Transactions on Electronics, vol.
E88-C, no. 4, pp. 509-519, 2005.
[3] M. Alioto, “Ultra-Low Power VLSI Circuit Design Demystified and Explained: A
Tutorial,” IEEE Transactions on Circuits and Systems–I: Regular Papers (TCAS–
I), vol. 59, no. 1, Jan. 2012.
[4] N. Weste and D. Harris, CMOS VLSI Design. Reading, MA: Addison-Wesley,
2004.
[5] K. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshida, G. Sano, M. Nor-
ishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai,
“A 0.9V 150MHz l0mW 4mm2 2-D Discrete Cosine Transform Core Processor
with Variable-Threshold-Voltage Scheme,” IEEE International Solid-State Cir-
cuits Conference (ISSCC), pp. 166-167, Feb. 1996.
BIBLIOGRAPHY 114
[6] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada,“1-V
Power Supply High-Speed Digital Circuit Technology with Multithreshold-Voltage
CMOS,” IEEE Journal of Solid-State Circuits (JSSC), vol. 30, no. 8, pp. 847-854,
Aug. 1995.
[7] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “SRAM leakage
suppression by minimizing standby supply voltage,” International Symposium on
Signals, Circuits and Systems (SCS), pp. 55-60, 2004.
[8] K. Zhang et al., “SRAM Design on 65-nm CMOS Technology with Dynamic Sleep
Transistor for Leakage Reduction,” IEEE Journal of Solid-State Circuits (JSSC),
vol. 40, no. 4, pp. 895-900, 2005.
[9] K. Zhang et al., “Low-Power SRAMs in Nanoscale CMOS Technologies,” IEEE
Transactions on Electron Devices, vol. 55, no. 1, pp. 145-151, 2008.
[10] F. Hamzaoglu et al., “A 3.8 GHz 153 Mb SRAM Design with Dynamic Stability
Enhancement and Leakage Reduction in 45 nm High-K Metal Gate CMOS Tech-
nology,” IEEE Journal of Solid-State Circuits (JSSC), vol. 44, no. 1, pp. 148-154,
2009.
[11] A. Raychowdhury, J. Kim, D. Peroulis, and K. Roy, “Integrated MEMS Switches
for Leakage Control of Battery Operated Systems,” Custom Integrated Circuits
Conference (CICC), pp. 457-460, 2006.
[12] Y. Zhang et al., “A Batteryless 19 muW MICS/ISM-band Energy Harvesting
Body Sensor Node SoC for ExG applications,” IEEE Journal of Solid-State Cir-
cuits (JSSC),” vol. 48, no. 1, pp. 199-213, 2013.
BIBLIOGRAPHY 115
[13] Y. Pu et al., “A 9-mm2 Ultra-Low-Power Highly Integrated 28-nm CMOS SoC
for Internet of Things,” IEEE Journal of Solid-State Circuits (JSSC),” vol. 53,
no. 3, pp. 936-948, 2018.
[14] M. Yip, R. Jin, H. H. Nakajima, K. M. Stankovic, and A. P. Chandrakasan,
“A Fully-Implantable Cochlear Implant SoC with Piezoelectric Middle-Ear Sen-
sor and Arbitrary Waveform Neural Stimulation,” IEEE Journal of Solid-State
Circuits (JSSC), vol. 50, no. 1, pp. 214-229, 2015.
[15] G. Chen et al., “A Cubic-Millimeter Energy-Autonomous Wireless Intraocular
Pressure Monitor,” IEEE International Solid-State Circuits Conference (ISSCC),
pp. 310-311, 2011.
[16] The Eversense Continuous Glucose Monitoring System . Available online:
https://www.eversensediabetes.com/eversense-cgm-syste. 2019.
[17] M. P. Christiansen et al., “A Prospective Multicenter Evaluation of the Accu-
racy of a Novel Implanted Continuous Glucose Sensor: PRECISE II,” Diabetes
Technology & Therapeutics, vol. 20, no. 3, pp. 197-206, 2018.
[18] Cubeworks – Millimeter-Scale Computing. Real-life smart dust. Available online:
http://cubeworks.us. 2019.
[19] B. Warneke, M. Last, B. Liebowitz, and K. Pister, “Smart Dust: Communicating
with a Cubic-Millimeter Computer,” Computer, vol. 34, no. 1, pp. 44-51, Jan.
2001.
[20] M. Alioto, “Energy Harvesters for IoT,” Short course at the IEEE Symposium
on VLSI Circuits (VLSI), 2015.
BIBLIOGRAPHY 116
[21] J. B. Bates, N. J. Dudney, B. Neudecker, A. Ueda, and C. D. Evans, “Thin-
Film Lithium and Lithium-Ion Batteries,” Solid-State Ionics, vol. 135, no. 1-4,
pp. 33-45, 2000.
[22] S. Narendra and A. Chandrakasan, Leakage in Nanometer CMOS Technologies.
Berlin: Springer, 2006.
[23] C. R. Baugh and B. A. Wooley, “A Two’s Complement Parallel Array Multi-
plication Algorithm,” IEEE Transactions on Computers, vol. C-22, No. 12, pp.
1045-1047, December 1973.
[24] A. Wang and A. Chandrakasan, “A 180-mV Subthreshold FFT Processor Using
a Minimum Energy Design Methodology,” IEEE Journal of Solid-State Circuits
(JSSC), vol. 40, no. 1, pp. 310-319, Jan. 2005.
[25] M. Seok, S. Hanson, Y. -S. Lin, Z. Foo, D. Kim, Y. Lee, N. Lee, D. Sylvester, and
D. Blaauw, “The Phoenix Processor: A 30pW Platform for Sensor Applications,”
IEEE Symposium on VLSI Circuits (VLSI), pp. 188-189, Jun. 2008.
[26] S. Kim and M. Seok, “Variation-Tolerant Near-threshold Microprocessor Design
with Low-Overhead, Within-a-Cycle In-situ Error Detection and Correction Tech-
nique,” IEEE Journal of Solid-State Circuits (JSSC), vol. 50, no. 6, pp. 1478-1490,
May 2015.
[27] M. Fojtik, D. Kim, G. Chen, Y.-S. Lin, D. Fick, J. Park, M. Seok, M.-T. Chen,
Z. Foo, D. Blaauw, and D. Sylvester, “A Millimeter-Scale Energy-Autonomous
Sensor System with Stacked Battery and Solar Cells,” IEEE Journal of Solid-
State Circuits (JSSC), vol. 48, no.3, pp. 801-813, Mar. 2013.
BIBLIOGRAPHY 117
[28] A. P. Chandrakasan and R. W. Brodersen, “Minimizing Power Consumption in
Digital CMOS Circuits,” Proceedings of the IEEE, vol. 83, no. 4, pp. 498-523,
Apr. 1995.
[29] M. S. Hrishikesh, N. P. Jouppi, K. I. Farkas, D. Burger, S. W. Keckler, and P.
Shivakumar, “The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter
delays,” ACM/IEEE International Symposium on Computer Architecture (ISCA),
pp. 14-24, May 2002.
[30] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. N. Strenski,
and P. G. Emma, “Optimizing pipelines for power and performance,” ACM/IEEE
International Symposium on Microarchitecture (MICRO), pp. 333-344, Nov. 2002.
[31] A. Hartstein and T. R. Puzak, “Optimum Power/Performance Pipeline Depth,”
IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 117-
125, Dec. 2003.
[32] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “Theoretical and Practical
Limits of Dynamic Voltage Scaling,” ACM/IEEE Design Automation Conference
(DAC), pp. 868-873, 2004.
[33] B. H. Calhoun and A. P. Chandrakasan, “Characterizing and Modeling Min-
imum Energy Operation for Subthreshold Circuits,” ACM/IEEE International
Symposium on Low Power Electronics and Design (ISLPED), pp. 90-95, 2004.
[34] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P.
Bose, “Microarchitectural Techniques for Power Gating of Execution Units,”
BIBLIOGRAPHY 118
ACM/IEEE International Symposium on Low Power Electronics and Design
(ISLPED), pp. 32-37, Aug. 2004.
[35] J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, S. Borkar, and V. De,
“Dynamic Sleep Transistor and Body Bias for Active Leakage Power Control of
Microprocessors,” IEEE Journal of Solid-State Circuits (JSSC), vol. 38, no. 11,
pp. 1838-1845, Oct. 2003.
[36] K. -S. Min, H. Kawaguchi, and T. Sakurai, “Zigzag Super Cut-off CMOS (ZSC-
CMOS) Block Activation with Self-Adaptive Voltage Level Controller: An Alter-
native to Clock-Gating Scheme in Leakage Dominant Era,” IEEE International
Solid-State Circuits Conference (ISSCC), 2003.
[37] T. Miyazaki, T. Q. Canh, H. Kawaguchi, and T. Sakurai, “Observation of One-
Fifth-of-a-Clock Wake-up Time of Power-Gated Circuit,” IEEE Custom Integrated
Circuits Conference (CICC), pp. 87-90, Oct. 2004.
[38] D. Jeon, M. Seok, C. Chakrabiarti, D. Blaauw, and D. Sylvester, “A Super-
Pipelined Energy Efficient Subthreshold 240 MS/s FFT Core in 65nm CMOS,”
IEEE Journal of Solid-State Circuits (JSSC), vol. 47, no. 1, pp. 23-34, Nov. 2011.
[39] M. Seok, D. Jeon, C. Chakrabiarti, D. Blaauw, and D. Sylvester, “Pipeline
Strategy for Improving Optimal Energy Efficiency in Ultra-Low Voltage Design,”
ACM/IEEE Design Automation Conference (DAC), 2011.
[40] M. Seok, S. Hanson, D. Blaauw, and D. Sylvester, “Sleep Mode Analysis and
Optimization with Minimal-Sized Power Gating Switch for Ultra-Low VDD Op-
BIBLIOGRAPHY 119
eration,” IEEE Transaction on Very Large Scale Integration Systems (TVLSI),
vol. 20, no. 4, pp. 605-615, Feb. 2011.
[41] R. K. Krishnamurthy, A. Alvandpour, S. Mathew, M. Anders, V. De, and S.
Borkar, “High-performance, Low-power, and Leakage-tolerant Challenges for Sub-
70nm Microprocessor Circuits,” IEEE European Solid-State Circuits Conference
(ESSCIRC), pp. 315-321, 2002.
[42] P. Larsson, “Parasitic Resistance in an MOS Transistor Used as On-Chip De-
coupling Capacitance,” IEEE Journal of Solid-State Circuits (JSSC), vol. 32, no.
4, pp. 574-576, Apr. 1997.
[43] B. H. Calhoun, F. A. Honore, and A. P. Chandrakasan, “A Leakage Reduction
Methodology for Distributed MTCMOS,” IEEE Journal of Solid-State Circuits
(JSSC), vol. 39, no. 5, pp. 818-826, May 2004.
[44] C. H. Kim, J. J. Kim, I. J. Chang, and K. Roy, “PVT-Aware Leakage Reduction
for On-Die Caches With Improved Read Stability,” IEEE Journal of Solid-State
Circuits (JSSC), vol. 41, no. 1, pp. 170-178, Jan. 2006.
[45] A. R. Trivedi, W. Yueh, and S. Mukhopadhyay, “In Situ Power Gating Efficiency
Learner for Fine-Grained Self-Adaptive Power Gating,” IEEE Transactions on
Circuits and Systems–II: Express Briefs (TCAS–II), vol. 61, no. 5, pp. 344-348,
May 2014.
[46] P. Mishra, A. N. Bhoj, and N. K. Jha, “Die-Level Leakage Power Analysis of
FinFET Circuits Considering Process Variations,” IEEE Symposium on Quality
Electronic Design (ISQED), pp. 347-355, Mar. 2010.
BIBLIOGRAPHY 120
[47] Y. Kim, Y. Lee, D. Sylvester, and D. Blaauw, “SLC: Split-control Level Con-
verter for dense and stable wide-range voltage conversion,” IEEE European Solid-
State Circuits Conference (ESSCIRC), pp. 478-481, Sep. 2012.
[48] D. Kim, J. Lee, and M. Seok, “Energy-Optimal Voltage Model Supporting a
Wide Range of Nodal Switching Rates for Early Design-Space Exploration,” IEEE
International Conference on Computer Design (ICCD), 2015.
[49] S. R. Sridhara, M. DiRenzo, S. Lingam, S.- J. Lee, R. Bla´zquez, J. Maxey, S.
Ghanem, Y.- H. Lee, R. Abdallah, P. Singh, and M. Goel, “Microwatt Embedded
Processor Platform for Medical System-on-Chip Applications,” IEEE Journal of
Solid-State Circuits (JSSC), vol. 46, no. 4, April 2011.
[50] M. Seok, D. Jeon, C. Chakrabarti, D. Blaauw, D. Sylvester, “A 0.27V 30MHz
17.7nJ/transform 1024-pt Complex FFT Core with Super-Pipelining,” IEEE In-
ternational Solid-State Circuits Conference (ISSCC), Feb. 2011.
[51] Y. Chen, Y.- W. Lin, Y.-C. Tsao, and C.-Y. Lee, “A 2.4-Gsample/s DVFS FFT
Processsor for MIMO OFDM Communication Systems,” IEEE Journal of Solid-
State Circuits (JSSC), vol. 43, no. 5, pp. 1260-1273, May 2008.
[52] C.-H. Yang, T.-H. Yu, and D. Markovic, “Power and Area Minimization of Re-
configurable FFT Processors: A 3GPP-LTE Example,” IEEE Journal of Solid-
State Circuits (JSSC), vol. 47, no.3, pp. 757-768, Mar. 2012.
[53] B. Baas, “A Low-Power, High-Performance, 1024-Point FFT Processor,” IEEE
Journal of Solid-State Circuits (JSSC), vol. 34, no. 3, March 1999.
BIBLIOGRAPHY 121
[54] S.-J. Huang, S.-G. Chen, “A green FFT processor with 2.5-GS/s for IEEE
802.15.3c (WPANs),” International Conference on Green Circuits and Systems
(ICGCS), 2010.
[55] O. Abari, E. Hamed, H. Hassanieh, A. Agarwal, D. Katabi, A. Chandrakasan,
and V. Stojanovic, “A 0.75-Million-Point Fourier-Transform Chip for Frequency-
Sparse Signals,” IEEE International Solid-State Circuits Conference (ISSCC),
Feb. 2014.
[56] D. N. Truong, W. H. Cheng, T. Mohsenin, Z. Yu, A. T. Jacobson, G. Landge,
M. J. Meeuwsen, C. Watnik, A. T. Tran, Z. Xiao, E. W. Work, J. W. Webb, P. V.
Mejia, and B. Baas, “A 167-Processor Computational Platform in 65 nm CMOS,”
IEEE Journal of Solid-State Circuits (JSSC), vol. 44, no. 4, pp. 1130-1144, April
2009.
[57] M. Ayinala, Y. Lao, and K. K. Parhi, “An In-Place FFT Architecture for Real-
Valued Signals,” IEEE Transactions on Circuits and Systems–II: Express Briefs
(TCAS–II), vol. 60, no. 10, Oct. 2013.
[58] Z.-G. Ma, X.- B. Yin, and F. Yu, “A Novel Memory-Based FFT Architecture for
Real-Valued Signals Based on a Radix-2 Decimation-In-Frequency Algorithm,”
IEEE Transactions on Circuits and Systems–II: Express Briefs (TCAS–II), vol.
62, no. 9, Sep. 2015.
[59] D. Kim, G. Chen, M. Fojtik, M. Seok, D. Blaauw, and D. Sylvester, “A
1.85fW/bit Ultra Low Leakage 10T SRAM with Speed Compensation Scheme,”
IEEE International Symposium of Circuits and Systems (ISCAS), July 2011.
BIBLIOGRAPHY 122
[60] C. Q. Tran, H. Kawaguchi, and T. Sakurai, “More Than Two Orders of Magni-
tude Leakage Current Reduction in Look-Up Table for FPGA’s,” IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS), 2005.
[61] H. Homayoun, M. Makhzan, A. Veidenbaum, “Multiple Sleep Mode Leakage
Control for Cache Peripheral Circuits in Embedded Processors,” International
Conference on Compilers, Architectures and Synthesis for Embedded Systems
(CASES), 2008.
[62] Y. Shin, S. Paik, and H.- O. Kim, “Semicustom Design of Zigzag Power-Gated
Circuits in Standard Cell Elements,” IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems (TCAD), vol. 28, no. 3, March 2009.
[63] D. E.-Damak and A. P. Chandrakasan, “A 10 nW–1 µW Power Management
IC With Integrated Battery Management and Self-Startup for Energy Harvesting
Applications,” IEEE Journal of Solid-State Circuits (JSSC), vol. 51, no. 4, pp.
943-954, 2016.
[64] A. M. Klinefelter et al., “A Programmable 34 nW/Channel Sub-Threshold Sig-
nal Band Power Extractor on a Body Sensor Node SoC,” IEEE Transactions on
Circuits and Systems–II: Express Briefs (TCAS–II), vol. 59, no. 12, pp. 937-941,
2012.
[65] N. Lotze and Y. Manoli, “A 62mV 0.13µm CMOS standard-cell-based design
technique using schmitt-trigger logic,” IEEE International Solid-State Circuits
Conference (ISSCC), 2011.
BIBLIOGRAPHY 123
[66] M.-E. Hwang et al., “A 85mV 40nW Process-Tolerant Subthreshold 8×8 FIR
Filter in 130nm Technology,” IEEE Symposium on VLSI Circuits (VLSI), 2007.
[67] L. Lin, S. Jain, and M. Alioto, “A 595pW 14pJ/Cycle Microcontroller with
Dual-Mode Standard Cells and Self-Startup for Battery-Indifferent Distributed
Sensing,” IEEE International Solid-State Circuits Conference (ISSCC), 2018.
[68] W. Lim, I. Lee, D. Sylvester, and D. Blaauw, “Batteryless Sub-nW Cortex-M0+
processor with dynamic leakage suppression logic,” IEEE International Solid-State
Circuits Conference (ISSCC), 2015.
[69] J. P. Cerqueira and M. Seok, “Temporarily Fine-Grained Sleep Technique for.
Near- and Subthreshold Parallel Architectures,” IEEE Transactions on Very Large
Scale Integration Systems (TVLSI), vol. 25, no. 1, 2016.
[70] C. W. Wu and Y. Tsividis, “An Event Driven, Alias Free ADC with Signal-
Dependent Resolution,” IEEE Symposium on VLSI Circuits (VLSI), 2012.
[71] B. Bohnenstiehl et al., “KiloCore: A 32-nm 1000-Processor Computational Ar-
ray,” IEEE Journal of Solid-State Circuits (JSSC), vol. 52, no. 4, pp. 891-902,
2017.
[72] S. Saito et al., “MuCCRA-Cube: A 3D dynamically reconfigurable processor
with inductive-coupling link,” International Conference on Field Programmable
Logic Applications (FPL), pp. 6–11, 2009.
[73] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, G.-Y. Wei, “A 28nm
SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1
BIBLIOGRAPHY 124
timing error rate tolerance for IoT applications,” IEEE International Solid-State
Circuits Conference (ISSCC), 2017.
[74] D. H. Kim et al., “3D-MAPS: 3D Massively parallel processor with stacked mem-
ory,” IEEE International Solid-State Circuits Conference (ISSCC), vol. 55, pp.
188-189, 2012.
[75] D. Fick et al., “Centip3De: A cluster-based NTC architecture with 64 ARM
cortex-M3 cores in 3D stacked 130 nm CMOS,” IEEE Journal of Solid-State Cir-
cuits (JSSC), vol. 48, no. 1, pp. 104-117, 2013.
[76] A. Parashar et al., “Triggered Instructions: A Control Paradigm for Spatially-
Programmed Architectures,” International Symposium on Computer Architecture
(ISCA), pp. 142-153, 2013.
[77] A. Parashar et al., “Efficient Spatial Processing Element Control via Triggered
Instructions,” IEEE MICRO, vol. 34, no. 3, pp. 120-137, May-June, 2014.
[78] T. J. Repetti, J. P. Cerqueira, M. A. Kim, and M. Seok, “Pipelining a Triggered
Processing Element,” IEEE/ACM International Symposium on Microarchitecture
(MICRO), pp. 96-108, 2017.
