



Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy










Ultra-low-voltage (ULV) operation where the supply voltage of the digital computing hard-
ware is scaled down to the level near or below transistor threshold voltage (e.g. 300-500mV)
is a key technique to achieve high computing energy efficiency. It has enabled many new
exciting applications in the field of Internet of Things (IoT) devices and energy-constrained
applications such as medical implants, environment sensors, and micro-robots. Ultra-low-
voltage (ULV) operation is also commonly used with the emerging architectures that are
often non Von-Neumann style to empower energy-efficient cognitive computing.
One the biggest challenge in realizing ULV design is the large circuit delay variability. To
guarantee functionality in the worst-case process, voltage, and temperature (PVT) condi-
tion, the traditional safety margin approach requires operating at a slower clock frequency or
higher supply voltage which significantly limits the achievable energy efficiency of the hard-
ware. To fully claim the energy-efficiency of ULV, the large circuit delay variation needs to be
adaptively handled. However,the existing adaptive techniques that are optimized for nomi-
nal supply voltage operation and traditional Von-Neumann architectures becomes inefficient
for ULV designs and emerging architectures.
This thesis presents adaptive techniques based on timing error detection and correction
(EDAC) that are more suitable for the energy-constrained ULV designs and the emerging ar-
chitectures. The proposed techniques are demonstrated in three test chips: (1) R-Processor:
A 0.4V resilient processor with a voltage-scalable and low-overhead in-situ EDAC technique.
It achieves 38% energy efficiency improvement or 2.3× throughput improvement as compared
to the traditional safety margin approach. (2) A 450mV timing-margin-free waveform sorter
for brain computer interface (BCI) microsystem. It achieves 49.3% higher energy efficiency
and 35.6% higher throughput than the traditional safety margin approach. (3) Ultra-low-
power and robust power-management system which consists of a micropocessor employing
ULV EDAC, 63-ratio integrated switched-capacitor DC-DC converter, and a fully-digital
error based regulation controller.
In this thesis, we also explore circuits for emerging techniques. The first is temperature
sensors for dynamic-thermal-management (DTM). The modern high-performance micropro-
cessors suffers from ever-increasing power densities which has led to reliability concerns and
increased cooling costs from excessive heat. In order to monitor and manage the thermal
behavior, DTM techniques embed multiple temperature sensors and uses its information.
The size, accuracy, and voltage-scalability of the sensor is critical for the performance of
DTM. Therefore, we propose a temperature sensor that directly senses transistor threshold
voltage and the test chip demonstrates 9× smaller area and 3× higher accuracy than the
previous state-of-art.
Another area of exploration is interconnect design for ultra-dynamic-voltage-scaling (UDVS)
systems. UDVS has been proposed for applications that require both high performance and
high energy efficiency. UDVS can provide peak performance with nominal supply voltage
when work load is high. When work load is moderate or low, UDVS systems can switch
to ULV operation for higher energy efficiency. One of the critical challenges for developing
UDVS systems is the inflexibility in various circuit fabrics that are often optimized for a sin-
gle supply voltage. One critical example is conventional repeater based long interconnects
which suffers from non-optimal performance and energy efficiency in UDVS systems. There-
fore, in this thesis, we propose a reconfigurable interconnect design based on regenerators
and demonstrate near optimal performance and energy efficiency across the supply voltage
of 0.3V and 1V.
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction 1
1.1 Ultra-Low-Voltage (ULV) Operation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Variation in ULV operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Variation Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Adaptive Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Emerging Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Ultra-Dynamic-Voltage-Scaling (UDVS) . . . . . . . . . . . . . . . . 6
1.4.2 Dynamic-Thermal-Management (DTM) . . . . . . . . . . . . . . . . . 6
1.4.3 Emerging Architectures for Cognitive Computing . . . . . . . . . . . 7
1.5 Challenges and Contribution of this Thesis . . . . . . . . . . . . . . . . . . . 7
1.5.1 EDAC techniques for ULV design and Emerging Architectures . . . . 7
1.5.2 Circuits for Emerging Techniques . . . . . . . . . . . . . . . . . . . . 9
2 Variation-Tolerant, Ultra-Low-Voltage Microprocessor with a Low-Overhead,
Within-a-Cycle In-Situ Timing-Error Detection and Correction Technique 11
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Challenges of Conventional Error Detection Techniques in ULV Design
and Proposed Error Detection Techniques . . . . . . . . . . . . . . . . . . . 15
2.2.1 Conventional Flop based Error Detection . . . . . . . . . . . . . . . . 15
i
2.2.2 Conventional Two-Phase Latch based Error Detection . . . . . . . . . 18
2.2.3 Proposed Sparse Error-Detecting Register Insertion . . . . . . . . . . 20
2.2.4 Case Study with 3-Stage Pipeline . . . . . . . . . . . . . . . . . . . . 27
2.3 Challenges of Conventional Error Detecting Registers in ULV Regime and
Proposed Error Detecting Latch . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Conventional Error-Detecting Flip-Flop and Latch . . . . . . . . . . . 35
2.3.2 Proposed Voltage Scalable Error Detecting Latch Circuits . . . . . . 37
2.4 Challenges of Error Correction Techniques in ULV Design and Proposed Error
Correction Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Conventional Error Correction Scheme . . . . . . . . . . . . . . . . . 40
2.4.2 Proposed Non-Stall Error Correction . . . . . . . . . . . . . . . . . . 40
2.5 R-Processor Design and Implementation . . . . . . . . . . . . . . . . . . . . 43
2.6 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 A 450mV Timing-Margin-Free Waveform Sorter based on Body Swapping
Error Correction 54
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Proposed Techniques and Sorter Implementation . . . . . . . . . . . . . . . . 57
3.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Ultra-Low-Power and Robust Power-Management/Microprocessor System
based on Error Regulation 68
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 PM/µP Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 A 30.1µm2, < ±1.1oC 3σ-Error, 0.4-1.0V Digital Standard-Cell Compatible
Temperature Sensor for On-Chip Dense Thermal Monitoring 81
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
ii
5.2 Proposed Temperature Sensor Design . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 Operating Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 Optimal tsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3 Supply Voltage Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.4 Sensor Device Type and Body Connection . . . . . . . . . . . . . . . 93
5.3 Test Chip Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.1 Shared P2 and Csample . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.2 Operating Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.3 On-Chip DSADC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Sensor Accuracy Measurement . . . . . . . . . . . . . . . . . . . . . . 98
5.4.2 Supply Voltage Scalability Measurement . . . . . . . . . . . . . . . . 100
5.4.3 On-chip DSADC Measurement . . . . . . . . . . . . . . . . . . . . . 101
5.4.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Digital Standard-Cell-Compatible Sensor Experiment . . . . . . . . . . . . . 103
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Reconfigurable Regenerator-based Interconnect Design for Ultra-Dynamic-
Voltage-Scaling Systems 107
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Challenges of Repeater-Based Interconnect Design for UDVS Systems . . . . 111
6.2.1 Optimal Interval of Repeater Insertion . . . . . . . . . . . . . . . . . 111
6.2.2 Repeater-based Interconnect Design . . . . . . . . . . . . . . . . . . . 112
6.3 Optimized Regenerator Circuit Design . . . . . . . . . . . . . . . . . . . . . 114
6.3.1 Self-Timed Regenerator (STR) . . . . . . . . . . . . . . . . . . . . . 114
6.3.2 Robustness Challenges in the STR design . . . . . . . . . . . . . . . . 116
6.3.3 Robustness and Reconfiguration . . . . . . . . . . . . . . . . . . . . . 119
6.4 Reconfigurable Regenerator-Based Interconnect Design for UDVS Systems . 121
6.4.1 Design Process of the Proposed Interconnects . . . . . . . . . . . . . 121
iii
6.4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.3 Non-Minimum Width Wire . . . . . . . . . . . . . . . . . . . . . . . 126





1.1 Simulated energy and delay across VDDs using 50-FO4 long inverter chains. . 2
1.2 Simulation using 50-FO4 long inverter chains in a 65nm CMOS shows that
12× frequency margin or 160mV supply voltage margin are needed for worst-
case condition at 0.4V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 (a) The conventional flop-based error detection. (b) The conventional latch-
based error detection. (c) The proposed sparse error-detection. (ED: error
detection pipeline registers) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 At lower VDDs, (a) a more number of short paths need to be delay-padded,
and (b) a more number of flops needs to be replaced with error-detecting
registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The proposed sparse error detection technique inserts EDLs every N stage,
instead of every stage. The delay increase (timing error) can propagate to
next stage via cycle borrowing and may disappear as it passes through a non-
critical path or be cycle-borrowed again. We insert error-detecting latches
before the delay accumulation can exceed the cycle borrowing window. . . . 20
2.4 When error-detecting registers are inserted at every latch stage, only 19%
of error-detection window is utilized even at 0.3V, motivating the proposed
sparse insertion of error-detecting registers. . . . . . . . . . . . . . . . . . . . 23
2.5 The required amount of error-detection window increases as the granularity
of pipeline reduces, but still less than 25% of TCLK even for 10-FO4 long latch
stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
2.6 A significant number of latch stages can be skipped before error-detecting
registers are inserted. At VDD>0.4, the optimal sparseness is estimated to be
larger than 20 latch stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 The optimal sparsenesss are estimated from 4 to 14 across the lengths of latch
stages of 10-FO4 to 100-FO4 delays. . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Diagrams of (a) flop based, and (b) latch based 3-stage pipeline circuits using
three 16b multipliers. (c) Multiple latch based pipelines are implemented for
the different sparsenesss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 The delay distribution of receiving latches in the two-phase latch base pipeline
circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10 Pseudo algorithm for error-detecting register insertion. . . . . . . . . . . . . 30
2.11 Comparison of the conventional techniques and the proposed sparse insertion
technique: (a) combinational area, (b) sequential area, (c) total number of
error-detectors. (Abbreviation: I. baseline without error-detection capability
II. conventional flop EDAC [1] III. conventional two-phase latch based EDAC
with N=1 [2] IV. the proposed EDAC with sparsely inserted error-detecting
registers [N=6]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.12 Timing violation rate comparison for different N. . . . . . . . . . . . . . . . 34
2.13 (a) The conventional double-sampling method suffers from false error detec-
tion due to clk-to-q delay mismatch between main and shadow elements at low
voltage. (b) The 3 clk-to-q delay mismatch over 100k Monte Carlo simulation
with random process variation at 0.35V is 1.8-FO4 delay, which causes the
false error rate of 28%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.14 (a) The schematic and (b) the operational waveform of the proposed EDL. It
uses the side-channel error detection method to avoid the clk-to-q mismatch
problem. It is also optimized to upgrade voltage scalability down to 0.3V. . 37
vi
2.15 We propose a non-stall error correction scheme which utilizes local and tem-
poral VDD boosting. When timing error is flagged in the detection stage
and while the late arriving signals still propagate via cycle borrowing, the
VDD,local-control block changes the supply voltage (VDD,local) of the next stage
(correction stage) to higher VDD (VDDH) to accelerate signal propagation. The
latches in the correction stage are not boosted to avoid accidental state loss.
Level converters are bypassed at the absence of errors. . . . . . . . . . . . . 41
2.16 (a) Operational waveform of the non-stall correction technique. The VDD,local
is boosted during the negative clock phase for isolating the correction stage
from the next stage. The supply voltage headers are sized to meet the 1-FO4
boosting slew. (b) Boosting the voltage at ULV can allow sufficient speed-up
to correct error without stalling pipelines. For example, boosting from 0.4V
to 0.55V can give 4× speed-up which is sufficient to produce error-free results
in less than a half cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.17 R-Processor: a 16-bit 5-stage microprocessor employing the proposed EDAC
techniques. The memory blocks (DMEM, IMEM, and RF) are also pipelined
with 2-phase latches to continue cycle-borrowing across the entire pipeline. . 44
2.18 Die photograph of (a) the R-Processor, (b) the baseline processor. . . . . . . 46
2.19 The FCLK,max of the baseline processor is measured based on the worst-case
PVT condition. First, the FCLK,max at the worst-case voltage and temperature
condition (10% VDD drop and -20
oC) is measured over 10 chips. Then, in order
to account for process variation we find the 6σ worst-case FCLK,max out of the
10 chip measurements, which is used for the margined FCLK,max of the baseline
processor. If considering the variation across wafers and lots, the worst-case
FCLK,max can be even worse than our estimation. . . . . . . . . . . . . . . . . 47
vii
2.20 The R-Processor achieves energy efficiency and performance improvement over
the baseline design. (1) The R-Processor can scale VOPT by 140mV as com-
pared to the baseline, where the R-Processor consumes 42% smaller energy
per cycle at FCLK=60MHz; (1) At the same performance (80MHz that the
baseline achieves at its VOPT), the R-Processor exhibit 38% lower energy con-
sumption; (3) At the same energy consumption, the R-Processor is estimated
to be 2.3× faster than the baseline. . . . . . . . . . . . . . . . . . . . . . . . 48
2.21 The energy savings and error rates of R-Processor. The R-Processor can use
110mV lower VDD and consume 38% less energy when it reaches the point of
the first failure (PoFF), i.e., detecting and correcting the first error. . . . . . 49
2.22 Experiment results of R-Processor running at 10MHz with an off-chip DVS
system while ambient temperature is varying from -20oC to 70oC. R-Processor
can operate well down to the deep sub-threshold regime of 0.26V. . . . . . . 51
3.1 Sorter architecture with the proposed EDAC technique. . . . . . . . . . . . . 55
3.2 Sorting results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Previous VDD boosting correction. . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Proposed body swapping correction. . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Waveforms of body swapping correction. . . . . . . . . . . . . . . . . . . . . 59
3.6 Body controller schematics with a test circuitry. . . . . . . . . . . . . . . . . 60
3.7 Measured delay of body swapping control. . . . . . . . . . . . . . . . . . . . 60
3.8 Circuit delay reduction via body swapping. . . . . . . . . . . . . . . . . . . . 61
3.9 Correction stage layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Schematics of the proposed fully-static transparent high ED latch. . . . . . . 62
3.11 Die photo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.12 Measured energy and throughput improvement. . . . . . . . . . . . . . . . . 65
3.13 Test circuitry for error statistic measurement. . . . . . . . . . . . . . . . . . 66
3.14 Error rate reduction via independent error handling. . . . . . . . . . . . . . . 66
viii
4.1 Conventional voltage based regulation. . . . . . . . . . . . . . . . . . . . . . 69
4.2 The conventional EDAC-DVS technique requires a variable VREF generator,
which consumes a non-negligible amount of energy (e.g., 1µW). With this
estimation we project the PCE to degrade by 4%. . . . . . . . . . . . . . . . 69
4.3 The conventional EDAC-DVS control loop has a considerable amount of la-
tency to translate error information to VREF. This latency makes EDAC to
correct errors for a longer period before adjusting VDD. For example, 40µs
latency is estimated to cause 8% energy loss . . . . . . . . . . . . . . . . . . 70
4.4 Proposed timing-error regulation. . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 The proposed EDAC-SCDC controller has a fast loop which responds to a
single error event and starts a new SCDC phase in the following rising CLK
edge. This loop quickly raises VDD to VDD,max and minimizes the time during
which the EDAC needs to handle errors. . . . . . . . . . . . . . . . . . . . . 72
4.6 When errors continue to occur, the slow loop of proposed EDAC-SCDC con-
troller modulates the target VDD levels (VDD,max and VDD,min) in one CLK
cycle latency to regulate the average error rate to TER (bottom) . . . . . . . 72
4.7 Schematic and operating modes of the 6-stage 63-ratio SCDC based on the
recursive topology [3]. To support low VIN, transmission gates in intermediate
switches [3,4] need to be upsized, which cause leakage-incurred PCE degrada-
tion. Thus, we avoid using transmission gates. We also employ the technique
to recycle bottom-plate charges using the switches Rp and Rn, improving PCE
by 2-3%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Test-chip die photo. The SCDC is sized to supply up to 1mA for other exper-
iments. Area estimation if the SCDC is sized for the µp maximum. . . . . . 75
4.9 Measured PCE of SCDC across different load currents. . . . . . . . . . . . . 75
4.10 Measured PCE of SCDC across ratios. . . . . . . . . . . . . . . . . . . . . . 76
4.11 Measured VOUT of SCDC across ratios. . . . . . . . . . . . . . . . . . . . . . 76
ix
4.12 Measured transient behavior while executing programs having different power
consumptions. With our EDAC-SCDC controller disabled, we observe an
84mV VDD drop and program failure. With the controller enabled, the fast
loop can reduce the VDD drop to 15mV and the slow loops raises the target
VDD levels to meet the TER. We observe no program failure. . . . . . . . . . 77
4.13 Energy efficiency comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.14 Energy breakdown and energy savings. . . . . . . . . . . . . . . . . . . . . . 79
4.15 As compared to Baseline-1, the proposed PM/µP system achieves 37-45%
savings as it needs little margin for PVT variations and SCDC output ripple. 79
5.1 Area, error, and VDD,min comparisons of recent compact thermal sensors. . . 83
5.2 Schematic and operation of the proposed sensor front-end that directly sam-
ples VTH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 VTH over temperature across process variations. . . . . . . . . . . . . . . . . 86
5.4 (a) Linearity of the sampled VSENSOR value across tsamples. (b) Discharging
rate of the VSENSOR node voltage across tsample. . . . . . . . . . . . . . . . . 89
5.5 Impact of pre-charge level variations on accuracy. . . . . . . . . . . . . . . . 92
5.6 Three possible body connections of the sensing device P1. . . . . . . . . . . 94
5.7 Die photo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Test chip block diagram and its operational waveform. . . . . . . . . . . . . 95
5.9 Accuracy and area across sensor sizes. . . . . . . . . . . . . . . . . . . . . . 98
5.10 (a) Measured VOUTs of an SS16 after OPC at 50
oC. (b) Errors across temper-
atures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.11 Measured error after TPC at 20oC and 80oC. . . . . . . . . . . . . . . . . . . 99
5.12 The worst-case error of SS16s across tsamples. . . . . . . . . . . . . . . . . . . 100
5.13 The worst-case error across VDDs. . . . . . . . . . . . . . . . . . . . . . . . . 101
5.14 The worst-case error using on-chip DSADC. . . . . . . . . . . . . . . . . . . 102
5.15 Layout of 32-bit multiplier and embedded SS16 in the digital standard-cell
format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
x
5.16 (a) Worst-case coupling noise error across the VSENSOR wire length exposed.
(b) Worst-case coupling noise error across the sampling capacitor size. . . . . 105
5.17 Coupling noise induced error and its reduction via averaging. . . . . . . . . . 105
6.1 Interconnect designs using (a) the conventional repeaters and (b) the proposed
reconfigurable regenerators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Simulation shows a 6× variation in Loptimal over VDD=1-0.35V. (R/C: the on-
resistance and gate capacitance of unit-size inverters; Rw/Cw: the resistance
and capacitance of unit-length wires; pinv: the ratio of diffusion and gate
capacitance of unit-size inverters.) . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Any single repeater-based interconnect design cannot simultaneously achieve
optimal delay, slew and energy-consumption across a wide range of VDDs. (a)
At 0.35V, the Design III outperforms the Design I and II. At 1V, however, the
Design I exhibits 2.8× shorter delay than the Design III. (b) All the designs
achieve acceptable slew rates at the VDDs that they are optimized for. The
Design III exhibits large slew at 1V. (c) The three designs consumes similar
amounts of energy since the total widths of inserted repeaters are similar.
Only the Design III shows a large energy consumption at VDD=0.6-1V due to
the short circuit current induced by large slew. . . . . . . . . . . . . . . . . . 113
6.4 The STR with original sizing [5] and (b) the optimized regenerator design. . 115
6.5 The required size of the writing devices (NN5 and PP5) rapidly increases
under the worst-case process and temperature corner. . . . . . . . . . . . . . 116
6.6 Leakage through PP4 and strongly-skewed devices, NN1 and NN2 (Fig. 6.4[a]),
can induce false transition detections at low VDDs. An example operation at
0.35V is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.7 Layout of the proposed regenerator design. The height is set as multiples of
the height of standard cells in this technology. . . . . . . . . . . . . . . . . . 120
6.8 The optimal number of regenerator is found to be 35 at 1V with 1mm,
minimum-width wires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xi
6.9 At lower VDDs, some of the regenerators can be disabled while still meeting the
target performance. At 0.35V, for example, only 11 out of 35 regenerators are
enabled, achieving 21% reduction in energy consumption compared to when
all enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.10 The optimal numbers of enabled regenerators to achieve the target perfor-
mance across VDDs are found. The proposed reconfigurable interconnect de-
sign reduces energy consumption by up to 28% by disabling a subset of re-
generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.11 The simulation results of (a) delay, (b) energy consumption, (c) slew, and
(d) area of the proposed reconfigurable interconnect design and the three
conventional repeater-based interconnect designs. . . . . . . . . . . . . . . . 124
6.12 The proposed design demonstrates the similar amounts of delay and energy
improvement over the wires of different widths. The proposed design is com-
pared to (a) the Design I(1V) at 0.35V, and (b) the Design III(0.35V) at
1V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
xii
List of Tables
2.1 Pseudo algorithm for error-detecting register insertion. . . . . . . . . . . . . 31
2.2 Comparisons of the conventional latch and the proposed EDL circuits at 0.35V. 39
2.3 The summary of the R-Processor and the baseline processor chips in the typ-
ical PVT corner. Utilization is defined as total area divided by gate area. . . 50
2.4 Summary of baseline processor and R-Processor at the slow, typical, and fast
corner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5 Comparison of R-Processor and previous EDAC works. . . . . . . . . . . . . 52
3.1 List of registers that requires roll-back for replay correction. . . . . . . . . . 56
3.2 Measured improvement summary. . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Comparison chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Energy savings as compared to Baseline-1 across slow, typical, and fast corners. 78
4.2 Comparisons to the recent designs. . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Comparison of proposed sensor with different device type. . . . . . . . . . . 93
5.2 Comparison of proposed sensor with different body connection. . . . . . . . . 94
5.3 Comparison table with previous designs. . . . . . . . . . . . . . . . . . . . . 102
6.1 Implementation details of the Design I, II, and III. . . . . . . . . . . . . . . . 112
xiii
Acknowledgments
My journey throughout the graduate study was full of excitement and gratitude. The journey
has made a better me and this would have not been possible without my advisor, Professor
Mingoo Seok. His devotion and passion for research was a big motivation for me and his
research and circuit skills was something I tried to adapt as much. His support and share of
excitement when discussing new ideas was always encouraging in the long journey. Also, as
a first student in the group, I was lucky to learn from watching how to successfully manage
a research group and was lucky to share all the excitements. I would also like to thank him
for his time and effort in discussing my future career path with sincere care. Thank you
again Mingoo, for all you have done for my journey in the past few years.
I would also like to express great gratitude to the thesis committee, Professor Ken Shep-
ard, Professor Luca Carloni, Professor Martha Kim, and Dr. Ram Krishnamurthy. Also,
Professor Peter Kinget for serving the thesis proposal committee. Thank you all for your
commitment and providing valuable feedback to completing the thesis.
I am thankful to Dr. Ram Krishnamurthy for serving as the thesis committee all the
way from Oregon, but more I would like to thank for making my internship experience at
Intel Corporation great and having me on board in the team to start a new exciting journey.
Also, I would like to appreciate my mentor Dr. Gregory Chen for guiding the project during
the internship and I would like to thank Dr. Mark Anders, Dr. Sanu Mathew, and Dr.
Himanshu Kaul for providing critical discussion in the meetings.
My colleagues in the group had made my daily life in graduate school to not feel alone.
Thank you for all the discussions on technical topics and chats on various topics. Teng
Yang, Doyun Kim, Jiangyi Li, Joao Cerqueira, Zhewei Jiang, Wei Jin, Tianchan Guan, and
xiv
Minhao Yang. I will forever remember the fun moments we shared in the lab and in the
annual gatherings. Although I haven’t had much chance to talk, but welcome to the group
Chen-Yu Yen and Pavan Chundi.
Outside the group, I would like to thank Youngwan Kim and Doyun Kim for always being
a supportive friend, Ning Guo for being a fishing buddy and sharing the career discussions,
Ritesh Bhat for exploring Oregon and Jamba Juice together and the occasional swimming
chats, Dr. Anandaroop Chakrabarti for the life lessons in the Mudd hallway, and Hyungsik
Kim for sharing many interesting corporate experience with me. I would also like to thank
Kevin Tien and Paolo Mantovani for all the helps during the collaboration project. I would
also miss the occasional chats I had in the Mudd hallway with Dr. Jianxun Zhu, Dr. Yang
Xu, Dr. Shavil Patil, Yu Chen, Jahnavi Sharma, Jeffrey Chuang, Linxiao Zhang, Tolga Dinc,
Jin Zhou, and Negar Reiskarimian.
Special thanks to all my friends back in Korea. Too long to list but I know you won’t be
reading this anyways. Thank you guys for all the mental support and being there no matter
how busy you are when I had the chances to visit.
Finally and most importantly, I would like to thank and thank my family. My parents
for the life long unconditional love and support. Adding even more love and support from
my parents-in-law, brother, sister-in-law, and brother-in-law. Lastly, my wife, Gihee Hong,
thank you for accompanying me throughout this long journey. Thanks for sharing the
excitement and joy and thanks for all your love and efforts to encourage me and cheer me
up when I was depressed. I know you don’t like our marriage motto as much as I do but
”two hearts and one dream!”. Thank you and I love you.
xv





1.1 Ultra-Low-Voltage (ULV) Operation
Ultra-low-voltage (ULV) operation is a key technique to achieve extremely energy efficient
digital computing hardware by scaling supply voltage (VDD) to the level near or below tran-
sistor threshold voltage (VTH) [6,7]. ULV operation can improve computing energy efficiency,
particularly beneficial to prolong battery lifetime of ranges Internet-of-Things (IoT) devices
and energy-constrained applications including biomedical implants, environment sensors,
and micro-robots [8, 9].
Fig. 1.1 shows the simulated energy consumption and delay of a 50-FO4 (Fan-Out of 4)
inverter chain as a function of VDD. Initially, as VDD is scaled, near quadratic energy saving
is achieved from active energy reduction. At lower VDD regimes, the delay exponentially
increases which in turn substantially increases the leakage energy. This creates an optimal
energy point (Emin) where the increase in leakage energy offsets the quadratic saving in
2
Figure 1.1: Simulated energy and delay across VDDs using 50-FO4 long inverter chains.
active energy [11,21]. This optimal energy point normally occurs in ULV regime (e.g. 300-
500mV) in modern CMOS technologies; thus, making ULV operation desirable for high
energy efficiency.
1.2 Variation in ULV operation
One of the most critical challenges in designing ULV computing hardware is large delay
variability. In the ULV regime, the transistor on-current becomes exponentially dependent
on VDD, VTH, and temperature. Therefore, at the ULV regime, circuit delay can radically
change across typical and worst-case process, voltage, and temperature (PVT) conditions.
As shown in Fig. 1.2, at 1V operation, the worst-case clock frequency of the inverter chain
is only 1.7× slower than the typical-case. However, at 0.4 V operation, the worst-case clock
3
Figure 1.2: Simulation using 50-FO4 long inverter chains in a 65nm CMOS shows that 12×
frequency margin or 160mV supply voltage margin are needed for worst-case condition at
0.4V.
frequency becomes 12× slower than the typical-case.
1.2.1 Variation Sources
Transistor variability can be categorizes based on their spatial reach (i.e. global or local)
and temporal rate of change (i.e. static/slow or dynamic) [10–13].
• Global variation affects all transistors on a die. Examples include inter-die process
variations, ambient temperature fluctuations, and package/die VDD fluctuations.
• Local variation affect transistors that are in the immediate vicinity of one another.
Examples include intra-die process variation, resistive (IR) VDD drops in the power-
grid, coupling noise, and local temperature hot-spots.
4
• Static/slow variation sources are fixed after fabrication or changes extremely slowly
over lifetime. Examples include intra and inter-die variations and transistor aging
effects (e.g. NBTI, TDDB, and electro-migration).
• Dynamic variation develops during runtime. The source of dynamic variation can
be further categorized into fast-changing or slow-changing. Fast-changing dynamic
variation sources include inductive (Ldi/dt) VDD overshoots, resistive (IR) VDD drops,
and coupling noise. Slow-changing dynamic variation sources include temperature hot-
spots, and board-parasitic induced VDD variation.
1.3 Adaptive Techniques
The conventional practice to ensure correct operation at the worst-case PVT condition is to
operate the chip with safety margin. However, the worst-case PVT condition rarely occurs
and such safety margins lead to degraded throughput or wasted power in most cases. For
example, at 0.4V operation, the chip always needs to operate at 12× slower clock frequency
(FCLK) or at 160mV higher supply voltage (VDD) than required in the typical case and limits
the achievable throughput and energy efficiency (Fig. 1.2).
Several adaptive techniques [1,2,11–21] have been proposed to reduce the frequency and
voltage safety margins. One class of techniques is to use the circuits that replicate the critical
paths of a target design. The delay of the replicated circuits can predict timing errors, and if
needed inform a dynamic voltage frequency scaling (DVFS) controller to modulate VDD and
5
FCLK [18–21]. This, so-called canary technique, can track global and slow-changing variations
such as inter-die process variation and ambient temperature fluctuation. However, the canary
technique cannot remove the margins for fast dynamic variations due to the limitation in
the response time of DVFS systems. It also cannot remove the margins for local (random)
variations since the replicated circuits does not experience the same variations with the
actual circuits. Those fast-dynamic and local variations include intra-die process variation,
local dynamic VDD drop, capacitive coupling, and local hot spots. The margins for these
variations can be large, especially in ULV regime, and limit the achievable throughput and
energy efficiency of the system.
In order to remove the margins for fast-dynamic and local variation, in-situ error detection
and correction (EDAC) techniques have been proposed [1, 2, 11–17]. In these techniques,
error detecting registers are inserted in critical paths to in-situ detect and correct timing
violations via hardware. This can eliminate the margins for fast-dynamic and local variation.
In addition, the error statistics from those registers can also inform a DVFS controller to
modulate VDD and FCLK, which can eliminate the margins for global and slowly changing





Ultra-dynamic-voltage-scaling (UDVS) have been proposed for applications that require both
high performance and high energy efficiency [22]. UDVS can provide peak performance by
operating at nominal VDD while it can also achieve extremely high energy efficiency by scal-
ing VDD down to ULV regime under average and low workload. UDVS can be applicable
to a wide range of computing applications including data centers, personal computing, mo-
bile electronics, and embedded computing systems, for further improving performance and
energy-efficiency limits.
1.4.2 Dynamic-Thermal-Management (DTM)
Dynamic-thermal-management (DTM) techniques have been proposed for monitoring and
controlling the thermal behavior of the system for high performance yet reliable operation
[23–27]. The continuing miniaturization and higher level of integration has led to impres-
sive performance achievements in modern high-performance microprocessors. However, it
has also increased the power densities substantially which caused reliability concerns (i.e.
electromigration, TDDB, and NBTI) and increasing cooling costs due to excessive heat
[23, 24, 28]. A DTM technique can embed multiple temperature sensors on a chip and use
the provided temperature information to monitor the heat level and control them (i.e. per-
formance throttling).
7
1.4.3 Emerging Architectures for Cognitive Computing
In the on-going quest to enabling energy-efficient cognitive computing, parallel, and non-
instruction architectures emerge as a promising candidate [29–32]. Unlike the traditional Von
Neumann architectures, these emerging architectures do not have such thing as instruction
or program counter. They often use the dataflow approach where the execution is input data
driven. Also, as compared to the Von Neumann architecture with separated processing unit
and memory, these emerging architectures often have distributed memory mixed with logics.
1.5 Challenges and Contribution of this Thesis
1.5.1 EDAC techniques for ULV design and Emerging Architec-
tures
The large delay variability in ULV regime remains as one of the largest challenges. As
discussed in Section 1.3, EDAC coupled with DVFS can virtually remove all the timing
margin requirements and can achieve near optimal energy efficiency. However, as will be
discussed in the following chapters, the conventional EDAC techniques suffers from several
challenges when applied to ULV designs and emerging non Von-Neumann architectures.
Also, the conventional EDAC based power management systems can cause degraded energy
efficiency when using the conventional voltage based regulation which becomes prominent in
energy-constrained applications. Therefore, in the following chapters, we propose new EDAC
techniques and power management systems that are more suitable for energy-constrained
8
ULV designs and emerging architectures.
Chapter 2 analyzes the challenges of conventional EDAC techniques when applied to ULV
designs and a design approach for upgrading the resiliency of ULV microprocessor through
a voltage-scalable and low-overhead in-situ EDAC technique is presented. Particular efforts
are made to overcome the poor voltage scalability and area/energy/throughput overhead
of the existing EDAC techniques when applied to ULV designs. The 16 bit microprocessor
employing the proposed EDAC and dynamic voltage scaling schemes is demonstrated in a
CMOS 65nm.
Chapter 3 studies the challenges of conventional EDAC techniques when applied to the
emerging architectures discussed in Section 1.4.3. The conventional EDAC techniques opti-
mized for Von-Neumann architectures operating at super-VTH poses several challenges when
applied to the emerging architectures operating at ULV regime. One of the major challenge
is that these architectures do not have a program counter and have distributed memory
mixed with logic. This makes the existing commonly used instruction replay based cor-
rection scheme become inefficient. Therefore, in this chapter, a new EDAC technique is
proposed based on local body swapping that can correct error without replaying or stalling
the pipeline. Via these techniques, an unsupervised waveform sorter based on spiking neural
network (SNN) for brain computer interface (BCI) microsystems is demonstrated in CMOS
65nm.
Chapter 4 presents an EDAC based power management (PM) system and microproces-
sor design. Conventional EDAC based PM systems using voltage based regulation schemes
9
requires a variable reference voltage (VREF) generator that degrades the overall energy effi-
ciency of the system and causes non-negligible energy loss due to the large latency to translate
error information from EDAC to optimal VREF. Therefore, in this work, the PM system with
microprocessor is co-designed in CMOS 65nm and consists of (1) microprocessor employing
near/sub-VTH EDAC; (2) 63-ratio integrated switched-capacitor DC-DC converter (SCDC);
and (3) fully-digital EDAC-SCDC controller. The system directly regulates the timing er-
ror of EDAC: the controller receives error events from EDAC and adaptively produces the
settings (ratio and clock) of the SCDC.
1.5.2 Circuits for Emerging Techniques
In this thesis, we also explore circuits for emerging techniques discussed in section 1.4. In
the following chapters, we discuss new temperature sensor design for DTM and interconnect
design for UDVS systems.
Chapter 5 presents an on-chip temperature sensor circuit for dense thermal monitoring.
The design of on-chip temperature sensor is critical for DTM techniques as the number
of sensor deployed and its accuracy directly relates to the performance of DTM. With the
emerging technology trends toward multicore architectures, 3D-IC, and UDVS, sensor design
needs to be even smaller, more accurate, and have better voltage-scalability. The proposed
sensor prototyped in CMOS 65nm have a footprint of 30.1m2, 3σ-error of 1.1oC across 0
to 100oC after one temperature point calibration (OPC), and voltage scalability down to
0.4V, marking significant improvement over existing arts for accurate and dense thermal
10
monitoring in VLSI systems.
Chapter 6 proposes a reconfigurable interconnect design based on an regenerator to im-
prove the performance and energy efficiency in UDVS systems. For developing UDVS sys-
tems, one of the critical challenges is to mitigate the inflexibility in various circuit fabrics
that are often optimized for a single VDD. One example is the design of repeater based long
interconnects. In this chapter, a study with 10mm interconnect shows that the conventional
repeater based interconnect design have non-optimal performance and energy efficiency in
UDVS systems while proposed reconfigurable regenerator based approach can achieve near




Microprocessor with a Low-Overhead,
Within-a-Cycle In-Situ Timing-Error
Detection and Correction Technique
2.1 Motivation
Ultra-low-voltage (ULV) operation has gained a significant amount of attention for highly
energy-efficient digital integrated circuits (ICs). Supply voltage (VDD) of ICs can be scaled
down to near or below transistor threshold voltage (VTH) for increasing energy efficiency,
prolonging battery lifetime, and miniaturizing systems. Those benefits can enable a range
12
of exciting applications such as medical implants, environment sensors, micro robots, and
other so-called cyber-physical systems.
One of the most critical challenges in designing ULV ICs is to mitigate delay variability.
In ULV regime, device current becomes exponentially sensitive to process, voltage, and
temperature (PVT) variations. The large variability demands designers to add an excessive
amount of margin for ensuring correct operation under the worst-case PVT conditions. Such
margin, however, can severely limit the performance and energy efficiency of the ICs when
they operate under nominal or best conditions. In [33] it is shown that the worst-case margin
can force a chip to operate at only 10% of their potential performance although the chance
to experience the worst-case condition is slim.
Error detection and correction (EDAC) techniques [1,2,11,13,15,17,34,35] have been pro-
posed to eliminate such margins while still ensuring correct operation across PVT variations.
The conventional EDAC techniques use special pipeline registers having error detection capa-
bility. Those error-detecting registers, which are employed as the receiving pipeline registers
for critical and near-critical paths, capture incoming data at two times, i.e., (i) at a clock
edge and (ii) during a detection window which is often the high phase of the clock. If those
two captured data are different, it is interpreted as a timing error. In those detection tech-
niques, the signals which propagate through short paths may be captured in the detection
window, causing false-error detection. To avoid false-error detection, designers need to insert
delay buffers into the short paths such that the delays of all the paths become longer than
the error-detection window. Once error is detected, the correction scheme (e.g. instruction
13
replay) corrects them by replaying the erroneous instruction at a slower clock frequency
(FCLK).
Dynamic voltages scaling (DVS) or dynamic voltage frequency scaling (DVFS) are often
employed along with EDAC techniques [1, 2, 11, 13, 34]. The controller for DVS/DVFS can
take the error rate from error-detecting registers and modulate operating conditions, i.e., VDD
or cycle time (TCLK), for making the circuits to operate on the edge of failure. The closed-loop
systems allow us to remove the margin for static (e.g., process variations) and slow-varying
variations (e.g., ambient temperature changes) without any post-silicon calibrations. Fast
dynamic variations, such as voltage droops, local hot/cold spots, and coupling noise, can
be detected and corrected by EDAC techniques. This way, the margin for almost all the
variations can be removed.
The conventional EDAC techniques [1, 2, 11–17], however, cannot be directly applied to
ULV designs for the following critical problems. (1) A significantly larger number of registers
need to be replaced with bulky error-detecting ones. Note that the error detecting registers
typically have 8 to 44 more transistors than the conventional one per register [1,2,11–17]. (2)
A large amount of short-path padding is needed, incurring large area and energy overhead.
(3) The conventional error-detecting register circuits become unreliable. (4) Timing error
rates can increase, degrading energy efficiency and throughput.
In this work, in order to mitigate those problems, we propose a voltage-scalable, low-
overhead, and within-a cycle EDAC technique, which consists of the following three sub-
techniques. (1) We devise a sparse error detection strategy where errors are detected in every
14
several pipeline stages rather than every stage without compromising detection coverage. The
benefit is a substantially less number of error-detecting registers inserted, and the elimination
of short-path padding requirement. (2) We design an error-detecting latch circuit that can
operate reliably at very low voltage. (3) We develop an error-correction scheme where errors
are detected and corrected within a cycle, without stalling pipelines. This eliminates the
control overhead of the existing multi-cycle detection and correction process.
Based on the proposed EDAC technique, a variation-tolerant Resilient-Processor (R-
Processor) is designed and fabricated in a 65nm CMOS. At a typical PVT corner, R-Processor
can reduce the minimum energy consumption (Emin) by 42% at a 140mV lower VDD, as
compared to the baseline processor operating with the worst-case voltage margins. At the
same FCLK=80MHz where the baseline processor achieves its Emin, R-Processor consumes
38% less energy. Finally, R-Processor can have 2.3× higher throughput at the same energy
consumption of the baseline operating at its energy-optimum supply voltage (VOPT). The
area overhead of the proposed EDAC technique in R-Processor is only 8.3%.
15
Figure 2.1: (a) The conventional flop-based error detection. (b) The conventional latch-based
error detection. (c) The proposed sparse error-detection. (ED: error detection pipeline
registers)
2.2 The Challenges of Conventional Error Detection
Techniques in ULV Design and Proposed Error De-
tection Techniques
2.2.1 Conventional Flop based Error Detection
Short-path padding: When the existing flop based EDAC techniques operating at super-
VTH VDDs [1, 11–17] is directly applied to ULV designs, it severely suffers from the area
overhead caused by short path padding. The conceptual schematics is shown in Fig. 2.1(a).
In this technique, any data arriving in error detection window is regarded as timing error.
However, the signals that propagate through short paths also can arrive in this window, caus-
ing false error detection. False error detection does not affect the functionality of pipelines
but can significantly waste energy and throughput due to the unnecessary exercise of error
16
correction processes (e.g. instruction replay in [11–13,15,16]). To filter those correct signals
arriving through short paths from actual timing errors, delay elements (e.g., buffers and in-
verters) are typically inserted to ensure the delays of short paths longer than error detection
window.
Figure 2.2: At lower VDDs, (a) a more number of short paths need to be delay-padded, and
(b) a more number of flops needs to be replaced with error-detecting registers.
Short paths must be longer than the detection window even under the worst-case PVT
condition. As shown in Fig. 2.2(a), this makes the overhead of short-path padding to
increases at lower VDD where delay variability becomes larger. In our experiment using a
single-stage 16b multiplier synthesized at 40 FO4 delays in a 65nm CMOS, short paths should
be longer than 61% of clock period (TCLK) when considering 3σ delay variation incurred by
local process variations at 0.35V. As a result, a large amount of delay buffers are inserted,
causing a 2.2× increase in combinational-logic area, as compared to the baseline design
without error detection capability.
To avoid the overhead incurred by short-path padding, some of the previous works have
17
proposed to reduce the duty cycle of clock below 50% [13, 15, 34] or to generate internal
detection window in each error-detectors [11]. Whereas those approaches can relax the
requirement of the short-path padding, they also reduce the size of detection window for
timing errors. This creates a design trade-off between the detection window which dictates
the degree of tolerance to dynamic and local variation and the short path constraint dictating
the overhead of added delay buffers. This is undesirable in ULV designs due to the larger
delay variability. Additionally, the large delay variability makes it difficult to generate and
distribute such clock signal with skewed duty cycle and/or to locally generate a pulse (i.e.,
fixed amount of detection window via delay elements) whose quality also suffers from large
variability.
Error-detector insertion rate: Another major source of area overhead in the con-
ventional flop based technique is error-detecting registers to insert. Typically, the delays of
critical and near-critical paths are estimated under the worst-case dynamic variations. Those
flops that receive data from the paths which can potentially violate TCLK are replaced with
error-detecting registers. In ULV regime, more paths are likely to violate TCLK due to the
higher sensitivity to the ranges of dynamic variations. A few notable examples of dynamic
variations include IR drops particularly in the designs employing distributed power gating
switches [36,37], and coupling noise [38] particularly in the designs using multiple VDDs and
VTHs where the strength difference between aggressors and victims is large.
We investigate the amount of critical and near-critical paths requiring error detection
using the single-stage 16b multiplier across VDDs from nominal down to 0.3V. As shown in
18
Fig. 2.2(b), every path whose delay is longer than 76% of TCLK should be monitored at
0.35V whereas only the paths longer than 92% of TCLK need to be monitored at 1V. The
increased amount of critical and near-critical paths require 44% of the total flops need to be
replaced with error-detecting registers as compared to only 19% of the total flops at 1V. Note
that some of the conventional works targeting nominal VDD operation [1,11,13] present the
lower replacement rates of 7 to 17%, partly due to the imbalanced delays among stages. It is,
however, common to find the designs having the similar delays among stages, and therefore
such opportunistic savings in the replacement rate can be limited.
2.2.2 Conventional Two-Phase Latch based Error Detection
An EDAC technique based on two-phase latch sequencing has been recently proposed primar-
ily focusing on reducing architectural invasiveness [2]. The conceptual schematics is shown
in Figs. 2.1(b). Additional benefit of using latch-based sequencing for EDAC techniques is
the elimination of the false error detection induced by short paths. Since each consecutive
latch stage becomes transparent at an opposite phase of clock signal, no new data from the
previous latch stage is launched during the transparent phase of the current latch stage,
inherently eliminating false error detection. The technique uses the cycle-borrowing window
of latch stages as error detection window. If time borrowing is occurred, it is interpreted as
timing errors.
The EDAC techniques for two-phase latch based pipelines, however, have their own
challenges:
19
Sequential overhead: Transforming a flop based design to a two-phase-latch based
design can increase sequential overhead. In our experiment, such transformation performed
on a 16b multiplier can increase the sequential area by 2.6× and the total area by 18%. This
is because (i) a pair of latch has larger area than a single flop and (ii) the total number
of latches is more than twice the number of flops, i.e., 16 flops (i.e., roughly 32 latches) in
original design are transformed into 39 latches. While the overhead is considerable, it is
still worthwhile to note that the inherent overhead of latch-based pipelines is significantly
smaller than the overhead of short path padding in flip-flop-based pipeline when we apply
the EDAC techniques at 0.35V.
Error-detector replacement rate: Applying an EDAC technique to latch-pipelined
designs can significantly increase the number of error-detecting registers. This is because a
latch-pipelined design has more sequential elements than a flop-based design. In addition,
the delay of one latch stage is shorter (close to half of that of one flop stage), which can
pronounce the impact of local variations. In our multiplier test circuits, a latch stage has
1.7× higher variability than a flop stage. As a result, 23 out of 39 latches needs to be
replaced with error-detecting ones while only 7 out of 16 flops are replaced in the flop based
design.
20
Figure 2.3: The proposed sparse error detection technique inserts EDLs every N stage,
instead of every stage. The delay increase (timing error) can propagate to next stage via
cycle borrowing and may disappear as it passes through a non-critical path or be cycle-
borrowed again. We insert error-detecting latches before the delay accumulation can exceed
the cycle borrowing window.
2.2.3 Proposed Sparse Error-Detecting Register Insertion
2.2.3.1 Concept
In order to minimize the replacement rate of error-detecting registers, we propose a sparse
error detection technique where error-detecting registers are sparsely inserted across multiple
pipeline stages (Fig. 2.1(c)). The technique is based on two-phase latch based pipelines for
eliminating the short-path padding requirement. In the proposed technique, we do not detect
the delay increase (potential timing error) generated in every latch stage. Instead we let it
be cycle-borrowed to the subsequent stages. As shown in Fig. 2.3, the cycle-borrowed delay
increase may disappear in the next stage while propagating through non-critical paths, or is
cycle-borrowed again to the following stage. The error-detecting latch is inserted just before
the accumulation of the delay increases across several stages is expected to exceed the size
21
of cycle borrowing window. This sparse detection technique can significantly reduce the
overhead incurred by the high error-detecting latch replacement rate, even in the pipeline
whose stages are balanced. In addition, it can reduce the number of actual errors and the
cost involved to correct them.
Sparse insertion reduces the maximum detection capability per stage, however, the tech-
nique attempts to maximize the detection window utilization if every stage does not require
the largest-possible detection window of 50% of TCLK. This can be the case when EDAC is
combined with DVFS. In this combination, the global and static/slow-dynamic variations can
be tracked as the DVFS makes a pipeline to operate at points of first failure (PoFF). Only the
remaining local (random) and fast-dynamic variations need to be detected by EDAC. When
we use the 6σ worst-case delay variation from Monte-Carlo simulation with local process
variation as a proxy for these fast-dynamic and local variations, we find that the detection
window required in each stage is significantly less than 50% of TCLK. Therefore, we can share
the detection window across multiple stages. For this reason, in the proposed scheme, most
of the cycle-borrowing window is reserved for dynamic-variations induced delay variability,
by operating the pipeline at PoFFs.
Another significant benefit of sparse insertion of error-detectors is the reduced error rate.
A critical path in one stage may not directly feed another critical path in the next stage [39].
Therefore, some of the delay surplus produced in a stage can disappear in the next stage via
cycle-borrowing without explicitly flagging timing errors. This self-healing effect can reduce
the overhead associated with detecting and correcting errors, saving energy and improving
22
throughput even under the operating condition with large variability
The benefit of only latch-based pipelines in ULV regime has been demonstrated in some
of the previous works [40–42]. However, the use of only cycle-borrowing cannot completely
remove the worst-case margin across ranges of PVT variations as the variability is severe.
2.2.3.2 Inverter Chain Study - Simulation Setup
In order to evaluate the robustness and effectiveness of the proposed sparse error detection
technique, we perform SPICE-level simulations using 20 latch stage circuits where each stage
has a 25-FO4 long inverter chain. First, we determine the minimum TCLK by measuring the
delay of two latch-stages which include the delays of an inverter chain and a pair of latches.
A minimal margin of 1 FO4 delay is added to TCLK in order to account for input and clock
uncertainties. Second, Monte-Carlo simulations with local process variations are performed.
The TCLK is set to the value found above. Across the simulation, the data arrival time in
each latch stage is observed to determine if they are properly captured.
In this work, the 6σ worst-case delay variability from local process variations is used to
account for all the dynamic variations. The sources of dynamic variation include IR-drop
(particularly across local power gating switches) [36, 37], clock jitters, capacitive coupling
[38], and temperature cold spots. In ULV operation, a smaller amount of driving current and
relatively slow clock frequency can reduce the concern on inductive noise. In addition, device
current has a positive temperature coefficient, i.e., current increases with higher temperature,
in near and sub-threshold regimes, which can relieve the concern for temperature hot spots.
23
To precisely estimate the amount of dynamic variations is a design-specific task and beyond
the scope of this work.
Figure 2.4: When error-detecting registers are inserted at every latch stage, only 19% of
error-detection window is utilized even at 0.3V, motivating the proposed sparse insertion of
error-detecting registers.
2.2.3.3 Inverter Chain Study - Conventional Case (N=1)
First, we analyze the conventional case where error-detecting registers are inserted in every
stage, i.e., insertion sparseness or N is 1. The required window is defined as the minimum
amount of window needed to capture 6σ worst-case delay from the Monte-Carlo simulation.
As shown in Fig. 2.4, simulations show that the required amount of error-detection window
increases as VDD is scaled down since the variability grows. The results, however, also show
that even at 0.3V, the small error-detection window of 19 % of TCLK is sufficient for the
worst-case dynamic variations, making the remaining error detection window of 31% of TCLK
24
Figure 2.5: The required amount of error-detection window increases as the granularity of
pipeline reduces, but still less than 25% of TCLK even for 10-FO4 long latch stages.
redundant. In addition, we experiment with a different length of latch stages, particularly
because the delay variability can become worse at the fine-grained pipeline stages, demanding
wider error-detection window. At 0.35V, as shown in Fig. 2.5, we observe that the required
size of window increases at finer-grained latch stages due to the diminishing amount of
averaging effects. However, again, even 10-FO4 long latch stage (20-FO4 long stage in flop-
based design), only 25% of TCLK is sufficient for error detection window when error-detecting
registers are inserted in every stage (i.e., N=1). The under-utilized detection window is a
motivating point for the proposed sparse insertion of error-detecting registers.
25
Figure 2.6: A significant number of latch stages can be skipped before error-detecting reg-
isters are inserted. At VDD>0.4, the optimal sparseness is estimated to be larger than 20
latch stages.
2.2.3.4 Inverter Chain Study - Sparseness Optimization
The under-utilization of error-detection window motivates us to investigate the way to
sparsely insert error-detecting registers coupled with latch-based sequencing. This way we
can accumulate delay surplus across stages without explicitly causing timing error, and the
sparsely placed detection stage can utilize the entire error-detection window of 50% of TCLK.
First, we use the test circuits of 20 latch stages, each of which is 25-FO4 long at VDDs
ranging from 1V to 0.3V. As shown in Fig. 2.6, at 0.35V, up to 7 latch stages can be
skipped without placing error-detecting registers under the 6σ worst-case dynamic variations.
A notable observation is that the required error-detection window is considerably low at
VDD>0.4V. This is because the large cycle-borrowing window (50% of TCLK) and the added
1-FO4 (2.5% of TCLK in this case) margin, coupled with a smaller amount of delay variability,
26
Figure 2.7: The optimal sparsenesss are estimated from 4 to 14 across the lengths of latch
stages of 10-FO4 to 100-FO4 delays.
are sufficient to absolve all the dynamic variations at those relatively high VDDs. N, therefore,
can be larger than 20 latch stages (10 stages in flop based pipeline), and not found in our
simulation.
We also investigate the optimal sparseness across different lengths of latch stages from 10
to 100 FO4 delays at 0.35V. As shown in Fig. 2.7, the optimal sparseness increases as latch
stages become longer due to the larger amount of averaging effects. For the very aggressive
latch stage of 10-FO4 delays, the optimal sparseness is still 4 (i.e., 2 stages in flop based
pipeline). The optimal sparseness grows to 14 when latch stage is 100-FO4 long.
27
2.2.3.5 Optimal Sparsenes N for General Pipelines
In large-scale pipeline designs, it is not trivial to do the brute-force search for N as we did
for inverter chains in Section 2.2.3.4. A strictly-non-optimal yet effort-saving approach is
to find the N for the top several longest paths among all the stages. Critical paths can be
found using commercial tools for static timing analysis (STA) and automatic test pattern
generation (ATPG). For the found critical paths, we can run Monte-Carlo simulation with
local process variations to estimate the mean (µ) and the standard deviation (σ). This µ
and σ then can be used conservatively for all the other stages. Finally, as shown in Eq. 5.1,
based on the law of sum of independent random variables, we can estimate the optimal N
which can fully utilize the detection window but can still cover the just found worst-case
dynamic variation (i.e., the 6σ value).
6
√
Nσ2 < DetectionWindow (2.1)
2.2.4 Case Study with 3-Stage Pipeline
In Section 2.2.3, we propose and investigate the sparse insertion technique using FO4 in-
verter chains. In this section, we apply the proposed technique to more realistic benchmark
circuits, a 3-stage pipeline design based on three 16b multipliers, targeting at VDD=0.35V
and TCLK=40-FO4 delays (Fig. 2.8).
28
Figure 2.8: Diagrams of (a) flop based, and (b) latch based 3-stage pipeline circuits using
three 16b multipliers. (c) Multiple latch based pipelines are implemented for the different
sparsenesss.
2.2.4.1 Two-Phase Latch based Sequencing
We design the test circuits which are pipelined using two-phase latches. Using the industrial
CAD tools and custom scripts, 3-stage pipeline flop based circuits are retimed into 6 latch
stage ones (Fig. 2.8). Retiming was performed using half the TCLK (20 FO4 delays) of the
original flop based pipeline. We reserve cycle-borrowing window for the use when we add
the proposed error-detection. The latches, therefore, are treated as flip-flops during retiming
and the timing closure step becomes the same to that of the conventional flop based design.
Finally, custom automatic scripts are created to replace the flops in the odd stages with
transparent-low latches and those in the even stages with transparent-high latches. Fig. 2.9
shows the distribution of the longest path delay at the receiving registers in the latch-based
pipeline circuits. The nominal delay of the latch stage including sequential overhead is 20
29
Figure 2.9: The delay distribution of receiving latches in the two-phase latch base pipeline
circuits.
FO4 delays.
Before being equipped with error-detection capability, the latch based design exhibits
18% larger area and 2.4× larger clock load than the flop based one, which translate roughly
24% energy overhead in the test circuits. This is because a pair of latches have a larger
amount of clock load than a single flop and also the total number of latches is larger than
twice the number of flops for the same pipeline circuits. As shown in Fig. 2.8, the flop
based and latch based pipelines have 48 flops and 117 latches (48 transparent-high and 69
transparent-low latches), respectively. This is inherent overhead that latch based pipelines
have. The gains from (i) cycle-borrowing and (ii) the elimination of short-path padding
when using proposed technique, however, largely outweigh this intrinsic overhead as we will
see in the Sections 2.2.4.2 and 2.2.4.3.
30
2.2.4.2 Sparse Error Detection
Figure 2.10: Pseudo algorithm for error-detecting register insertion.
Now we replace some of the latches and flops with the error-detecting ones based on
the algorithm with several user-defined constraints as shown in Fig. 2.10. The replacement
process starts by finding the µ and σ of the critical path as we discussed in Section 2.2.3.5. In
the test circuits, the longest path appears in the first half stage (the stage having transparent-
low latches at the end) which has µ=20 FO4 and σ=1.4 FO4 delays (Table. 2.1). Next, we
find the optimal sparseness (Noptimal) based on the and Eq. 5.1, which is found to be 6 at
0.35V. We still explore several N values from 1 to 6 for verifying the non-linearity caused by
the circuit structures. For a given insertion sparseness of N, the algorithm finds the required
window (RW) which represents the required size of error detection window for the given
insertion sparseness. Then the process finds the pipeline latches which receive the data from
the paths with the slack smaller than the RW. For example, for the latch based design with
N=1, the RW is found to be 9.2FO4 delays (23% of TCLK). The latches which receive data
31
Table 2.1: Pseudo algorithm for error-detecting register insertion.
from the paths longer than 10.8FO4 (i.e., 20-9.2) need to be replaced with error-detecting
latches.
We perform the same insertion process (Fig. 2.10) for the flop-based design. In flop
based design, however, the N is set to 1 since cycle-borrowing is not supported. Similar to
the latch based design, we extract the critical path of the flop based design to determine the
µ and σ (which are 40 FO4 and 1.6 FO4 delays) at 0.35V. Using the µ and σ, the RW is
calculated to be 24% of TCLK (9.6FO4). We find the flops that receive the data from the
paths having delay longer than 30.4 FO4 delays (i.e., 40-9.6), then replace those with error
detecting ones.
The final and intermediate results of the insertion process are summarized in the Table.
2.1. The RW values in the insertion algorithm are found across N = 1, 2, 3, and 6. The
optimal N is found to be 6, which is in fact similar to the estimation using inverter chains in
Section 2.2.3.4. At N=6, the total number of error-detecting registers is only 16 whereas at
N=1 like the conventional two-phase-latch based EDAC techniques [2], the algorithm deter-
mines to replace 69 out of 117 latches with error-detecting registers. A notable observation
is that the number of error-detecting registers for N=3 is larger than N=2. This is because
32
the stage width in the middle of logic circuits is typically wider than in that in the input
or output parts of logic circuits. Also, the paths ending in transparent-low latch stage were
more critical (more paths located in the right side of histogram Fig. 2.9) than the paths
ending in transparent-high latch stage. This observation implies that there is an additional
overhead-reducing opportunity to place an error-detecting stage at the location of a pipeline
where the bit width is small.
2.2.4.3 Comparisons of Error Detection Techniques
Finally, we compare four design approaches - I. no error detection technique, II. the conven-
tional error detection technique based on flop based sequencing [1], III. the conventional error
detection technique based on two-phase latch based sequencing (N=1) [2], and IV. the pro-
posed sparse insertion technique (N=6) - by applying them in the same benchmark pipeline
circuits. The area, error-detection register count, and timing violation rate are investigated.
For the technique II, we use the well-known error-detecting register circuits having a main
flip-flop, a shadow latch, an XOR gate, and a meta-stability detector [1]. For the techniques
III and IV, we use the error-detecting register which has a main latch, a shadow latch with
an opposite phase, and a XOR gate [2].
Fig. 2.11(a) shows that the technique II can incur more than 2× area overhead in
combinational logic due to the excessive amount of short-path padding requirement. The
technique III uses two-phase latch based sequencing and incurs little increase in logic area
since no short-path padding is necessary. The sequential area of technique II is increased
33
Figure 2.11: Comparison of the conventional techniques and the proposed sparse insertion
technique: (a) combinational area, (b) sequential area, (c) total number of error-detectors.
(Abbreviation: I. baseline without error-detection capability II. conventional flop EDAC [1]
III. conventional two-phase latch based EDAC with N=1 [2] IV. the proposed EDAC with
sparsely inserted error-detecting registers [N=6])
by 1.8× (Fig. 2.11(b)) as 21 out of 48 flip-flops (44%) are replaced with error-detecting
registers. The area for sequential circuits and the total area in technique III are increased
by 4.1× and 50%, respectively as compared to the design without error detection capability,
i.e., the technique I.
The proposed technique, i.e., the technique IV, significantly reduces the count of error-
detecting registers by 1.3× and 4.3×, as compared to the techniques II and III, respectively
(Fig. 2.11(c)). The total area is also reduced by 40% and 15% over the conventional tech-
niques II and III. As compared to the baseline design having no error-detection capability,
the area overhead is only 27%. Note that the error-detection technique can substantially
improve performance and energy efficiency over the baseline design which is plagued by an
excessive amount of margin across ranges of variations.
In addition to the area overhead, the proposed technique can significantly reduce error
rate since many of the potential timing violations (i.e., delay surpluses induced by variations)
can disappear as signals propagate through non-critical paths across multiple stages. Fig.
34
Figure 2.12: Timing violation rate comparison for different N.
2.12 shows the timing violation rates of the conventional error detection technique III, and
the proposed technique IV. The conventional flop based design II is excluded as late arriving
data cannot be propagated correctly to next stage without considering correction schemes.
The timing violation rates are simulated by running the pipeline circuits with 300 random
vectors at the fixed TCLK for 10
oC of temperature variations. In the proposed design, a large
fraction of delay increases are fixed via cycle-borrowing before they impose timing errors in
the detection stage. Contrarily the conventional design exhibits a large amount of timing
violation rate of up to 37% since any delay increases in a stage contribute timing errors.
The smaller timing violation rate is critical to reduce the energy and throughput penalty
associated with correction processes.
35
2.3 Challenges of Conventional Error Detecting Regis-
ters in ULV Regime and Proposed Error Detecting
Latch
2.3.1 Conventional Error-Detecting Flip-Flop and Latch
Figure 2.13: (a) The conventional double-sampling method suffers from false error detection
due to clk-to-q delay mismatch between main and shadow elements at low voltage. (b) The
3 clk-to-q delay mismatch over 100k Monte Carlo simulation with random process variation
at 0.35V is 1.8-FO4 delay, which causes the false error rate of 28%.
One of the common methods to detect timing errors in error-detecting flip-flop (EDFF)
and error-detecting latch (EDL) circuits is double-sampling [1, 2, 12–14, 17]. Fig. 2.13(a)
shows the schematic of a conventional double-sampling based EDFF. The data input (D) is
sampled by the main positive edge-triggered flip-flop and also by the transparent-high shadow
latch. During the clock high phase (i.e. detection window), the output of the shadow latch
36
(Qshadow) directly show the data input while the output of the main flip-flop (Qmain) is the
input captured at the rising clock edge. Discrepancy between Qshadow and Qmain suggests
that the input arrives after rising clock edge, i.e., timing violation.
The double-sampling technique becomes unreliable at ULV operation. Particularly, the
clk-to-q delay mismatch between the main flip-flop and the shadow latch can become large.
As shown in Fig. 2.13(b), although the data arrives well before the clock rising edge (i.e.
error-free operation), the delay difference between Qmain and Qshadow can cause a glitch in the
ERROR signal, leading to a false error detection. We perform 100k Monte Carlo simulations
with process variation at 0.35V, and find that the 3σ clk-to-q delay difference is 1.8 FO4
delays, which cause the false error detection rate of 28%.
This large false error rate can be masked in the conventional EDAC techniques having
multi-cycle correction schemes, as the error signals are sampled by another registers at the
next clock edge. In the proposed within-a-cycle correction scheme (Section 2.4.2), however,
the error signals cannot be sampled by another registers due to the stringent timing con-
straints. This can unnecessarily trigger a correction process, and thereby wasting energy
and throughput. While other EDFF circuits without double-sampling have been proposed
[11–13,15], those circuits rely on non-static gates and largely-skewed transistor sizes, which
can become unreliable at ULV regime.
37
Figure 2.14: (a) The schematic and (b) the operational waveform of the proposed EDL. It
uses the side-channel error detection method to avoid the clk-to-q mismatch problem. It is
also optimized to upgrade voltage scalability down to 0.3V.
2.3.2 Proposed Voltage Scalable Error Detecting Latch Circuits
In order to apply EDAC techniques in any ULV designs, EDFF and EDL circuits need to
be voltage-scalable, particularly free of false error detection. For this purpose, as shown in
Fig. 2.14(a), we propose voltage-scalable EDL circuits which use only the shadow latch for
error detection. . The main and shadow latches are transparent high and low, respectively.
The shadow latch uses the side-channel timing error detection method adapted from [16]
and is designed such that no glitches can be generated in the error detection node though a
domino-circuit like mechanism having precharge and evaluation phases.
More specifically, Fig. 2.14(b) shows the operation of the EDL During the clock-low
phase, the virtual nodes, VVDD and VVSS, in the shadow latch is pre-charged high and
pre-discharged low via P1 and N2, respectively. During this phase, the tri-state inverter
at the input of the shadow latch is active and the node DN is the inversion of the data
38
input (D). When the clock becomes high, the devices P1 and N2 are turned off and the
state of the node DN is maintained through the back-to-back inverters in the shadow latch.
At the absence of errors (i.e., the input D does not change during a clock-high phase), the
potentials of the VVDD and VVSS remain high and low, respectively. When D=1 and
DN=0, for example, the transistor N1 and the feedback inverter keeps the VVSS to be low.
The node VVDD is floating, but the VGS of the P2 becomes negative, significantly cutting
leakage and helping maintaining the potential of the VVDD. However, when error happens,
i.e., the input D changes during a clock-high-phase, either the VVDD becomes low or the
VVSS becomes high, which sets the ERROR signal to be high through simple detection
logic. In the error detecting mechanism, the states of VVDD/VVSS have no glitch during
error-free operation regardless of the state of the new data, which is also confirmed via 100k
Monte Carlo simulation with process variations at 0.3V.
Differently from the prior design for super-VTH operation [16], we augment the devices
N1 and P2 with the devices N3 and P3. In the prior design, the node VVDD is discharged
through the PMOS (P2) and the node VVSS is charged through the NMOS (N1). In ULV
operation, however, the threshold-voltage drop and the resultant noise-margin and delay
penalties are intolerable. A downside is that the added devices N3 and P3 can contribute
leakage current which can affect the floating virtual rails (i.e., VVDD or VVSS) and cause
false error detection. Particularly, the devices N3 and P3 do not experience the negative
VGS as do the devices P2 and N1 (the N3 and P3 have VGS=0). In order to minimize the
leakage, we selectively use higher VTH transistors for the N3 and P3 at about 30% longer
39
detection delay (i.e., delay from D to DETECT). Note that, the VVDD and VVSS nodes
are refreshed every cycle, which can relax leakage reduction requirement. Again, we perform
100k Monte-Carlo simulations for the proposed EDL circuits and find that the circuits cause
no leakage-induced false error detection down to VDD=0.3V and the worst-case FCLK>10
KHz.
Table 2.2: Comparisons of the conventional latch and the proposed EDL circuits at 0.35V.
Table 2.2 summarizes the comparisons of the proposed EDL and the regular latch circuits
at 0.35V. The D-Q delay is increased by 14% due to the extra capacitance from driving the
shadow latch. The CLK-Q delay is increased by 25% as the internal clock buffers of main
latch were shared with shadow latch to reduce the energy overhead from extra clock loading
at the cost of increased internal clock delay. Although the overhead of the individual EDL is
considerable, the proposed sparse error detection strategy can amortize the overall overhead
by enabling the EDL insertion only in a single pipeline stage.
40
2.4 Challenges of Error Correction Techniques in ULV
Design and Proposed Error Correction Technique
2.4.1 Conventional Error Correction Scheme
Upon detection, EDAC techniques can correct the errors. Instruction replay is one of the
common methods [11–13,15–17]. It flushes the pipeline and replays the instructions that just
caused errors at a safer clock frequency (e.g., half the original FCLK). The instruction replay
based correction scheme, however, requires architecture modification and consumes up to
28 cycles per correction, which can severely degrade throughput and energy efficiency [13].
The authors in [13] have proposed a correction method based on multiple-issue instruction
replay. In this method, upon detecting errors, the instruction that just caused errors is
issued multiple times at the original FCLK. As it requires no change in FCLK, it incurs 15
cycle penalty per correction. In the existing latch based EDAC technique [2], a correction
scheme based on local stalling has been proposed. When error is detected, the pipeline takes
one extra cycle during which it sends clock gating signals to the subsequent stages.
2.4.2 Proposed Non-Stall Error Correction
In the conventional EDAC techniques, the process to detect and correct an error can take
multiple clock cycles, undermining throughput and energy efficiency [11–13,15–17]. Further-
more, the detection and correction process needs to be controlled across modules connected
together for multiple clock cycles. This requires architectural modifications and the conse-
41
quent hardware overhead [11–13,15–17].
Figure 2.15: We propose a non-stall error correction scheme which utilizes local and temporal
VDD boosting. When timing error is flagged in the detection stage and while the late arriving
signals still propagate via cycle borrowing, the VDD,local-control block changes the supply
voltage (VDD,local) of the next stage (correction stage) to higher VDD (VDDH) to accelerate
signal propagation. The latches in the correction stage are not boosted to avoid accidental
state loss. Level converters are bypassed at the absence of errors.
In order to reduce the overhead associated with the detection and correction process, we
propose a highly localized technique, called the non-stall error correction method. This can
detect and correct error within a single cycle, obviating the need to control detection and
correction processes across pipelines and cycles. Fig. 2.15 shows the block diagram of the
proposed correction technique and Fig. 2.16(a) shows the waveforms during correction. As
shown in Fig. 2.15, the ERROR signals from the EDLs in the detection stage are collected
42
Figure 2.16: (a) Operational waveform of the non-stall correction technique. The VDD,local is
boosted during the negative clock phase for isolating the correction stage from the next stage.
The supply voltage headers are sized to meet the 1-FO4 boosting slew. (b) Boosting the
voltage at ULV can allow sufficient speed-up to correct error without stalling pipelines. For
example, boosting from 0.4V to 0.55V can give 4× speed-up which is sufficient to produce
error-free results in less than a half cycle.
via an OR tree by the VDD,local Control Block. This block then controls the supply voltage
of the following correction stage (VDD,local). While the timing error is detected, the late
arriving correct data still propagate to the correction stage via cycle borrowing. VDD,local
is then boosted from nominal (VDD) to boosting voltage (VDDH) by switching the header
devices. The headers are sized to enable sufficient switching speed (i.e. switching slew of
1-FO4 delay). The boosting is only performed during the negative phase of the clock for
isolating the correction stage from other adjacent stages. Also, the latches in the correction
stage are not boosted to avoid accidental state loss. Level-converters are instead inserted
after the latches which can be bypassed at the absence of errors for avoiding the delay
overhead.
43
In this project we set VDDH to be 0.55V to allow a speed-up of 4× in the correction stage
(i.e., the parts denoted as 3-1 and 3-2 in the Fig. 2.15) from the pipeline operating at the
nominal VDD of 0.4V (Fig. 2.16(b)). The required speed-up is estimated as following. In the
targeted pipeline having the stage length of 50 FO4 delays, the correction scheme need to
detect the worst-case timing violation (i.e., signals arrive at the EDL input at 25 FO4 delays
after the rising clock edge) and produce the error-free result before the following rising clock
edge. Under this worst-case scenario, the time budget for boosting VDD and re-computing
the results is 25 FO4 delays. As shown in Fig. 2.16(a), it takes 7.5 FO4 delays (=3.5 +
2.7 + 1.3) to boost VDD. If we can achieve 4× shorter circuit delay via the boosting, re-
computation takes 12.5 FO4 delays (=50/4). The sum of those two delays is 20 FO4 delays,
which is smaller than the aforementioned time budget of 25 FO4 delays. For the pipelines
having shorter TCLKs, we can increase the VDDH for a larger speed-up to a certain extent,
but have to pay a higher energy cost per correction.
2.5 R-Processor Design and Implementation
As shown in Fig. 2.17, the proposed EDAC techniques are applied to the design of a 5-stage,
16-bit microprocessor, which we call R-Processor. We first replace flip-flops with two-phase
latches using industrial retiming tools (Section 2.2.4.1 for details). R-Processor is retimed
with TCLK of 50 FO4 delays where per-latch-stage is 25-FO4 long. As shown in Fig. 2.6,
for VDD=0.4V, NOPT is found to be 15 (equivalent to about 7 flip-flop stages) for the per-
latch-stage length of 25 FO4 delays. Therefore, we replace the positive-phase latches only in
44
Figure 2.17: R-Processor: a 16-bit 5-stage microprocessor employing the proposed EDAC
techniques. The memory blocks (DMEM, IMEM, and RF) are also pipelined with 2-phase
latches to continue cycle-borrowing across the entire pipeline.
the ID stage with the proposed EDLs, resulting in very low EDL replacement rate of 13%
(57 out of 445). The proposed non-stall correction technique is applied in the EX stage. EX
stage is chosen for avoiding boosting memory circuits (DMEM, IMEM, and RF). Timing
errors in other stages (e.g. IF, EX, MEM, and WB) can disappear when cycle-borrowed to
subsequent non-critical paths, or be carried on to the end of ID stage for detection and be
corrected in the EX stage. Therefore, the seemingly-localized detection can actually cover
the entire pipeline of the R-Processor.
In order to allow cycle borrowing to continue across memory structures, all the memory
blocks (DMEM, IMEM, and RF) are also pipelined with 2-phase latches. For example, in
45
DMEM, the negative latches are inserted between address decoders and arrays of bitcells.
For read operation, the first half clock cycle (positive clock phase) is used to decode the
address and the next half clock cycle (negative clock phase) is used for accessing bitcells. In
order to guarantee sufficient writing time even in the worst-case delay increase, writing to
arrays is delayed by half clock cycle. Note that this delaying does not interfere with the next
instruction that might access memory since the next instruction still decodes the address
during the first half cycle.
The use of 2-phase latches in memory blocks allows the timing errors occurred in them,
like other pipeline stages of the R-Processor, to either disappear when propagating through
the subsequent non-critical paths or be eventually detected at the end of ID stage and
corrected in the EX stage. To reliably explore the proposed EDAC techniques at ULV
operation, we use the memory cell based on a regular 12-transistor (12T) latch with the
single-ended and static readout path. A single VDD as low as 0.26V is used for both pipeline
and memories. The use of the compact bit-cells is under investigation.
Fig. 2.18 shows the die photograph of the R-Processor and the baseline processor with
flip-flop sequencing and no EDAC technique, both of which are fabricated in a 65nm CMOS
technology. The absolute layout area overhead of R-Processor was 1.8% compared to baseline
chip. When considering the utilization statistics from automatic placement and routing
(APR) tool, the area overhead was 8.3%.
46
Figure 2.18: Die photograph of (a) the R-Processor, (b) the baseline processor.
2.6 Measurement Results
We measure and compare the baseline processor having no adaptive techniques and the
R-Processor having the proposed EDAC and DVFS combination. The operating condition
of the baseline is set for guaranteeing correct computation across all PVT conditions. As
the baseline is forced to operate with a single VDD/FCLK pair, the chosen pair has a large
amount of margins. On the other hand, the R-Processor can dynamically choose the optimal
VDD/FCLK pair for the given PVT condition. In this process, the EDAC technique can inform
the DVFS controller to ensure the R-Processor operate at its PoFFs. Note that the existing
EDAC works have demonstrated the aggressively scaling of VDD/FCLK beyond PoFFs for
additional improvement [1, 2, 11, 14–16], whereas the R-Processor limits such scaling for
maintaining a sufficient size of error detection window. The savings by operating beyond
47
PoFFs are also small as the R-Processor already operates in very low VDD (Fig. 2.21).
Figure 2.19: The FCLK,max of the baseline processor is measured based on the worst-case
PVT condition. First, the FCLK,max at the worst-case voltage and temperature condition
(10% VDD drop and -20
oC) is measured over 10 chips. Then, in order to account for process
variation we find the 6σ worst-case FCLK,max out of the 10 chip measurements, which is used
for the margined FCLK,max of the baseline processor. If considering the variation across wafers
and lots, the worst-case FCLK,max can be even worse than our estimation.
Based on the conventional worst-case design practice we determine the VDD and FCLK of
the baseline processor. As shown in Fig. 2.19, at each VDD, the maximum FCLK for correct
operation is measured for 10 dies under the worst-case voltage and temperature condition.
We assume 10% VDD drop and the temperature of -20
oC as the worst-case condition. To
account for the worst-case process variation, we use the -6σ value of the measured maximum
FCLK for the margined FCLK. We did not use the worst-case measured performance since
the number of the samples (i.e., 10 dies) is too small to represent the worst-case process
variation.
48
Figure 2.20: The R-Processor achieves energy efficiency and performance improvement over
the baseline design. (1) The R-Processor can scale VOPT by 140mV as compared to the
baseline, where the R-Processor consumes 42% smaller energy per cycle at FCLK=60MHz;
(1) At the same performance (80MHz that the baseline achieves at its VOPT), the R-Processor
exhibit 38% lower energy consumption; (3) At the same energy consumption, the R-Processor
is estimated to be 2.3× faster than the baseline.
Fig. 2.20 shows the measured energy per cycle of a typical chip of the baseline processor
operating at the margined FCLK across VDDs at 25
oC. The Emin is found to be 5.29pJ when
operating at FCLK=80MHz and VDD=0.57V. Fig. 2.20 also shows the energy consumption
49
per cycle of a typical chip of the R-Processor. Across VDDs, FCLK is selected at the point of
the first failure (PoFF). The EMIN for R-Processor is measured to be 3.13pJ at FCLK=60MHz
and VDD=0.43V [(1) in Fig. 2.20]. The Emin of the R-Processor was 42% lower than baseline
chip. At the same FCLK of 80MHz, the R-Processor can operate at 110mV lower VDD,
achieving 38% energy reduction [(2) in Fig. 2.20]. Also, the R-Processor can have 2.3×
higher throughputs (FCLK=180MHz) while consuming the same energy with the baseline
[(3) in Fig. 2.20].
Figure 2.21: The energy savings and error rates of R-Processor. The R-Processor can use
110mV lower VDD and consume 38% less energy when it reaches the point of the first failure
(PoFF), i.e., detecting and correcting the first error.
Fig. 2.21 shows the measured energy consumption and error rate of the typical chip of the
R-Processor across VDDs at FCLK=80MHz and 25
oC. As the VDD is lowered, it reaches to the
PoFF at 0.46V and the error rate increases sharply beyond the PoFF. We can further scale
the VDD down to 0.45V to achieve more energy savings (41% reduction over the baseline).
50
Table 2.3: The summary of the R-Processor and the baseline processor chips in the typical
PVT corner. Utilization is defined as total area divided by gate area.
However, it reduces the size of detection window. Also, energy savings by operating the R-
Processor beyond PoFFs are small since R-Processor is already in very low voltage regime.
Table 2.3 summarizes the measurement results of the R-Processor and the baseline processor.
Table 2.4: Summary of baseline processor and R-Processor at the slow, typical, and fast
corner.
Table 2.4 summarizes the measured energy consumption of the baseline and the R-
Processor running at the PoFFs at three different process and temperature corners. Energy
51
savings increase to 51% in the fast corner (fast process, 70oC) since the amount of wasted
margins of the baseline processor becomes larger. In the slow corner (slow process, -20oC),
on the other hand, the energy savings decrease to 33% as compared to 38% in the typical
corner (typical process, 25oC).
The R-Processor can automatically tune the VDD to the PoFF using the error statistics
from EDLs. The R-Processor has an on-chip counter that counts the number of timing
errors. The number of errors (particularly their LSBs) is sent to an off-chip DVS system
consisting of a programmable DC power supply and an NI LabView system. When the DVS
system observes new errors, it marginally increases VDD. In addition, if it observes no errors
for a predefined period, it reduces VDD for reducing energy consumption.
Figure 2.22: Experiment results of R-Processor running at 10MHz with an off-chip DVS
system while ambient temperature is varying from -20oC to 70oC. R-Processor can operate
well down to the deep sub-threshold regime of 0.26V.
The R-Processor and the proposed EDAC technique can also reliably operate at VDDs
52
down to deep sub-threshold regime. In order to experiment the functionality, as shown in
Fig. 2.22, we modulate the ambient temperature as rapidly as possible (0.25oC/s) between
-20oC and 70oC. In this experiment, we use an FCLK of 10MHz and employ the off-chip
DVS system. The R-Processor can reliably detect and correct timing errors while the DVS
automatically tunes, based on the error statistics from the R-Processor, VDD from 0.26V to
0.35V such that the R-Processor can operate at its PoFFs over the temperature changes.
Finally, in table 2.5, this work is compared to the previous works.
Table 2.5: Comparison of R-Processor and previous EDAC works.
2.7 Summary
In this work, in order to remove the worst-case margins and extend the voltage scalability
under the extreme variability of ULV design, we propose an EDAC technique that consist of
sparse error detection, voltage-scalable error detection latch circuits, and local non-stall error
53
correction. We design and prototype the R-Processor using the proposed technique in a 65nm
CMOS. At a typical PVT corner, R-Processor improves the minimum energy consumption
by 42% via 140mV additional voltage scaling over the baseline processor margined for the
worst-case condition. At the same throughput of the baseline processor which operates at
its minimum energy point, the R-Processor achieves 38% energy efficiency improvement. At
the same energy consumption, the R-Processor achieves 2.3× throughput improvement. The




Waveform Sorter based on Body
Swapping Error Correction
3.1 Motivation
In the on-going quest to enabling energy-efficient cognitive computing, parallel, and non-
instruction architectures implemented in near/sub-VTH circuits emerge as a promising can-
didate [30–32]. However, the complex and parallel nature of such architectures combined
with the large delay variability from near/sub-VTH circuits impose prohibitive timing margin
to cycle time (TCLK), limiting achievable energy-efficiency and throughput.
In-situ error detection and correction (EDAC), combined with dynamic voltage and fre-
55
quency scaling (DVFS), can operate the chip at the point of first failure (PoFF). This can
eliminate the margins for static and slow variations (e.g. process and temperature) and fast
variations (e.g. VDD drop) [2, 11,17,43–45].
Figure 3.1: Sorter architecture with the proposed EDAC technique.
However, the existing approaches [2, 11, 17, 43, 44], as they often target super-VTH in-
order microprocessors, may not be well-suited for parallel non-instruction architectures in
near/sub-VTH circuits. The existing approaches often use the program counter for replaying
instructions to perform correction [11]. The targeted architectures, however, do not have a
program counter and also have distributed memory mixed with logics. Thus, in order to use
56
Table 3.1: List of registers that requires roll-back for replay correction.
replay correction, such architectures must have additional memory to store past architectural
states for rolling-back. The proposed sorter (Fig. 3.1), for example, needs to duplicate 80.2%
of the distributed registers to store single past architectural state. We estimate this can cause
>28.8% area overhead (Table. 3.1). Refs. [17, 43, 44] proposes EDACs for a SIMD, a NoC
router and a register file. However, all of them rely on replay correction that need either
program counters or additional roll-back memory. Ref. [2] proposes non-replay correction
based on local clock-gating. However, the area overhead of the technique is non-negligible (up
to 87% in [2]). Also, an error and correction process can spread across entire architectures,
which can hurt throughput and energy efficiency particularly in parallel architectures.
Here, we propose a new EDAC design that is able to correct errors without replay and thus
are more suitable for the targeted non-instruction architectures in near/sub-VTH circuits. We
propose three techniques: (1) body swapping correction that eliminates the need for replay
correction, (2) a fully-static error-detecting (ED) latch, and (3) area-efficient 2-phase latch
ED pipelines. Via these techniques, we design an unsupervised waveform sorter based on
57
spiking neural network (SNN) for brain computer interface (BCI) microsystems (Fig. 3.1).
At VDD=0.45V, the hardware can detect and correct timing errors without stopping any of
parallel pipelines, eliminating timing margins for process, voltage, and temperature (PVT)
variations. This enables 49.3% higher energy efficiency and 35.6% higher throughput than
the baseline margined for the worst-case variation. It requires no additional VDD and causes
the area overhead of only 4.1%.
3.2 Proposed Techniques and Sorter Implementation
Figure 3.2: Sorting results.
We propose three techniques and apply them on an unsupervised waveform sorter. The
sorter architecture is based on [32]. It can take spike waveform inputs, train itself based on
spike-timing dependent plasticity rules, and perform clustering (Fig. 3.2). High-VTH devices
58
are used for low leakage.
Figure 3.3: Previous VDD boosting correction.
Figure 3.4: Proposed body swapping correction.
The first proposed technique is body swapping correction, which requires no replay and
59
also incurs very low overhead. Our work in Section 2.4.2 [45] has proposed local VDD boosting
for correcting errors without replay. However, as shown in Fig. 3.3, it needs bulky level
conversion and bypass circuits (LC/bypass) and additional supply voltage (VDDH). The
newly proposed technique requires only a small circuit called a body controller (BC) (Figs.
3.4 and 3.6).
Figure 3.5: Waveforms of body swapping correction.
In this technique, if the data arrives late at the ED latches (i.e., timing error), it still
enters the correction stage via cycle borrowing and the BC swaps the bodies of NMOSs and
PMOSs (NB and PB) of that stage. This can induce forward body bias and accelerate the
computation to prepare the error-free results before the next rising clock edge (Fig. 3.5).
The BC (Fig. 3.6) is sized to make the delay from the error detection to NB/PB swapping
to be sufficiently fast, which is measured to be <3% of TCLK (Fig. 3.7). We used the self-
oscillating test mode (Fig. 3.6) for this measurement. This body swapping can provide more
60
Figure 3.6: Body controller schematics with a test circuitry.
Figure 3.7: Measured delay of body swapping control.
speed-up than required (2.2×) for correcting the worst-case timing violation (Figs. 3.5 and
3.8). The area overhead for isolating the bodies of the correction stage is minimal (Fig. 3.9)
since the deep-nwell boundary is within the power ring. We insert well taps every 15µm.
61
The total area overhead of BC and the well isolation is only 1.2% in the sorter design. It
requires no additional VDD.
Figure 3.8: Circuit delay reduction via body swapping.
Figure 3.9: Correction stage layout.
62
Figure 3.10: Schematics of the proposed fully-static transparent high ED latch.
Next, we propose fully-static ED latch circuits that are more robust than the existing
semi-static design with floating detection channels presented in Section 2.3.2 [45] (Fig. 3.10).
The proposed latch also can avoid the clk-to-q delay mismatch problem discussed in Section
2.3.1 [45] since it compares the data (S) stored in the shadow latch (the opposite phase with
main latch) with the incoming data (D) instead of the data stored in the main latch (Q).
The impact of the extra loading on D can be small as the ED latches are inserted only in the
output neurons based on the sparse detection scheme [45]. The proposed ED latch passes
100k Monte-Carlo simulations with process variations and also reliably operates at as low as
0.3V.
Finally, we optimize 2-phase ED latch based pipelines for low overhead. We apply the
sparse error detection scheme proposed in Section 2.2.3 to minimize the number of inserted
63
ED latches [45]. The sparseness (NOPT) is found to be 8 latch stages at VDD=0.35V and
TCLK =110 FO4 delays. While the architecture has various data flow paths across training,
synapse-updating, and clustering phases, we find that implementing the output neurons as
our detection and correction stage allows all the data flow paths to reach ED latches while
traveling <NOPT latch stages. Errors are handled independently in each output neuron. In
order to reduce the overhead of 2-phase latch sequencing itself, which can cause up to 13-21%
area overhead over flip-flop (FF) sequencing as discussed in Section 2.2.2 [2, 45], we remove
the local clock buffers in the latches and distribute the clock with a merged (centralized)
buffer via 1-level [46] clock tree. The area overhead of the 2-phase latch pipelines is 2.6%
and that of the inserted 126 ED latches is 0.3%.
3.3 Measurement Results
Table 3.2: Measured improvement summary.
Test chips are fabricated in 65nm (Fig. 3.11). The baseline has no adaptive techniques
64
Figure 3.11: Die photo.
and thus needs margin for the worst-case PVT variation (defined as the slowest among 10
dies, -20oC, and -10% VDD drop) even when operating at the typical condition (typical die,
25oC, and no VDD drop). The proposed design, on the other hand, can operate without
margins at the PoFF. The baseline minimum energy dissipation (EOPT) is 132nJ/clustering
at energy-optimal VDD (VOPT) of 0.525V and FCLK of 2.36MHz (Fig. 3.12). The proposed
design achieves EOPT of 69.1nJ/clustering (49.3% smaller) and FCLK=3MHz (35.6% better)
at VOPT=0.450V (75mV lower). At the same FCLK that the baseline works at its EOPT,
the proposed design achieves 47.6% energy savings at 100mV lower VDD. At the VOPT of
the baseline, the proposed design achieves 2.6× higher throughput with 42.1% less energy
dissipation. We summarize the energy savings at slow, typical, and fast corners at the same
65
Figure 3.12: Measured energy and throughput improvement.
FCLK of the baseline at its EOPT (Table. 3.2). Error statistics measurement (Fig. 3.14)
using test circuitry (Fig. 3.13) show that handling errors independently in each output
neuron exercises 4.6× lower error handling as compared to the conventional replay case (i.e.
counter Comb.) where single error requires replaying all output neurons. As compared
to the work in Chapter 1[45], the proposed EDAC technique can be well-suited to parallel
and non-instruction architectures with a minimal area overhead of 4.1% and without any
66
Figure 3.13: Test circuitry for error statistic measurement.
Figure 3.14: Error rate reduction via independent error handling.
additional VDD (Table. 3.3).
67





System based on Error Regulation
4.1 Motivation
To create Internet-of-Thing devices, near/sub-threshold circuits and adaptive techniques
such as in-situ error detection and correction (EDAC) and dynamic voltage scaling (DVS) [11,
47] can enable highly energy-efficient computing while ensuring robustness against process,
voltage, and temperature (PVT) variations. This approach requires energy-efficient voltage
conversion for producing variable near/sub-threshold load supply voltage (VDD). Existing
works typically use a voltage regulation scheme which regulates VDD to VREF (Fig. 4.1).
69
Figure 4.1: Conventional voltage based regulation.
Figure 4.2: The conventional EDAC-DVS technique requires a variable VREF generator,
which consumes a non-negligible amount of energy (e.g., 1µW). With this estimation we
project the PCE to degrade by 4%.
However, this causes several challenges in energy efficiency. First, it requires a variable VREF
generator (e.g. DAC in [11]) which consumes a significant amount of power (e.g. ultra-low-
70
Figure 4.3: The conventional EDAC-DVS control loop has a considerable amount of latency
to translate error information to VREF. This latency makes EDAC to correct errors for a
longer period before adjusting VDD. For example, 40µs latency is estimated to cause 8%
energy loss .
power 8-bit DAC MAX5510 consumes 6µA at 25KHz with 1.8V). Even with an optimistic
1µW consumption, we project the power conversion efficiency (PCE) to drop by 4% (Fig.
4.2). Second, we need to translate the error rate (from EDAC) to the optimal VREF, which
can cause latency in variation tracking (e.g. 55µs in [11]), forcing EDAC to handle more
errors and thus degrading energy efficiency. Latency of 40µs is estimated to cause 8% energy
loss if fast variation occurs every 1ms (Fig. 4.3). Finally, the analog/mixed-signal circuits
used in the conventional control loop can limit the input voltage (VIN). However, a PM
system with low-VIN support can substantially improve the system-level PCE since it can
take advantage of some energy sources with sub-1V outputs such as harvesters and capacitors
(e.g. ∼0.6V for PV cells at MPPT).
71
Figure 4.4: Proposed timing-error regulation.
In this work, we demonstrate a power management and microprocessor (PM/µP) sys-
tem which consists of (1) µP employing near/sub-threshold EDAC; (2) 63-ratio integrated
switched-capacitor DC-DC converter (SCDC); and (3) fully-digital EDAC-SCDC controller.
The system directly regulates the timing error of EDAC (Fig. 4.4): the controller receives
error events from EDAC and adaptively produces the settings (ratio and clock) of the SCDC.
We compare the proposed system to (1) Baseline-1 (SCDC, VIN=1V, fixed VDD with mar-
gins), (2) Baseline-2 (SCDC, VIN=1V, VDD regulated by the voltage regulation and EDAC),
and (3) Baseline-ideal (no SCDC, optimal VDD across PVT variations). Our proposed sys-
tem, for VIN=0.6-1V, achieves 37-45% and 10-20% higher energy efficiency than Baseline-1
and Baseline-2, respectively. As compared to Baseline-ideal, our system exhibits 16-32%
worse energy-efficiency. The area overhead to embed EDAC in the µP is 3.2%. The size of
the controller is 2.3% of the µP area.
72
4.2 PM/µP Implementation Details
Figure 4.5: The proposed EDAC-SCDC controller has a fast loop which responds to a single
error event and starts a new SCDC phase in the following rising CLK edge. This loop
quickly raises VDD to VDD,max and minimizes the time during which the EDAC needs to
handle errors.
Figure 4.6: When errors continue to occur, the slow loop of proposed EDAC-SCDC controller
modulates the target VDD levels (VDD,max and VDD,min) in one CLK cycle latency to regulate
the average error rate to TER (bottom)
The EDAC-SCDC employs two loops for direct error regulation: fast loop to respond to
a single error event and slow loop to regulate the average error rate to a target error rate
(TER). At no error, it inverts the SCDC clock (CLKSC) every NCLK of µP clock (CLK)
(Fig. 4.5). At an error event, it inverts CLKSC immediately at the following rising edge
of CLK, making the SCDC start a new phase. This replenishes charges on the VDD node
and quickly raises VDD to VDD,max. This minimizes the time that EDAC needs to handle
73
errors. If errors continue to occur, the slow loop increases the target VDD levels (VDD,max
and VDD,min) to meet TER (Fig. 4.6). Specifically, as soon as the error count (counterror)
reaches the threshold (targeterror), the SCDC starts a new phase with the increased ratio
in the following rising edge of CLK. This shortens the latency of changing the target VDD
levels to be < 1 CLK cycle. If the error rate is < TER for a pre-defined amount of time,
the controller reduces the ratio. Using these two loops, the controller can ensure correct
operation at high energy efficiency.
Figure 4.7: Schematic and operating modes of the 6-stage 63-ratio SCDC based on the
recursive topology [3]. To support low VIN, transmission gates in intermediate switches [3,4]
need to be upsized, which cause leakage-incurred PCE degradation. Thus, we avoid using
transmission gates. We also employ the technique to recycle bottom-plate charges using the
switches Rp and Rn, improving PCE by 2-3%.
We design the 6-stage 63-ratio SCDC based on the reconfigurable multi-ratio topology
74
for supporting a wide range of output voltage at high PCE [3, 4]. As compared to LDOs
[48] and buck converters [49], this topology provides a good trade-off of on-chip integration-
ability and PCE. Fig. 4.7 shows the schematics of our converter using the recursive topology
[3]. Differently from Refs. [3,4], we avoid using transmission gates in intermediate switches.
These transmission gates can have low VGS (especially with sub-1V VIN) and require sig-
nificant upsizing, which cause leakage-incurred PCE degradation. Rp and Rn recycles the
bottom-plate charge during phase change to improve PCE by 2-3% at little area overhead.
We employ the near/sub-threshold EDAC technique presented in Chapter 3 [47] in the
µP. It requires circuit-level modification only in the execution stage which performs both
detection and body swapping based correction without stalling pipelines nor replaying in-
structions. As compared to the error warning [48] and the replica-based [49] techniques, the
EDAC techniques [11,47] can have the smallest margin.
The proposed PM/µP system is fabricated in 65nm (Fig. 4.8). Total 0.33nF capacitance
is needed to support the maximum µP load current (130µA at 9.6 MHz in a fast corner)
while the test chip has total 2.5nF for other experiments (Fig. 4.9). Vertically stacked MIM
and MOS capacitors are used. For VIN=1V, VOUT =0.4-0.5V, the SCDC achieves the PCE
of 73-80% (Fig. 4.10). For VIN=0.6V, the PCE is 81-87%. It produces linear VOUT with the
resolution of 16mV at VIN=1V (9mV at VIN=0.6V) (Fig. 4.11).
75
Figure 4.8: Test-chip die photo. The SCDC is sized to supply up to 1mA for other experi-
ments. Area estimation if the SCDC is sized for the µp maximum.
Figure 4.9: Measured PCE of SCDC across different load currents.
76
Figure 4.10: Measured PCE of SCDC across ratios.
Figure 4.11: Measured VOUT of SCDC across ratios.
4.3 Measurement Results
We measure the transient behavior when executing programs having different power con-
sumption (Fig. 4.12). With the controller disabled (SCDC has fixed ratio and frequency),
77
Figure 4.12: Measured transient behavior while executing programs having different power
consumptions. With our EDAC-SCDC controller disabled, we observe an 84mV VDD drop
and program failure. With the controller enabled, the fast loop can reduce the VDD drop
to 15mV and the slow loops raises the target VDD levels to meet the TER. We observe no
program failure.
we observe 84mV (18.3%) VDD drop, which causes the programs to fail. With the controller
enabled, the fast loop reduces the VDD drop to 15mV (3.3%) and the slow loop raises the
target VDD levels to meet the TER (set to 0.1% with targeterror=30 and targetdown=30000).
We measure the system energy efficiency of a typical die at TER=0.1%, VIN=0.6-1V, and
25oC and compare to Baseline-1 (the worst-case margin for 20 dies, -20oC, and -10% VDD
drop.), Baseline-2, and Baseline-ideal (Fig. 4.13). Compared to Baseline-1, the proposed
system achieves overall 37-45% energy savings (Fig. 4.14) as we can remove the worst-case
margin for PVT variations and SCDC ripples (Fig. 4.15). Our proposed design can operate
78
Figure 4.13: Energy efficiency comparisons.
Table 4.1: Energy savings as compared to Baseline-1 across slow, typical, and fast corners.
79
Figure 4.14: Energy breakdown and energy savings.
Figure 4.15: As compared to Baseline-1, the proposed PM/µP system achieves 37-45%
savings as it needs little margin for PVT variations and SCDC output ripple.
at VDD,min=0.444V, 96mV less than the VDD of Baseline-1 (85mV by removing the worst-
case margin and 11mV by allowing the error rate of 0.1%). Compared to Baseline-2, the
80
Table 4.2: Comparisons to the recent designs.
proposed system achieves overall 10-20% energy saving by using the direct error regulation
scheme which does not require VREF generation and has < one CLK cycle control latency.
The proposed system with VIN=0.6V achieves additional 8-10% higher energy efficiency
than the baselines since the SCDC becomes more efficient in conversion. The energy savings
against Baseline-1 in the slow, typical, and fast corner chips are summarized (Table 4.1).The
proposed system is compared with previous works (Table 4.2).
81
Chapter 5
A 30.1µm2, < ±1.1oC 3σ-Error,
0.4-1.0V Digital Standard-Cell
Compatible Temperature Sensor for
On-Chip Dense Thermal Monitoring
5.1 Motivation
The design of on-chip temperature sensor is critical for dynamic thermal management (DTM)
in high-performance microprocessors and Systems-on-Chips (SoC). A DTM technique [23,24]
typically embed multiple temperature sensors on a chip and use the provided temperature
information to monitor and control the thermal behavior of the system for high performance
82
yet reliable (i.e. reliable against electromigration, time dependent dielectric breakdown, and
negative bias temperature instability) operation. Small and accurate temperature sensor
design is desired since the distance between deployed sensors and hot spots together with
sensors circuit-level accuracy directly relates to the performance of DTM [23, 24]. Existing
sensors achieve impressive area or accuracy [50–60]. However, emerging technology trends
toward multicore architectures, 3D-IC, and ultra-dynamic-voltage-scaling (UDVS) make sen-
sor designs to be even more demanding with the following requirements.
Firstly, ultra-compact sensors are required to monitor the increasing number of thermal
hot spots and to improve flexibility in placing them in the optimal locations. The number
of thermal hot spots has increased with higher level of transistor integration. This has led
modern high-performance microprocessors to embed tens of temperature sensors (e.g., 48
in [25–27]). The emerging technology trends toward multicore architectures and 3D-IC add
even more number of hot spot due to the thermal coupling between cores and 3D layers
[23]. To be able to monitor this increasing number of hot spots with low hardware overhead,
sensor footprint needs to be extremely small [23, 24]. Further on, the hot spots are often
only identified in the later stages of design phase; thus it is highly desirable to make sensors
small so as to be easily inserted or moved around.
Secondly, while minimizing the sensor size, the sensors need to maintain high accuracy
to maximize the performance of DTM techniques. Overestimating the temperature of the
system can cause unnecessary throttling, and thus degrading performance; on the other
hand, underestimating can cause reliability concerns. To achieve high performance under
83
the reliability constraint, high accuracy temperature sensor is needed. In addition, such
high accuracy is desired to be achieved with low calibration cost (e.g. one temperature point
calibration [OPC]).
Lastly, better voltage scalability is required for the compatibility with UDVS systems
[61,62]. In applications that require both high performance and low power operation, UDVS
systems are desirable. UDVS systems can provide peak performance when the workload
is heavy by operating at nominal supply voltage (VDD). They can achieve low power by
scaling the VDD down to near threshold voltage when the workload is moderate or low. For
the sensors to be employed without extra voltage distribution or local regulation in such
systems, they need to operate across a wide range of VDD.
Figure 5.1: Area, error, and VDD,min comparisons of recent compact thermal sensors.
Existing sensors achieve small area or high accuracy [50–60], however, their areas and
accuracies typically pose a trade-off (Fig. 5.1). Also, previous sensor designs have limited
84
voltage scalability, making it difficult to use them in sub-1V supply voltage. BJT based
sensors achieve the highest accuracy (e.g. 0.15oC 3σ-error), however, it consumes a non-
negligible silicon area (e.g. 20,000µm2 per front-end) and have limited voltage-scalability
(e.g. minimum VDD > 1V) [50–52]. On the other hand, CMOS based sensors achieve
smaller footprint, however, the accuracy is typically lower [53–60, 63]. In [59], the CMOS
based sensor achieves the 279µm2 footprint (among the smallest) and the voltage-scalability
down to 0.6V with an acceptable1 3σ-error of +3.4oC/-3.2oC after OPC. However, to meet
the emerging demands, a better sensor is still desired that can achieve smaller area yet more
accurate temperature sensing with better voltage scalability.
In this work, we propose a temperature sensor that meets the aforementioned require-
ment. The sensor operation is to directly sample the threshold voltage (VTH) of a single
sensing PMOS device and use its temperature dependency for temperature sensing. Since
the sensor uses only one transistor for sensing, the sensor area is greatly reduced. Also, the
single transistor sensing mitigates the complexities from transistor mismatches (Note that
previous designs often require multiple transistors and matching their strengths is critical
for accuracy). We design and prototype sensor front-ends together with a readout circuitry
in 65nm CMOS. The sensor front-ends are designed to allow us to reconfigure the size of
sensors. The measurement of our proposed SS16 (sensor-size-16) have a 30.1µm2 footprint
and achieves 1.1oC 3σ-error after OPC. The proposed sensor also achieves near-constant
accuracy across VDD of 0.4V and 1V with voltage-specific temperature coefficients (TC).
1< 8oC error, according to the typical requirement outlined in [54]
85
The proposed sensor is 9× smaller than the previous smallest sensor [59] while achieving 3×
higher accuracy (Fig. 5.1). The sensor also demonstrates the lowest voltage scalability down
to 0.4V which is 0.2V lower than the previous lowest-voltage design [59].
Additionally, we experiment the robustness of our sensor operation while being embedded
in digital circuits. Embedding sensors inside digital blocks raises the concern of coupling noise
from nearby gates that are actively-switching. We layout our proposed sensor in a digital
standard-cell format and place and route it together with a multiplier. Then, we simulate
the parasitic-extracted netlists of the sensor and multiplier. The results show that it is
feasible to mitigate the impact of coupling noise of digital gates with the design efforts such
as shielding, larger sampling capacitors, and post-measurement data processing (averaging).
The paper is organized as follows. In Section 5.2, we discuss the operating principle
of the proposed sensor and the design methodology to optimize accuracy. In Section 5.3,
we discuss the test chip design including the on-chip readout circuitry using the dual-slope
analog-to-digital converter (DSADC) topology. We then discuss the measurement results
of the test chip in Section 5.4. In Section 5.5, the experiment with the proposed sensor in
digital standard-cell format embedded in multipliers is described and techniques to mitigate
the effect of coupling noise are presented. Finally, we conclude the paper in Section 5.6.
86
Figure 5.2: Schematic and operation of the proposed sensor front-end that directly samples
VTH.
Figure 5.3: VTH over temperature across process variations.
87
5.2 Proposed Temperature Sensor Design
5.2.1 Operating Principle
The proposed sensor directly samples the VTH of sensing PMOS device P1 (Fig. 5.2). VTH
is well-known to have a strong and well-defined linear relationship with temperature and can
be formulated as:
VTH(T ) = VTH(Troom) +KV TH · (T − Troom) (5.1)
, where T is temperature, Troom is 300K, and KVTH is the first-order TC of VTH [64]. This
is also confirmed with our SPICE simulation results showing a high linearity of R2>0.9999
and strong temperature coefficient (KVTH) of -1.12mV/
oC across process variation (Fig.
5.3). The manufacturing process variation mostly modulates the offset of the VTH curve
with near-constant KVTH and therefore is well-suited for low-cost OPC.
In order to capture the VTH of P1, we propose to use the discharging behavior of PMOS
devices also known as VTH drop. This can be simply done by pre-charging the source voltage
(VSENSOR) of P1 followed by discharging operation. Specifically, as shown in the waveform
of Fig. 5.2, we first use the shared pre-charging device P2 to pre-charge the shared sampling
capacitor (VSENSOR node) up to VDD. Once the node is fully charged, we turn off P2, and
turn on our sensing device P1 (at time=0 in Fig. 5.2). The P1 device starts to rapidly
discharge VSENSOR node as it is initially in the strong-inversion region. At time=tweak, P1
enters the weak-inversion region, and the discharging rate of VSENSOR node is largely reduced
88
which is known as the VTH drop phenomenon. Finally, we sample the voltage of VSENSOR
node at the optimal sampling time (tsample).
5.2.2 Optimal tsample
In the proposed sensor design, it is important to sample VSENSOR node at the optimal
sampling time (tsample). This provides range of benefits including:
• Good linearity of sampled VSENSOR value over temperature.
• Robustness against leakage current of P1.
• Robustness of the TC of sampled VSENSOR value against process variation.
• Robustness against pre-charged level (i.e. VDD) variation.
The optimal sampling time can be determined based on the two constraints that set
the upper bound and lower bound. The upper bound is set by the leakage current of P1
perturbing the desired sampled VSENSOR value. Intuitively, if we sample too late, the leakage
current of P1 will modulate the VSENSOR value away from the desired VTH value of P1. In
such case, the sampled VSENSOR value will not only be determined by VTH of P1 but will
also be determined by the leakage current of P1 with stronger weight. Since leakage current
has exponential relationship with VTH of P1 (or temperature), the linearity of sampled
VSENSOR value with temperature will be deteriorated. The lower bound is set by sampling
time variation concerns. Ideally, we would want to sample as soon as the VTH drop happens
(as soon as P1 enters weak inversion). However, during that time frame, the discharging
89
rate of VSENSOR node is relatively high and sampling time variation can largely degrade the
accuracy of the sensor.
Figure 5.4: (a) Linearity of the sampled VSENSOR value across tsamples. (b) Discharging rate
of the VSENSOR node voltage across tsample.
First, we use simulation results to find the optimal range of sampling time. As expected,
the linearity of sampled VSENSOR value rapidly degrades when sampled too late (Fig. 5.4(a)).
To maintain the linearity R2>0.9999, we set the upper bound of tsample to 700µs. On the
other hand, the discharging rate exponentially increases with smaller tsample (Fig. 5.4(b)).
Simulation results show that tsample of larger than 1µs can significantly reduce the discharging
rate to below 30µV/ns as P1 is in the weaker inversion region. These two bound condition
sets the optimal sampling time window to be 1µs to 700µs after P1 device is turned on. In
modern IC designs, this optimal tsample window can be easily met since the clock resolution
is in much finer level.
Next, we use analytical approach to confirm the validity of our intuition and simulation
results. In order to understand the dependency of sampled VSENSOR value on temperature
90
after P1 just enters weak inversion, we derive its equation:
VSENSOR(tsample) = VTH −
Iweak · (tsample − tweak)
Csample
(5.2)
In Eq. 5.2, tsample which is the moment to sample VSENSOR node is more than 10×
larger than tweak which is the time when P1 enters weak inversion region (e.g. tweak=100ns,
tsample=1µs to 700µs in the optimal sampling time window). Therefore, tweak can be ignored.
Iweak, which is the sub-threshold leakage current of P1 when it just enters weak inversion
region can be formulated as:
Iweak ≈ µ0 · (
T
Troom
)−Ku · COX ·
W
L
· (n− 1) · (KT
q
)2 · exp(VGS − VTH(T )
nVT
)−−− (a)
≈ µ0 · COX ·
W
L
· (n− 1) · (K
q
)2 · TKuroom · TK0
≈ µ0 · COX ·
W
L
· (n− 1) · (K
q




≈ µ0 · COX ·
W
L
· (n− 1) · (K
q




≈ µ0 · COX ·
W
L
· (n− 1) · (K
q





, where Ku is the TC of the mobility (µ) and K0=-Ku+2. A key point in the derivation
is that VGS is close to VTH(T) and thus the exponential term in Eq. 5.3(a) becomes 1.
In addition, another high-order temperature dependent term, 1+(T-Troom)/Troom) in Eq.
5.3(b), can be approximated to a linear function via the Taylor series since (T-Troom)/Troom)
is much smaller than 1 for the interested temperature range. For example, for temperature
range of 0oC to 100oC, this term is in the range of -0.09 and 0.24. Therefore, as shown in
91
Eq. 5.3, Iweak also becomes a linear function of temperature. After plugging Eq. 5.3 and Eq.
5.1 to Eq. 5.2, the value of VSENSOR node sampled at tsample can be formulated as:








,where Aweak = C · (1−K0) and Kweak = C ·
K0
Troom
, where C = µ0 · COX ·
W
L




The sampled VSENSOR value is a linear combination of the two parameters, VTH and Iweak,
which are linear to temperature, and thus is also linear to temperature. If VSENSOR node
is sampled after the optimal window, the assumption that VGS is close to VTH(T) used in
deriving Eq. 5.3(a) becomes invalid and thus the exponential term cannot be eliminated.
This makes the sampled VSENSOR value exhibit poor linearity which matches our simulation
results shown in Fig. 5.4(a).
Another important consideration on choosing the proper tsample value can be found from
the above analytical study. As shown in Eq. 5.4, the TC of the sampled VSENSOR value is
formulated as KVTH-Kweaktsample/Csample. In simulation, we saw that KVTH is well-maintained
across process variation (Fig. 5.3). However, the capacitance value of sampling capacitor
(Csample) can have large variation across process (e.g. MIMCAP have 3σ/µ variation of 15%).
Also, Kweak value can also vary across process variation depending on P1 sizing (i.e. W, L).
Therefore, it is critical to minimize the impact of Csample and Kweak variation, which can be
achieved by using the smallest allowable tsample value. We use tsample=10µs, so that KVTH
92
(-1.12mV/oC) can be more than 50× larger than the Kweaktsample/Csample term.
5.2.3 Supply Voltage Noise
Figure 5.5: Impact of pre-charge level variations on accuracy.
The optimal sample time also makes the proposed sensor robust against VDD noise. VDD
noise can modulate the pre-charge level. As the sensing device P1 turns on, the modulated
pre-charge level changes the time it takes to generate VTH drop which is the time to enter
the weak inversion region (i.e. tweak). However, as discussed in Eq. 5.2, the optimal tsample
(10µs) is two orders of magnitude larger than tweak (100ns) and the impact of tweak variation
on the accuracy is minimal. As shown in Fig. 5.5, the simulation results shows that the
pre-charge level variation of 100mV causes a negligible error increase of <0.02oC. For the
same reason, VTH offset variation (i.e. VTH(Troom) in Eq. 5.1) also has a negligible impact
on accuracy. The VTH(Troom) variation only affects the offset of the sampled VSENSOR value
93
in Eq. 5.4 and can be calibrated via OPC.
5.2.4 Sensor Device Type and Body Connection
Table 5.1: Comparison of proposed sensor with different device type.
The proposed sensor circuitry is explored with different device types provided in 65nm
CMOS process. We simulate the accuracy by running 100 Monte-Carlo simulation with
process variation and performing OPC. The simulation is done using 2.5V thick-oxide device
and 1V thin-oxide devices with different VTH characteristic (i.e. high-VTH, standard-VTH,
and low-VTH). We choose the optimal sensor size and tsample value for each device types
while sweeping the length by 1-10× of minimum, width by 1-30× of minimum, and the
tsample value from 1µs to 100µs. For all the device types, the sample capacitor (Csample) value
is fixed to 1pF. The results are summarized in Table 5.1. All the device types achieves the
3σ-error of < 2.72oC while the 2.5V thick-oxide device achieves the best 3σ-error of 0.93oC.
The sensor using 2.5V thick-oxide device is simulated with different body connections
(i.e. connected to VDD, ground, or VSENSOR) (Fig. 5.6). As shown in Table 5.2, the sensor
with body connected to VDD achieved the best accuracy. However, if VDD is susceptible to
large noise depending on the user scenario, the body can be connected to VSENSOR, ground,
94
Figure 5.6: Three possible body connections of the sensing device P1.
Table 5.2: Comparison of proposed sensor with different body connection.
or a separate clean bias voltage with < 0.22oC accuracy degradation.
5.3 Test Chip Details
The test chip is designed and fabricated in a 65nm general-purpose CMOS process. Fig. 5.7
shows the die photo of the test chip. The test chip consists of 8×8 reconfigurable sensor
front-end network array using 64 unit-size sensors (S1-S64), sample and hold circuits (S&H),
95
Figure 5.7: Die photo.
Figure 5.8: Test chip block diagram and its operational waveform.
and on-chip read-out circuitry using the DSADC topology (Fig. 5.8). Each unit-size sensor
is 3× minimum-sized 2.5V thick-oxide PMOS device with body tied to VDD. We used 2.5V
thick-oxide devices with body tied to VDD since it achieves the best accuracy as discussed in
96
Section 5.2.4. The reconfigurable sensor network can combine multiple unit-size sensors to
form a larger-size sensor to experiment with varying sensor sizes. The capacitance of 1pF is
used for Csample.
5.3.1 Shared P2 and Csample
The pre-charge PMOS device (P2) is shared across sensors and the sampling capacitor
(Csample) is shared across sensors and the S&H, providing three benefits.
• Each sensor sees the identical load capacitance which is the sum of Csample and the
capacitance of all wires connecting Csample and the sensors. This makes the TC of
sampled VSENSOR value (i.e. KVTH-Kweaktsample/Csample to be same for all the sensors
on the chip.
• The manufacturing variation of Csample makes little impact on accuracy since they are
shared across all sensors on the chip and calibrated out after OPC.
• The area can be saved from sharing.
Since all sensors are all tied together, the inactive sensors during a single sensor reading
may impact each other. However, the inactive sensors with gate voltage of VDD (i.e. turned
off) experience negative VGS and have negligible impact on the VSENSOR node and accuracy.
97
5.3.2 Operating Principle
The operational waveform of test chip is shown in Fig. 5.8. During period t1, the VSENSOR
node is pre-charged to VDD by P2. Then, during period t1 (which is our tsample), P2 is turned
off and one of the selected sensor turns on and discharges the VSENSOR node. During this
t1+t1 period, the S&H is in the sampling mode. At last, during period t1, S&H captures
the VSENSOR value on VOUT and enters hold mode. The VOUT value which is VCM(0.8V) +
VSENSOR(tsample) is digitized by an off-chip ADC (16bit, ±5V) or by on-chip DSADC.
5.3.3 On-Chip DSADC
The on-chip DSADC digitizes the VOUT value 32 times and stores them in the digital memory
(FIFO) (Fig. 5.8). The average of the 32 values is used for the temperature measurement.
The DSADC digitization process is as follows. First, ADCOUT resets to VCM for 1µs. The
DSADC counter also resets to zero. Second, ADCOUT is discharged for a fixed period of 1µs
at the rate of VSENSOR(tsample)/R1C2. Third, the DSADC counter starts and ADCOUT is
charged with a fixed rate of VCM/R1C2. In the course of charging, the comparator finds the
moment when the ADCOUT becomes larger than VCM and stops the counter. The digital
counter output (count), which is formulated as VSENSOR(tsample) × 1µs/VCM, represents the




5.4.1 Sensor Accuracy Measurement
The test chip is placed in the temperature chamber and is measured while sweeping the
temperature from 0oC to 100oC with 10oC steps. We measured the sensors across 10 dies
using both off-chip ADC and on-chip DSADC. The sensor reading is calibrated with OPC
at 50oC and the error is calculated. In all the measurement, t1 and t1 in Fig. 5.8 are set
to be 1µs and 10µs, respectively. Therefore, the sensor produces new samples at the rate of
91kS/s.
Figure 5.9: Accuracy and area across sensor sizes.
In order to study the impact of sensor area on accuracy, multiple unit-size sensors are
combined and measured with off-chip ADC with VDD of 1V. As more unit-size sensors are
combined to form a larger sensor, the accuracy was improved (Fig. 5.9). When 16 of unit-size
99
Figure 5.10: (a) Measured VOUTs of an SS16 after OPC at 50
oC. (b) Errors across temper-
atures.
Figure 5.11: Measured error after TPC at 20oC and 80oC.
sensors are combined, called Sensor-Size-16 or SS16, it achieves 3σ-error of 1.1oC after OPC.
The footprint of SS16 is 30.1µm2. The VOUTs of the 40 SS16 sensors after OPC is shown in
100
Fig. 5.10(a) where the average TC is measured to be -1.27mV/oC. The VOUTs translated into
error is shown in Fig. 5.10(b). We also perform two temperature point calibration (TPC)
at 20oC and 80oC (Fig. 5.11). The TPC can further reduce the error to -0.4oC /+0.6oC at
the higher calibration cost.
Figure 5.12: The worst-case error of SS16s across tsamples.
We also investigate the impact of tsample on accuracy (Fig. 5.12). As expected from
discussion in Section 5.2.2, the worst-case error (i.e. max.(+)error - max.(-)error) exhibits
bathtub curve with optimal tsample range of 1µs to 100µs to achieve worst-case error of <2
oC.
This range is smaller than the simulated range of optimal window but still sufficiently large.
5.4.2 Supply Voltage Scalability Measurement
We also measure the supply voltage scalability of the sensors (Fig. 5.13). The same mea-
surement methodology described in Section 5.4.1 is performed across VDD range of 0.4V to
101
Figure 5.13: The worst-case error across VDDs.
1V with 0.1V step for SS16. In each VDD, the voltage-specific TC is found and used for error
calculation. The measurements across 20 instances across 5 chips show that the worst-case
errors are nearly constant to be around 1.8oC across VDDs.
5.4.3 On-chip DSADC Measurement
We repeat the measurement in Section 5.4.1 using on-chip DSADC (Fig. 5.14). The mea-
surement across 5 chips shows the worst-case error increase by 1.1oC, as compared to the
measurement using the off-chip ADC. The increased error is mainly due to the resolution
limitation (0.5oC) of the DSADC.
102
Figure 5.14: The worst-case error using on-chip DSADC.
Table 5.3: Comparison table with previous designs.
103
5.4.4 Comparison
Finally, as summarized in Table 5.3, the proposed sensor is compared to the previous tem-
perature sensor works. The proposed sensor front-end circuit has a footprint of 30.1µm2 and
3σ-error of <1.1oC across 40 instances in 10 dies. As can be seen in Fig. 5.1, the proposed
sensor breaks the traditional area and accuracy trade-off. The proposed sensor achieves 9×
smaller area and 3× higher accuracy than the previous smallest design [59]. The proposed
sensor front-end also achieves the voltage scalability down to 0.4V, which is 0.2V lower than
the previous lowest-voltage design [59].
5.5 Digital Standard-Cell-Compatible Sensor Experi-
ment
We need highly non-invasive sensors to optimally place them very closely to target hot spots
in digital circuits. This requires a sensor that is in the scale and the format of digital
standard-cells. We first layout the proposed SS16 design in digital standard-cell format
which consumes area of 33.12µm2 (Fig. 5.15). Then, a commercial place and route tool
is used to place one sensor in the center of the multiplier circuits. We use four multiplier
designs having the input data widths of 8, 16, 32, and 64 bits, respectively, each of which is
synthesized with standard cells using 1V thin-oxide standard-VTH devices.
We study the impact of coupling noise on sensor outputs. In the SPICE simulation with
the parasitic-extracted netlists and VDD=1V, we monitor sensor outputs (VSENSOR) while the
104
Figure 5.15: Layout of 32-bit multiplier and embedded SS16 in the digital standard-cell
format.
multipliers are actively switching. To extract the inaccuracy only incurred by digital circuits,
we run two simulations with and without switching activities and take the difference between
the two. We take 1000 samples across varying input vectors for 100 multiplier-clock (CLK)
cycles.
The worst-case coupling noise error increases with larger multiplier designs since the
wire of the VSENSOR node becomes longer and thus exposed to more of digital circuits (Fig.
5.16(a)). One technique to reduce the coupling noise is to use the well-known routing tech-
nique that shields the sensitive signal with wires that have fixed voltage (e.g. VDD or VSS).
For example, in the 64-bit multiplier, shielding the VSENSOR node with wires that are tied
to VSS reduces the worst-case error by about 2×.
Another technique is to use larger sampling capacitor. This increases the capacitance
105
Figure 5.16: (a) Worst-case coupling noise error across the VSENSOR wire length exposed.
(b) Worst-case coupling noise error across the sampling capacitor size.
Figure 5.17: Coupling noise induced error and its reduction via averaging.
of victim wire relative to the capacitance of the aggressor wire and reduces the impact of
coupling. As shown in Fig. 5.16(b), the increasing sampling capacitor size proportionally
106
reduces the worst-case error. For example, in the 64-bit multiplier with VSENSOR node shield,
using 10× larger sampling capacitor (i.e. 10pF) reduces the worst-case error by 10× to 0.44oC
as compared to error of 4.04oC when using 1pF.
Finally, the last technique we study is to average multiple samples of VSENSOR node. Fig.
5.17 shows the VSENSOR node voltage for 100 CLK cycles after tsample while the multiplier
is computing random input vectors that change every CLK. Multiple samples can be taken
and stored in local FIFO for averaging using the on-chip DSADC discussed in Section 5.3.3.
For example, by averaging 10 samples, we can reduce the error by 2.6× as compared to the
worst case.
5.6 Summary
In this work, we propose a temperature sensor that directly senses transistor VTH. The sensor
achieves a compact footprint of 30.1µm2, 3σ-error of 1.1oC across 0 to 100oC after OPC, and
voltage scalability down to 0.4V without losing much accuracy. This is 9× smaller area and
3× higher accuracy than the previous smallest design [59]. It also operates at 0.2V lower
than the previous lowest-voltage design [59]. The compact footprint enables the proposed
sensor front-end to be in the scale and the format of digital standard-cells, which enables
aggressive sensor placement that is non-invasive and in proximity to the target hotspots. The








In VLSI systems, ultra-dynamic-voltage-scaling (UDVS) has been proposed to further extend
the range of the conventional dynamic-voltage-scaling [22]. UDVS can provide peak perfor-
mance by operating at nominal supply voltage (VDD) while it can also achieve extremely
high energy efficiency by scaling VDD down to near or below device threshold voltage (VTH)
108
under average and low workload. UDVS can be applicable to a wide range of computing
applications including data centers, personal computing, mobile electronics, and embedded
computing systems, for further improving performance and energy-efficiency limits.
For developing UDVS systems, one of critical challenges is to mitigate the inflexibility
in various circuit fabrics. Circuit fabrics such as pipeline structures, clock networks, and
on-chip memory bitcells are often optimized for only a single VDD [40, 65]. Those circuit
fabrics, however, can exhibit highly sub-optimal performance, energy-efficiency, variability,
and robustness when operating at the different VDDs. Conventionally, designers have made
compromised decisions for favoring the operation at a specific VDD [7, 65–69].
One of the critical examples of such inflexibility is the design of long (>mm) interconnects
on a chip. In the conventional techniques, repeaters are inserted throughout wires at a certain
interval, called an optimal interval of repeater insertion or Loptimal, for optimizing total delay
[70–73]. This Loptimal is, however, a strong function of VDD. In high VDD regime, Loptimal
becomes smaller as the delay improvements from shorter wire segments are larger than
the penalties incurred by inserting more repeaters. Contrarily in near and sub-threshold
regime, Loptimal tends to be longer since the intrinsic delay of repeaters exponentially grows.
The delay overhead of an additional repeater can therefore outweigh the delay improvement
enabled by the short interconnect segments [65]. Our simulations show that Loptimal can vary
by 6× across the range of VDDs from 1.0V to 0.35V. This widely varying Loptimal makes an
interconnect design optimized for a specific VDD to exhibit significantly lower performance
and energy efficiency when operating at the VDDs that they are not optimized for.
109
Figure 6.1: Interconnect designs using (a) the conventional repeaters and (b) the proposed
reconfigurable regenerators.
Reconfigurable circuits and architecture can be a promising direction to mitigate the
challenges of the inflexibility of circuit fabrics in UDVS systems. Unfortunately, for the
repeater-based interconnect-designs, it is not trivial to dynamically reconfigure the number
of repeaters with minimal invasiveness since repeaters are inserted in series with wires and
the wire segments are physically disconnected by the repeaters (Fig. 6.1(a)). One naive
solution for the reconfiguration ability is to implement multiple interconnect lanes with
different insertion intervals, and the UDVS system dynamically selects the optimal lane
based on the VDD currently used. This approach, however, can cause large area overhead as
the number of lanes quickly increase with the number of VDD options in UDVS systems.
In this work, we instead focus on an alternative interconnect design technique based on
regenerators for their use in UDVS systems. Regenerators have been proposed to enable bi-
directional signaling with often better performance and energy-efficiency over repeater-based
interconnect designs for primarily nominal super-threshold VDD operation [5, 74]. They can
110
sense the signal transitions appeared in wires, and when sensing, they can rapidly source
current to quickly complete the transitions.
A regenerator is particularly different from a repeater in the sense that it has a single
signal port to serve both input and output, which is connected to a wire in parallel, without
physically dividing wires (see Fig. 6.1(b)). This parallel connection can facilitate to dynam-
ically reconfigure the number of regenerators that contribute signal transitions for different
VDD options in UDVS systems. If a regenerator is disabled it simply becomes a dangling ca-
pacitance with minimal energy and delay impacts. Such reconfigurability is hard to achieve
in the repeater-based interconnect design.
In this paper, we, therefore, investigate a reconfigurable and regenerator-based intercon-
nect design technique. We first optimize the existing regenerator circuits to enable dynamic
reconfiguration and also to improve functional robustness at VDDs from nominal to near and
sub-threshold regime. We then design and analyze the interconnects based on the regen-
erators which can dynamically reconfigure the number of active regenerators for different
VDD operations. We compare the proposed design to the three conventional repeater-based
interconnect ones each of which is optimized for the operations at VDDs = 0.35V, 0.7V,
and 1.0V, respectively. In the case study of driving 10-mm long and 0.1µm wide wires in
an industrial 65nm CMOS technology, SPICE simulations show that the proposed design
achieves 2.1-3× improvement in delay and 1.4-6.3× improvement in energy efficiency across
VDD=0.35-1V, as compared to the three conventional repeater-based interconnect designs.
The similar amount of gains are observed for non-minimum width wires.
111
6.2 Challenges of Repeater-Based Interconnect Design
for UDVS Systems
In this section, we analyze the challenges of the conventional repeater-based interconnect
design in the context of UDVS systems. Inverters are used as a repeater element throughout
this paper since they are considered to provide the best performance and energy-efficiency
[72].
6.2.1 Optimal Interval of Repeater Insertion
Figure 6.2: Simulation shows a 6× variation in Loptimal over VDD=1-0.35V. (R/C: the on-
resistance and gate capacitance of unit-size inverters; Rw/Cw: the resistance and capacitance
of unit-length wires; pinv: the ratio of diffusion and gate capacitance of unit-size inverters.)
One of the critical challenges in the repeater-based interconnect design for UDVS systems
is that Loptimal is a strong function of VDD. Fig. 6.2 shows (i) the analytical solution of Loptimal
112
[72] and (ii) the simulated Loptimal across VDDs in an industrial 65nm CMOS. When VDD
is scaled down to near and sub-threshold regimes, Loptimal rapidly increases since the on-
resistance of the repeaters (R) exponentially increases while the capacitances of repeaters (C
and pinv) remains relatively constant and also the resistance and capacitance of wires remain
constant (Rw and Cw). It is shown that Loptimal varies by up to 6× from nominal VDD (1.0V
in this technology) to sub-threshold (0.35V) VDD. In nominal VDD, Loptimal is smaller since
delay improvements from shorter wire segments are greater than the delay added by extra
repeaters. In ULV regime, however, the Loptimal becomes larger since the delay of repeaters
exponentially grows and thus favoring less number of repeater insertion.
6.2.2 Repeater-based Interconnect Design
Table 6.1: Implementation details of the Design I, II, and III.
The large difference in Loptimal across VDDs can make repeater-based interconnect designs
highly sub-optimal if operation VDD deviates from optimization VDD. To confirm this, we
design three interconnects which are optimized at three VDDs, 1V (Design I), 0.7V (Design
II), and 0.35V (Design III), respectively. In each design, Loptimal is first found by sweeping
the number of repeater insertion. At this point we do not need fully optimized repeater sizing
113
since Loptimal at a given VDD is not a strong function of repeater sizing (see the equation in
Fig. 6.2). Next, using the Loptimal just found, we search the size of repeaters for achieving
the best delay performance. The optimized designs are summarized in Table 6.1. In this
experiment, we consider the third layer wire whose length is 10mm and width is minimum-
sized (i.e., 0.1µm). The wire is modeled with the distributed RC -model with 1000 segments
(i.e., each segment is 10µm long).
Figure 6.3: Any single repeater-based interconnect design cannot simultaneously achieve
optimal delay, slew and energy-consumption across a wide range of VDDs. (a) At 0.35V,
the Design III outperforms the Design I and II. At 1V, however, the Design I exhibits 2.8×
shorter delay than the Design III. (b) All the designs achieve acceptable slew rates at the
VDDs that they are optimized for. The Design III exhibits large slew at 1V. (c) The three
designs consumes similar amounts of energy since the total widths of inserted repeaters are
similar. Only the Design III shows a large energy consumption at VDD=0.6-1V due to the
short circuit current induced by large slew.
The delay, slew, and energy consumption of the Design I, II, and III are simulated across
VDD=0.35-1.0V. As shown in Fig. 6.3, the Design I which is optimized at 1.0V achieves
the best performance at VDD=1.0V. At VDD = 0.35V, however, it exhibits 3× worse delay
than the Design III since the excessive number of repeaters in the Design I significantly
increase delay. Contrarily, the Design III optimized at 0.35V achieves the shortest delay
114
among the three designs at VDD=0.35V while exhibiting 2.8× longer delay at VDD=1.0V
than the Design I. The Design II achieves a balanced delay performance across VDDs, yet
still exhibiting 2.1× longer delay at 0.35V than the Design III and 1.1× longer delay at 1.0V
than the Design I.
The slew rates of all three designs are less than two fan-out-of-4 (FO4) delay at the VDDs
that those designs are optimized for. When operation VDD deviates from optimization VDD,
however, some of the designs, particularly the Design III, exhibit significantly degraded slew
and energy consumption. As shown in Fig 6.3(b), the Design III exhibits the slew of more
than 5 FO4 delays at VDD > 0.5V due to the less than ideal number of repeater insertions.
This large slew also degrades energy efficiency due to the increased short-circuit current. As
shown in Fig 6.3(c), the Design III consumes 6.5× more energy than the Design I and II at
1V.
6.3 Optimized Regenerator Circuit Design
In this section, we introduce several conventional regenerator circuits and their challenges in
the context of UDVS operation. We then propose our optimized regenerator circuits based
on the self-timed regenerators (STR, [5]).
6.3.1 Self-Timed Regenerator (STR)
Several regenerator designs have been proposed primarily targeting at nominal VDD operation
[5, 74]. In [74], the Booster was proposed as an alternative solution for driving long on-chip
115
wires. The Booster has several advantages over repeaters. It can achieve shorter delays
and can also allow bi-directional signaling with a single wire. The delay is less sensitive to
the variations of regenerator placement. The more number of regenerators than the optimal
number can still achieve near-optimal delay since the propagation delays of the extra Boosters
are not added to the overall interconnect delay.
Figure 6.4: The STR with original sizing [5] and (b) the optimized regenerator design.
In [5], another regenerator design called STR was proposed to further improve perfor-
mance and energy efficiency over the Booster (see Fig. 6.4(a)). The detail operations of
the STR are as follows. When it detects transitions in the wire (i.e., the node INTER-
CONNECT), it turns on PP4 or NN4, supplying current to accelerate the transitions. The
transition-high detection (NN1, NN2, PP1) and transition-low detection circuits (PP2, PP3,
NN3) are highly skewed using both transistor sizing and multi-VTH transistors for fast signal
transition detections. Specifically in [5], the three devices, PP2, PP3, and NN3, are sized to
have the effective ratio of PMOS to NMOS of ∼46×. Also, the devices, NN1, NN2, and PP1,
116
are set to have the ratio of ∼16.3×. The main current-supplying devices (PP4 and NN4)
are switched off after a certain amount of time which is defined by the inverter chains. This
self-timed operation can avoid the situation that the node INTERCONNECT is actively
held by the previous state, thereby improving delay. The cross-coupled inverters (the INV1
and INV2 in Fig. 6.4(a)) are added to hold the states at the node BB, which is critical for
maintaining the correct inputs for the devices, NN2 and PP2.
6.3.2 Robustness Challenges in the STR design
Since the original STR design targets at only nominal and super-threshold VDD operation,
its robustness can be compromised in UDVS systems when near and sub-threshold VDD is
used.
Figure 6.5: The required size of the writing devices (NN5 and PP5) rapidly increases under
the worst-case process and temperature corner.
117
The first robustness challenge of the conventional STR design comes from the cross-
coupled inverters (INV1 and INV2) and the inverter-chain based feedback path (NN5 and
PP5 in Fig. 6.4(a)). The writing devices, NN5 and PP5, need to be sized up so that
they can overwrite the state (i.e., the node BB) of the cross-coupled inverters even under
the worst-case process, temperature, and voltage (PVT) variation. In UDVS systems, this
demands very large NN5 and PP5 since the variations at near and sub-threshold regime
significantly grow. This apparently increases the overhead of area and energy to design the
STR to operate reliably across a wide range of VDDs. As shown in Fig. 6.5, the process and
temperature corner simulations show that 2.6× larger writing devices are needed at 0.35V
than at 1V. Random process variations and other dynamic variations can demand even larger
device size, significantly increasing area and energy consumption.
Figure 6.6: Leakage through PP4 and strongly-skewed devices, NN1 and NN2 (Fig. 6.4[a]),
can induce false transition detections at low VDDs. An example operation at 0.35V is shown.
118
Another robustness challenge is the false transition detection induced by (i) the leakage
of the current-supplying devices (PP4 and NN4) and (ii) the use of highly skewed circuits
in the transition detection circuitry (i.e., devices PP1, NN1, NN2, PP2, PP3, and NN3).
At the steady-state and when the node INTERCONNECT is low, the PP4 and NN4 are
turned off, and therefore the node INTERCONNECT becomes floating. If some process,
temperature, and voltage (PVT) variations make the PP4 to leak more than the NN4, the
potential of the node INTERCONNECT can start to increase (Fig. 6.6(1)). Since the
transition detection circuits are highly skewed, the increase can be easily interpreted as a
signal transition, and thereby causing false transition detection (Fig. 6.6(2)). This flips the
node INTERCONNECT to the wrong high state (Fig. 6.6(3)). The STR can still return
to the correct state (Fig. 6.6(4)) since the initial buffer (see Fig. 6.1) drives the node
INTERCONNECT to the correct state. The glitch induced by false transition detections,
however, can increase delay, consume more energy, and even propagate wrong states to the
receiving registers.
The use of highly skewed and multiple-VTH circuits can significantly increase the proba-
bility of such false transitions in UDVS systems. While the optimal skew can be set based
on noise margin at VDD=1V, the skew can largely increase at near and sub-threshold regime
since the on and off-current of low-VTH devices are orders of magnitude larger than mid-
VTH devices in those VDD domains. This can create the excessive skew in detection circuits,
resulting in much higher rate of false transition detection.
119
6.3.3 Robustness and Reconfiguration
In order to improve robustness in the context of UDVS systems, we optimize the STR
regenerators. First of all, we avoid the use of multi-VTH devices in the detection circuitry.
In addition, a smaller amount of skew is introduced. The effective ratio of PMOS to NMOS
sizes is ∼9.5× in P2,P3, and N3. The effective ratio of NMOS to PMOS sizes is ∼3.75× in
N1,N2, and P1. This significantly improves the robustness. While the original STR design
exhibits the false detection rate of 9%, the proposed design has that of 0% when 1-k Monte-
Carlo simulations with all the process variations is ran with 1µs leaking period. The delay
penalty from the reduced skew is only about 3%. In addition, in order to avoid over-sizing in
the writing devices NN5 and PP5, the proposed regenerator employs a SR-latch (SR1, SR2
in Fig. 6.4(b)). Since the SR-latch is free from the contention problem, it can be designed
with nearly minimum sized devices. The use of SR-latch can reduce area by about 12% at
the same delay.
Dynamically enabling and disabling regenerators is critical to avoid unnecessary switching
activities when Loptimal is large at low VDD regime. Therefore, in addition to the above
optimizations for higher robustness, we also add two gates, NAND1 and NOR1 which can
enable and disable the regenerator controlled by the external signals EN and ENB. When
EN is high (ENB is low), the regenerator are enabled. When EN is low, the regenerator is
disabled by forcing the node OUT b 1 to be high (and OUT b 2 to be low). This disables the
transition detectors. Only some of the gate capacitances in the transition detection circuitry
(i.e., N1, P1, N3, P3) and the diffusion capacitance of the main driving devices (i.e., P4 and
120
N4) are exposed to the wire. The additional gates for dynamic reconfiguration causes 5%
area overhead.
Figure 6.7: Layout of the proposed regenerator design. The height is set as multiples of the
height of standard cells in this technology.
The proposed regenerator is sized and drawn for our experiment with a 10mm intercon-
nect (Fig. 6.7). The total area is 61.56µm2 in a 65nm CMOS. Note that, the main driving
devices (N4 and P4) use medium VTH transistors for fair comparison with the repeater-based
interconnect design that also use medium VTH devices.
121
6.4 Reconfigurable Regenerator-Based Interconnect De-
sign for UDVS Systems
In this section, we propose a reconfigurable regenerator-based interconnect design technique
for UDVS systems (Fig. 6.1). Applied for driving 10mm-long, minimum-width wires, the
proposed design significantly outperforms the conventional repeater-based design across a
wide range of VDDs.
6.4.1 Design Process of the Proposed Interconnects
In order to design the reconfigurable and regenerator-based interconnect for UDVS systems,
we introduce the two-step design process.
Figure 6.8: The optimal number of regenerator is found to be 35 at 1V with 1mm, minimum-
width wires.
122
Step I: We sweep the size of the initial buffer, the size of the regenerator, and the number
of regenerators to find the combination that achieve the same delay of the repeater based
design at 1.0V (66 FO4 delays). As shown in Fig. 6.8, the optimal number of the regenerators
is found to be 35. It is possible to further improve delay performance at lower energy
efficiency since the performance benefit of adding regenerators outweighs the capacitance
penalty beyond the optimal insertion count. The interconnect design with 90 regenerators,
for example, can achieve 10% shorter delay but consume 72% more energy per switching.
Step II: Similarly to the conventional repeater based interconnect design, the proposed
design also has the optimal numbers of regenerators to be enabled at different VDDs. Enabling
all regenerators can achieve shorter delay but it can also incur a considerable amount of
energy-efficiency penalty.
Figure 6.9: At lower VDDs, some of the regenerators can be disabled while still meeting
the target performance. At 0.35V, for example, only 11 out of 35 regenerators are enabled,
achieving 21% reduction in energy consumption compared to when all enabled.
123
Figure 6.10: The optimal numbers of enabled regenerators to achieve the target performance
across VDDs are found. The proposed reconfigurable interconnect design reduces energy
consumption by up to 28% by disabling a subset of regenerators.
We find the optimal number of regenerators enabled for each VDD, which can be used
to dynamically enable and disable regenerators during runtime. The target performance is
the best performance among three repeater based designs (Design I-III) at each VDDs. As
shown in Fig. 6.9, at 0.35V, only 11 out of 35 regenerators need to be enabled to achieve
the same performance of the design III (12.2 FO4 delays). The remaining 24 regenerators
can be disabled, reducing energy consumption of regenerators by 21% compared to when all
regenerators are enabled. At VDDs=0.35-0.5V, it is sufficient to enable a subset of regenera-
tors (11 to 17 out of 35) for achieving the target performance (Fig. 6.10). At VDD=0.6-1.0V,
all of the 35 regenerators need to be enabled. As shown in Fig. 6.10, the reconfiguration can




Figure 6.11: The simulation results of (a) delay, (b) energy consumption, (c) slew, and (d)
area of the proposed reconfigurable interconnect design and the three conventional repeater-
based interconnect designs.
The proposed interconnect design are compared to the three conventional repeater based
designs, Design I, II, and III, each of which is optimized for the best performance at 1, 0.7,
and 0.35V, respectively. The total active area for the proposed and the conventional designs
125
are shown in Fig. 6.11 (d). Although, all four designs have the similar total device width,
the proposed regenerator-based interconnect design have 29-45% active area overhead due to
the more complex topology of the regenerator. The performance, slew, and energy efficiency
of the four interconnect designs are compared across VDD ranging from 0.35 to 1.0V.
As shown in Figs. 6.11(a) and (b), the proposed interconnect design achieves 3× improve-
ment in performance and 28% improvement in energy efficiency at VDD=0.35V as compared
to the Design I. The Design I exhibits a large amount of performance degradation in low
VDDs due to the excessive number of repeaters. At VDD=1V, where the Design I is optimized
for, the proposed interconnect design still achieves a comparable performance with less than
3% degradation and energy efficiency. As the Design II is optimized at the intermediate VDD
of 0.7V, it exhibits more balanced performance and energy efficiency across VDDs than the
Design I and III. As shown in Fig. 6.11(a), the Design II, however, has 2.1× longer delay and
49% more energy consumption than the proposed interconnect design at 0.35V. In addition,
the Design II is slower by 5% than the proposed design at 1V operation as the number of
added repeaters is not optimal. As shown in Fig. 6.11(b), the proposed interconnect design
also has comparable delay with < 3% degradation and 28% lower energy consumption at
the VDD=0.35V than the Design III. The proposed design also achieves 2.8× shorter delay
and 6.3× higher energy efficiency at VDD = 1V than the Design III. Note that the Design
III exhibits largely compromised performance at higher VDDs since the number of inserted
repeaters is far smaller than the optimal values.
As regenerators rely on the detection of interconnect transition, the slew of the proposed
126
interconnect design is found to be worse than that of the conventional repeater based designs
(see Fig. 6.11(c)). However, the slew of the proposed interconnect is still less than 6 FO4
delays which can be considered to be acceptable [75,76]. Another overhead of the proposed
interconnect design is higher static power consumption. The Design I, II, and III have the
similar static power consumption due to the similar total device width. The regenerator
have 3× higher static power consumption due to more leakage path and the usage of low
VTH devices. Static power reduction techniques (e.g., power gating switches) can be used to
mitigate this overhead.
6.4.3 Non-Minimum Width Wire
Figure 6.12: The proposed design demonstrates the similar amounts of delay and energy
improvement over the wires of different widths. The proposed design is compared to (a) the
Design I(1V) at 0.35V, and (b) the Design III(0.35V) at 1V.
So far the minimum width wire has been used throughout the paper. In this section,
we reiterate the experiments for confirming the effectiveness of the proposed interconnect
127
design technique across non-minimum-width wires. We use five different wire widths from
0.1 (minimum) to 0.5 µm. The lengths of wires are 10 mm. The optimal size and the optimal
number of repeaters and regenerators are re-searched. As shown in Fig. 6.12, the simulation
results shows that the proposed interconnect design technique achieves the similar amount of
improvement both in delay and energy consumption across different wire widths, confirming
the proposed technique is effective for wider wires.
6.5 Summary
In this work, we propose a reconfigurable interconnect design technique based on regenerators
for UDVS systems. The proposed interconnect design outperforms all the three repeater-
based interconnect designs in performance by 2.1×-3× and in energy efficiency by 1.4×-6.3×.
Even compared to the best case among the three repeater based design across VDDs, the





Ultra-low-voltage operation and emerging architectures are key techniques in enabling new
applications such as energy-constrained Internet of Things devices and cognitive computing.
The large delay variability across PVT variation has shown to be a limiting factor for the
achievable energy-efficiency in these systems. To fully claim the energy-efficiency benefits, it
is important to adaptively handle the PVT variations without imposing the worst-case safety
margin. However, it is shown that the conventional adaptive techniques that are optimized
for nominal supply voltage and traditional Von-Neumann architecture become unreliable
and causes large area,throughput, and energy overhead.
Chapter 2 analyzed the challenges of conventional EDAC techniques when applied to
ultra-low-voltage regime and proposed voltage-scalable and low-overhead EDAC techniques
which was demonstrated by the 0.4V R-Processor. Chapter 3 discussed the challenges of
conventional EDAC techniques in emerging architectures and proposed architecture inde-
129
pendent EDAC technique and demonstrated a 450mV timing-margin-free waveform sorter.
Chapter 4 introduced the challenges of conventional EDAC based power management sys-
tem which uses voltage based regulation and demonstrated a load and power management
co-design strategy based on direct error regulation.
This thesis also explored two circuits techniques for dynamic-thermal-management and
ultra-dynamic-voltage-scaling. Chapter 5 presented a temperature sensor circuit for dynamic-
thermal-management. Chapter 6 introduced a regenerator based reconfigurable interconnect
design strategy for ultra-dynamic-voltage-scaling systems.
Although this thesis have presented techniques to efficiently handle variation in ultra-
low-voltage designs and emerging architectures as well as circuits for dynamic-thermal-
management and ultra-dynamic-voltage-scaling, there is yet more to be studied. In the
context of adaptive design, the detailed variation analysis in ultra-low-voltage computing
hardware needs further efforts. Also, the challenges when implementing the adaptive tech-
niques in commercial systems with real life application needs further investigation. In the
context of dynamic-thermal-management, the employing strategy of the temperature sensor
needs further investigation. Also, the challenges that arise when embedding the sensors in
commercial high-performance microprocessors is yet to be studied. Lastly, for the circuit
designs for ultra-dynamic-voltage-scaling systems, there exists various areas to be explored.
This thesis only explored the interconnect design, but other areas such as clock network,
memory, and pipeline structure needs to be studied.
130
Bibliography
[1] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge,
“A self-tuning dvs processor using delay-error detection and correction,” IEEE Journal
of Solid-State Circuits, vol. 41, no. 4, pp. 792–804, April 2006.
[2] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. M. Harris, D. Blaauw, and D. Sylvester,
“Bubble razor: Eliminating timing margins in an arm cortex-m3 processor in 45 nm
cmos using architecturally independent error detection and correction,” IEEE Journal
of Solid-State Circuits, vol. 48, no. 1, pp. 66–81, Jan 2013.
[3] L. G. Salem and P. P. Mercier, “An 85%-efficiency fully integrated 15-ratio recursive
switched-capacitor dc-dc converter with 0.1-to-2.2v output voltage range,” in 2014 IEEE
International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb
2014, pp. 88–89.
[4] S. Bang, A. Wang, B. Giridhar, D. Blaauw, and D. Sylvester, “A fully integrated
successive-approximation switched-capacitor dc-dc converter with 31mv output volt-
131
age resolution,” in 2013 IEEE International Solid-State Circuits Conference Digest of
Technical Papers, Feb 2013, pp. 370–371.
[5] J. s. Seo, P. Singh, D. Sylvester, and D. Blaauw, “Self-timed regenerators for high-speed
and low-power interconnect,” in 8th International Symposium on Quality Electronic
Design (ISQED’07), March 2007, pp. 621–626.
[6] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “Theoretical and practical limits
of dynamic voltage scaling,” in Proceedings of the 41st Annual Design Automation
Conference, ser. DAC ’04. New York, NY, USA: ACM, 2004, pp. 868–873. [Online].
Available: http://doi.acm.org/10.1145/996566.996798
[7] B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and sizing for minimum
energy operation in subthreshold circuits,” IEEE Journal of Solid-State Circuits, vol. 40,
no. 9, pp. 1778–1786, Sept 2005.
[8] M. Seok, S. Hanson, Y.-S. Lin, Z. Foo, D. Kim, Y. Lee, N. Liu, D. Sylvester, and
D. Blaauw, “The phoenix processor: A 30pw platform for sensor applications,” in 2008
IEEE Symposium on VLSI Circuits, June 2008, pp. 188–189.
[9] Y. Lee, S. Bang, I. Lee, Y. Kim, G. Kim, M. H. Ghaed, P. Pannuto, P. Dutta,
D. Sylvester, and D. Blaauw, “A modular 1 mm3 die-stacked sensing platform with low
power i2c inter-die communication and multi-modal energy harvesting,” IEEE Journal
of Solid-State Circuits, vol. 48, no. 1, pp. 229–243, Jan 2013.
132
[10] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J.
Nowak, D. J. Pearson, and N. J. Rohrer, “High-performance cmos variability in the
65-nm regime and beyond,” IBM Journal of Research and Development, vol. 50, no.
4.5, pp. 433–449, July 2006.
[11] D. Bull, S. Das, K. Shivashankar, G. S. Dasika, K. Flautner, and D. Blaauw, “A power-
efficient 32 bit arm processor using timing-error detection and correction for transient-
error tolerance and adaptation to pvt variation,” IEEE Journal of Solid-State Circuits,
vol. 46, no. 1, pp. 18–31, Jan 2011.
[12] K. A. Bowman, J. W. Tschanz, N. S. Kim, J. C. Lee, C. B. Wilkerson, S. L. L. Lu,
T. Karnik, and V. K. De, “Energy-efficient and metastability-immune resilient circuits
for dynamic variation tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1,
pp. 49–63, Jan 2009.
[13] K. A. Bowman, J. W. Tschanz, S. L. L. Lu, P. A. Aseron, M. M. Khellah, A. Ray-
chowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. De,
“A 45 nm resilient microprocessor core for dynamic variation tolerance,” IEEE Journal
of Solid-State Circuits, vol. 46, no. 1, pp. 194–208, Jan 2011.
[14] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,
T. Austin, K. Flautner, and T. Mudge, “Razor: a low-power pipeline based on circuit-
level timing speculation,” in Microarchitecture, 2003. MICRO-36. Proceedings. 36th
Annual IEEE/ACM International Symposium on, Dec 2003, pp. 7–18.
133
[15] S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T.
Blaauw, “Razorii: In situ error detection and correction for pvt and ser tolerance,”
IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32–48, Jan 2009.
[16] I. Kwon, S. Kim, D. Fick, M. Kim, Y. P. Chen, and D. Sylvester, “Razor-lite: A light-
weight register for error detection by observing virtual supply rails,” IEEE Journal of
Solid-State Circuits, vol. 49, no. 9, pp. 2054–2066, Sept 2014.
[17] R. Pawlowski, E. Krimer, J. Crop, J. Postman, N. Moezzi-Madani, M. Erez, and P. Chi-
ang, “A 530mv 10-lane simd processor with variation resiliency in 45nm soi,” in 2012
IEEE International Solid-State Circuits Conference, Feb 2012, pp. 492–494.
[18] A. Drake, R. Senger, H. Deogun, G. Carpenter, S. Ghiasi, T. Nguyen, N. James,
M. Floyd, and V. Pokala, “A distributed critical-path timing monitor for a 65nm high-
performance microprocessor,” in 2007 IEEE International Solid-State Circuits Confer-
ence. Digest of Technical Papers, Feb 2007, pp. 398–399.
[19] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, “A dynamic voltage
scaled microprocessor system,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11,
pp. 1571–1580, Nov 2000.
[20] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi, H. Kawahara,
K. Kumano, and M. Shimura, “Dynamic voltage and frequency management for a low-
power embedded microprocessor,” IEEE Journal of Solid-State Circuits, vol. 40, no. 1,
pp. 28–35, Jan 2005.
134
[21] R. Wilson, E. Beigne, P. Flatresse, A. Valentian, F. Abouzeid, T. Benoist, C. Bernard,
S. Bernard, O. Billoint, S. Clerc, B. Giraud, A. Grover, J. L. Coz, I. M. Panades, J. P.
Noel, B. Pelloux-Prayer, P. Roche, O. Thomas, Y. Thonnart, D. Turgis, F. Clermidy,
and P. Magarshack, “A 460mhz at 397mv, 2.6ghz at 1.3v, 32b vliw dsp, embedding
fmax tracking,” in 2014 IEEE International Solid-State Circuits Conference Digest of
Technical Papers (ISSCC), Feb 2014, pp. 452–453.
[22] B. H. Calhoun and A. P. Chandrakasan, “Ultra-dynamic voltage scaling (udvs) using
sub-threshold operation and local voltage dithering,” IEEE Journal of Solid-State Cir-
cuits, vol. 41, no. 1, pp. 238–245, Jan 2006.
[23] J. Long, S. O. Memik, G. Memik, and R. Mukherjee, “Thermal monitoring mechanisms
for chip multiprocessors,” ACM Trans. Archit. Code Optim., vol. 5, no. 2, pp. 9:1–9:33,
Sep. 2008. [Online]. Available: http://doi.acm.org/10.1145/1400112.1400114
[24] A. N. Nowroz, R. Cochran, and S. Reda, “Thermal monitoring of real processors:
Techniques for sensor allocation and full characterization,” in Proceedings of the 47th
Design Automation Conference, ser. DAC ’10. New York, NY, USA: ACM, 2010, pp.
56–61. [Online]. Available: http://doi.acm.org/10.1145/1837274.1837291
[25] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Mey-
ers, E. Fang, and R. Kumar, “An integrated quad-core opteron processor,” in 2007 IEEE
International Solid-State Circuits Conference. Digest of Technical Papers, Feb 2007, pp.
102–103.
135
[26] M. Floyd, M. Allen-Ware, K. Rajamani, B. Brock, C. Lefurgy, A. J. Drake, L. Pesantez,
T. Gloekler, J. A. Tierno, P. Bose, and A. Buyuktosunoglu, “Introducing the adaptive
energy management features of the power7 chip,” IEEE Micro, vol. 31, no. 2, pp. 60–75,
March 2011.
[27] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogen-
miller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan,
K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, “Power8tm: A 12-core server-
class processor in 22nm soi with 7.6tb/s off-chip bandwidth,” in 2014 IEEE Interna-
tional Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb 2014,
pp. 96–97.
[28] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol. 19, no. 4, pp.
23–29, Jul 1999.
[29] G. Indiveri and S. C. Liu, “Memory and information processing in neuromorphic sys-
tems,” Proceedings of the IEEE, vol. 103, no. 8, pp. 1379–1397, Aug 2015.
[30] D. Jeon, M. B. Henry, Y. Kim, I. Lee, Z. Zhang, D. Blaauw, and D. Sylvester, “An
energy efficient full-frame feature extraction accelerator with shift-latch fifo in 28 nm
cmos,” IEEE Journal of Solid-State Circuits, vol. 49, no. 5, pp. 1271–1284, May 2014.
[31] V. Karkare, S. Gibson, and D. Markovi?, “A 75-µw, 16-channel neural spike-sorting
processor with unsupervised clustering,” IEEE Journal of Solid-State Circuits, vol. 48,
no. 9, pp. 2230–2238, Sept 2013.
136
[32] B. Zhang, Z. Jiang, Q. Wang, J. S. Seo, and M. Seok, “A neuromorphic neural spike
clustering processor for deep-brain sensing and stimulation systems,” in Low Power
Electronics and Design (ISLPED), 2015 IEEE/ACM International Symposium on, July
2015, pp. 91–97.
[33] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-
threshold computing: Reclaiming moore’s law through energy efficient integrated cir-
cuits,” Proceedings of the IEEE, vol. 98, no. 2, pp. 253–266, Feb 2010.
[34] S. Kim, I. Kwon, D. Fick, M. Kim, Y. P. Chen, and D. Sylvester, “Razor-lite: A
side-channel error-detection register for timing-margin recovery in 45nm soi cmos,” in
2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers,
Feb 2013, pp. 264–265.
[35] I. Shin, J. J. Kim, Y. S. Lin, and Y. Shin, “A pipeline architecture with 1-cycle tim-
ing error correction for low voltage operations,” in Low Power Electronics and Design
(ISLPED), 2013 IEEE International Symposium on, Sept 2013, pp. 199–204.
[36] B. H. Calhoun, F. A. Honore, and A. Chandrakasan, “Design methodology for fine-
grained leakage control in mtcmos,” in Proceedings of the 2003 International Symposium
on Low Power Electronics and Design, ser. ISLPED ’03. New York, NY, USA: ACM,
2003, pp. 104–109. [Online]. Available: http://doi.acm.org/10.1145/871506.871535
[37] M. Seok, “Decoupling capacitor design strategy for minimizing supply noise of ultra
low voltage circuits,” in Proceedings of the 49th Annual Design Automation Conference,
137
ser. DAC ’12. New York, NY, USA: ACM, 2012, pp. 968–973. [Online]. Available:
http://doi.acm.org/10.1145/2228360.2228534
[38] H. Fuketa, R. Takahashi, M. Takamiya, M. Nomura, H. Shinohara, and T. Sakurai,
“Increase of crosstalk noise due to imbalanced threshold voltage between nmos and
pmos in subthreshold logic circuits,” IEEE Journal of Solid-State Circuits, vol. 48,
no. 8, pp. 1986–1994, Aug 2013.
[39] M. R. Choudhury and K. Mohanram, “Masking timing errors on speed-paths in logic
circuits,” in 2009 Design, Automation Test in Europe Conference Exhibition, April 2009,
pp. 87–92.
[40] M. Seok, D. Jeon, C. Chakrabarti, D. Blaauw, and D. Sylvester, “Pipeline strategy for
improving optimal energy efficiency in ultra-low voltage design,” in Design Automation
Conference (DAC), 2011 48th ACM/EDAC/IEEE, June 2011, pp. 990–995.
[41] M. Seok, “A 0.27v 30mhz 17.7nj/transform 1024-pt complex fft core with super-
pipelining,” in 2011 IEEE International Solid-State Circuits Conference, Feb 2011, pp.
342–344.
[42] D. Jeon, M. Seok, C. Chakrabarti, D. Blaauw, and D. Sylvester, “A super-pipelined
energy efficient subthreshold 240 ms/s fft core in 65 nm cmos,” IEEE Journal of Solid-
State Circuits, vol. 47, no. 1, pp. 23–34, Jan 2012.
138
[43] S. Paul, M. Abbott, E. Kishinevsky, P. Aseron, S. Vangal, V. De, and G. Taylor, “A
3.6gb/s 1.3mw 400mv 0.051mm2 near-threshold voltage resilient router in 22nm tri-gate
cmos,” in VLSI Technology (VLSIT), 2013 Symposium on, June 2013, pp. C30–C31.
[44] J. P. Kulkarni, C. Tokunaga, P. Aseron, T. Nguyen, C. Augustine, J. Tschanz, and
V. De, “A 409gops/w adaptive and resilient domino register file in 22nm tri-gate cmos
featuring in-situ timing margin and error detection for tolerance to within-die variation,
voltage droop, temperature and aging,” in 2015 IEEE International Solid-State Circuits
Conference - (ISSCC) Digest of Technical Papers, Feb 2015, pp. 1–3.
[45] S. Kim and M. Seok, “Variation-tolerant, ultra-low-voltage microprocessor with a low-
overhead, within-a-cycle in-situ timing-error detection and correction technique,” IEEE
Journal of Solid-State Circuits, vol. 50, no. 6, pp. 1478–1490, June 2015.
[46] M. Seok, D. Blaauw, and D. Sylvester, “Clock network design for ultra-low power
applications,” in Proceedings of the 16th ACM/IEEE International Symposium on Low
Power Electronics and Design, ser. ISLPED ’10. New York, NY, USA: ACM, 2010,
pp. 271–276. [Online]. Available: http://doi.acm.org/10.1145/1840845.1840901
[47] S. Kim, J. P. Cerqueira, and M. Seok, “A 450mv timing-margin-free waveform sorter
based on body swapping error correction,” in 2016 IEEE Symposium on VLSI Circuits
(VLSI-Circuits), June 2016, pp. 1–2.
[48] K. Hirairi, Y. Okuma, H. Fuketa, T. Yasufuku, M. Takamiya, M. Nomura, H. Shi-
nohara, and T. Sakurai, “13% power reduction in 16b integer unit in 40nm cmos by
139
adaptive power supply voltage control with parity-based error prediction and detection
(pepd) and fully integrated digital ldo,” in 2012 IEEE International Solid-State Circuits
Conference, Feb 2012, pp. 486–488.
[49] S. R. Sridhara, M. DiRenzo, S. Lingam, S. J. Lee, R. Blazquez, J. Maxey, S. Ghanem,
Y. H. Lee, R. Abdallah, P. Singh, and M. Goel, “Microwatt embedded processor plat-
form for medical system-on-chip applications,” IEEE Journal of Solid-State Circuits,
vol. 46, no. 4, pp. 721–730, April 2011.
[50] K. Souri and K. A. A. Makinwa, “A 0.12 mm2 7.4µw micropower temperature sensor
with an inaccuracy of ±0.2oc (3σ) from - 30oc to 125 oc,” IEEE Journal of Solid-State
Circuits, vol. 46, no. 7, pp. 1693–1700, July 2011.
[51] K. Souri, Y. Chae, and K. A. A. Makinwa, “A cmos temperature sensor with a voltage-
calibrated inaccuracy of ±0.15oc (3σ) from - 55oc to 125oc,” IEEE Journal of Solid-State
Circuits, vol. 48, no. 1, pp. 292–301, Jan 2013.
[52] J. S. Shor and K. Luria, “Miniaturized bjt-based thermal sensor for microprocessors in
32- and 22-nm technologies,” IEEE Journal of Solid-State Circuits, vol. 48, no. 11, pp.
2860–2867, Nov 2013.
[53] E. Saneyoshi, K. Nose, M. Kajita, and M. Mizuno, “A 1.1v 35µm × 35µm thermal
sensor with supply voltage sensitivity of 2oc/10%-supply for thermal management on
the sx-9 supercomputer,” in 2008 IEEE Symposium on VLSI Circuits, June 2008, pp.
152–153.
140
[54] Y. W. Li, H. Lakdawala, A. Raychowdhury, G. Taylor, and K. Soumyanath, “A 1.05v
1.6mw 0.45oc 3σ-resolution 4
∑
-based temperature sensor with parasitic-resistance
compensation in 32nm cmos,” in 2009 IEEE International Solid-State Circuits Con-
ference - Digest of Technical Papers, Feb 2009, pp. 340–341,341a.
[55] K. Kim, H. Lee, S. Jung, and C. Kim, “A 366ks/s 400uw 0.0013mm2 frequency-to-
digital converter based cmos temperature sensor utilizing multiphase clock,” in 2009
IEEE Custom Integrated Circuits Conference, Sept 2009, pp. 203–206.
[56] K. Souri, Y. Chae, F. Thus, and K. Makinwa, “A 0.85v 600nw all-cmos temperature
sensor with an inaccuracy of ±0.4oc (3σ) from -40 to 125oc,” in 2014 IEEE International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb 2014, pp. 222–
223.
[57] S. Hwang, J. Koo, K. Kim, H. Lee, and C. Kim, “A 0.008 mm2 500µw 469 ks/s
frequency-to-digital converter based cmos temperature sensor with process variation
compensation,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 60,
no. 9, pp. 2241–2248, Sept 2013.
[58] D. Shim, H. Jeong, H. Lee, C. Rhee, D. K. Jeong, and S. Kim, “A process-variation-
tolerant on-chip cmos thermometer for auto temperature compensated self-refresh of
low-power mobile dram,” IEEE Journal of Solid-State Circuits, vol. 48, no. 10, pp.
2550–2557, Oct 2013.
141
[59] T. Yang, S. Kim, P. R. Kinget, and M. Seok, “Compact and supply-voltage-scalable
temperature sensors for dense on-chip thermal monitoring,” IEEE Journal of Solid-
State Circuits, vol. 50, no. 11, pp. 2773–2785, Nov 2015.
[60] R. Quan, U. Sonmez, F. Sebastiano, and K. A. A. Makinwa, “A 4600µm2 1.5oc (3σ)
0.9ks/s thermal-diffusivity temperature sensor with vco-based readout,” in 2015 IEEE
International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, Feb
2015, pp. 1–3.
[61] K. K. Rangan, G.-Y. Wei, and D. Brooks, “Thread motion: Fine-grained power
management for multi-core systems,” in Proceedings of the 36th Annual International
Symposium on Computer Architecture, ser. ISCA ’09. New York, NY, USA: ACM,
2009, pp. 302–313. [Online]. Available: http://doi.acm.org/10.1145/1555754.1555793
[62] D. N. Truong, W. H. Cheng, T. Mohsenin, Z. Yu, A. T. Jacobson, G. Landge, M. J.
Meeuwsen, C. Watnik, A. T. Tran, Z. Xiao, E. W. Work, J. W. Webb, P. V. Mejia, and
B. M. Baas, “A 167-processor computational platform in 65 nm cmos,” IEEE Journal
of Solid-State Circuits, vol. 44, no. 4, pp. 1130–1144, April 2009.
[63] P. Chen, C.-C. Chen, C.-C. Tsai, and W.-F. Lu, “A time-to-digital-converter-based
cmos smart temperature sensor,” IEEE Journal of Solid-State Circuits, vol. 40, no. 8,
pp. 1642–1648, Aug 2005.
[64] Y. Tsividis and C. McAndrew, Operation and Modeling of the MOS Transistors, 3rd ed.
Oxford, U.K: Oxford Univ. Press, 2011.
142
[65] M. Seok, D. Blaauw, and D. Sylvester, “Robust clock network design methodology
for ultra-low voltage operations,” IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, vol. 1, no. 2, pp. 120–130, June 2011.
[66] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “A variation-tolerant sub-200 mv 6-t
subthreshold sram,” IEEE Journal of Solid-State Circuits, vol. 43, no. 10, pp. 2338–
2348, Oct 2008.
[67] R. H. Krambeck, C. M. Lee, and H. F. S. Law, “High-speed compact circuits with
cmos,” IEEE Journal of Solid-State Circuits, vol. 17, no. 3, pp. 614–619, Jun 1982.
[68] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, R. Heald, and
G. Yee, “A new family of semidynamic and dynamic flip-flops with embedded logic for
high-performance processors,” IEEE Journal of Solid-State Circuits, vol. 34, no. 5, pp.
712–716, May 1999.
[69] S. Hanson, B. Zhai, M. Seok, B. Cline, K. Zhou, M. Singhal, M. Minuth, J. Olson,
L. Nazhandali, T. Austin, D. Sylvester, and D. Blaauw, “Performance and variability
optimization strategies in a sub-200mv, 3.5pj/inst, 11nw subthreshold processor,” in
2007 IEEE Symposium on VLSI Circuits, June 2007, pp. 152–153.
[70] H. B. Bakoglu and J. D. Meindl, “Optimal interconnection circuits for vlsi,” IEEE
Transactions on Electron Devices, vol. 32, no. 5, pp. 903–909, May 1985.
143
[71] V. Adler and E. G. Friedman, “Repeater design to reduce delay and power in resistive
interconnect,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal
Processing, vol. 45, no. 5, pp. 607–616, May 1998.
[72] N. H. E. Weste and D. Harris, CMOS VLSI Design A Circuits and Systems Perspective.
MA: Addison-Wesley, 2005.
[73] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” Proceedings of the IEEE,
vol. 89, no. 4, pp. 490–504, Apr 2001.
[74] A. Nalamalpu, S. Srinivasan, and W. P. Burleson, “Boosters for driving long onchip in-
terconnects - design issues, interconnect synthesis, and comparison with repeaters,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 21, no. 1, pp. 50–62, Jan 2002.
[75] Y. Peng and X. Liu, “Low-power repeater insertion with both delay and slew rate
constraints,” in Proceedings of the 43rd Annual Design Automation Conference, ser.
DAC ’06. New York, NY, USA: ACM, 2006, pp. 302–307. [Online]. Available:
http://doi.acm.org/10.1145/1146909.1146989
[76] A. B. Kahng, S. Muddu, E. Sarto, and R. Sharma, “Interconnect tuning strategies for
high-performance ics,” in Proceedings of the Conference on Design, Automation and
Test in Europe, ser. DATE ’98. Washington, DC, USA: IEEE Computer Society, 1998,
pp. 471–478. [Online]. Available: http://dl.acm.org/citation.cfm?id=368058.368271
