Modeling and Implementation of Threshold Logic Circuits and Architectures by Leshner, Samuel (Author) et al.
Modeling and Implementation of Threshold Logic Circuits and Architectures
by
Samuel Leshner
A Dissertation Presented in Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
Approved November 2010 by the
Graduate Supervisory Committee:
Sarma Vrudhula, Chair
Lawrence Clark
Aviral Shrivastava
Karamvir Chatha
ARIZONA STATE UNIVERSITY
December 2010
ABSTRACT
Threshold logic has long been studied as a means of achieving higher perfor-
mance and lower power dissipation, providing improvements by condensing simple
logic gates into more complex primitives, effectively reducing gate count, pipeline
depth, and number of interconnects. This work proposes a new physical implemen-
tation of threshold logic, the threshold logic latch (TLL), which overcomes the diffi-
culties observed in previous work, particularly with respect to gate reliability in the
presence of noise and process variations. Simple but effective models were created to
assess the delay, power, and noise margin of TLL gates for the purpose of determin-
ing the physical parameters and assignment of input signals that achieves the lowest
delay subject to constraints on power and reliability. From these models, an optimized
library of standard TLL cells was developed to supplement a commercial library of
static CMOS gates. The new cells were then demonstrated on a number of automat-
ically synthesized, placed, and routed designs. A two-stage 2’s complement integer
multiplier designed with CMOS and TLL gates utilized 19.5% less area, 28.0% less
active power, and 61.5% less leakage power than an equivalent design with the same
performance using only static CMOS gates. Additionally, a two-stage 32-instruction 4-
way issue queue designed with CMOS and TLL gates utilized 30.6% less area, 31.0%
less active power, and 58.9% less leakage power than an equivalent design with the
same performance using only static CMOS gates.
ii
To my parents, my brother Harry, and all of my friends.
iii
ACKNOWLEDGEMENTS
I would first like to acknowledge a deep appreciation for my committee mem-
bers Dr. Lawrence Clark, Dr. Aviral Shrivastava, Dr. Karamvir Chatha, Dr. Georgios
Fainekos, and in particular my committee chair Dr. Sarma Vrudhula for their guidance
and support.
I would like to thank all of my research colleagues throughout my academic ca-
reer, including Sarvesh Bhardwaj, Praveen Ghanta, Ravishankar Rao, Tejaswi Gowda,
Vinay Hanumaiah, Saurabh Patel, Gayathri Chalivendra, Indira Negi, Manoj Venkata-
subbu, Siddhesh Mhambrey, and Xiaoyin Yao for their friendship and collaborative
input, and extend a very special thanks to Dr. Kryzsztof Berezowski, who provided
invaluable support and insight without which this work would not be possible. Addi-
tionally, I would like to extend my thanks to Dr. David Blaauw at the University of
Michigan and his students, particularly Carlos Tokunaga and Zhiyoong Foo, for their
assistance with the CAD flow used in the fabrication of the multiplier test chip.
I would also like to express my deepest thanks for the funding received from the
National Science Foundation under award CCF-070283, the Science Foundation Ari-
zona SFAZ-SBC and the Stardust Foundation, the Consortium for Embedded Systems,
and the Department of Computer Science and Engineering through principal investi-
gator Dr. Sarma Vrudhula, which provided my salary as a full time research assistant,
tuition, travel expenses, and additional benefits from 2005 to 2010.
Finally, I would also like to thank my family for their love and support through-
out this long and arduous process.
iv
TABLE OF CONTENTS
Page
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 INTRODUCTION TO THRESHOLD LOGIC . . . . . . . . . . . . . . . . 1
1.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Threshold logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Important properties of threshold logic functions . . . . . . . . . . . . . 7
2 PHYSICAL IMPLEMENTATIONS OF THRESHOLD LOGIC . . . . . . . 11
2.1 Conventional implementations . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Capacitive implementations . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Conductance-based implementations . . . . . . . . . . . . . . . . . . . 19
2.4 Non-CMOS implementations . . . . . . . . . . . . . . . . . . . . . . . 22
3 DIFFERENTIAL THRESHOLD LOGIC . . . . . . . . . . . . . . . . . . . 26
3.1 Common design and principles of operation . . . . . . . . . . . . . . . 26
3.2 Cross-coupled inverters with asymmetrical loads (CIAL) . . . . . . . . 28
3.3 Latch-type CMOS threshold logic (LCTL) . . . . . . . . . . . . . . . . 31
3.4 Single-input current-sensing differential logic (SCSDL) . . . . . . . . . 34
3.5 Differential current-switch threshold logic (DCSTL) . . . . . . . . . . . 37
3.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Failure simulations and reliability . . . . . . . . . . . . . . . . . . . . . 44
3.8.1 Voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8.2 Silicon on insulator . . . . . . . . . . . . . . . . . . . . . . . . 49
4 THE THRESHOLD LOGIC LATCH (TLL) . . . . . . . . . . . . . . . . . . 51
v
Chapter Page
4.1 Design of the TLL element . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Comparison with existing differential threshold logic . . . . . . . . . . 55
4.2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Performance and power dissipation . . . . . . . . . . . . . . . . 58
4.3 Masking the reset phase . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Comparison with static CMOS . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Comparison with domino CMOS . . . . . . . . . . . . . . . . . . . . . 67
4.6 Error detection and correction . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Adding scan capability to the TLL element . . . . . . . . . . . . . . . . 72
5 MODELING AND OPTIMIZATION OF THE TLL GATE . . . . . . . . . . 74
5.1 Signal assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Physical parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Delay modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1.1 Evaluation delay . . . . . . . . . . . . . . . . . . . . . 86
5.2.1.2 Reset delay . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Delay optimization . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.2.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 93
5.2.3 Power modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.4 Power optimization . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2.4.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 108
5.2.5 Reliability modeling . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.6 Reliability optimization . . . . . . . . . . . . . . . . . . . . . . 115
5.2.6.1 Model evaluation . . . . . . . . . . . . . . . . . . . . 116
6 TLL BASED DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1 CMOS/TLL hybridization . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Design of a 32-bit integer 2’s complement multiplier using TLL . . . . 125
6.2.1 Partial product generation . . . . . . . . . . . . . . . . . . . . . 126
vi
Chapter Page
6.2.2 Partial product reduction . . . . . . . . . . . . . . . . . . . . . 126
6.2.3 Partial product addition . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Multiplier architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4 Design of TLL standard cell library . . . . . . . . . . . . . . . . . . . . 135
6.5 Synthesis, place, and route . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Mixed-signal simulation . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.7 Design and fabrication of a 65 nm LP bulk CMOS multiplier test archi-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.7.1 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.7.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.7.3 Output selection and verification . . . . . . . . . . . . . . . . . 147
6.7.4 Clock generation . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.7.5 Top level design and manufacturing . . . . . . . . . . . . . . . . 149
6.7.6 Measurement and analysis of the 65 nm LP bulk CMOS multi-
plier test chip . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.7.6.1 Functional testing . . . . . . . . . . . . . . . . . . . . 153
6.7.6.2 Performance testing . . . . . . . . . . . . . . . . . . . 154
6.7.6.3 Power testing . . . . . . . . . . . . . . . . . . . . . . 154
7 ISSUE LOGIC WITH TLL . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1 Issue logic design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.1 Instruction scoreboard . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.2 Request logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.1.3 Arbiter logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1.4 Update logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.2 Design of the CMOS/TLL issue queue . . . . . . . . . . . . . . . . . . 165
7.3 Synthesis, placement, and routing . . . . . . . . . . . . . . . . . . . . . 167
8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
vii
Chapter Page
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
viii
LIST OF TABLES
Table Page
1.1 Circuits Multi-Projets manufacturing costs of ST Microelectronics 65 nm
LP process for 2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ratio of threshold functions to all Boolean functions for fan-in from 1-5. . . 6
1.3 Truth table for the threshold function {2.5, 1.32, 1.47; 2.1}. . . . . . . . . . 9
1.4 Truth table for the threshold function {2.5, 1.32, w2; 2.1}, with 0.78 ≤
w2 ≤ 2.09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Truth table for the threshold function {25, 13.2, 14.7; 21}. . . . . . . . . . 10
2.1 Worst case device count vs. number of inputs for multi-level CMOS net-
work, single complex CMOS gate, single complex domino gate, and trans-
mission gate steering logic. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Output voltage and static power dissipation of the output wired inverter
implementation of the threshold function {21111;4} . . . . . . . . . . . . . 20
3.1 Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process. . . . . 41
3.2 Evaluation delay of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and
DCSTL elements by αL/αR combination for a 65 nm LP bulk CMOS process. 42
3.3 Power dissipation of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (1
GHz clock). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Power dissipation of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and
DCSTL elements by αL/αR combination for a 65 nm LP bulk CMOS pro-
cess (1 GHz clock). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Branch discharge delay of an LCTL element assuming Ci = Zi = 1. . . . . . 45
3.6 Branch discharge delay of an LCTL element assuming Ci = Zi = 1, τN1 +
10%, and τN2 - 10%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix
Table Page
3.7 Noise margin of minimum-sized CIAL, LCTL, SCSDL, and DCSTL ele-
ments by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd =
1.2V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Noise margin of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and DC-
STL elements by αL/αR combination for a 65 nm LP bulk CMOS process
(Vdd = 1.2V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9 Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd
= 1.0 V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.10 Power dissipation of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (1
GHz clock, Vdd = 1.0 V). . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.11 Noise margin of minimum-sized CIAL, LCTL, SCSDL, and DCSTL ele-
ments by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd =
0.8V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Noise margin of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and TLL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd
= 1.2V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Noise margin of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DCSTL,
and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS
process (Vdd = 1.2V). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Number of failures out of 1000 Monte Carlo simulations exhibited by sized
CIAL, LCTL, SCSDL, DCSTL, and TLL elements by αL/αR combination
for a 65 nm LP bulk CMOS process (Vdd = 1.2V). . . . . . . . . . . . . . . 58
4.4 Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and
TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process. . 59
x
Table Page
4.5 Evaluation delay of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DC-
STL, and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS
process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Power dissipation of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and
TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process
(1 GHz clock). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Power dissipation of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DC-
STL, and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS
process (1 GHz clock). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Truth table of the SR latch with inputs S and R and outputs Q and Q. . . . . 63
4.9 Evaluation delay of sized (M5−8 = 0.96 µm) TLL elements with different
output slave latches by αL/αR combination for a 65 nm LP bulk CMOS
process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.10 Power dissipation of sized (M5−8 = 0.96 µm) TLL elements with different
output slave latches by αL/αR combination for a 65 nm LP bulk CMOS
process (1 GHz clock). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Area, delay, and power dissipation comparison between TLL and static
CMOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.12 Area, delay, and power dissipation comparison between TLL and domino
CMOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Input configuration vs. noise margin for a minimum-sized 16-input TLL
gate implemented in 65 nm LP bulk CMOS . . . . . . . . . . . . . . . . . 75
5.2 Input configuration vs. typical evaluation delay for a minimum-sized 16-
input TLL gate implemented in 65 nm LP bulk CMOS . . . . . . . . . . . 76
5.3 Input configuration vs. typical evaluation power for a minimum-sized 16-
input TLL gate implemented in 65 nm LP bulk CMOS . . . . . . . . . . . 77
xi
Table Page
5.4 Truth table of the threshold function a[b(c+d+e)+c(d+e)+de]+bcde
(21111;4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Simple signal assignment 4a+2b+2c+2d+2e > 7 of the threshold func-
tion {42222;7} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Example signal assignment #1 of the threshold function {42222;7} . . . . . 82
5.7 Example signal assignment #2 of the threshold function {42222;7} . . . . . 82
5.8 Required simulations of evaluation delay for construction of the delay model. 93
5.9 Simulated parameter values for comparison with evaluation delay model. . . 96
5.10 Model coefficients and fitting parameters for the 65 nm LP and GP bulk
CMOS processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.11 Simulated evaluation delay of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.12 Simulated evaluation delay of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.13 Modeled evaluation delay of a minimum-sized 16-input TLL gate in 65 nm
LP bulk CMOS with respect to WDN across a range of αL/αR combinations. 99
5.14 Modeled evaluation delay of a minimum-sized 16-input TLL gate in 65 nm
GP bulk CMOS with respect to WDN across a range of αL/αR combinations. 99
5.15 Required simulations of reset delay for construction of the delay model. . . 101
5.16 Required simulations of power dissipation for construction of the power
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.17 Simulated power dissipation of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xii
Table Page
5.18 Simulated power dissipation of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.19 Modeled power dissipation of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.20 Modeled power dissipation of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.21 Required simulations of noise margin for construction of the reliability model.117
5.22 Simulated noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm LP bulk CMOS with respect to WDN across a range of αL/αR
combinations. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where
large model inaccuracies occur. . . . . . . . . . . . . . . . . . . . . . . . . 118
5.23 Simulated noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm GP bulk CMOS with respect to WDN across a range of αL/αR
combinations. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where
large model inaccuracies occur. . . . . . . . . . . . . . . . . . . . . . . . . 118
5.24 Modeled ∆τinputτdi f f for a minimum-sized 16-input TLL gate in 65 nm LP bulk
CMOS with respect to WDN across a range of αL/αR combinations. Shaded
cells are high reliability (∆τinputτdi f f > 1) cases where large model inaccuracies
occur. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.25 Modeled ∆τinputτdi f f for a minimum-sized 16-input TLL gate in 65 nm GP bulk
CMOS with respect to WDN across a range of αL/αR combinations. Shaded
cells are high reliability (∆τinputτdi f f > 1) cases where large model inaccuracies
occur. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xiii
Table Page
5.26 Modeled noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm LP bulk CMOS with respect to WDN across a range of αL/αR
combinations. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where
large model inaccuracies occur. . . . . . . . . . . . . . . . . . . . . . . . . 120
5.27 Modeled noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm GP bulk CMOS with respect to WDN across a range of αL/αR
combinations. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where
large model inaccuracies occur. . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1 TLL functions employed in CMOS/TLL multiplier designs. . . . . . . . . . 136
6.2 Threshold functions realizable by the TLL cell in Figure 6.11. . . . . . . . 137
6.3 Place and route results of CMOS and CMOS/TLL multiplier designs. . . . 140
6.4 Nanosim simulation results of multiplier designs at worst case delay corner
(SS, 1.1V, 105C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5 Nanosim simulation results of multiplier designs at typical delay corner
(TT, 1.2V, 25C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.6 Average measured multiplier power dissipation across 23 functional dies
(Vdd = 1.2V) at four different internal clock frequencies. . . . . . . . . . . . 155
6.7 Measured multiplier leakage across 23 functional dies (Vdd = 1.2V). . . . . 155
7.1 Comparison between CMOS and CMOS/TLL (design B) issue logic post-
place and route assuming worst case delay conditions (slow NMOS, slow
PMOS, 1.1V supply, and 105C. . . . . . . . . . . . . . . . . . . . . . . . . 171
7.2 Comparison between CMOS and CMOS/TLL (design B) issue logic post-
place and route assuming typical delay conditions (typical NMOS, typical
PMOS, 1.2V supply, and 25C. . . . . . . . . . . . . . . . . . . . . . . . . 171
8.1 Credits for multiplier test chip architecture. . . . . . . . . . . . . . . . . . 182
xiv
LIST OF FIGURES
Figure Page
1.1 Logic absorption of the function a(b(c+d+e) + c(d+e) + de) + bcde into a
single threshold gate with inputs {a, b, c, d, e}, weights {2, 1, 1, 1, 1}, and
threshold 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Resistor-vacuum tube implementation of the threshold function {21111;4}. 11
2.2 Multi-level CMOS implementation of the threshold function {21111;4}. . . 13
2.3 Single complex static CMOS implementation of the threshold function
{21111;4}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Single complex domino CMOS implementation of the threshold function
{21111;4}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Steering logic implementation of the threshold function {21111;4}. . . . . 16
2.6 Static neuMOS inverter implementation of the threshold function {21111;4}. 18
2.7 Clocked capacitive implementation (STTL) of the threshold function {21111;4}. 19
2.8 Output wired inverter implementation of the threshold function {21111;4}. 20
2.9 Clocked conductive implementation of the threshold function {21111;4}. . 21
2.10 Annotated current-voltage characteristics of the resonant tunneling diode
(RTD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 MOBILE implementation of the threshold function {21111;4}. . . . . . . . . . . . . 23
2.12 Operation of a static CMOS inverter. . . . . . . . . . . . . . . . . . . . . . 24
2.13 Operation of the MOBILE during evaluation (clk = logic 1). . . . . . . . . 24
3.1 Generic differential sense amplifier employed in SRAM column design. . . 27
3.2 Generic differential sense amplifier employed in a differential threshold logic element. . . 27
3.3 Device-level schematic of the CIAL gate. . . . . . . . . . . . . . . . . . . 29
3.4 Evaluation waveforms of the CIAL gate, assuming αL/αR toggling between
5/4 and 4/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Device-level schematic of the LCTL gate. . . . . . . . . . . . . . . . . . . 32
xv
Figure Page
3.6 Evaluation waveforms of the LCTL gate, assuming αL/αR toggling be-
tween 5/4 and 4/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Device-level schematic of the SCSDL gate. . . . . . . . . . . . . . . . . . 34
3.8 Evaluation waveforms of the SCSDL gate, assuming αL/αR toggling be-
tween 5/4 and 4/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Device-level schematic of the DCSTL gate. . . . . . . . . . . . . . . . . . 37
3.10 Evaluation waveforms of the DCSTL gate, assuming αL/αR toggling be-
tween 5/4 and 4/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.11 Evaluation delay of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL,
αR) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.12 Power dissipation of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL,
αR) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.13 Noise margin of minimum-sized and sized (M5−8 = 0.96 µm) CIAL, LCTL,
SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL, αR) = 1. . 47
4.1 Device-level schematic of the threshold logic latch (TLL). . . . . . . . . . 51
4.2 Evaluation waveforms of the TLL gate, assuming αL/αR toggling between
5/4 and 4/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Noise margin of minimum-sized and sized (M5−8 = 0.96 µm) CIAL, LCTL,
SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL, αR)
= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Frequency histogram of threshold voltage variations in several minimum-
sized PMOS devices in the input networks M9 and M10 of a TLL gate. . . . 57
4.5 Evaluation delay of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL,
αR) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xvi
Figure Page
4.6 Power dissipation of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL,
αR) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Evaluation waveforms of the TLL gate augmented with slave latch at out-
put, assuming αL/αR toggling between 5/4 and 4/5. . . . . . . . . . . . . . 62
4.8 Device-level schematic of the threshold logic latch (TLL) augmented with
an SR slave latch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Device-level schematic of the threshold logic latch (TLL) augmented with
D slave latches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.10 Domino CMOS implementation of a 7-input OR function (y = a + b + c +
d + e + f + g). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.11 Domino CMOS implementation of a 7-input OR function (y = a + b + c +
d + e + f + g) with keeper circuit. . . . . . . . . . . . . . . . . . . . . . . . 69
4.12 Augmentation of a single input in in the input network to provide test mode
capability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.13 TLL input network with configurable dummy devices. . . . . . . . . . . . . 71
4.14 Device level schematic of the TLL gate with scan functionality. . . . . . . . 72
5.1 Evaluation delay vs. number of inputs per input network by αL/αR combi-
nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Average evaluation power vs. number of inputs per input network by αL/αR
combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 TLL element annotated with physical parameter sizing groups: DP, DN, IP,
IN, and X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Evaluation delay vs. parameter sizing for minimum-sized 16-input TLL
gate assuming an αL/αR combination of 1/0. . . . . . . . . . . . . . . . . . 86
5.5 RC network representation of input network propagation delay. . . . . . . . 87
5.6 RC network representation of differential discharge delay. . . . . . . . . . . 88
xvii
Figure Page
5.7 Reset delay vs. parameter sizing for minimum-sized 16-input TLL gate. . . 90
5.8 Evaluation delay vs. sizing of the parameter WDN for minimum-sized 16-
input TLL gate across a range of αL/αR combinations. . . . . . . . . . . . . 91
5.9 Contour plot of RMS error (ps) vs. β and γ for minimum-sized 16-input
TLL gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.10 Modeled vs. simulated evaluation delay vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 100
5.11 Modeled vs. simulated evaluation delay vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 100
5.12 Average power dissipation vs. parameter sizing for minimum-sized 16-
input TLL gate assuming an αL/αR combination of 1/0. . . . . . . . . . . . 104
5.13 Average power dissipation vs. parameter sizing for minimum-sized 16-
input TLL gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.14 Average power dissipation vs. αL/αR combination for minimum-sized 16-
input TLL gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.15 Modeled vs. simulated power dissipation vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 112
5.16 Modeled vs. simulated power dissipation vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 113
5.17 Modeled vs. simulated noise margin vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 121
xviii
Figure Page
5.18 Modeled vs. simulated noise margin vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process
operating under typical conditions. . . . . . . . . . . . . . . . . . . . . . . 122
6.1 Flip-flops and a portion of the preceding combinational logic are absorbed
into TLL gates through hybridization. . . . . . . . . . . . . . . . . . . . . 124
6.2 Poor absorption through hybridization due to non-threshold structures. . . . 124
6.3 3:2 counter block diagram and operation. . . . . . . . . . . . . . . . . . . . 127
6.4 Parallel 3:2 counter reducing three n-bit vectors into two n-bit vectors. . . . 127
6.5 32-bit partial production reduction tree implemented using 3:2 counters. . . 128
6.6 7:3 counter block diagram and operation. . . . . . . . . . . . . . . . . . . . 128
6.7 CMOS implementation of a 7:3 counter using a network of 3:2 counters . . 130
6.8 Gate-level schematic of a clocked hybrid CMOS/TLL 7:3 counter. . . . . . 132
6.9 Block level schematic of the two-stage CMOS multiplier. . . . . . . . . . . 134
6.10 Block level schematic of the two-stage CMOS/TLL multiplier. . . . . . . . 135
6.11 Standard cell implementation of a TLL gate with 5 inputs in each input
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.12 Area vs. frequency comparison between placed and routed CMOS and
CMOS/TLL multiplier designs at typical design corner. . . . . . . . . . . . 139
6.13 Standard cell count vs. frequency comparison between placed and routed
CMOS and CMOS/TLL multiplier designs at typical design corner. . . . . . 140
6.14 Placed and routed layout of a CMOS/TLL multiplier design with high-
lighted clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.15 Total power vs. frequency comparison between placed and routed CMOS
and CMOS/TLL multiplier designs at typical design corner. . . . . . . . . . 142
6.16 Leakage power vs. frequency comparison between placed and routed CMOS
and CMOS/TLL multiplier designs at typical design corner. . . . . . . . . . 143
6.17 Block level architecture of complete test chip. . . . . . . . . . . . . . . . . 145
xix
Figure Page
6.18 Block diagram of multiplier test architecture data sources. . . . . . . . . . . 147
6.19 Block diagram of multiplier test architecture output sources. . . . . . . . . 148
6.20 Block diagram of multiplier test architecture clock generator. . . . . . . . . 149
6.21 Layout of the complete test architecture with I/O ring. . . . . . . . . . . . . 150
6.22 Magnified and annotated photograph of a manufactured test die. . . . . . . 150
6.23 Bonding diagram of the multiplier test chip. . . . . . . . . . . . . . . . . . 151
6.24 Layout of the test PCB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.25 Photograph of the test PCB mounted with a packaged die. . . . . . . . . . . 152
6.26 Photograph of the complete test setup including PC, PCB, power supplies,
oscilloscope, and DMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.1 Instructions per cycle vs. number of instructions in the issue queue across
various floating point and integer benchmark processes (from [21]). . . . . 157
7.2 Operand storage of a single instruction in the instruction scoreboard. . . . . 159
7.3 Request logic generating ready signal for a single instruction. . . . . . . . . 160
7.4 8-bit CMOS sorting logic. . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.5 Full 32 instruction arbiter using 1 to 8-bit CMOS sorters. . . . . . . . . . . 163
7.6 TLL arbiter for a single instruction. . . . . . . . . . . . . . . . . . . . . . . 164
7.7 Update logic generating grant and one-hot encoded shift signals for a single
instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8 Block level schematic of two-stage CMOS issue queue. . . . . . . . . . . . 166
7.9 Block level schematic of two-stage CMOS/TLL issue queue design A, with
full hybridization of the arbiter logic. . . . . . . . . . . . . . . . . . . . . . 167
7.10 Block level schematic of two-stage CMOS/TLL issue queue design B, with
partial hybridization of the arbiter logic. . . . . . . . . . . . . . . . . . . . 167
7.11 Standard cell layout for extremely wide fan-in TLL gate. . . . . . . . . . . 168
7.12 Automatically placed and routed 689 MHz CMOS issue logic with high-
lighted clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
xx
Figure Page
7.13 Automatically placed and routed 689 MHz CMOS/TLL issue logic with
highlighted clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xxi
Chapter 1
INTRODUCTION TO THRESHOLD LOGIC
An increasing demand for greater functionality, faster response time, and longer
battery life in embedded mobile applications has driven research in increased perfor-
mance and lower power dissipation. Meanwhile, the competitive marketplace forces
greater flexibility, reduced costs, and faster time to market, necessitating reduced area,
increased yield, and automated design [59]. Automated design entails the use of computed-
aided design tools to automatically perform low level synthesis and layout of a design
from a high level description. Through automated design, circuits can be produced
much more rapidly than via a custom approach; it is thus quite popular amongst com-
mercial producers of ICs, enabling new products to be developed at much lower costs
using a greatly reduced number of man hours.
Automated design operates by mapping the high-level description of a circuit
onto a network of standard cells. A standard cell implements a specific Boolean func-
tion; there are generally a number of cells implementing a single function in a standard
cell library, varying in size according the drive strength provided. Automated designs
provide high performance at the cost of area and power dissipation, as fast designs
require high drive strength cells along the critical paths of the design. Assuming the
available real estate in all library cells is well utilized, the size and power dissipation
of a standard cell grow as drive strength increases. Assuming performance to be a con-
stant constraint across the design space, minimization of area and power dissipation are
both important design objectives, as these parameters correlate directly with production
costs and battery life.
1.1 Area
Area refers to the amount of silicon upon which a design is manufactured. Logic
gates in a design occupy area, as do the interconnect wires between them, the power
grid, well taps, heat sinks, I/O pads, and additional scribing and alignment structures.
1
Generally, it is beneficial to reduce area as much as possible. Reducing area has two
potential effects: it can reduce cost and/or improve total yield [15].
In manufacturing, there is a direct relationship between area and cost. In prac-
tice, terms vary from foundry to foundry, and the relationship between area and cost
is not always linear. It is, however, monotonic. The manufacturing costs for the ST
Microelectronics 65 nm LP process provided by Circuits Multi-Projets in Grenoble,
France, are given in Table 1.1.
Table 1.1: Circuits Multi-Projets manufacturing costs of ST Microelectronics 65 nm
LP process for 2010.
Area Cost
1-5 mm2 9500ε/mm
5+ mm2 47500ε + [Area(mm) - 5]*6500ε/mm
The costs given in the table are for 25 dies. For example, if 25 dies are required
requiring an area of 12 mm2, the total cost would be 86,500ε . Reducing the area of the
design by 25% to 9 mm2 would reduce the cost by 19,500ε , a savings of 22.5%.
Additionally, savings provided by the reduction in area can also be invested
towards a greater number of dies, improving total yield. Assuming the yield of the
design is 80%, roughly 1 in every 5 dies manufactured will fail. Thus out of the 25
dies produced, on average 20 can be expected to function properly. If the cost savings
from the previous example are applied towards the manufacture of 5 additional dies,
however, the expected yield is 24 working dies, a 20% increase in yield.
1.2 Power dissipation
Dynamic power dissipation refers to the power consumed as a capacitive node
switches from one logic state to another, from logic 0 to logic 1, for instance. If a
statically maintained capacitive node has a value of logic 0, there is a conducting path
from the node to ground, and no conducting path from the node to the supply voltage.
In order to transition from logic 0 to logic 1, the conducting path from the node to
ground must be disabled, and a conducting path from the node to the supply voltage
2
enabled. During this interval, a conductive path temporarily exists between the supply
voltage and ground, and dynamic power P is dissipated according to Equation 1.1.
P =CV 2dd (1.1)
In the equation, C is the capacitance of the node, and Vdd is the supply volt-
age. Dynamic power is dissipated every time the node switches values, thus average
power over time increases linearly as the frequency of activity on the node is increased.
Switching events are not the only source of power dissipation in a circuit, however.
Static power dissipation refers to power consumed constantly over time in the
absence of switching events. If a capacitive node possesses a constant conducting path
to both the supply voltage and ground, static power P is dissipated according to Equa-
tion 1.2.
P = IVdd (1.2)
In the equation, I is the current across the capacitance of the node, and Vdd is
the supply voltage. Leakage is a type of static power dissipation. Even if a transistor
is turned off, a small amount of current still flows across the terminals of the device,
thus even statically maintained capacitive nodes dissipate some static power [2]. Every
transistor possesses an “off” current Io f f which is drawn when the device is inactive; the
magnitude of Io f f is determined by the process. While relatively small for an individual
node, leakage power dissipation can become quite substantial over a large set of nodes.
While many circuit architectures exist which do not draw large static currents
across individual nodes, leakage power dissipation is an unavoidable byproduct of any
design. A great deal of research has been invested into reducing leakage [9, 43, 49]; this
can be done at the device level, the architectural level, and/or the gate level. Techniques
applied at each level are generally independent of one another and can all be applied, if
desired [35].
3
At the device level, leakage can be reduced by raising the threshold voltage Vt
of the transistors, thereby increasing Io f f [69]. This is generally performed by increas-
ing the oxide thickness or level of doping in the device. Increasing Vt also reduces
the speed of the device, thus high performance processes tend to exhibit larger leak-
age currents than processes geared towards lower performance, low power embedded
systems. Other device level techniques for reducing leakage include the use of differ-
ential materials for substrates and gate dielectrics which increase the Ion/Io f f ratio of
the transistor.
Architectural level techniques for reducing power include voltage scaling and
power gating, amongst others. Voltage scaling reduces leakage power dissipation by
reducing both I and Vdd [9]. As a consequence, the ratio of Vdd to Vt is reduced as well,
resulting in diminished performance. Voltage scaling also reduces the noise margin of
gates, thus there are limits to amount of scaling that can be performed. Power gating
entails completely powering down (i.e. putting to sleep) parts of the design that are not
performing useful work [43]. Such components do not consume any static power since
Vdd is effectively at the same potential as ground, however sleeping gates obviously
cannot perform computation, either. Recovering from sleep mode can be an expensive
operation in terms of both the number of clock cycles required and power dissipated,
and thus only provides an advantage if the circuit can remain asleep for a significant
period of time.
At the gate level, relatively few techniques have been realized to reduce power
dissipation. Gates may be reduced in size to lower capacitance, and transistors can be
added in series to the pull-up and pull-down of a gate to reduce I [49]. Alternatively, if
the functions implemented by a network of gates could be implemented by a smaller,
more efficient primitive structure, the number of capacitive nodes (and leakage paths)
in the design would reduce, reducing both area and power dissipation. Threshold logic
has been proposed as a possible means of achieving this goal.
4
1.3 Threshold logic
Threshold functions are a proper subset of Boolean functions [48]. A function
y = f (x0,x1, ...,xn−1) is threshold if there exists a set of weights w0,w1, ...,wn−1 and a
threshold T such that the output y is that provided in Equation 1.3.
y =
 1 if∑
n−1
i=0 wixi ≥ T,
0 otherwise.
(1.3)
The advantage of threshold logic rests in its ability to compute complex Boolean
functions very efficiently. In many cases, it is possible to replace a large multi-level
CMOS network with a single gate. For instance, the function f = a(b(c+d+e) + c(d+e)
+ de) + bcde can implemented as a single threshold gate, as seen in Figure 1.1, whereas
an optimally synthesized version using traditional gate libraries requires three to five
levels of logic.
a
b
c clk
y
d
e
clk
4 y
a
b
c
d
e
2
= 111
1
Figure 1.1: Logic absorption of the function a(b(c+d+e) + c(d+e) + de) + bcde into a
single threshold gate with inputs {a, b, c, d, e}, weights {2, 1, 1, 1, 1}, and threshold 4.
This “absorption” of static logic has two effects. The first is a reduction in gate
count, which may result in reductions in power consumption and/or gate area. The
second is a potential reduction in critical path length, which translates to improvements
in performance and/or reductions in power and area due to the relaxation of timing
constraints. Robust and efficient implementations of threshold logic promise faster,
5
smaller, and lower power designs than those currently achievable. For a given fan-in n
a fraction of Boolean functions with n inputs will be threshold functions. This fraction
is large for low values of n, decreasing quickly as n is increased, as shown in Table 1.2.
Table 1.2: Ratio of threshold functions to all Boolean functions for fan-in from 1-5.
Fan-in # Threshold functions # Boolean functions Ratio
1 4 4 1.00
2 12 16 0.75
3 42 256 0.164
4 306 65,536 0.00466
5 2,594 4,294,967,296 0.000000603
The number of threshold functions as a function of fan-in was computed from
the enumeration of all threshold functions with fan-in up to 5 provided in [48]. As
the ratio of threshold functions versus Boolean functions decreases, so too does the
probability than an arbitrary function of n inputs can be replaced by a single threshold
function. Boolean functions implemented using static CMOS standard cells are gener-
ally limited to a maximum fan-in of 4-6. Functions commonly implemented in a static
CMOS standard cell library include AND and OR operations, which are themselves
threshold functions. While the probability that an arbitrary function can be fully real-
ized using a threshold logic gate decreases as fan-in increases, threshold logic functions
can be used to reduce the gate count of a multi-level static CMOS logic network.
Threshold logic has a long and storied history, having been the subject of re-
search since its inception in the early 1960s [6]. Many advances have been made
in identification of threshold logic functions, as well as synthesis and verification of
threshold logic networks [28, 72, 25, 63]. In addition, many applications have been
identified as being particularly well-suited to threshold logic computation, such as neu-
ral networks and modeling of gene regulation [58, 27]. The widespread adoption of
threshold logic in practice has been slow, however, largely due to a lack of efficient
physical implementations; compression of multi-level primitive Boolean gate networks
6
into threshold gates is not possible if a primitive threshold logic gate that provides
superior area, performance, and/or power dissipation does not exist.
Standard cells are selected, placed, and routed automatically, thus do not receive
as much careful attention as would be expected in a manually placed custom design. As
a result, gates selected and placed must be exhibit a higher degree of robustness in the
presense of noise and process variations. Standard cells are typically implemented as
static CMOS logic gates, which unlike many other logic families are extremely reliable
across widely varying conditions in supply voltage and temperature, as well as the
dopant levels and physical dimensions of the transistors composing the gate. In addition
to area and power efficiency, a standard cell implementation of a threshold logic gate
must be reliable, as well.
1.4 Important properties of threshold logic functions
Threshold logic functions exhibit a number of mathematical properties which
become important in the development of physical implementations. The first prop-
erty is that any threshold function can be expressed such that all weights w0, w1, ... ,
wn−1 and the threshold value T are positive values. This is important to many physical
implementations of threshold logic, as weights and threshold values in a physical im-
plementation may be represented by physical parameters such as transistor gate widths,
which cannot assume negative values.
Consider as an example a threshold function defined by the inequality given in
Equation 1.4.
w0x0+ ...−wixi+ ...+wn−1xn−1 ≥ T (1.4)
In the inequality, the input xi is multiplied by negative weight −wi. If the input
xi is replaced by xi and the quantity wi is added the threshold value T , the weight of xi
becomes positive and the inequality is preserved, as shown by equations 1.5-1.8.
7
w0x0+ ...+wi−wixi+ ...+wn−1xn−1 ≥ T +wi (1.5)
w0x0+ ...+wi(1− xi)+ ...+wn−1xn−1 ≥ T +wi (1.6)
1− xi = xi (1.7)
w0x0+ ...+wixi+ ...+wn−1xn−1 ≥ T +wi (1.8)
The second property is that any threshold function can be expressed such that
all weights w0, w1, ... , wn−1 and the threshold value T are integer values. Physical
parameters such as transistor gate widths used to represent weights and thresholds in a
physical implementation of a threshold logic gate in practice assume a real value, but
are designed using a discrete range of values. In a commercial 65 nm LP bulk CMOS
process, for instance, the gate width of a transistor has a minimum value of 120 nm and
is incremented by discrete intervals of 5 nm.
In any threshold function, it is generally possible for the weights and the thresh-
old value of the function to vary by some amount without altering the function. Con-
sider as an example the threshold function {2.5, 1.32, 1.47; 2.1} implementing the
Boolean function y = x0+ x1x2, the truth table for which is featured in Table 1.3.
Assuming that all weights and the threshold value are fixed save for wi (for some
0 ≤ i ≤ n− 1), there is a range of values wi may assume without altering the onset or
offset of the function. The weight w2 of input x2, for instance, can vary between 0.78
and 2.09 while still implementing the same Boolean function, as shown in Table 1.4.
Note that the range of values across which w2 can vary is substantial, but this as-
sumes that all other weights and the threshold value remain constant. Multiple weights
and thresholds may vary without disrupting the operation of the function, however
there is often a correlation between the amount of allowable variance in a weight and
8
Table 1.3: Truth table for the threshold function {2.5, 1.32, 1.47; 2.1}.
x0 x1 x2 Σwi y
0 0 0 0 0
0 0 1 1.47 0
0 1 0 1.32 0
0 1 1 2.79 1
1 0 0 2.5 1
1 0 1 3.97 1
1 1 0 3.82 1
1 1 1 5.29 1
Table 1.4: Truth table for the threshold function {2.5, 1.32, w2; 2.1}, with 0.78≤ w2 ≤
2.09.
x0 x1 x2 Σwi y
0 0 0 0 0
0 0 1 0.78-2.09 0
0 1 0 1.32 0
0 1 1 2.1-3.41 1
1 0 0 2.5 1
1 0 1 3.28-4.59 1
1 1 0 3.82 1
1 1 1 4.6-5.91 1
the variance in all other weights and the threshold value. Additionally, it is also possi-
ble to multiply all weights and the threshold value by the same value without altering
the function, as shown in Table 1.5, where all weights and the threshold value of the
function have been multiplied by a factor of 10.
Multiplying a weight or threshold value by a factor of 10 decreases the ratio
of the fractional component of the value to the integer component of the value. If
the ratio is large enough, it is possible to adjust all weights and the threshold value
simultaneously to a degree such that all values are integers only. The function {2.5,
1,32, 1.47; 2.1}, for instance, can be represented by the all integer weight-threshold
vector {2, 1, 1; 2}. Algebraic techniques exist for determining the minimum weight
9
Table 1.5: Truth table for the threshold function {25, 13.2, 14.7; 21}.
x0 x1 x2 Σwi y
0 0 0 0 0
0 0 1 14.7 0
0 1 0 13.2 0
0 1 1 27.9 1
1 0 0 25 1
1 0 1 39.7 1
1 1 0 38.2 1
1 1 1 52.9 1
representation of a threshold logic function [10], however these are not required to
produce a physical implementation, and in practice non-minimum weight realizations
of threshold functions may be preferable in some cases.
10
Chapter 2
PHYSICAL IMPLEMENTATIONS OF THRESHOLD LOGIC
Threshold logic has been researched for more than half a century as a more
compact and efficient means of performing Boolean computation. In order to achieve
any reduction from compaction of complex Boolean logic into threshold logic gates,
however, there must exist a physical implementation of a threshold logic element that
is more efficient than what is being replaced. The earliest threshold logic gates predate
the integrated circuit [48]; an implementation using resistors and a vacuum tube is
shown in Figure 2.1. The binary inputs are provided via the voltages a, b, c, d, and
e, which assume binary values; weights are provided via the adjacent resistors and are
proportional to device conductance. The threshold value is determined by the voltage
Vbias.
+
-
a
b
c
d
e
Vbias
y
Figure 2.1: Resistor-vacuum tube implementation of the threshold function {21111;4}.
Other early implementations of threshold gates include those constructed using
toroidal magnetic cores and paired tunnel-diodes [57]. While interesting as academic
curiousities, implementations that use such archaic components are of no use in modern
circuit design. To produce a VLSI circuit that takes advantage of threshold logic, one
must rely solely on nanoscale components, such as MOSFETs, to implement threshold
11
functions. Throughout the years, a myriad of such implementations have been pro-
posed, so many that exhaustive enumeration and analysis of every solution would be
an extremely daunting task. However, nearly every implementation can be classified as
one of four main types of solutions. These include:
• Conventional solutions: implementations using conventional Boolean pull-up
and pull-down networks and/or transmission gates.
• Capacitive solutions: implementations that perform linear comparisons between
quantities of charge.
• Conductance-based solutions: implementations that perform linear comparisons
between quantities of conductance.
• Non-CMOS solutions: implementations that compute threshold logic using the
unique current-voltage characteristics of emerging nano-electric devices.
The following sections address each type of solution in detail, demonstrating the
advantages and disadvantages of each with a small number of representative samples.
2.1 Conventional implementations
Given the overwhelming popularity and ease of use of static CMOS, it is only
natural that many implementations of threshold logic exist using familiar gate design
techniques [6]. All threshold logic functions are Boolean functions, and as such may be
implemented as a multi-level network of simple Boolean gates. Often, a single thresh-
old logic function can be implemented using one of many different networks, varying
in both gate count and depth. An examples of a multi-level network implementing the
threshold function {21111;4} is shown in Figure 2.2.
The example in the figure uses 11 gates arranged in 4 levels, although innu-
merable variations are possible. The more gates a network utilizes, the more power
it will consume, whereas the depth of the network corresponds directly to the latency
12
ya b dc e
Figure 2.2: Multi-level CMOS implementation of the threshold function {21111;4}.
of the network. Often gate count and depth are at odds, and it is up the designer to
optimize the power versus latency of the function. Logical effort is also an important
consideration in the determination of the structure of the network; a high input pin ca-
pacitance will not directly effect the delay or power consumption of the network itself,
but will be more difficult for preceding logic to drive, increasing the delay and/or power
dissipation of those elements. While potentially large, slow, and high in power dissi-
pation, multi-level CMOS networks are highly reliable, and perfectly compatible with
any automated, standard cell-based design flow.
In addition to multi-level networks, threshold functions may also be imple-
mented using a single complex static CMOS element [29]. Rather than decompose
the function into smaller Boolean components, the entire function is represented by a
single pull-up and pull-down network, as shown in Figure 2.3.
Although technically a single gate, the number of devices required may be very
large. In addition to the obvious impact on gate area, gate delay and power dissipation
also suffer due to the extremely large number of capacitances present in the gate. While
such a gate is unlikely to be a part of most commercial standard cell libraries, the
13
ab
c
c
b
c
d d d
e
ee e
d
a
b c d e
b
c
d e
d
e
c
d
e
y
Figure 2.3: Single complex static CMOS implementation of the threshold function
{21111;4}.
single gate is still highly reliable, assuming that prohibitively tall pull-up or pull-down
transistor stacks are not used.
In lieu of purely static CMOS, dynamic logic styles such as domino logic can be
used to eliminate the pull-up network, as shown in Figure 2.4. The elimination results
in a much lower input capacitance compared with the single static CMOS gate.
For some threshold logic functions, particularly those with low values of T
(such as an OR function of arbitrary width), this can be a very efficient implemen-
tation. However, for complex functions with higher values of T , the benefits of this
approach diminish quickly. Tall pull-down stacks are still a potential issue, more so
if footed domino gates are used. Additionally, the dynamic operation of the gate re-
quires additional scrutiny to ensure proper operation in the presence of both static and
dynamic noise. When the clock signal is high and the input configuration is such that
no path exists between the node feeding to the final inverter and ground, the node is
left floating. In this state, the node is supposed to retain a value of logic high until
14
ab
c
c
b
c
d d d
e
ee e
d
y
clk
Figure 2.4: Single complex domino CMOS implementation of the threshold function
{21111;4}.
the next pre-charge event (clock low), and is extremely vulnerable to noise. While the
keeper device provides a weak feedback loop to help the node maintain its value while
floating, excessive noise or process variations can still cause the gate to fail despite this
additional safety measure.
In addition to the pull-up/pull-down networks typically employed in static and
dynamic CMOS gates, transmission gate networks can also be used to implement arbi-
trary Boolean functions [55]. These networks implement threshold functions by parti-
tioning nodes that represent each possible weighted sum, as shown in Figure 2.5. While
much more efficient in terms of device count than the previously mentioned implemen-
tations, the latency increases linearly with the number of function inputs.
A noteworthy feature of all of the previously mentioned implementations is that
they do not perform a direct linear computation of the threshold function. That is, they
do not determine the output of the function by comparing a weighted sum and threshold
value. Rather, the onset of the function is replicated by its Boolean equivalent, and then
computed using conventional techniques. As a result, the size of the gate will typically
15
ed
d
d
d
d
c
c
c
c
c
c
c
1
1
0
0
0
b
b
b
b
b
b
a
a
a
a
y
Figure 2.5: Steering logic implementation of the threshold function {21111;4}.
16
increase non-linearly as threshold function complexity increases, as shown in Table 2.1.
Table 2.1: Worst case device count vs. number of inputs for multi-level CMOS net-
work, single complex CMOS gate, single complex domino gate, and transmission gate
steering logic.
Inputs Multi-level Single CMOS Domino Steering logic
3 18 12 10 16
5 92 40 24 40
7 374 140 74 72
9 2100 504 256 112
While perfectly acceptable for small functions, conventional implementations
are extremely impractical for large or complex threshold functions. An efficient im-
plementation of such functions necessitates a circuit architecture with a linear scaling
factor.
2.2 Capacitive implementations
Many implementations of threshold logic attempt to overcome the complexity
scaling issue posed by conventional implementations by performing an actual linear
comparison rather than simulating one. To accomplish such a feat, the operands of the
weighted sum must be represented by some physical quantity, such as charge.
Capacitive implementations refer to a large family of threshold logic implemen-
tations that compute logic values by representing weights and the threshold as quanti-
ties of charge [60]. One example of such a gate is the static neuMOS inverter, shown
in Figure 2.6.
The two transistors of the neuMOS inverter employ a floating gate coupled with
a network of input controlled capacitances. The floating gate assumes a voltage equiv-
alent to the sum of the charge on the input capacitances. In the example in Figure 2.6,
the voltage Vin on the floating gate is given by Equation 2.1.
Vin =Vdd
2Va+Vb+Vc+Vd +Ve
7
(2.1)
17
a
b
c
d
e
y
0
Figure 2.6: Static neuMOS inverter implementation of the threshold function
{21111;4}.
If this voltage exceeds the threshold required to toggle the inverter (Vdd/2 as-
suming the inverter is perfectly balanced), the output will register a logic 1. Otherwise,
the output will register a logic 0. While purely combinational, many input combinations
will result in a voltage that draws static current across the floating gate inverter. If sup-
ported by the process being used, this static current can be mitigated using transistors
with high threshold voltages, although this will diminish the gate’s speed of operation.
Another technique for reducing the static power dissipated by the gate is to
use a clock signal to periodically deactivate the DC current paths [33, 13, 12]. An
implementation demonstrating this technique known as STTL (or self-timed threshold
logic) is shown in Figure 2.7.
As observed in the figure, all paths between the supply voltage and ground are
completely cut off when the signal clk is a logic 1. In this state, both nodes N1 and N2
are discharged to logic 0. As the signal clk falls to a logic 0, one of either N1 or N2 will
charge to a logic 1, depending on which of the two current mirrors in the gate provides
the greatest magnitude of current. The current provided by the leftmost current mirror
is a function of the voltage on the floating gate of the neuMOS transistor, which is itself
a function of the sum of charges on the input capacitances; the current provided by the
rightmost current mirror is a function of the reference voltage re f . Note that while clk
remains at logic 0, both current mirrors will dissipate static power, thus the static power
dissipation of the gate is reduced but not eliminated entirely.
18
a
b
c
d
e clk
clk
clk
clk
ref
yy
0
Figure 2.7: Clocked capacitive implementation (STTL) of the threshold function
{21111;4}.
2.3 Conductance-based implementations
Charge is not the only physical quantity that can be used to represent the operands
of a weighted sum. Current or conductance may be used in a similar fashion, and many
threshold logic implementations exist that use these quantities to perform linear com-
parisons [40, 65, 24].
The simplest example of such as an approach is the gate shown in Figure 2.8
using output wired inverters, essentially a pseudo-NMOS circuit. Each inverter pulls
up if its input is low, and down if its input is high. The strength with which it pulls
is proportional to the widths of its devices. Inverters may pull in both directions, but
assuming the pull is stronger in one direction than the other, the shared node will adopt
a stable operating point that is closer to either the supply voltage or ground, and the
output will settle to either output high or output low.
While area efficient and extremely fast, there are a number of drawbacks to
such an approach. If some inverters pull up while others pull down, a connection exists
between the supply voltage and ground, resulting in a large DC current similar to that
19
ab
c
d
e
0
y
Figure 2.8: Output wired inverter implementation of the threshold function {21111;4}.
observed in capacitive implementations of threshold logic. Table 2.2 shows the output
voltage and static power dissipation of an output wired inverter gate implementing the
threshold function {21111;4} for all possible sums of the weighted inputs simulated
using a 65 nm LP bulk CMOS process under typical operating conditions (TT, 1.2V,
25C).
Table 2.2: Output voltage and static power dissipation of the output wired inverter
implementation of the threshold function {21111;4}
Σwixi Output (V) Static power (µW)
0 0 0
1 0 99
2 0 192
3 0.01 271
4 1.19 284
5 1.2 226
6 1.2 154
As the table demonstrates, the gate properly implements the threshold func-
tion, but exhibits static power dissipation of up to 284 µW, approximately 5 orders of
magnitude greater than the power dissipation due to leakage under the same conditions.
Considering the importance of lowering static power dissipation in purely static CMOS
ICs, clearly this is not a viable alternative to static CMOS, no matter how much latency
is improved.
20
Numerous modifications to the output wired inverter threshold logic gate have
been proposed which have attempted to reduce static power dissipation [19]. Many
solutions use a clock signal to periodically shut off static current paths. While such
gates no longer operate as purely combinational logic elements, they do significantly
reduce power dissipation. One example of such a gate is shown in Figure 2.9.
clk a b c d e
y
N1 N2
Figure 2.9: Clocked conductive implementation of the threshold function {21111;4}.
When the clock signal is low, nodes N1 and N2 of the gate are both pre-charged
high. As the clock signal rises, the paths from nodes N1 and N2 to the supply voltage
are cut off and paths to ground are established. Both nodes N1 and N2 thus begin to
discharge at different rates; the node that discharges fastest is determined by the in-
put configuration and sizes of the relevant transistors. The first node to discharge will
also pull up on the opposite node, preventing the opposite node from discharging com-
pletely. While much of the static power dissipation observed in purely combinational
conductive threshold logic gates is eliminated by this solution, a static current path
still exists between the supply voltage and ground when clock is high and the sum of
weighted inputs is greater than the threshold.
Differential threshold logic is subset of current and conductance-based thresh-
old logic which relies upon a clocked differential comparison between two banks of
configurable conductance values [44, 4, 62, 50]. Capable of high performance and free
21
of the static power dissipation issues that plague other conductance-based implemen-
tations, differential implementations are the most promising CMOS-based realization
of threshold functions. These implementations are the basis of the threshold logic gate
architecture proposed in this work, and will be discussed in much greater detail in the
following chapter.
2.4 Non-CMOS implementations
In addition to the numerous implementations of threshold logic constructed
from the MOSFET devices common in modern ICs, implementations have also been
proposed that take advantage of the unique properties of newer nano-electronic de-
vices [5]. The most prominent example of such a device is the resonant tunneling
diode, or RTD [64]. The RTD is a two terminal device that exhibits a negative differen-
tial resistance, or NDR, as the potential applied across it is increased. In a conventional
MOSFET, the drain-source current increases monotonically as a function of the drain-
source voltage. In an RTD, however, as the potential across the device is increased the
current rises to a “peak” and falls to a “valley” before rising again. The current-voltage
characteristics of the RTD are displayed in Figure 2.10.
V0
0
I
Voltage
Cu
rr
en
t peak current
valley current
P-V ratio
+
-
VoltageCurrent
Figure 2.10: Annotated current-voltage characteristics of the resonant tunneling diode
(RTD).
22
The monostable-bistable logic element, or MOBILE, is a clocked logic fam-
ily that uses a combination of RTDs and n-type FET devices to compute a threshold
function [46, 1]. The device level schematic of the MOBILE is shown in Figure 2.11.
clk
y
driver
load
eb c da
Figure 2.11: MOBILE implementation of the threshold function {21111;4}.
In a typical FET-based inverter, the monotonic I-V curves of the driver and load
devices intersect at a single point, ensuring a single possible output, as demonstrated
in Figure 2.12. The non-monotonic I-V characteristics of resonant tunneling diodes,
however, establish multiple stable operating points, as Figure 2.13 demonstrates. When
the clock signal controlling the MOBILE is low, there is no potential drop across the
driver and load, resulting in an output of logic 0. As the clock signal rises, the output
settles to one of two stable states, dependent upon which RTD has the highest peak
current. If the load device has the highest peak current, the output settles to logic 0; if
the driver device has the highest peak current, the output settles to logic 1. While the
I-V curves of the two devices intersect at three points on the plot, the middle point is
meta-stable, and the output will not stabilize to that value unless forced by means apart
from the ordinary operation of the gate. The peak current of the driver driver RTD in
the MOBILE is augmented by number of active RTDs in parallel with it. While an
RTD is a two terminal device, a FET placed in series provides the means to turn an
RTD on and off by toggling the state of the transistor. The saturation current of the
23
FET must exceed the peak current of the RTD to ensure that I-V curve of the RTD
being “activated” is preserved.
V0
0
I
Voltage
Cu
rr
en
t
Vgate = V
NMOS
PMOS
V0
0
I
Voltage
Cu
rr
en
t
Vgate = 0
PMOS
NMOS
Figure 2.12: Operation of a static CMOS inverter.
V0
0
I
Voltage
Cu
rr
en
t driver
VHVL
load
Ipeak,driver > Ipeak,load
V0
0
I
Voltage
Cu
rr
en
t driver
load
VHVL
Ipeak,driver < Ipeak,load
Figure 2.13: Operation of the MOBILE during evaluation (clk = logic 1).
MOBILEs constructed from RTDs are extremely fast, and threshold logic net-
works constructed from such elements have been reported operating at frequencies up
to 40 GHz [37]. However, at present the maturity of resonant tunneling diodes is such
that the disadvantages accompanying their use are prohibitive. The RTD’s property of
negative differential resistance, critical for MOBILE operation, is extremely sensitive
to operating temperature. As temperature increases, the ratio between the peak current
and valley current of the current-voltage response of the device diminishes until NDR
disappears completely. While RTDs have been demonstrated in Silicon and Silicon-
Germanium, such devices have demonstrated NDR only at extremely low temperatures
24
(exhibiting a peak-to-valley ratio of 1.5 at a temperature of 77 K); at room temperature a
circuit constructed from such devices becomes inoperable. To create a MOBILE that is
operable at room temperature, the RTDs can only be fabricated in a Gallium Arsenide-
based substrate. Gallium Arsenide, however, is much more expensive to produce than
traditional Silicon, making it impractical for the vast majority of IC applications.
25
Chapter 3
DIFFERENTIAL THRESHOLD LOGIC
Many conductive threshold logic implementations, while fast and easy to im-
plement using standard CMOS processes, are plagued by the issue of static power dis-
sipation. While clocking mechanisms have been employed to eliminate static power
during one phase of the clock or for certain combinations of inputs, one cannot expect
widespread application of threshold logic until all sources of static power dissipation
are eliminated from the gate. A subfamily of conductive logic gates that are able to
achieve this goal are differential threshold logic gates.
3.1 Common design and principles of operation
Differential threshold logic refers to a large family of implementations sharing
the same essential architectural components, as well as common principles of opera-
tion. As its name suggests, differential threshold logic elements each contain a differ-
ential amplifier.
Differential amplifiers are widely used in SRAM designs to quickly distinguish
between and amplify the potentials of two input signals [71, 39, 45], as shown in Fig-
ure 3.1. In the design shown in the figure, the inputs bit line and bit line are initially
pre-charged to a logic 1. During a memory read, one of the two inputs will slowly
discharge while the other remains at logic 1. The differential amplifier senses the small
difference in impedance between the devices M1 and M2 provided by the discrepancy
in potential between bit line and bit line and provides full signal swing amplification
of this difference at the output.
Integrated within each branch of the differential amplifier in a differential thresh-
old logic gate is a bank of parallel transistors referred to henceforth as an input network,
shown in Figure 3.2. This input network fulfills the same role as the input transistors
in a SRAM sense amplifier, except that the inputs to each device in the network are
full signal values and the difference in impedance between the two input networks is
26
clk clk
bit_line bit_line
Vout Vout
Differential amplifier
M1 M2
Figure 3.1: Generic differential sense amplifier employed in SRAM column design.
provided by the total parallel impedance of each as determined by the number of active
inputs in each network.
clk clk
Vout Vout
Differential amplifier
in0in1inn-1 inn+1 in2n-1inn
M1 M2
Figure 3.2: Generic differential sense amplifier employed in a differential threshold logic element.
Differential threshold logic gates operate in two phases, referred to henceforth
as reset and evaluation. In the reset phase of operation, both branches of the differential
amplifier are pre-charged to prepare the circuit for evaluation. After reset completes,
both outputs Vout and Vout register a logic 0. In the evaluation phase of operation, both
branches of the differential amplifier begin to discharge at different rates determined
27
by the configuration of input signals in the input networks. Depending on exact com-
bination of the inputs, the outputs of the gate will settle to one of two bistable states
(Vout = 0, Vout = 1 or Vout = 1, Vout = 0). After evaluation is complete, the gate is effec-
tively latched; no change in the input signals will produce any change in the output(s)
of the gate until the next edge of the clock signal.
Many implementations of differential threshold logic have been proposed, each
with specific modifications included to improve area, delay, and/or power dissipation.
Of these, four specific implementations were chosen for analysis as representatives of
the logic family. These include:
• Cross-coupled inverters with asymmetrical loads (CIAL)
• Latch-type CMOS threshold logic (LCTL)
• Single-input current sensing differential logic (SCSDL)
• Differential current-switch threshold logic (DCSTL)
The unique characteristics of each of these implementations are described in the
following sections of this chapter.
3.2 Cross-coupled inverters with asymmetrical loads (CIAL)
Proposed in 1995, cross-coupled inverters with asymmetrical loads (CIAL) is
one of the earliest differential implementations of threshold logic [44]. An annotated
device-level schematic of the CIAL gate is shown in Figure 3.3.
Like all differential threshold logic implementations, the CIAL elements con-
sists of a differential amplifier (M1−8) integrated with two networks of parallel input
transistors (M9−M10). When the signal clk is a logic 0, the gate is in the reset phase of
operation. Transistor M7 is inactive, thus none of the internal nodes of the gate possess
a conducting path to ground. Transistors M1 and M4 are active, and, since any signal
assignment provided to the CIAL element assumes that at least one of the input tran-
28
clkclk
inn-1 in0in1 inn inn+1 in2n-1
N3
M1 M4
M9 M10
N4
clk
clkVout Vout
M6M5
M3
M8
M2
M7
N1 N2
N5
Figure 3.3: Device-level schematic of the CIAL gate.
sistors comprising M9 and/or M10 is active at all times, nodes N1−4 are pre-charged to
logic 1.
When the signal clk rises to a logic 1, the gate begins the evaluation phase
of operation, shown in Figure 3.4. Transistors M1 and M4 become inactive, thus the
accumulated charge on nodes N3 and N4 becomes floating charge. At the same time,
M7 becomes active, causing nodes N1 and N2 to begin discharging to ground through
devices M5−7. Before nodes N1 and N2 can completely discharge, however, they must
also each drain off the the floating charges on N3 and N4, respectively. The discharge
rates τN3 and τN4 of nodes N3 and N4 according the Elmore delay model are given in
Equations 3.1 and 3.2.
τN3 =C3(Z9+Z5+Z7)+C1(Z5+Z7)+C5(Z7) (3.1)
τN4 =C4(Z10+Z6+Z7)+C2(Z6+Z7)+C5(Z7) (3.2)
29
clk
N1
N2
Vout
Vout
V
N3
N4
t (s)
Figure 3.4: Evaluation waveforms of the CIAL gate, assuming αL/αR toggling between
5/4 and 4/5.
In the equations, Ci corresponds to the capacitance of node Ni, and Zi corre-
sponds to the impedance provided by active transistor Mi. It should be noted that
capacitances C1 and C2 are equivalent by design, as are C3 and C4. The physical di-
mensions, and thus impedances, of devices M5 and M6 are also equivalent. As a result,
a large portion of both discharge delays are equivalent, as Equation 3.3 demonstrates.
τEQ =C3(Z5+Z7)+C1(Z5+Z7)+C5(Z7)
=C4(Z6+Z7)+C2(Z6+Z7)+C5(Z7)
(3.3)
30
The difference ∆(τN3,τN4) between the rates at which N3 and N4 are able to
discharge is thus exclusively determined by the number of active inputs in the input
networks M9 and M10, respectively, as shown in Equations 3.4, 3.5, and 3.6.
τN3 =C3(Z9)+ τEQ (3.4)
τN4 =C4(Z10)+ τEQ (3.5)
∆(τN3,τN4) =C4(Z10)−C3(Z9) (3.6)
This difference is the designed difference of the function, and under ideal con-
ditions, the output will assume one of two bistable output states depending on whether
the quantity ∆(τN3,τN4) is positive or negative.
While effectively implementing a threshold logic function via a linear compari-
son of conductances without drawing static current, the CIAL element is quite slow and
high in power dissipation. Each branch of the differential amplifier must discharge two
relatively large capacitances. The delay can be improved substantially by increasing
the gate widths of devices M5-M7, thus reducing the impedances through which each
capacitance must discharge. However, this will also increase the parasitic capacitance
of the logic element, resulting in even higher power dissipation.
3.3 Latch-type CMOS threshold logic (LCTL)
Also proposed in 1995, latch-type CMOS threshold logic (LCTL) operates sim-
ilarly to CIAL [4]; an annotated device-level schematic of the LCTL gate is shown in
Figure 3.5.
Similarly to the CIAL gate, the LCTL element consists of a differential am-
plifier (M1−8) integrated with two networks of parallel input transistors (M9−M10).
When the signal clk assumes a logic value of 0, the gate is in the reset phase of op-
eration. Transistors M5 and M6 are inactive, thus neither of the internal nodes of the
31
N1 N2M1clk M4 clkVout
in0in1inn-1
M9
N3
clk M5
M7
N5 inn inn+1 in2n-1
M10
N4
clkM6
M8
N6
M2 M3
Vout
Figure 3.5: Device-level schematic of the LCTL gate.
gate N1−2 possess a conducting path to ground. Transistors M1 and M4 are active, thus
nodes N1−2 are pre-charged to logic 1. Any signal assignment provided to the LCTL
element assumes that there is at least one active device in both M9 and M10 at all times,
thus nodes N3−6 are discharged to ground during reset. When the signal clk rises to a
logic 1, the gate begins the evaluation phase of operation, shown in Figure 3.6.
When the signal clk rises to a logic 1, transistors M1 and M4 become inactive,
while transistors M7 and M8 become active, causing nodes N1 and N2 to begin discharg-
ing to ground through devices M5−10. The discharge rates τN1 and τN2 of nodes N1 and
N2 are given in Equations 3.7 and 3.8.
τN1 =C1(Z5+Z7+Z9)+C3(Z7+Z9)+C5(Z7) (3.7)
τN2 =C2(Z6+Z8+Z10)+C4(Z8+Z10)+C6(Z8) (3.8)
In the equations, Ci corresponds to the capacitance of node Ni, and Zi corre-
sponds to the impedance provided by active transistor Mi. As in the CIAL element, the
LCTL gate utilizes matching pairs of transistors and node capacitances throughout the
differential amplifier. Capacitances C1 and C2 are equivalent, as are the pair C3 and C4
32
clk
N1
N2
Vout
Vout
t (s)
V
Figure 3.6: Evaluation waveforms of the LCTL gate, assuming αL/αR toggling between
5/4 and 4/5.
and the pair C5 and C6. The physical dimensions, and thus impedances, of devices M5
and M6 are also equivalent, as are the pair M7 and M8. As a result, a large portion of
both discharge delays are equivalent, as Equation 3.9 demonstrates.
τEQ =C1(Z5+Z7)+C3(Z7)+C5(Z7) =C2(Z6+Z8)+C4(Z8)+C6(Z8) (3.9)
The difference ∆(τN1,τN2) between the rates at which N1 and N2 are able to
discharge is thus exclusively determined by the number of active inputs in the input
networks M9 and M10, respectively, as shown in Equations 3.10, 3.11, and 3.12.
τN1 =C1(Z9)+C3(Z9)+ τEQ (3.10)
τN2 =C2(Z10)+C4(Z10)+ τEQ (3.11)
33
∆(τN1,τN2) = (C2+C4)(Z10)− (C1+C3)(Z9) (3.12)
This difference is the designed difference of the function, and under ideal con-
ditions, the output will assume one of two bistable output states depending on whether
the quantity ∆(τN1,τN2) is positive or negative. It is important to note that this quantity
is a larger portion of the overall discharge delay than that observed in the CIAL gate,
which provides a higher degree of robustness than that exhibited by CIAL.
The LCTL gate is a faster, lower power solution than CIAL as well due to the
placement of the input networks in the differential amplifier. In the CIAL gate, the input
network, which contributes a large amount of capacitance, is located at the top of the
discharge path, and thus observes the maximum possible impedance. In the LCTL gate,
however, the input network is located in the middle in the discharge path, encountering
less impedance as it discharges.
3.4 Single-input current-sensing differential logic (SCSDL)
Proposed in 2000, single-input current-sensing differential logic (SCSDL) is
structurally very similar to LCTL, essentially a reordering of the devices in the pull-
down of the two differential branches of the logic element [62]. The device-level
schematic of the SCSDL element is shown in Figure 3.7.
M9 M10
N5 N6
in1 in0inn-1 inn inn+1 in2n-1
M1 M2 M3 M4
M5 M6
M7 M8
N1 N2
N3 N4
clk
clk
clk
clk
Vout Vout
Figure 3.7: Device-level schematic of the SCSDL gate.
34
Transistors M1−8 comprise the differential amplifier portion of the gate, while
transistor banks M9−10 provide the networks of parallel input transistors. When the
signal clk assumes a logic 0, the gate is the reset phase of operation. Transistors M5 and
M6 are inactive, thus neither of the internal nodes N1−2 of the gate possess a conducting
path to ground. Transistors M1 and M4 are active, thus nodes N1−2 are pre-charged to
logic 1. Any signal assignment provided to the SCSDL element assumes that there is at
least one active device in both M9 and M10 at all times, thus nodes N3−6 are discharged
to ground during reset. When the signal clk rises to a logic 1, the gate begins the
evaluation phase of operation, shown in Figure 3.8.
clk
N1
N2
Vout
Vout
t (s)
V
Figure 3.8: Evaluation waveforms of the SCSDL gate, assuming αL/αR toggling be-
tween 5/4 and 4/5.
35
In the evaluation phase, transistors M1 and M4 become inactive, while transis-
tors M5 and M6 become active, causing nodes N1 and N2 to begin discharging to ground
through devices M5−10. The discharge rates τN1 and τN2 of nodes N1 and N2 are given
in Equations 3.13 and 3.14.
τN1 =C1(Z5+Z7+Z9)+C3(Z7+Z9)+C5(Z9) (3.13)
τN2 =C2(Z6+Z8+Z10)+C4(Z8+Z10)+C6(Z10) (3.14)
In the equations, Ci corresponds to the capacitance of node Ni, and Zi corre-
sponds to the impedance provided by active transistor Mi. As in the LCTL element, the
SCSDL gate utilizes matching pairs of transistors and node capacitances throughout
the differential amplifier. Capacitances C1 and C2 are equivalent, as are the pair C3 and
C4 and the pair C5 and C6. The physical dimensions, and thus impedances, of devices
M5 and M6 are also equivalent, as are the pair M7 and M8. As a result, a large portion
of both discharge delays are equivalent, as Equation 3.15 demonstrates.
τEQ =C1(Z5+Z7)+C3(Z7) =C2(Z6+Z8)+C4(Z8) (3.15)
The difference ∆(τN1,τN2) between the rates at which N1 and N2 are able to
discharge is thus exclusively determined by the number of active inputs in the input
networks M9 and M10, respectively, as shown in Equations 3.16, 3.17, and 3.18.
τN1 =C1(Z9)+C3(Z9)+C5(Z9)+ τEQ (3.16)
τN2 =C2(Z10)+C4(Z10)+C6(Z10)+ τEQ (3.17)
∆(τN1 ,τN2) = (C2+C4+C6)(Z10)− (C1+C3+C5)(Z9) (3.18)
36
This difference is the designed difference of the function, and under ideal con-
ditions, the output will assume one of two bistable output states depending on whether
the quantity ∆(τN1,τN2) is positive or negative. The SCSDL gate further improves upon
the design of LCTL by implementing the input networks at the bottom of the pull-down
network, minimizing the amount of internal capacitance in the logic element that must
be discharged during an evaluation.
3.5 Differential current-switch threshold logic (DCSTL)
Proposed in 2001, differential current-switch threshold logic incorporates the
improvements provided by SCSDL while adding additional devices to reduce the prop-
agation delay of the logic element [50, 54]. The device level schematic of the DCSTL
gate is shown in Figure 3.9.
Vout
in0in1inn-1 inn inn+1 in2n-1
M9 M10
N5 N6
M5 M6
N1 N2
N3 N4
clk clk
Vout
M7
M1 M2 M3 M4
M12
M13
clk clk
clk
N7M11
M8
Figure 3.9: Device-level schematic of the DCSTL gate.
Transistors M1−8 comprise the differential amplifier portion of the gate, while
transistor banks M9−10 provide the networks of parallel input transistors. As with pre-
vious implementations of differential threshold logic, when the signal clk assumes a
37
logic 0, the gate is in the reset phase of operation. Transistors M7 and M8 are inac-
tive, thus none of the internal nodes N1−4 of the gate possess a conducting path to
ground. Transistors M1 and M4 are active, thus nodes N1−4 are pre-charged to logic
1. Assuming there is at least one active device in both M9 and M10, nodes N5 and N6
are discharged to ground during reset. When the signal clk rises to a logic 1, the gate
begins the evaluation phase of operation, shown in Figure 3.10.
clk
N1
N2
Vout
Vout
t (s)
V
Figure 3.10: Evaluation waveforms of the DCSTL gate, assuming αL/αR toggling be-
tween 5/4 and 4/5.
DCSTL uses the same placement of the input networks proposed by SCSDL,
at the bottom of the discharge path (rather than the middle as in LCTL or the top as in
CIAL). Additionally, the gate introduces transistors M11−13, which provide parallel dis-
charge paths to nodes N1 and N2. These parallel discharge paths, while not required for
proper operation of the gate, improve the performance of the gate by reducing the total
discharge impedance observed by nodes N1 and N2, as demonstrated by Equations 3.19
38
and 3.20.
τN1 =C1
(Z5+Z7+Z9)(Z11+Z13)
Z5+Z7+Z9+Z11+Z13
+C3(Z7+Z9)+C5(Z9)+C7(Z13) (3.19)
τN2 =C2
(Z6+Z8+Z10)(Z12+Z13)
Z6+Z8+Z10+Z12+Z13
+C4(Z8+Z10)+C6(Z10)+C7(Z13) (3.20)
In the equations, Ci corresponds to the capacitance of node Ni, and Zi corre-
sponds to the impedance provided by active transistor Mi. As in the SCSDL element,
the DCSTL gate utilizes matching pairs of transistors and node capacitances through-
out the differential amplifier. Capacitances C1 and C2 are equivalent, as are the pair
C3 and C4 and the pair C5 and C6. The physical dimensions, and thus impedances, of
devices M5 and M6 are also equivalent, as are the pair M7 and M8 and the pair M11 and
M12. As a result, a portion of both discharge delays are equivalent, as Equation 3.21
demonstrates.
τEQ =C3(Z7)+C7(Z13) (3.21)
The difference ∆(τN1,τN2) between the rates at which N1 and N2 are able to
discharge is thus exclusively determined by the number of active inputs in the input
networks M9 and M10, respectively, as shown in Equations 3.22, 3.23, and 3.24.
τN1 =C1
(Z5+Z7+Z9)(Z11+Z13)
Z5+Z7+Z9+Z11+Z13
+C3(Z9)+C5(Z9)+ τEQ (3.22)
τN2 =C2
(Z6+Z8+Z10)(Z12+Z13)
Z6+Z8+Z10+Z12+Z13
+C4(Z10)+C6(Z10)+ τEQ (3.23)
∆(τN1 ,τN2) =C2
(Z6+Z8+Z10)(Z12+Z13)
Z6+Z8+Z10+Z12+Z13
+C4(Z10)+C6(Z10)
−C1 (Z5+Z7+Z9)(Z11+Z13)Z5+Z7+Z9+Z11+Z13 −C3(Z9)−C5(Z9)
(3.24)
39
It should be noted that while the parallel discharge paths present in the DCSTL
element reduce the gate’s discharge delay, they also reduce the difference in impedance
between the left and right branches of the differential amplifier, negatively impacting
the robustness of the gate. If transistors M11−13 are removed from the gate, the dis-
charge delays τ ′N1 and τ
′
N2 of the logic element are simplified as given in Equations 3.25
and 3.26.
τ ′N1 =C1(Z5+Z7+Z9)+C3(Z7+Z9)+C5(Z9) (3.25)
τ ′N2 =C2(Z6+Z8+Z10)+C4(Z8+Z10)+C6(Z10) (3.26)
The effects of this augmentation on the logic element are apparent, as demon-
strated by the difference ∆′(τN1,τN2) between the rates at which N1 and N2 are able to
discharge given in Equation 3.27.
∆′(τN1,τN2) = (C2+C4+C6)(Z10)− (C1+C3+C5)(Z9) (3.27)
The difference between equation 3.27 and 3.24 is given by Equation 3.28.
∆′−∆= (C2)(Z10− (Z6+Z8+Z10)(Z12+Z13)Z6+Z8+Z10+Z12+Z13 )
−(C1)(Z9−C1 (Z5+Z7+Z9)(Z11+Z13)Z5+Z7+Z9+Z11+Z13 )
(3.28)
The value of ∆′ is greater than the value of ∆, indicating a higher degree of
robustness in the gate. Note that since C1 =C2, Z5 = Z6, Z7 = Z8, and Z11 = Z12, ∆′−∆
reduces as shown in Equation 3.29.
∆′−∆= (C2)(Z10−Z9)− (C2)(Z10−Z9)(Z11+Z13)2/[(Z6+Z8)2+(Z11+
Z13)2+2(Z6+Z8)(Z11+Z13)+(Z6+Z8+Z11+Z13)(Z9+Z10)+Z9Z10]
(3.29)
40
The second term in the equation is equivalent to the first term in the equation
multiplied by a factor of (Z11 + Z13)2/[(Z6 + Z8)2 + 2(Z6 + Z8)(Z11 + Z13) + (Z11 +
Z13)2 +(Z6 + Z8 + Z11 + Z13)(Z9 + Z10)+ Z9Z10]. This factor is always greater than
zero, as all of the impedances composing the factor are positive values. Additionally,
this factor is always less than 1, as the denominator is greater than the numerator. As a
consequence of this, it must be true that ∆′−∆ is always a positive value.
3.6 Performance
While essentially similar in structure and operation, the key differences between
each implementation of differential threshold logic can create a substantial variance is
the delay values exhibited by each cell. Table 3.1 compares the delay from the rising
edge of clk to the rising edge of either Vout or Vout between 12-input, minimum-sized
CIAL, LCTL, SCSDL, and DCSTL gates. Simulations were performed using Synopsys
HSpice version 2009.09 on a commercial 65 nm LP bulk CMOS process assuming
typical operating conditions, a supply voltage of 1.2V, and a temperature of 25C.
Table 3.1: Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process.
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 277.0 ps 83.1 ps 80.0 ps 70.6 ps
3/2 2/3 254.7 ps 95.0 ps 88.8 ps 78.9 ps
4/3 3/4 254.1 ps 100.3 ps 95.9 ps 85.5 ps
5/4 4/5 255.9 ps 101.6 ps 102.0 ps 90.7 ps
In the table, αL/αR represent various input configurations provided to each gate.
The value of αL represents the number of active devices comprising the left input net-
work M9 of each logic element, while the value of αR represents the number of active
devices comprising the right input network M10 of each logic element. As the table
demonstrates, DCSTL is the fastest of the four logic elements. This was predicted
from the Elmore delay equations above due to the parallel discharge paths present in
the DCSTL architecture. CIAL is the slowest of the four logic elements, also as pre-
41
dicted, due to its combination of large capacitances and impedances. In each design, it
was shown that reducing the impedances along the discharge path (M5-M7 in CIAL and
M5-M8 in LCTL, SCSDL, and DCSTL) will significantly reduce the total evaluation
delay. This is supported by the data in Table 3.2. The information provided in both
Tables 3.1 and 3.2 is summarized in Figure 3.11.
Table 3.2: Evaluation delay of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and
DCSTL elements by αL/αR combination for a 65 nm LP bulk CMOS process.
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 87.9 ps 59.3 ps 73.9 ps 62.1 ps
3/2 2/3 79.2 ps 67.0 ps 77.0 ps 59.7 ps
4/3 3/4 77.0 ps 72.3 ps 81.0 ps 60.9 ps
5/4 4/5 76.3 ps 75.0 ps 84.8 ps 63.0 ps
Figure 3.11: Evaluation delay of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL, αR) = 1.
While increasing the sizes of particular devices in the logic element can im-
prove the performance of the gate, such modifications are not without costs. Increasing
42
the gate width of a transistor increases area as well as the drain, gate, and source ca-
pacitances of the device, resulting in higher power dissipation.
3.7 Power dissipation
Power dissipation of differential threshold logic elements can vary significantly
due to differences in the structure of the gate. Specifically, the more capacitance a gate
possesses, the greater power dissipation it will exhibit as the capacitance is charged and
discharged. Table 3.3 demonstrates how power dissipation compares between 12-input
CIAL, LCTL, SCSDL, and DCSTL gates of minimum size.
Table 3.3: Power dissipation of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (1 GHz clock).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 17.49 µW 10.96 µW 7.73 µW 8.14 µW
3/2 2/3 17.61 µW 12.11 µW 8.13 µW 8.48 µW
4/3 3/4 17.65 µW 12.81 µW 8.41 µW 8.76 µW
5/4 4/5 17.69 µW 13.35 µW 8.63 µW 8.98 µW
As demonstrated previously, additional sizing of each gate can be performed to
reduce the evaluation delay of the logic element. However, increased sizing increases
the capacitance of the logic element, resulting in increased power dissipation. Table 3.4
demonstrates how power dissipation increases in response to gate sizing for each of the
different threshold logic elements. The information provided in both Tables 3.3 and 3.4
is summarized in Figure 3.12.
Table 3.4: Power dissipation of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and
DCSTL elements by αL/αR combination for a 65 nm LP bulk CMOS process (1 GHz
clock).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 29.28 µW 14.50 µW 13.45 µW 13.69 µW
3/2 2/3 29.54 µW 16.07 µW 14.02 µW 13.96 µW
4/3 3/4 29.69 µW 17.10 µW 14.50 µW 14.25 µW
5/4 4/5 29.81 µW 17.84 µW 14.88 µW 14.44 µW
43
Figure 3.12: Power dissipation of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL, αR) = 1.
3.8 Failure simulations and reliability
The designed difference in impedance between each branch of the differential
amplifier for each the logic elements described is a function of the impedances pre-
sented by the parallel input networks M9 and M10. The impedances of each network,
however, are only part of the total discharge impedance observed by each branch. While
this additional impedance is designed to be equivalent between the two branches of the
amplifier, in reality a gate will deviate from this ideal. In a manufactured circuit, noise
and process variations can potentially cause the impedances of the matched transistors
in the differential amplifier to vary.
A failure in a differential threshold logic element implementing a specific func-
tion is defined as the event in which any input configuration results in an incorrect
output. In any differential threshold logic element, outputs are decided based on a
comparison between the impedances of the left and right input networks. Noise or pro-
44
cess variations in the gate can effectively raise or lower the effective impedance of a
network; if the variation is larger than the designed difference between the two input
networks, it may be sufficient to alter the designed outcome of the race, resulting in a
failure. Consider as an example, the evaluation phase of an LCTL gate, assuming the
following capacitances and impedances:
• C1 = C2 = C3 = C4 = C5 = C6 = 1
• Z5 = Z6 = Z7 = Z8 = 1
• Z9 = 1αL , Z10 = 1αR
Following Equations 3.7 and 3.8, the discharge delays τN1 and τN2 using these
values are thus given across a range of αL/αR combinations in Table 3.5.
Table 3.5: Branch discharge delay of an LCTL element assuming Ci = Zi = 1.
αL/αR τN1 τN2 ∆(τN1,τN2)
2/1 5 6 1
3/2 4.667 5 0.333
4/3 4.5 4.667 0.167
5/4 4.4 4.5 0.1
Assuming all impedances vary plus or minus 10% due to process variations,
it is possible for ∆(τN1,τN2) to become a negative value, indicating that the gate has
failed. The worst possible variation for the gate across a range of αL/αR combinations
is given in Table 3.6.
Table 3.6: Branch discharge delay of an LCTL element assuming Ci = Zi = 1, τN1 +
10%, and τN2 - 10%.
αL/αR τN1 τN2 ∆(τN1,τN2)
2/1 5.5 5.4 -0.1
3/2 5.133 4.5 -0.633
4/3 4.95 4.2 -0.75
5/4 4.84 4.05 -0.79
45
Naturally, the probability of failure increases as the difference in impedance
between the two networks decreases. To determine the robustness of an individual
function, it is therefore only essential to inspect configurations of inputs that provide the
smallest difference in impedance between the two input networks in both the onset and
offset. If a failure does occur in the threshold gate, the set of input configurations that
yield a failure must include one of these combinations [17]. These input combinations
are henceforth defined as the “critical configurations” of the gate.
The noise margin of a differential threshold logic gate is the amount of DC
bias that can be applied to either differential node N1 or N2 before a failure occurs
in the gate. Assuming the input configuration is such that node N1 is supposed to
discharge before node N2, a positive bias on node N1 or negative bias on node N2 may
result in a failure. Similarly, assuming the input configuration is such that node N2 is
supposed to discharge before node N1, a positive bias on node N2 or a negative bias on
node N1 may result in a failure. The noise margin of each minimum-sized differential
implementation is demonstrated in Table 3.7.
Table 3.7: Noise margin of minimum-sized CIAL, LCTL, SCSDL, and DCSTL ele-
ments by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd = 1.2V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 5 mV 28 mV 51 mV 18 mV
3/2 2/3 8 mV 12 mV 22 mV 9 mV
4/3 3/4 9 mV 7 mV 12 mV 5 mV
5/4 4/5 9 mV 5 mV 7 mV 3 mV
In a static CMOS gate, 10% of the supply voltage is generally regarded as a
minimum acceptable noise margin [41, 66]. As the table demonstrates, none of the
differential threshold logic elements simulated satisfy this requirement; the maximum
noise margin demonstrated is 4.25%. Increasing the widths of the pull-down transistors
of the differential amplifier (M5-M7 in CIAL, M5-M8 in LCTL, SCSDL, and DCSTL)
increases the difference in designed impedance between the two branches of the dif-
46
ferential amplifier and reduces contention, thus increasing the noise margin. Table 3.8
displays how the noise margin of each gate responds to increasing the widths of the
pull-down transistors of the differential amplifier by 8x. The information provided in
both Tables 3.7 and 3.8 is summarized in Figure 3.13.
Table 3.8: Noise margin of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd = 1.2V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 13 mV 123 mV 61 mV 75 mV
3/2 2/3 16 mV 64 mV 35 mV 53 mV
4/3 3/4 18 mV 43 mV 23 mV 38 mV
5/4 4/5 19 mV 33 mV 17 mV 29 mV
Figure 3.13: Noise margin of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, and DCSTL elements assuming max(αL, αR) - min(αL, αR) = 1.
As observed in the table and plot, the noise margin of each gate is improved
significantly through sizing. Although the LCTL element with an αL/αR combination
of 2/1 exhibits a minimally acceptable noise margin, it degrades quickly as the dif-
47
ferential impedance in the gate decreases. The rest of gates have improved as well,
however robustness remains abysmal. It is possible to improve noise margin through
further sizing of the gates, though the associated performance and power costs quickly
become prohibitive. DCSTL gates demonstrated in a fabricated 0.25 µm test chip re-
quired transistor gate width sizing of up to 40x the minimum width in order to attain
a reasonable degree of robustness, the power and performance costs of which greatly
outweighed the supposed advantages of using threshold logic [52].
3.8.1 Voltage scaling
Voltage scaling is often employed in static CMOS logic gates as technique for
reducing static and dynamic power dissipation at the expense of the performance [23,
22]. It is widely used in embedded applications where extending battery life is critical,
and high performance is typically only in demand for brief intervals of time. Assuming
a fixed operating frequency, dynamic power decreases quadratically as the supply volt-
age is reduced. Leakage power dissipation reduces non-linearly as well, since voltage
scaling decreases the difference between the supply voltage and the threshold voltage
as well, decreasing static current. The effects of voltage scaling on the performance and
power dissipation of minimum-sized CIAL, LCTL, SCSDL, and DCSTL are demon-
strated in Tables 3.9 and 3.10, respectively.
Table 3.9: Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd = 1.0 V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 303.1 ps 133.4 ps 112.2 ps 112.0 ps
3/2 2/3 296.7 ps 157.8 ps 125.7 ps 128.9 ps
4/3 3/4 299.8 ps 171.0 ps 137.1 ps 142.0 ps
5/4 4/5 303.5 ps 177.3 ps 146.6 ps 152.1 ps
As in all logic gates both static and dynamic, the noise margin exhibited by a
gate is extremely sensitive to the difference in potential between the supply and ground.
As the supply voltage is reduced, the noise margin decreases as well [42]. For any
48
Table 3.10: Power dissipation of minimum-sized CIAL, LCTL, SCSDL, and DCSTL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (1 GHz clock, Vdd
= 1.0 V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 8.29 µW 6.92 µW 5.05 µW 5.34 µW
3/2 2/3 8.43 µW 7.55 µW 5.23 µW 5.52 µW
4/3 3/4 8.49 µW 7.96 µW 5.36 µW 5.65 µW
5/4 4/5 8.54 µW 8.25 µW 5.46 µW 5.75 µW
logic gate, some minimum voltage exists below which the circuit cannot be expected
to operate as intended with a reasonable probability. Since the relationship between
voltage and noise margin is positively correlated, reduction in voltage can only decrease
the reliability of gate; naturally, the less reliable a gate is to begin with, the greater the
problem posed by voltage scaling will be. This feature is demonstrated in Table 3.11.
Table 3.11: Noise margin of minimum-sized CIAL, LCTL, SCSDL, and DCSTL ele-
ments by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd = 0.8V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL
2/1 1/2 13 mV 19 mV 35 mV 15 mV
3/2 2/3 16 mV 7 mV 15 mV 8 mV
4/3 3/4 17 mV 4 mV 8 mV 5 mV
5/4 4/5 17 mV 2 mV 5 mV 3 mV
Since none of the differential threshold gates presented provide a sufficient
noise margin at the nominal supply voltage of 1.2V for the process in which they are
implemented, voltage scaling is not advised for these logic elements in DSM processes.
3.8.2 Silicon on insulator
Silicon on insulator (SOI) transistors utilize a floating body rather than a con-
nected bulk node. Such transistors exhibit reduced power dissipation due to isolation
from the capacitance of the bulk silicon and are more resistant to latchup issues that
can occur in body connected processes [18]. As a side effect of the floating body node,
however, the threshold voltage of the device will vary based on its previous states of
operation [8].
49
This is especially troublesome in differential threshold logic, as each evaluation
is designed as an independent event; the output should be determined only by the state
of the parallel transistors in each input network at the instant the clock signal clk rises
from logic 0 to logic 1. The introduction of hysteresis into the transistors composing
the gate provides a handicap to one of the two branches of the differential amplifier
depending on the activity of the transistors over previous clock cycles. Effectively, this
hysteresis can be considered as an additional source of noise. Differential threshold
logic gates manufactured in SOI processes will thus generally possess a lesser degree
of robustness than gates constructed in a similar bulk CMOS process. However, the
magnitude of this impact will vary from process to process and corner to corner.
50
Chapter 4
THE THRESHOLD LOGIC LATCH (TLL)
Previous implementations of differential threshold logic all suffer from the same
problem. While fast and relatively low in power dissipation and area requirements,
each gate is extremely unreliable in the presence of noise and process variations, mak-
ing them risky to use in custom design and completely unsuitable for automated design.
While the reliability of a differential threshold logic element can be improved by care-
ful width-sizing of particular transistors within the gate, the area, delay, and power
dissipation costs of such improvements can often be prohibitive.
4.1 Design of the TLL element
Apart from increasing transistor widths and lengths, these issues can also be
mitigated by isolating the parallel input networks from the rest of the differential am-
plifier. A proposed logic element, referred to henceforth as the threshold logic latch, or
TLL, does just this by transporting the two parallel input networks out of the differential
amplifier, and changing their role to trigger the amplifier, as seen in Figure 4.1.
N6
M1M2 M3M4
M5 M6
M7 M8
N1 N2
N3 N4
VoutVout
M9 M10
in2n-1clk inn+1innin0in1inn-1
clk M11 clkM12
N5
Figure 4.1: Device-level schematic of the threshold logic latch (TLL).
51
Like all other differential threshold logic implementations, the TLL element
operates in two phases: reset and evaluation. When the signal clk is low, the gate is
reset. Nodes N5 and N6 are both discharged to logic 0 via active transistors M11 and M12.
Some or all of the transistors composing M9 and M10 may be active as well, propagating
the low clk signal to nodes N5 and N6, hastening the discharge. The discharge of nodes
N5 and N6 deactivates transistors M7 and M8, cutting off the path from nodes N1-N4 to
ground while activating transistors M1 and M4, causing nodes N1 and N2 to pre-charge
to logic 1. As the signal clk rises, the gate begins evaluation of the threshold function,
as shown in Figure 4.2.
clk
N1
N2
Vout
Vout
V
N5
N6
t (s)
Figure 4.2: Evaluation waveforms of the TLL gate, assuming αL/αR toggling between
5/4 and 4/5.
52
Transistors M11 and M12 become inactive, and clk propagates to N5 and N6
through the input networks M9 and M10. As N5 rises to a logic 1, transistor M1 becomes
inactive and M7 becomes active, causing N1 to begin to discharge. Similarly, as N6
rises to a logic 1, transistor M4 becomes inactive and M8 becomes actives, causing N2
to begin to discharge. The output is determined by which of the two nodes N1 and N2
is able to completely discharge first. The charge delays τN5 and τN6 of nodes N5 and N6
are given in Equations 4.1 and 4.2, respectively, while the discharge delays τN1 and τN2
of nodes N1 and N2 are given in Equations 4.3 and 4.4, respectively.
τN5 =C5(Z9) (4.1)
τN6 =C6(Z10) (4.2)
τN1 =C1(Z5+Z7)+C3(Z7) (4.3)
τN2 =C2(Z6+Z8)+C4(Z8) (4.4)
In the equations, Ci corresponds to the capacitance of node Ni, and Zi corre-
sponds to the impedance provided by active transistor Mi. The total delay of the TLL
element is the summation of the delay between the rising edge of the clock signal and
the charging of either node N5 or N6, plus the discharge delay of either node N1 or N2,
depending on which of the two nodes N5 or N6 completed charging first. The complete
delays from the rising edge of the clock to the discharge of nodes N1 and N2 are given
in Equations 4.5 and 4.6.
τN1+N5 =C1(Z5+Z7)+C3(Z7)+C5(Z9) (4.5)
τN2+N6 =C2(Z6+Z8)+C4(Z8)+C6(Z10) (4.6)
53
As with all differential threshold logic elements, the TLL gate utilizes matching
pairs of transistors and node capacitances throughout the differential amplifier. Capac-
itances C1 and C2 are approximately equal, as are the pair C3 and C4 and the pair C5
and C6. The physical dimensions, and thus impedances, of devices M5 and M6 are also
equal, as are the pair M7 and M8. As a result, a large portion of both discharge delays
are equal, as Equation 4.7 demonstrates.
τEQ =C1(Z5+Z7)+C3(Z7) =C2(Z6+Z8)+C4(Z8) (4.7)
The difference ∆(τN1+N5,τN2+N6) between the rates at which N1 and N2 are able
to discharge are thus almost exclusively determined by the number of active inputs in
the input networks M9 and M10, respectively, as shown in Equations 4.8, 4.9, and 4.10.
τN1+N5 = τEQ+C5(Z9) (4.8)
τN2+N6 = τEQ+C6(Z10) (4.9)
∆(τN1+N5,τN2+N6) =C6(Z10)−C5(Z9) (4.10)
The impact of this change is substantial. In previous implementations of differ-
ential threshold logic, both sides of the differential amplifier would begin to discharge
at the same time, but at different rates determined by the input configuration. With this
modification, one side of the amplifier actually begins to discharge before the other, de-
termined by which parallel input network the clock signal is able to propagate through
the fastest. The side that is triggered first is able to partially discharge unhindered by
the efforts of the opposite side, and in fact starts to reduce the initial rate of discharge
of the opposite side. This difference in initial discharge times is determined solely by
the impedances of the parallel input networks, whereas the difference in discharge rates
54
in other implementations are determined by the impedances of the parallel input net-
works in series with the transistors of the discharge paths of the differential amplifier.
Thus, while variations in the differential amplifier in a TLL element may still lead to
some imbalance in impedance between the two discharge paths, the designed difference
between the two parallel input networks is not as easily masked.
4.2 Comparison with existing differential threshold logic
In previous implementations of differential threshold logic, the clock triggers
the differential amplifier directly. In TLL, on the other hand, clock propagates through
the two networks of input transistors to trigger the amplifier. If the fan-in of the gate is
very large and/or the input transistors are large in size, the load observed by the clock
driver will increase accordingly, as will the power the driver dissipates.
4.2.1 Reliability
The impact of these structural improvements can be quantified. Simulations
were performed in HSpice for minimum-sized 65 nm LP bulk CMOS LCTL, CIAL,
SCSDL, DCSTL, and TLL gates, applying DC signal noise to the bistable nodes of
the differential amplifier. Assuming a supply voltage of 1.2 V and identical operat-
ing conditions, the noise margin of each gate can be observed in Tables 4.1 and 4.2,
where for each logic element the parameters αL and αR correspond to the number of
active devices in input networks M9 and M10, respectively. The same data is plotted in
Figure 4.3.
Table 4.1: Noise margin of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and TLL
elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd = 1.2V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 5 mV 28 mV 51 mV 18 mV 288 mV
3/2 2/3 8 mV 12 mV 22 mV 9 mV 146 mV
4/3 3/4 9 mV 7 mV 12 mV 5 mV 87 mV
5/4 4/5 9 mV 5 mV 7 mV 3 mV 57 mV
55
Table 4.2: Noise margin of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DCSTL,
and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process (Vdd =
1.2V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 13 mV 123 mV 61 mV 75 mV 376 mV
3/2 2/3 16 mV 64 mV 35 mV 53 mV 249 mV
4/3 3/4 18 mV 43 mV 23 mV 38 mV 172 mV
5/4 4/5 19 mV 33 mV 17 mV 29 mV 123 mV
Figure 4.3: Noise margin of minimum-sized and sized (M5−8 = 0.96 µm) CIAL, LCTL,
SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL, αR) = 1.
56
As the tables and plot indicate, the noise margin provided by a minimum-sized
TLL gate is much greater than that provided by minimum-sized LCTL, CIAL, SCSDL,
or DCSTL gates. Additionally, the noise margin provided by a minimum-sized TLL
gate is still greater than that provided by the sized (M5−7 = 0.96 µm in CIAL, M5−8
= 0.96 µm in LCTL, SCSDL, DCSTL, and TLL) gates, generally by a factor of 2x
or more. To further emphasize the improvement in reliability between TLL and pre-
vious implementations of differential threshold logic, Monte Carlo simulations were
performed on equivalently sized representatives of each logic family under typical oper-
ating conditions. In each of 1000 samples, the threshold voltage and transconductance
of each transistor in the gate were varied independently assuming a Gaussian distri-
bution; a histogram showing the threshold voltage variation of several minimum-sized
transistors from the input networks M9 and M10 of the TLL gate is shown in Figure 4.4.
Figure 4.4: Frequency histogram of threshold voltage variations in several minimum-
sized PMOS devices in the input networks M9 and M10 of a TLL gate.
Across all minimum-sized PMOS devices in the gate, the mean threshold volt-
age deviation was 0.2 mV and the mean standard deviation was 28.5 mV; the maximum
57
amount of deviation for the same devices was 119.3 mV. Across all minimum-sized
NMOS devices in the gate, the mean threshold voltage deviation was 0 mV and the
mean standard deviation was 44.9 mV; the maximum amount of deviation for the same
devices was 157.3 mV. Across the sized (0.96 µm) NMOS devices in the gate, the
mean threshold voltage deviation was -0.2 mV and the mean standard deviation was
16.1 mV; the maximum amount of deviation for the same devices was 62.7 mV. All de-
vices were considered to be independent rather than correlated to provide a pessimistic
bound on reliability. In practice, device pairs of differential threshold logic elements
are carefully matched. Variations between matched pairs of devices are expected to be
positively correlated rather than independent [70], providing a more reliable scenario
than that simulated in these experiments. The number of failures recorded for each gate
is shown in Table 4.3.
Table 4.3: Number of failures out of 1000 Monte Carlo simulations exhibited by sized
CIAL, LCTL, SCSDL, DCSTL, and TLL elements by αL/αR combination for a 65 nm
LP bulk CMOS process (Vdd = 1.2V).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 301 0 29 32 0
3/2 2/3 233 5 208 96 0
4/3 3/4 210 51 400 184 0
5/4 4/5 192 130 540 282 1
As the results indicate, 1000 monte carlo samples of each gate are sufficient
to highlight the difference in robustness between TLL and other differential threshold
logic elements. The improved noise margin of the TLL gate corresponds to far fewer
failures than that occurring in LCTL, CIAL, SCSDL, or DCSTL gates.
4.2.2 Performance and power dissipation
Recall that sizing, while providing an increase in a gate’s noise margin, can be
quite costly in terms of area, performance, and/or power dissipation. Tables 4.4 and 4.5
demonstrate how the delay of the TLL element compares to other differential threshold
58
logic implementations assuming the same amount of sizing is performed in all cases.
The same data is plotted in Figure 4.5.
Table 4.4: Evaluation delay of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and
TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process.
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 277.0 ps 83.1 ps 80.0 ps 70.6 ps 74.8 ps
3/2 2/3 254.7 ps 95.0 ps 88.8 ps 78.9 ps 73.7 ps
4/3 3/4 254.1 ps 100.3 ps 95.9 ps 85.5 ps 76.6 ps
5/4 4/5 255.9 ps 101.6 ps 102.0 ps 90.7 ps 80.5 ps
Table 4.5: Evaluation delay of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DCSTL,
and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process.
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 87.9 ps 59.3 ps 73.9 ps 62.1 ps 65.0 ps
3/2 2/3 79.2 ps 67.0 ps 77.0 ps 59.7 ps 56.5 ps
4/3 3/4 77.0 ps 72.3 ps 81.0 ps 60.9 ps 53.5 ps
5/4 4/5 76.3 ps 75.0 ps 84.8 ps 63.0 ps 53.4 ps
As the tables and plot indicate, TLL exhibits a delay that is roughly approximate
to that exhibited by LCTL, SCSDL, and DCSTL, and far superior to CIAL. Tables 4.6
and 4.7 demonstrate how the power dissipation of the TLL element compares to other
differential threshold logic implementations assuming the same amount of sizing is
performed in all cases. The same data is plotted in Figure 4.6.
Table 4.6: Power dissipation of minimum-sized CIAL, LCTL, SCSDL, DCSTL, and
TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process (1 GHz
clock).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 17.49 µW 10.96 µW 7.73 µW 8.14 µW 12.47 µW
3/2 2/3 17.61 µW 12.11 µW 8.13 µW 8.48 µW 12.79 µW
4/3 3/4 17.65 µW 12.81 µW 8.41 µW 8.76 µW 13.05 µW
5/4 4/5 17.69 µW 13.35 µW 8.63 µW 8.98 µW 13.28 µW
As the tables and plot indicate, while superior to the CIAL element, TLL gates
dissipate quite a bit more power than LCTL, SCSDL, or DCSTL gates. Since TLL gates
59
Figure 4.5: Evaluation delay of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL, αR) = 1.
Table 4.7: Power dissipation of sized (M5−8 = 0.96 µm) CIAL, LCTL, SCSDL, DC-
STL, and TLL elements by αL/αR combination for a 65 nm LP bulk CMOS process (1
GHz clock).
αL/αR (onset) αL/αR (offset) CIAL LCTL SCSDL DCSTL TLL
2/1 1/2 29.28 µW 14.50 µW 13.45 µW 13.69 µW 19.03 µW
3/2 2/3 29.54 µW 16.07 µW 14.02 µW 13.96 µW 19.13 µW
4/3 3/4 29.69 µW 17.10 µW 14.50 µW 14.25 µW 19.21 µW
5/4 4/5 29.81 µW 17.84 µW 14.88 µW 14.44 µW 19.68 µW
propagate the clock signal through the input networks, the clock pin in the TLL element
observes a much larger load. As a consequence, a larger clock buffer is required to drive
the gate, resulting in higher power dissipation. While TLL dissipates more power than
other differential threshold logic implementations, it is notable that the minimum-sized
TLL gate provides superior noise margin to the sized LCTL, SCSDL, and DCSTL
gates. In addition, the minimum-sized TLL gate provides superior power dissipation to
that of sized CIAL, LCTL, SCSDL, and DCSTL gates; on average providing reduction
by a factor of 1.15.
60
Figure 4.6: Power dissipation of minimum-sized and sized (M5−8 = 0.96 µm) CIAL,
LCTL, SCSDL, DCSTL, and TLL elements assuming max(αL, αR) - min(αL, αR) = 1.
4.3 Masking the reset phase
Depending on how the TLL gate is used, it is often useful to mask the reset
phase from the following logic gates. During the reset phase, the TLL gate outputs
no useful information; both outputs Vout and Vout register a logic 0, regardless of the
current or previous states of the input signals. If one or both of these output signals
feeds combinational logic gates, reset of the TLL gate will trigger power dissipating
transitions in the downstream logic even if the output maintains a constant value during
evaluation. Such transitions are wasteful, as they dissipate dynamic power without
performing useful computations.
Since the TLL gate stores no useful information during the reset phase, it can be
beneficial to add a slave latch to the output of the TLL gate. This preserves the evaluated
outputs of the logic element beyond the falling edge of the clock while preventing the
power dissipating transitions resulting from the reset of the gate from propagating to
any downstream logic, as shown in Figure 4.7.
61
clk
N5
N6
N1
N2
Vout
Vout
t (s)
V
Figure 4.7: Evaluation waveforms of the TLL gate augmented with slave latch at out-
put, assuming αL/αR toggling between 5/4 and 4/5.
While such a transformation can potentially save power in the logic following
the TLL element, there is a cost incurred within the gate itself. The additional latch in
series with the output of the gate will increase area, delay, and power dissipation. The
amount of each is determined by the design of the latch itself. There are many possible
approaches; one possible design is the augmentation of the TLL gate with an SR latch,
shown in Figure 4.8.
The SR latch operates asynchronously, latching its outputs after evaluation com-
pletes, and holding its values during reset. During reset, both inputs to the SR latch
register a logic 0, causing both NAND gates of the latch to maintain their current val-
62
VoutVout
N6
M1M2 M3M4
M5 M6
M7 M8
N1 N2
N3 N4
M9 M10
in2n-1clk inn+1innin0in1inn-1
clk M11 clkM12
N5
Figure 4.8: Device-level schematic of the threshold logic latch (TLL) augmented with
an SR slave latch.
ues. Following evaluation, one and only one of the two inputs to the latch will rise to
a logic 1, forcing the outputs of the SR latch to one of two stable states. At no point
during the TLL gate’s operation will both inputs to the SR latch register a logic 1;
such an input configuration cannot occur thus the outputs of the latch in this event are
inconsequential. The truth table for the SR latch is given in Table 4.8.
Table 4.8: Truth table of the SR latch with inputs S and R and outputs Q and Q.
S R Qnext Qnext
0 0 Q Q
0 1 0 1
1 0 1 0
1 1 - -
While simple and reasonably small, the SR latch can be quite slow. This is
due to the fact that falling transitions on the SR latch require propagation through both
NAND gates comprising the latch. Assuming Vout and Vout are latched at logic 0 and
63
logic 1, respectively, consider the case where a new evaluation causes N1 to discharge
faster than N2. If this case, the falling node N1 prompts the leftmost NAND gate in the
SR latch to charge. This event, in turn, causes the rightmost NAND gate to discharge.
This event toggles the final inverter, switching Vout from logic 0 to logic 1. The total
delay from the discharge event on the differential amplifier to the final transition on the
output node includes 2 NAND gate delays and 1 inverter delay. If greater performance
is required, a pair of positive D latches can be used to augment the TLL gate instead of
the SR latch, as shown in Figure 4.9.
N6
M1M2 M3M4
M5 M6
M7 M8
N1 N2
N3 N4
M9 M10
in2n-1clk inn+1innin0in1inn-1
clk M11 clkM12
N5
Vout
clk
clk
clk
Vout
clk
clk
clk
Figure 4.9: Device-level schematic of the threshold logic latch (TLL) augmented with
D slave latches.
The D latch is transparent when the clock signal clk is high, and holds its value
whenever clk is low. Due to the fact that the D latches are not coupled with each other,
the propagation delay of a D latch is smaller than that of an SR latch; approximately
two inverter delays. However, D latches are more expensive in terms of area and power
dissipation. Augmenting a TLL gate with two D latches adds another 20 transistors
64
versus 8 if an SR latch is used. Additionally, the D latches, unlike the SR latch, are
clocked, and thus increase clock load and power dissipation. The difference in eval-
uation delay and power dissipation between TLL elements with different output slave
latches is demonstrated in Tables 4.9 and 4.10.
Table 4.9: Evaluation delay of sized (M5−8 = 0.96 µm) TLL elements with different
output slave latches by αL/αR combination for a 65 nm LP bulk CMOS process.
αL/αR (onset) αL/αR (offset) No latch SR latch D latch
2/1 1/2 65.0 ps 127.7 ps 105.5 ps
3/2 2/3 56.5 ps 119.3 ps 96.7 ps
4/3 3/4 53.5 ps 117.9 ps 94.0 ps
5/4 4/5 53.4 ps 119.9 ps 94.0 ps
Table 4.10: Power dissipation of sized (M5−8 = 0.96 µm) TLL elements with different
output slave latches by αL/αR combination for a 65 nm LP bulk CMOS process (1 GHz
clock).
αL/αR (onset) αL/αR (offset) No latch SR latch D latch
2/1 1/2 19.03 µW 22.06 µW 25.35 µW
3/2 2/3 19.13 µW 22.34 µW 25.47 µW
4/3 3/4 19.21 µW 22.45 µW 25.77 µW
5/4 4/5 19.68 µW 22.72 µW 25.90 µW
As the tables indicate, adding an SR latch to the output of a TLL gate increases
the total delay of the logic element by between 1.96x and 2.24x while increasing power
dissipation by 15-16%. Use of D latches instead of an SR latch improves delay by
1.21-1.27x, although power dissipation increases an additional 13-14%.
4.4 Comparison with static CMOS
A TLL gate augmented with a slave latch at the output is functionally equivalent
to a combinational threshold function feeding a positive edge-triggered flip-flop. The
total delay τtotal of combinational logic plus a flip-flop is given in Equation 4.11.
τtotal = τcomb.+ τFF,setup+ τFF,clk2q (4.11)
65
In the equation, τcomb. is the propagation delay of the combinational logic el-
ement(s), τFF,setup is the setup time of the flip-flop, and τFF,clk2q is the propagation
delay of the flip-flop. To demonstrate how the TLL gate augmented with an SR latch
compares to combinational logic preceding a flip-flop, gates implementing a variety
of threshold functions were designed and simulated, including 3, 5, and 7-input AND,
OR, and MAJORITY functions.
The reason these functions were chosen for comparison is that they demon-
strate the two extreme ends of the advantages and disadvantages of using TLL ele-
ments. Gates implementing AND and OR with up to 4 or 5 inputs are common prim-
itive structures in static CMOS, and implemented very efficiently. The MAJORITY
function, on the other hand, is extremely difficult to implement efficiently using static
CMOS, particularly if the fan-in is large. The level of difficulty required to implement
any threshold function using static CMOS will be somewhere in between that required
to implement an AND or OR function and that required to implement a MAJORITY
function. A comparison between the gate area, delay, and power dissipation of each is
provided in Table 4.11.
Table 4.11: Area, delay, and power dissipation comparison between TLL and static
CMOS.
Function Area Delay Power
CMOS TLL CMOS TLL CMOS TLL
3OR 0.2916 µm2 0.5076 µm2 168.8 ps 123.1 ps 20.27 µW 20.64 µW
5OR 0.3996 µm2 0.5652 µm2 199.9 ps 127.2 ps 22.98 µW 22.06 µW
7OR 0.5436 µm2 0.6228 µm2 210.7 ps 131.4 ps 28.14 µW 23.59 µW
3MAJ 0.4212 µm2 0.4788 µm2 198.3 ps 120.8 ps 25.82 µW 19.76 µW
5MAJ 1.6164 µm2 0.5076 µm2 237.9 ps 116.5 ps 89.48 µW 20.75 µW
7MAJ 7.5348 µm2 0.5364 µm2 298.5 ps 117.2 ps 356.17 µW 21.85 µW
3AND 0.2700 µm2 0.5076 µm2 167.9 ps 123.1 ps 18.55 µW 20.64 µW
5AND 0.3780 µm2 0.5652 µm2 197.8 ps 127.2 ps 22.23 µW 22.06 µW
7AND 0.4860 µm2 0.6228 µm2 205.7 ps 131.4 ps 26.17 µW 23.59 µW
The functions demonstrated using static CMOS in the table were designed such
that the logic depth of the function was minimized. For many of the functions pre-
66
sented, alternative solutions exist which are slower but also lower in area and power
dissipation. As the table indicates, static CMOS is preferable for simple functions, such
as the 3-input AND and OR gates. As fan-in and/or complexity increase, however, so
too do the advantages of TLL.
4.5 Comparison with domino CMOS
Like TLL gates, domino CMOS elements are often used to speed up complex
functions that would be relatively slow implemented as static CMOS. Domino gates
eliminate the large pull-up network present in many static CMOS implementations,
providing fast implementations of wide OR gates, one-hot encoded multiplexers, and
similar functions. A typical domino CMOS gate is shown in Figure 4.10.
clk
a b ec d f g
x y
Figure 4.10: Domino CMOS implementation of a 7-input OR function (y = a + b + c +
d + e + f + g).
To demonstrate how the TLL gate augmented with an SR slave latch compares
to a domino CMOS gate augmented with a slave latch [7], gates implementing a variety
of threshold functions were designed and simulated, including 3, 5, and 7-input AND,
OR, and MAJORITY functions. A comparison in the gate area, delay, and power
dissipation of each is provided in Table 4.12.
As the table indicates, domino logic is actually superior to TLL for threshold
functions with small values of T , specifically the OR functions (where T = 1). TLL
gates provide a faster implementation of more complex threshold functions, however.
67
Table 4.12: Area, delay, and power dissipation comparison between TLL and domino
CMOS.
Function Area Delay Power
domino TLL domino TLL domino TLL
3OR 0.1764 µm2 0.5076 µm2 105.8 ps 123.1 ps 11.50 µW 20.64 µW
5OR 0.2052 µm2 0.5652 µm2 113.9 ps 127.2 ps 12.90 µW 22.06 µW
7OR 0.2340 µm2 0.6228 µm2 121.5 ps 131.4 ps 14.28 µW 23.59 µW
3MAJ 0.2484 µm2 0.4788 µm2 114.8 ps 120.8 ps 13.07 µW 19.76 µW
5MAJ 0.6948 µm2 0.5076 µm2 158.2 ps 116.5 ps 25.45 µW 20.75 µW
7MAJ 2.6388 µm2 0.5364 µm2 258.2 ps 117.2 ps 71.84 µW 21.85 µW
3AND 0.2340 µm2 0.5076 µm2 112.3 ps 123.1 ps 12.59 µW 20.64 µW
5AND 0.3780 µm2 0.5652 µm2 130.8 ps 127.2 ps 16.41 µW 22.06 µW
7AND 0.5796 µm2 0.6228 µm2 154.8 ps 131.4 ps 21.37 µW 23.59 µW
While fast and relatively area efficient, domino CMOS is not commonly uti-
lized in standard cell libraries due to reliability concerns. Unlike static CMOS and
TLL gates, some of the nodes within a domino element are floating during operation,
and thus particularly vulnerable to noise. Unlike logic elements where all nodes are
statically maintained, it is possible for leakage currents to charge or discharge a float-
ing node over time, resulting in a functional failure. Specifically, when the clock signal
clk rises to logic 1 after pre-charge, the internal node x of the domino gate remains
floating at logic 1 if no conducting path exists between x and ground. In Figure 4.10,
this corresponds to the input configuration in which all inputs a−g = 0. Leakage cur-
rents through the pull-down transistors, however, will eventually discharge x to logic 0
if it remains floating. When this occurs, the data dynamically stored on node x is lost,
and the gate fails. Domino gates are thus almost always implemented using keeper
circuits to counteract this effect, as shown in Figure 4.11.
The keeper provides a weak pull-up on node x; if the impedance observed be-
tween x and the supply voltage through the keeper is less than the impedance observed
between x and ground through the leaking inactive components of the pull-down net-
work, x will maintain a logic 1 even if it is left floating for an extended period of time.
This, of course, assumes that the keeper is large enough to overcome the leakage to
68
clk
a b ec d f g
x y
keeper
Figure 4.11: Domino CMOS implementation of a 7-input OR function (y = a + b + c +
d + e + f + g) with keeper circuit.
ground. If the keeper is too weak, the gate can fail, particular at process corners where
the leakage currents are increased; leakage is extremely sensitive to process variations,
supply voltage, and temperature, often varying two or more orders of magnitude across
corners.
To compensate for this effect, larger keepers can be used. The larger the keeper,
the greater the amount of leakage to ground that can be overcome. Large keepers
have their own drawbacks, however, primarily in terms of the power and delay of the
gate. The larger the keeper, the harder it is for the pull-down network to overcome
the keeper when it is supposed to discharge the gate. More troublesome still, if the
keeper is too strong, the pull-down may be unable to overcome the keeper when it is
supposed to discharge the gate, causing the gate to fail due to a stuck-at fault. While
TLL gates possess some vulnerability to local process variations, they possess none of
the vulnerabilities to global process variations present in domino gates nor a sensitivity
to leakage currents.
4.6 Error detection and correction
While TLL gates possess a far greater level of robustness than other differential
threshold logic implementations, it is still possible that noise or process variations will
result in a functional failure in the gate. A designer utilizing such gates therefore must
69
decide between one of two options. The first option is to design each gate such that the
probability of failure is acceptably low. The second option is the addition of structures
to the gate that permit the identification of errors as well as a mechanism for correcting
errors that are identified.
As stated in the previous chapter, every threshold logic gate implementing a
specific function possesses a small subset of identifiable input configurations defined as
the “critical configurations”. If the gate fails for any subset of input configurations, that
subset must include at least one critical configuration. An error identification mecha-
nism implemented in hardware must therefore test the gate for each critical configura-
tion. Testing for a specific input configuration necessitates the ability of the TLL gate
to assume a test mode of operation. Such a modification is provided in Figure 4.12.
test_in in
test_mode
Figure 4.12: Augmentation of a single input in in the input network to provide test
mode capability.
In the figure, a multiplexer controlled by the signal test mode selects between
the normal input in and the pre-selected test vector input test in. The output of the TLL
gate fans out to additional logic that compares the gate output to a known expected
value; if the two are equivalent, the gate is operating properly. If not, an error is known
to have occurred.
If errors are to be corrected, an additional mechanism is required to actually
adjust the weights and/or threshold value of function. An error essentially manifests
itself as a shift in the threshold value. Either a critical configuration in the onset fails,
indicating that the threshold value is slightly too high, or a critical configuration in
70
the offset fails, indicating that the threshold value is slightly too low. Adding active
transistors to the left input network of the gate serve to decrease the threshold value
of a function, whereas adding active transistors to the right input network of the gate
serve to increase the threshold value of the function. If the gate possesses a number
of configurable dummy devices in both input networks that can be turned on and off,
as shown in Figure 4.13, the threshold value of the function becomes adjustable in
real-time.
in
M9
off_1off_2off_3
Figure 4.13: TLL input network with configurable dummy devices.
In the figure, the input network M9 is augmented with three additional parallel
devices that can provide a constant offset to the impedance of the input network as de-
termined by the signals o f f 1, o f f 2, and o f f 3. Following the detection of an error,
a configurable device in either the left or right input network is activated (depending on
the nature of the error) and the gate is retested. If the logic element passes, the error has
been corrected. If the failure still occurs, another device is activated and the process
repeats. This certainly involves a great deal of overhead in terms of area and power
dissipation, but allows for offline verification and correction of TLL gates.
Note that such a mechanism can only correct errors that can be identified during
offline testing, such as those due to process variations or static noise. These techniques
will not protect against dynamic noise sources that only effect the gate some of the
time, as there is no guarantee that these sources will be present during testing of the
gate. Apart from identification, dynamic sources impede correction since corrections
that assist in the presence of one source of noise may create additional problems in the
presence of another.
71
4.7 Adding scan capability to the TLL element
A TLL element replaces some amount of combinational logic as well as a se-
quential element in a design, thus the TLL element must be able to provide all of the
same functionality a standard D flip-flop also provides. Many modern designs include
latches and flip-flops with embedded design-for-test (DFT) features, allowing the in-
formation stored within to scanned out and analyzed during offline testing [69].
While scan functionality may be provided to a D flip-flop by adding a multi-
plexer in series with the data input, adding such structures in series with the input(s)
of a TLL is less straightforward due to the potentially large fan-in of the gate. Modifi-
cation of the differential amplifier of the gate is highly undesirable due to the possible
ramifications upon reliability; augmentation of the input network alone is preferable.
One possible architecture integrating scan functionality with the TLL element is shown
in Figure 4.14.
N6
M1 M2 M3 M4
M5 M6
M7 M8
N1 N2
N3 N4
VoutVout
M9 M10
in2n-1inn+1innin0in1inn-1
clk M11 clkM12
N5
clk
TETE
TI TE TI TE
M13-14
M15 M16
Figure 4.14: Device level schematic of the TLL gate with scan functionality.
72
When the scan enable signal T E assumes a logic 0, both transistors M15 and
M16 are inactive (Vg = logic 1), regardless of the logic value of the scan input signal T I.
Transistors M13−14 are active (Vg = logic 0), and the gate provides its typical function.
When the scan enable signal T E assumes a logic 1, however, transistors M13−14 become
inactive and exactly one of transistors M15 and M16 becomes active, depending upon
the polarity of the scan input signal T I. As a result, exactly one of nodes N5 and N6
is charged, providing a non-contentious stimulus to the differential amplifier which
determines the appropriate output. Although no constraints are placed upon the states
of the devices comprising the input network M9 and M10 during test mode, transistors
M13−14 are inactive, ensuring that no conducting path exists between nodes N5 and N6.
73
Chapter 5
MODELING AND OPTIMIZATION OF THE TLL GATE
It has been shown that the TLL gate provides great benefits over other similarly
sized implementations of differential threshold logic such as SCSDL and DCSTL in
terms of noise immunity while providing superior area, performance, and power dis-
sipation to threshold logic gates scaled for the same amount of reliability. To extract
the maximum utility from this new logic family, however, there must exist models able
to accurately estimate the delay and power dissipation of a cell for the purposes of
optimization.
Most importantly, a model must be constructed for accurately estimating a cell’s
response to noise and process variations to ensure that the gate will operate as intended
during use. Optimization of a gate’s reliability through iterative simulation is an ex-
tremely computationally expensive process, and a simplified model for determining the
noise margin of a gate will cut down on the amount of simulation required to ensure
that a given TLL gate satisfies reliability constraints.
Two distinct controls are available to the designer of a TLL cell faced with the
task of optimization. First is the choice of signal assignment: the manner in which input
signals are assigned to devices in the two input networks of the gate. Different signal
assignments will provide varied responses to input configurations in terms of delay,
power, and reliability, and many determinations of optimality can be made indepen-
dently of any specific process or operating corner. The second is the definition of the
physical parameters of the gate: the gate widths and lengths of the devices composing
the cell. While the optimal physical sizing of a gate will vary from process to process,
some basic assumptions regarding the TLL element coupled with a small amount of
empirical data can be used to estimate an optimal delay sizing subject to constraints on
the power dissipation and noise margin of the gate.
74
5.1 Signal assignment
In any logic gate, the response of the gate, be it in terms of delay, power, or other
metric, is a function of the combination of input signals applied. The threshold logic
latch (TLL) is no different, providing a different response depending on the number
of active input signals in each of its input networks. Consider Table 5.1, which shows
the relationship between the input configuration and noise margin of a minimum-sized
16-input TLL gate. The gates were simulated using Synopsys HSpice version 2009.09
on a commercial 65 nm LP bulk CMOS process assuming a clock frequency of 1 GHz.
Delay was measured as the rising edge (Vdd/2) of clk to the falling edge (Vdd/2) of N1
or N2 (dependent upon which of the two nodes discharges first). Power was measured as
the total amount of power dissipated during the evaluation phase of the gate’s operation.
Typical operating conditions were assumed, as well as supply voltage of 1.2V and
temperature of 25C. Input configuration is represented by the term αL/αR, where αL
represents the number of active devices in the left input network, and αR represents
the number of active devices in the right input network. Note that αL can never be
equivalent to αR, as this would result in a scenario where the differential amplifier
would be unable to the resolve the output of the function.
Table 5.1: Input configuration vs. noise margin for a minimum-sized 16-input TLL
gate implemented in 65 nm LP bulk CMOS
αR
αL 0 1 2 3 4 5 6 7
0 - 508 mV 509 mV 509 mV 510 mV 510 mV 510 mV 510 mV
1 508 mV - 314 mV 371 mV 394 mV 407 mV 415 mV 420 mV
2 509 mV 314 mV - 163 mV 228 mV 260 mV 280 mV 293 mV
3 509 mV 371 mV 163 mV - 99 mV 152 mV 183 mV 204 mV
4 510 mV 394 mV 228 mV 99 mV - 65 mV 107 mV 135 mV
5 510 mV 407 mV 260 mV 152 mV 65 mV - 46 mV 80 mV
6 510 mV 415 mV 280 mV 183 mV 107 mV 46 mV - 35 mV
7 510 mV 420 mV 293 mV 204 mV 135 mV 80 mV 35 mV -
75
In determining the ”optimality” of a signal assignment, the most important fac-
tor to consider is the noise margin afforded by the signal assignment. Noise margin of a
TLL gate decreases quickly as the difference in impedance between the two input net-
works of the gate diminishes. While noise margin can be improved with sizing of phys-
ical parameters, such improvements are likely to increase delay and certain to increase
power dissipation. As the table demonstrates, if min[αL,αR] remains constant, the noise
margin increases monotonically as max[αL,αR] increases. Additionally, if max[αL,αR]
remains constant, the noise margin increases monotonically as min[αL,αR] decreases.
The worst case αL/αR combination will be that for which min[αL,αR] is maximized and
|αL−αR| is minimized.
The delay and power response of a TLL gate depend upon the αL/αR combina-
tion applied, as well; these relationships are demonstrated by Tables 5.2 and 5.3. As
the tables show, delay decreases rapidly as min[αL,αR] is increased. The delay exhib-
ited by the gate changes relatively little in response to changes in max[αL,αR], however
is generally higher when |αL−αR| is minimized. Power increases as a function of
min[αL,αR], sharply as min[αL,αR] increases from 0 to 1 (and contention is introduced
into the gate), then more gradually as min[αL,αR] continues to increase. Power is not
affected significantly by changes in max[αL,αR], though is generally slightly higher
when |αL−αR| is minimized.
Table 5.2: Input configuration vs. typical evaluation delay for a minimum-sized 16-
input TLL gate implemented in 65 nm LP bulk CMOS
αR
αL 0 1 2 3 4 5 6 7
0 - 70.1 ps 47.5 ps 39.5 ps 35.4 ps 32.9 ps 31.2 ps 30.0 ps
1 70.1 ps - 46.4 ps 39.1 ps 35.2 ps 32.8 ps 31.1 ps 29.9 ps
2 47.5 ps 46.4 ps - 39.3 ps 34.2 ps 31.5 ps 29.8 ps 28.7 ps
3 39.5 ps 39.1 ps 39.3 ps - 37.0 ps 32.6 ps 30.2 ps 28.7 ps
4 35.4 ps 35.2 ps 34.2 ps 37.0 ps - 36.2 ps 32.1 ps 30.0 ps
5 32.9 ps 32.8 ps 31.5 ps 32.6 ps 36.2 ps - 35.9 ps 32.2 ps
6 31.2 ps 31.1 ps 29.8 ps 30.2 ps 32.1 ps 35.9 ps - 36.0 ps
7 30.0 ps 29.9 ps 28.7 ps 28.7 ps 30.0 ps 32.2 ps 36.0 ps -
76
Table 5.3: Input configuration vs. typical evaluation power for a minimum-sized 16-
input TLL gate implemented in 65 nm LP bulk CMOS
αR
αL 0 1 2 3 4 5 6 7
0 - 17.04 µW 17.35 µW 17.76 µW 17.79 µW 17.96 µW 18.14 µW 18.29 µW
1 17.04 µW - 21.13 µW 21.36 µW 21.54 µW 21.71 µW 22.05 µW 22.20 µW
2 17.35 µW 21.13 µW - 21.87 µW 21.95 µW 22.30 µW 22.22 µW 22.36 µW
3 17.76 µW 21.36 µW 21.87 µW - 22.62 µW 22.53 µW 22.60 µW 22.69 µW
4 17.79 µW 21.54 µW 21.95 µW 22.62 µW - 22.94 µW 23.13 µW 23.04 µW
5 17.96 µW 21.71 µW 22.30 µW 22.53 µW 22.94 µW - 23.37 µW 23.58 µW
6 18.14 µW 22.05 µW 22.22 µW 22.60 µW 23.13 µW 23.37 µW - 23.79 µW
7 18.29 µW 22.20 µW 22.36 µW 22.69 µW 23.04 µW 23.58 µW 23.79 µW -
The worst case delay signal assignment is generally one for which max[αL,αR]
is minimized. Therefore, the “best” signal assignment overall, in terms of delay, power,
and reliability, is one for which max[αL,αR] is minimized when |αL−αR| is minimized,
and maximized otherwise.
For a given function the designer typically does not have any control over which
input configurations are provided to the gate. However, a single threshold function
implemented using a TLL gate can be realized using any number of signal assignments.
In previous works, the assignment of input signals has never been addressed as a means
of improving delay, power, or reliability.
Consider the threshold function {21111;4}, which can also be represented by
the Boolean expression a[b(c+d+e)+c(d+e)+de]+bcde. By definition, the output
of the function is a 1 if the sum of the weighted inputs equals or exceeds the threshold
4. Since the physical implementation of the gate contains no notion of exact equality
(there cannot be an equal number of active devices in both input networks), before any
signal assignment can be applied the weights and threshold values of the function must
be adjusted to remove the possibility of an exact equality.
Table 5.4 provides a truth table of the function {21111;4}, as well as the sum
of weighted inputs for each input configuration. As the table demonstrates, the sum of
weights can assume any integer value from 0 to 6. A sum from 0 to 3 results in an out-
77
put of 0, while a sum from 4 to 6 results in an output of 1. Therefore, the output of the
function is not changed for any input confinguration if the threshold is reduced by 0.5.
Additionally, the inequality of the function becomes strict, as no sum of integer weights
can be exactly equivalent to 3.5. In a TLL gate, however, weights and thresholds are
implemented using discrete devices, thus it is necessary for all weights and the thresh-
old value to represented as integers. After reduction of the integer threshold value by
0.5, if all weights and the threshold value are then multiplied by 2, the strict inequality
is preserved while ensuring that all weights and the threshold value are integer values.
The new representation of the function {21111;4} is thus {42222;7}.
Table 5.4: Truth table of the threshold function a[b(c+d+ e)+ c(d+ e)+de]+bcde
(21111;4)
a b c d e Σwixi y a b c d e Σwixi y
0 0 0 0 0 0 0 1 0 0 0 0 2 0
0 0 0 0 1 1 0 1 0 0 0 1 3 0
0 0 0 1 0 1 0 1 0 0 1 0 3 0
0 0 0 1 1 2 0 1 0 0 1 1 4 1
0 0 1 0 0 1 0 1 0 1 0 0 3 0
0 0 1 0 1 2 0 1 0 1 0 1 4 1
0 0 1 1 0 2 0 1 0 1 1 0 4 1
0 0 1 1 1 3 0 1 0 1 1 1 5 1
0 1 0 0 0 1 0 1 1 0 0 0 3 0
0 1 0 0 1 2 0 1 1 0 0 1 4 1
0 1 0 1 0 2 0 1 1 0 1 0 4 1
0 1 0 1 1 3 0 1 1 0 1 1 5 1
0 1 1 0 0 2 0 1 1 1 0 0 4 1
0 1 1 0 1 3 0 1 1 1 0 1 5 1
0 1 1 1 0 3 0 1 1 1 1 0 5 1
0 1 1 1 1 4 1 1 1 1 1 1 6 1
The generalization of this transformation is given in Equations 5.1 and 5.2.
Any threshold function of the form given in Equation 5.1 can be represented as given
in Equation 5.2 without altering the response of the function.
78
f =
n−1
∑
i=0
wixi ≥ T (5.1)
f =
n−1
∑
i=0
2wixi > 2T −1 (5.2)
A general signal assignment is represented by inequality given in Equation 5.3,
where the left side of the inequality represents the assignment of the left input network
M9 and the right side of the inequality represents the assignment of the right input
network M10.
k0+ k1x0+ k2x1+ ...+ knxn−1 > kn+1− kn+2x0− kn+3x1+ ...+ k2n+1xn−1 (5.3)
In the inequality, the positive term kix j represents ki input devices driven by the
signal x j, while the negative term −kix j represents ki input devices driven by the signal
x j, as Equation 5.4 demonstrates.
−kix j =−ki+ ki(1− x j) =−ki+ kix j (5.4)
A positive constant ki represents ki input devices driven by a constant logic 0,
while a negative constant ki represents ki input devices in the opposite input network
driven by a constant logic 0. For example, the function {42222;7} with the signal
assignment 4a+2b+2c+2d+2e > 7 is represented as given in Table 5.5.
Table 5.5: Simple signal assignment 4a+2b+2c+2d+2e > 7 of the threshold func-
tion {42222;7}
Left input network a a a a b b c c d d e e
Right input network 1 1 1 1 1 1 1 0 0 0 0 0
αL/αR 12/7, 10/7, 8/7, 6/7, 4/7, 2/7, 0/7
Min. worst cases 8/7, 6/7
Max. worst cases 8/7, 6/7
79
Using this signal assignment, the gate requires 12 inputs per input network. The
worst case input configurations in the onset of the function are those for which αL/αR
= 8/7, and the worst case input configurations in the offset of the function are those for
which αL/αR = 6/7.
The number of inputs per input network is a key determinant of the area, delay,
and power dissipation of the gate. A gate with a large number of inputs naturally
requires a larger cell area than a gate with a smaller number of inputs. Figures 5.1
and 5.2 show the relationship between number of inputs per input network, delay, and
power dissipation over a fixed set of αL/αR combinations.
Figure 5.1: Evaluation delay vs. number of inputs per input network by αL/αR combi-
nation.
As the figures demonstrate, a small number of inputs is favored both in terms of
delay and power dissipation. Regardless of whether inputs are active or inactive, they
contribute capacitance to nodes N5 and N6 in the TLL gate. The number of inputs is
not fixed for a given function, and can in fact be altered through re-assignment of the
input signals.
It is in fact possible to represent any function with a signal assignment that
uses no active (0) dummy devices; such an implementation will always have a smaller
80
Figure 5.2: Average evaluation power vs. number of inputs per input network by αL/αR
combination.
number of inputs per input network than an assignment that uses active dummy devices.
For a given function, there are one or more signal assignments that satisfy this property.
The worst case αL/αR combinations will vary within this subset, however, and there is
at least one assignment that provides optimal worst cases. Formally, the inequality
presented by an n-input threshold function can be represented as given in Equation 5.5.
(w0− k0)x0+(w1− k1)x1+ ...+(wn−1− kn−1)xn−1
> T − k0− k1− ...− kn1 + k0x0+ k1x1+ ...+ kn−1xn−1
(5.5)
The parameters k0,k1, ...,kn−1 and T are positive integer values. The sum T −
k0−k1− ...−kn−1 indicates the number of active dummy devices required to realize the
signal assignment. If T = k0 + k1 + ...+ kn−1, no active dummy devices are required.
The total number of inputs required by an input network is given in Equation 5.6.
max[(w0−w1+ ...+wn−1− k0− k1− ...− kn−1),T ] (5.6)
In the signal assignment 4a+2b+2c+2d+2e> 7, T−k0−k1− ...−kn−1 = 7,
thus seven active dummy devices are required to realize the expression. The number
81
of devices required per input network n = max[12,7] = 12. Note that both n and the
number of active dummies is reduced through the transfer of input devices from the left
input network to the right input network. A number of assignments exist for which no
dummy devices are required; two examples are given in Tables 5.6 and 5.7.
Table 5.6: Example signal assignment #1 of the threshold function {42222;7}
Left input network a a a a b 1 1
Right input network b c c d d e e
αL/αR 5/6, 5/4, 5/2, 5/0, 4/7, 4/5, 4/3, 4/1,
1/6, 1/4, 1/2, 1/0, 0/7, 0/5, 0/3, 0/1
Min. worst cases 1/0, 0/1
Max. worst cases 5/4, 5/6
Table 5.7: Example signal assignment #2 of the threshold function {42222;7}
Left input network a a b c d 1 1
Right input network a a b c d e e
αL/αR 5/2, 5/0, 4/3, 4/1, 3/4, 3/2,
2/5, 2/3, 1/6, 1/4, 0/7, 0/5
Min. worst cases 3/2, 2/3
Max. worst cases 4/3, 3/4
Note that for both examples, R = 0 and n = max[5,7] = 7. Also note that both
gates exhibit different sets of αL/αR combinations, denoting different delay, power, and
reliability responses. The second signal assignment possesses a larger max[αL,αR]
in the min. worst case, denoting a smaller worst case delay, as well as a smaller
max[αL,αR] in the max. worst case, denoting a smaller power dissipation and higher
degree of robustness. The optimal signal assignment of a TLL gate is thus not only a
matter of how many inputs are transferred between input networks, but which inputs
are transferred. The key difference between the two signal assignments is the number
of shared units possessed by each.
A shared unit is defined as an input transistor that is part of a shared pair. A
shared pair is defined as a pair of input transistors in opposite input networks such that
82
the input state of one device is dependent upon the state of the other; for instance,
if the input signal controlling one such transistor is the term a and the input signal
controlling the other is the term a. A shared pair implies one-to-one matching between
dependent devices. While a single device may have an input that shares a dependency
with several devices in the opposite network, no transistor in either input network is
defined as belonging to more than one shared pair.
An unshared unit is defined as an input transistor that is not part of a shared pair;
for instance, if the input signal controlling a transistor is the term b and no transistor
controlled by the term b exists in the opposite network that is not already a part of a
shared pair.
Since an unshared unit has no dependencies, there are no restrictions on whether
or not it can contribute to the number of active devices in a network. A shared unit may
only contribute to the number of active devices in a network if the other member of the
shared pair is not, and vice versa. As a result, the set of possible αL/αR combinations for
a gate is reduced by maximizing the number of shared units in the gate. The maximum
number of shared pairs a TLL gate may possess is given by Equation 5.7, assuming a
signal assignment in which no active dummy devices are present.
min[(w0+w1+ ...+wn−1−T ),T ] (5.7)
In the previous examples, the first signal assignment given possesses a single
shared pair (b and b). The second assignment possesses the maximum number of pos-
sible shared pairs (a and a, a and a, b and b, c and c, and d and d). Such a signal
assignment is the most robust representation of the function possible, with delay opti-
mized as a secondary goal. Choice of signal assignment is only half of the optimization
problem, however. Once an assignment has been chosen, the physical parameters of the
gate must be optimized to minimize worst case delay across all possible combinations
of αL/αR.
83
5.2 Physical parameters
The TLL gate is constructed of two parallel networks of input transistors, each
triggering one branch of a differential amplifier. A single TLL gate can be quite large,
potentially constructed from dozens of individual transistors. To simplify the problem
of sizing the devices in the gate, they are separated into “sizing groups”. Each member
of a sizing group plays a role similar to the other members of the group, thus all are
sized similarly. Five such groups exist, shown in an annotated schematic of the TLL
element in Figure 5.3.
N6
M1M2 M3M4
M5 M6
M7 M8
N1 N2
N3 N4
VoutVout
M9 M10
in2n-1clk inn+1innin0in1inn-1
clk M11 clkM12
N5
DP
IP
DN
IN
X
Figure 5.3: TLL element annotated with physical parameter sizing groups: DP, DN, IP,
IN, and X
The sizing group DP refers to the differential PMOS devices M1−4. These de-
vices pre-charge nodes N1 and N2 in the amplifier during the reset phase of operation
and actively work against the discharge of nodes N1 and N2 during contentious eval-
uation. The sizing group DN refers to the differential NMOS devices M5−8. These
devices determine the rate of discharge of nodes N1 and N2 during evaluation. The siz-
ing group IP refers to the input network PMOS devices composing networks M9 and
M10. It is these devices through which the clock signal propagates during evaluation,
84
charging nodes N5 and N6. The sizing group IN refers to the input network NMOS
devices M11 and M12, which discharge nodes N5 and N6 during reset. Finally, sizing
group X refers to the load inverters separating nodes N1 and N2 from Vout and Vout .
Assuming a single Vt process, each sizing group has two controls available to
it, gate width and gate length. An increase in gate width increases the drain, gate,
and source capacitance of the transistor, increasing the power dissipated when the re-
spective nodes are charged or discharged. At the same time, an increase in gate width
reduces the drain-source impedance of the device while it is active, reducing the delay
of any charging or discharging activity through the device. An increase in gate length,
however, increases both capacitance and impedance, and thus always increases both
delay and power dissipation. Since optimization of delay constrained by power dis-
sipation is our goal, the gate widths of each sizing group are adopted as optimization
parameters, while the gate lengths are ignored. The control parameters are thus WDP,
WDN , WIP, and WIN . The parameter X is not used as a control parameter, as the choice
of X is generally given by the required load of the TLL gate rather than the choice of
the designer.
The following subsections detail separated models for delay, power, and relia-
bility of a TLL element as they relate to these five parameters. Simulations conducted
for the following subsections were performed using Synopsys HSpice 2009.09 and a
commercial 65 nm LP bulk CMOS design kit operating under typical conditions with
a supply voltage of 1.2V and a temperature of 25C. Gate width sizings were applied as
multiples of the minimum gate width permitted by the process (120 nm).
5.2.1 Delay modeling
According to the Elmore delay model, the charging or discharging delay of a
node τ increases linearly with respect to the impedance Z and capacitance C according
to the function τ = Z ∗C. According to the model, if multiple capacitive nodes ex-
ist along a single charge or discharge path, the delays are computed individually and
85
summed to determine the total delay. The TLL element possesses two phases of opera-
tion: reset and evaluation. Each phase utilizes different charging and discharging paths,
and thus must be modeled separately.
5.2.1.1 Evaluation delay
The evaluation delay of the TLL element is the interval between the rising edge of
the clock signal and the falling edge of the appropriate output node of the differential
amplifier N1 or N2. The evaluation delay responds differently to increases in each
physical parameter, as shown in Figure 5.4.
Figure 5.4: Evaluation delay vs. parameter sizing for minimum-sized 16-input TLL
gate assuming an αL/αR combination of 1/0.
As the figure shows, evaluation delay increases roughly linearly with respect to
parameters WDP, WIN , and X . This is expected, as the devices of these groups contribute
only to the capacitance of the evaluation charge and discharge paths. An increase in
the parameter WDN decreases delay to a single global minimum, while delay decreases
asymptotically to some minimum value as WIP is increased.
86
While increasing the gate widths of the input devices WIP can improve the delay
of the gate by reducing the input network propagation delay, the cost of increasing WIP
can be prohibitive in terms of power dissipation. As indicated by Figure 5.2 in the
previous section, power dissipation increases linearly as the capacitance of nodes N5
and N6 increases due to additional inputs. An increase in WIP essentially multiplies
the number of input devices per network n; a 16-input network with WIP = 2 provides
approximately the same capacitance as a 32-input network with WIP = 1. Due to the
exorbitant costs and diminishing returns on sizing of WIP, it is recommended that the
gate widths of the input network devices remain minimum sized during optimization.
The delay of the evaluation phase can essentially be represented as two distinct
charging and discharging events. The first component of the delay is the input network
propagation delay. As the clock signal rises, the clock propagates through the net-
work(s) of parallel input transistors M9 and M10. This results in the charging of nodes
N5 and N6, respectively, which trigger the differential amplifier. Figure 5.5 shows an
RC representation of input network propagation delay.
clk
ZIP
α CIP CDNCDPCIN
N5 or N6
Figure 5.5: RC network representation of input network propagation delay.
Each input network assumes a variable impedance based on the number of ac-
tive devices in the network; the relationship between the two is inversely proportional.
The capacitance charged by each input network is provided by the gate capacitances
of the differential pull-up and pull-down devices as well as the drain capacitances of
the input devices and discharge devices. A simple RC-based expression for the input
network propagation delay is given by Equation 5.8.
87
τinput = min(
1
αL
,
1
αR
)[d0+d1n+d2WDP+d3WDN +d4WIN ] (5.8)
In the equation, the coefficients d0, d1, d2, d3, and d4 are process and corner
dependent, empirically derived from HSpice simulation data. The purpose for selecting
the minimum of 1αL and
1
αR is that the second component of the evaluation delay, the
differential discharge delay, is triggered by the fastest of the two input networks. The
differential amplifier will begin to discharge as soon as either node N5 or N6 charges to
a sufficient potential. If node N5 charges first, node N1 will begin to discharge through
devices M5 and M7; if node N6 charges first, node N2 will begin to discharge through
devices M6 and M8. Figure 5.6 shows an RC representation of the differential discharge
delay.
CDP CDN CXZDN
N1 or N2
Figure 5.6: RC network representation of differential discharge delay.
While the input configuration applied to the gate does not directly effect the
discharge delay of the differential amplifier, it does affect the amplifier’s performance
indirectly. Once the slowest of the two input networks completes propagation, it will
begin to discharge its corresponding branch of the differential amplifier as well, creat-
ing contention. As one branch of an amplifier begins to discharge, it attempts to hinder
the progress of the other branch via devices M2−3 and M5−6. The smaller the differ-
ence propagation delays between the two input networks, the closer the initial discharge
time of the two branches will be, and the greater the magnitude of the contention be-
tween them. As contention increases, the amplifier requires a greater amount of time
to resolve the conflict and complete the discharge of one of its two branches [51]. An
88
RC-based expression for the differential discharge delay with contention is given by
Equations 5.9 and 5.10.
τdi f f =
d5+
d6+d7WDP+d8X
WDN
1−β∆τ−γinput
(5.9)
∆τinput =
|αL−αR|
αLαR
[d0+d1n+d2WDP+d3WDN +d4WIN ] (5.10)
In the equations, the coefficients d5, d6, d7, d8, β and γ are process and corner
dependent, empirically derived from HSpice simulation data. The coefficients d0, d1,
d2, d3, and d4 are the same coefficients derived for the input network propagation delay
τinput . Note that in the case where there is no contention in the gate (αL = 0 or αR = 0),
∆τinput will evaluate to approaching infinity, and the denominator term of τdi f f will
evaluate to 1, reducing τdi f f to the simple RC expression in the numerator.
5.2.1.2 Reset delay
In addition to the evaluation delay, the reset delay of the TLL element must
be modeled as well. Unlike evaluation delay, where minimization is always a goal,
optimization of reset delay depends on the context in which the TLL element exists
in the design. If the gate feeds another TLL gate or combinational logic, reset delay
must be constrained to be equivalent to or less than the evaluation delay across all input
configurations. If the gate feeds a set-reset latch, the reset delay can potentially be
much greater than the evaluation delay of gate, constrained only by the length of the
inverted phase of the clock signal.
Like evaluation delay, reset delay can be represented as two distinct charging
and discharging events. The first component of the reset delay is the input network
discharge delay. As the clock signal falls, any charge on nodes N5 and N6 are discharged
to logic low through devices M11 and M12, respectively. The second component of the
reset delay is the charging delay of the differential amplifier. When nodes N5 and N6
are discharged, nodes N1 and N2 begin to charge to logic high through devices M1 and
89
M4. Unlike the differential discharge delay during evaluation, there is no contention
between the two branches of the differential amplifier during reset. The effects of
sizing each physical parameter on the reset delay are shown in Figure 5.7.
Figure 5.7: Reset delay vs. parameter sizing for minimum-sized 16-input TLL gate.
As the figure demonstrates, with the exception of X the roles of the parameters
are reversed with respect to evaluation delay. Delay increases linearly as WDN and WIP
are increased, as neither group is on a charge or discharge path during reset. There is
a single optimum value of WDP, and delay decreases asymptotically to some minimum
value as WIN is increased. The reset delay is expressed similarly to the evaluation delay,
as given by Equation 5.11.
τreset = d9+
d10+d11n+d12WDP+d13WDN
WIN
+
d14+d15WDN +d16X
WDP
(5.11)
In the equation, the coefficients d9, d10, d11, d12, d13, d14, d15, and d16 are pro-
cess and corner dependent, empirically derived from HSpice simulation data. Note that
this equation assumes that none of the devices in the input networks are active. During
actual reset operation, one or more devices in an input network may be active, passing
90
clock low to nodes N5 and/or N6 and hastening the reset of the gate. Since no constraints
are placed on the state of the input devices during reset operation, we must assume the
worst case conditions: a network in which discharge is the sole responsibility of the
discharge devices M11 and M12.
5.2.2 Delay optimization
As the previous results indicate, an increase in the physical parameters WDP
and WIN always result in a linear increase in evaluation delay. Thus there is only one
physical parameter of significance with regards to optimization of evaluation delay: the
width of the differential NMOS devices, WDN . Figure 5.8 demonstrates how evaluation
delay responds to an increase in WDN across a range of input configurations.
Figure 5.8: Evaluation delay vs. sizing of the parameter WDN for minimum-sized 16-
input TLL gate across a range of αL/αR combinations.
As the figure indicates, there is a single optimum value of WDN for each input
configuration, although this optimal value tends to increase as max[αL, αR] increases.
Since each curve is convex with a single global minimum, this minimum can be deter-
mined by taking the first order derivative of the total delay τinput + τdi f f with respect
91
to WDN . For a non-contentious input configuration, the derivative is given by Equa-
tion 5.12.
dτ
dWDN
= min[
1
αL
,
1
αR
]d3− d6+d7WDP+d8XW 2DN
(5.12)
The value of WDN that yields the minimum evaluation delay can be found by
solving for the derivative dτdWDN = 0. The closed form expression for the optimal value
of WDN for a non-contentious input configuration is given in Equation 5.13.
WDN,opt. =
√
max[αL,αR]
d6+d7WDP+d8X
d3
(5.13)
For contentious input configurations, the derivative dτdWDN becomes a more com-
plex expression, but the procedure remains the same. A single optimal solution for WDN
exists where the slope of the convex curve is equivalent to 0.
While an increase in WDP or WIN has an adverse effect on the evaluation delay,
it is not true that sizing of these parameters will never be necessary. If the physical
parameters chosen to optimize evaluation delay are such that reset delay is greater than
a specified maximum constraint, increases in WDP and/or WIN may become necessary
to reduce the delay such that the constraint is satisfied.
For any TLL gate with the physical parameter WDN chosen such that evalua-
tion delay is optimized, any increase in WIN or WDP or decrease in WDN will improve
reset delay at the cost of evaluation delay. Satisfaction of the reset constraint can be
performed iteratively, with one of the three relevant physical parameters increased or
decreased by a single sizing unit each step of the iteration until the constraint is sat-
isfied. The parameter chosen for sizing at each step should be the one such that the
difference between the decrease in reset delay and the increase in evaluation delay as a
result of the sizing adjustment is maximized.
92
5.2.2.1 Model evaluation
To determine the value of the proposed delay model, it must be evaluated across
a wide range of input configurations, physical parameter sizings, and process corners.
A valuable model is one which accomplishes two feats:
• The model must be able to accurately select the optimal set of physical parame-
ters for a given function and signal assignment.
• The model must be able to accurately estimate the delay for all potential input
configurations for sets of physical parameters close to the optimal configuration.
Fitting the evaluation delay model to a particular process requires the collection
of a small number of empirical HSpice simulation results. Specifically, the evaluation
delay of a TLL element must be simulated and measured under the conditions specified
in Table 5.8.
Table 5.8: Required simulations of evaluation delay for construction of the delay model.
Delay αL/αR n WDP WDN WIP WIN X
D1 1/0 16 1 1 1 1 1
D2 13/0 16 1 1 1 1 1
D3 2/1 16 1 1 1 1 1
D4 3/2 16 1 1 1 1 1
D5 4/3 16 1 1 1 1 1
D6 5/4 16 1 1 1 1 1
D7 6/5 16 1 1 1 1 1
D8 7/6 16 1 1 1 1 1
D9 1/0 4 1 1 1 1 1
D10 1/0 16 10 1 1 1 1
D11 13/0 16 10 1 1 1 1
D12 1/0 16 1 10 1 1 1
D13 13/0 16 1 10 1 1 1
D14 1/0 16 1 1 1 10 1
D15 1/0 16 1 1 1 1 10
93
After the empirical simulations have been completed, the coefficients are con-
structed from the results of each measurement based on a number of assumptions. The
first assumption is that the total delay τ is the sum of τinput and τdi f f . The second as-
sumption is that τinput for which min[αL,αR] = 0 and max[αL,αR] = x is x times smaller
than τinput for which min[αL,αR] = 0 andmax[αL,αR] = 1. From measurements D1 and
D2, τinput and τdi f f for the case in which n= 16, WDP = 1, WDN = 1, WIP = 1, WIN = 1,
and X = 1 is derived as according to Equations 5.14-5.17.
D1 = τinput + τdi f f (5.14)
D2 =
τinput
13
+ τdi f f (5.15)
τinput =
13
12
D1−D2 (5.16)
τdi f f = D1− τinput (5.17)
Equations 5.18-5.26 detail the operation by which each coefficient is derived.
d0 =
13
12
(D1−D2)−16d1−d2−d3−d4 (5.18)
d1 =
1
12
(D1−D9) (5.19)
d2 =
13
108
(D10−D11−D1+D2) (5.20)
d3 =
13
108
(D12−D13−D1+D2) (5.21)
d4 =
1
9
(D14−D1) (5.22)
94
d5 = D1− 1312(D1−D2)−d6−d7−d8 (5.23)
d6 =
10
9
[D1−D12+ 13
12
(D12−D13−D1+D2)]−d7−d8 (5.24)
d7 =
1
9
[D10−D1+ 13
12
(D1−D2−D10+D11)]] (5.25)
d8 =
1
9
(D15−D1) (5.26)
The fitting parameters γ and β are determined from measurements D2-D8 and
the computed values of d0-d8. Error is a convex function of both parameters, and a
single optimum value of each can be determined quickly using a gradient descent. This
is demonstrated in Figure 5.9, which displays a contour plot of the RMS error vs. the
fitting parameters β and γ for a 65 nm LP bulk CMOS process operating under typical
conditions.
As the plot indicates, for the gate measured the RMS error function demon-
strates a single global minimum at β = 0.24, γ = 0.56. Once the coefficients have been
determined and the optimal contention fitting parameters β and γ have been found, the
delay can modeled for any combination of physical parameter values and input con-
figurations. To demonstrate the accuracy of the model, modeled data was compared
to a large set of HSpice simulated data for two different CMOS processes. The set of
parameter values simulated over is summarized in Table 5.9.
For both the 65 nm LP and GP bulk CMOS processes, the coefficients and fit-
ting parameters determined by the model are given in Table 5.10. To determine the
accuracy of the absolute delay values estimated from the model, the modeled evalu-
ation delay values from 19,628 distinct combinations of parameter values and input
configurations were compared with simulated measurements. While a number of other
input configurations are possible (4/1, 3/0, etc.), the set compared includes all possible
95
Figure 5.9: Contour plot of RMS error (ps) vs. β and γ for minimum-sized 16-input
TLL gate.
Table 5.9: Simulated parameter values for comparison with evaluation delay model.
Parameter Values Conditions
n 4, 6, 8, 10, 12, 14, 16 -
WDP 1-2, 4, 8 WDP ≤WDN
WDN 1-16 -
WIP 1 -
WIN 1-2, 4, 8 WIN ≤WDN
X 1-2, 4, 8 WX ≤WDN
αL/αR 1/0, 2/1, 3/2, 4/3, 5/4, 6/5, 7/6 αL+αR ≤ n≤ 13
96
Table 5.10: Model coefficients and fitting parameters for the 65 nm LP and GP bulk
CMOS processes.
Parameter 65 nm LP 65 nm GP
d0 6.827 5.011
d1 2.133 1.292
d2 2.323 1.890
d3 1.481 0.566
d4 2.578 1.767
d5 10.731 6.731
d6 3.651 3.536
d7 4.377 2.677
d8 4.000 2.556
β 0.24 0.14
γ 0.56 0.72
worst case delay combinations for each gate assuming the optimal signal assignment is
applied; it is these input configurations for which the gate will be optimized, thus accu-
racy of the model is of the greatest importance across these combinations. Application
of the model to multiple CMOS processes ensures that the underlying assumptions of
the model are not “over-fit” for a particular process or corner condition.
The modeled and simulated power dissipation of the TLL gate across a set of
parameter values for both the 65 nm LP bulk CMOS and 65 nm GP bulk CMOS pro-
cesses are given in Tables 5.17-5.20. The same data is summarized in Figures 5.15
and 5.16.
On the 65 nm LP bulk CMOS process operating under typical conditions, the
model yielded and average absolute error of 2.05 ps across all 19,628 points of com-
parison, as well as an average percent error of 4.19. The maximum error observed was
7.70 ps in absolute terms, or 11.81% of the simulated delay. On the 65 nm GP bulk
CMOS process operating under typical conditions, the model yielded an average abso-
lute error of 1.30 ps, equating to an average percent error of 4.46, as well as a maximum
error of 7.16 ps, or 22.90%.
97
Table 5.11: Simulated evaluation delay of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 70.1 ps 46.3 ps 39.3 ps 37.0 ps 36.2 ps 35.9 ps 36.0 ps
2 66.8 ps 42.7 ps 34.0 ps 30.5 ps 28.9 ps 28.0 ps 27.5 ps
3 66.7 ps 42.2 ps 32.8 ps 28.8 ps 26.7 ps 25.6 ps 24.9 ps
4 66.5 ps 41.6 ps 32.0 ps 27.4 ps 25.1 ps 23.7 ps 22.9 ps
5 66.8 ps 41.5 ps 31.7 ps 26.9 ps 24.3 ps 22.8 ps 21.7 ps
6 67.7 ps 41.8 ps 31.8 ps 26.7 ps 24.0 ps 22.3 ps 21.2 ps
7 68.7 ps 42.3 ps 32.0 ps 26.8 ps 23.9 ps 22.1 ps 20.9 ps
8 69.9 ps 42.9 ps 32.4 ps 26.9 ps 23.9 ps 22.0 ps 20.7 ps
9 71.2 ps 43.5 ps 32.8 ps 27.2 ps 24.0 ps 22.1 ps 20.7 ps
10 72.6 ps 44.3 ps 33.3 ps 27.5 ps 24.2 ps 22.2 ps 20.7 ps
11 74.0 ps 45.0 ps 33.8 ps 27.9 ps 24.5 ps 22.3 ps 20.8 ps
12 75.4 ps 45.8 ps 34.3 ps 28.3 ps 24.7 ps 22.5 ps 21.0 ps
13 76.9 ps 46.6 ps 34.8 ps 28.7 ps 25.0 ps 22.7 ps 21.1 ps
14 78.3 ps 47.4 ps 35.4 ps 29.1 ps 25.4 ps 23.0 ps 21.3 ps
15 79.9 ps 48.2 ps 36.0 ps 29.5 ps 25.7 ps 23.2 ps 21.5 ps
16 81.4 ps 49.0 ps 36.5 ps 30.0 ps 26.0 ps 23.5 ps 21.7 ps
Table 5.12: Simulated evaluation delay of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 45.4 ps 30.0 ps 25.5 ps 23.9 ps 23.3 ps 23.0 ps 22.9 ps
2 41.4 ps 26.5 ps 21.2 ps 18.9 ps 17.7 ps 16.9 ps 16.5 ps
3 40.2 ps 25.4 ps 19.9 ps 17.3 ps 15.9 ps 15.0 ps 14.4 ps
4 40.0 ps 25.0 ps 19.4 ps 16.6 ps 15.1 ps 14.1 ps 13.4 ps
5 40.1 ps 25.0 ps 19.2 ps 16.4 ps 14.7 ps 13.7 ps 12.9 ps
6 40.4 ps 25.1 ps 19.2 ps 16.3 ps 14.5 ps 13.4 ps 12.6 ps
7 40.8 ps 25.3 ps 19.3 ps 16.3 ps 14.5 ps 13.3 ps 12.4 ps
8 41.4 ps 25.5 ps 19.4 ps 16.3 ps 14.5 ps 13.2 ps 12.4 ps
9 42.0 ps 25.8 ps 19.6 ps 16.4 ps 14.5 ps 13.3 ps 12.3 ps
10 42.6 ps 26.2 ps 19.9 ps 16.6 ps 14.6 ps 13.3 ps 12.4 ps
11 43.3 ps 26.5 ps 20.1 ps 16.8 ps 14.8 ps 13.4 ps 12.4 ps
12 44.0 ps 26.9 ps 20.4 ps 17.0 ps 14.9 ps 13.5 ps 12.5 ps
13 44.8 ps 27.4 ps 20.7 ps 17.2 ps 15.1 ps 13.7 ps 12.6 ps
14 45.5 ps 27.8 ps 21.0 ps 17.4 ps 15.2 ps 13.8 ps 12.7 ps
15 46.3 ps 28.2 ps 21.3 ps 17.7 ps 15.4 ps 13.9 ps 12.9 ps
16 47.1 ps 28.7 ps 21.6 ps 17.9 ps 15.6 ps 14.1 ps 13.0 ps
98
Table 5.13: Modeled evaluation delay of a minimum-sized 16-input TLL gate in 65 nm
LP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 70.0 ps 47.3 ps 40.3 ps 37.4 ps 36.1 ps 35.8 ps 36.1 ps
2 65.5 ps 41.8 ps 34.3 ps 31.0 ps 29.3 ps 28.6 ps 28.4 ps
3 65.0 ps 40.4 ps 32.6 ps 29.0 ps 27.2 ps 26.3 ps 26.0 ps
4 65.5 ps 40.1 ps 32.0 ps 28.3 ps 26.3 ps 25.3 ps 24.8 ps
5 66.3 ps 40.2 ps 31.8 ps 27.9 ps 25.9 ps 24.7 ps 24.2 ps
6 67.4 ps 40.6 ps 31.9 ps 27.8 ps 25.6 ps 24.4 ps 23.8 ps
7 68.6 ps 41.0 ps 32.1 ps 27.9 ps 25.6 ps 24.3 ps 23.6 ps
8 69.9 ps 41.5 ps 32.3 ps 28.0 ps 25.6 ps 24.2 ps 23.5 ps
9 71.2 ps 42.1 ps 32.6 ps 28.1 ps 25.7 ps 24.3 ps 23.5 ps
10 72.6 ps 42.7 ps 32.9 ps 28.3 ps 25.8 ps 24.3 ps 23.4 ps
11 73.9 ps 43.3 ps 33.3 ps 28.6 ps 25.9 ps 24.4 ps 23.5 ps
12 75.3 ps 43.9 ps 33.7 ps 28.8 ps 26.1 ps 24.5 ps 23.5 ps
13 76.7 ps 44.6 ps 34.1 ps 29.1 ps 26.3 ps 24.6 ps 23.6 ps
14 78.1 ps 45.2 ps 34.5 ps 29.4 ps 26.5 ps 24.7 ps 23.7 ps
15 79.6 ps 45.9 ps 34.9 ps 29.7 ps 26.7 ps 24.9 ps 23.8 ps
16 81.0 ps 46.6 ps 35.4 ps 30.0 ps 26.9 ps 25.0 ps 23.9 ps
Table 5.14: Modeled evaluation delay of a minimum-sized 16-input TLL gate in 65 nm
GP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 45.4 ps 30.7 ps 26.1 ps 24.1 ps 23.2 ps 23.0 ps 23.1 ps
2 41.5 ps 26.5 ps 21.7 ps 19.5 ps 18.4 ps 17.9 ps 17.8 ps
3 40.6 ps 25.3 ps 20.4 ps 18.1 ps 16.9 ps 16.3 ps 16.1 ps
4 40.5 ps 24.9 ps 19.8 ps 17.4 ps 16.2 ps 15.5 ps 15.2 ps
5 40.6 ps 24.7 ps 19.5 ps 17.1 ps 15.8 ps 15.1 ps 14.8 ps
6 40.9 ps 24.7 ps 19.4 ps 16.9 ps 15.6 ps 14.8 ps 14.5 ps
7 41.2 ps 24.7 ps 19.4 ps 16.8 ps 15.5 ps 14.7 ps 14.3 ps
8 41.6 ps 24.9 ps 19.4 ps 16.8 ps 15.4 ps 14.6 ps 14.1 ps
9 42.1 ps 25.0 ps 19.5 ps 16.8 ps 15.3 ps 14.5 ps 14.1 ps
10 42.6 ps 25.2 ps 19.5 ps 16.8 ps 15.3 ps 14.5 ps 14.0 ps
11 43.0 ps 25.4 ps 19.6 ps 16.9 ps 15.4 ps 14.5 ps 14.0 ps
12 43.5 ps 25.6 ps 19.8 ps 16.9 ps 15.4 ps 14.5 ps 14.0 ps
13 44.1 ps 25.8 ps 19.9 ps 17.0 ps 15.4 ps 14.5 ps 13.9 ps
14 44.6 ps 26.1 ps 20.0 ps 17.1 ps 15.5 ps 14.5 ps 14.0 ps
15 45.1 ps 26.3 ps 20.2 ps 17.2 ps 15.5 ps 14.6 ps 14.0 ps
16 45.6 ps 26.6 ps 20.3 ps 17.3 ps 15.6 ps 14.6 ps 14.0 ps
99
Figure 5.10: Modeled vs. simulated evaluation delay vs. WDN for a minimum-sized
16-input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process operating
under typical conditions.
Figure 5.11: Modeled vs. simulated evaluation delay vs. WDN for a minimum-sized
16-input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process operating
under typical conditions.
100
The ability of the model to accurately select the parameters that optimize the
delay is analyzed by comparing the optimal values computed via the model to the opti-
mal values produced through iterative HSpice simulations. As the plots in Figures 5.10
and 5.11 demonstrate, the modeled delay tracks reasonably well with the simulated
delay across all input configurations. For five of the seven input configurations (3/2,
4/3, 5/4, 6/5, and 7/6) in the 65 nm bulk CMOS process, the optimal WDN computed
by the model is identical to that determined through simulation. For the remaining two
configurations (1/0 and 2/1), the optimal WDN computed by the model differs from the
optimal WDN determined through simulation by one sizing unit (120 nm). However,
if the modeled optimal WDN is applied to the TLL gate for these input configurations
rather than the true optimal WDN , the cost of the sub-optimal parameter selection in
terms of increased evaluation delay is between 0.1 and 0.2 ps, a trivial amount in the
given process.
In addition to the evaluation delay model, the reset delay model must be evalu-
ated as well. Fitting the reset delay model to a particular process requires the collection
of a small number of empirical HSpice simulation results. Specifically, the reset delay
of a TLL element must be simulated and measured under the conditions specified in
Table 5.15.
Table 5.15: Required simulations of reset delay for construction of the delay model.
Delay n WDP WDN WIP WIN X
D16 16 1 1 1 1 1
D17 4 1 1 1 1 1
D18 16 10 1 1 1 1
D19 16 1 10 1 1 1
D20 16 1 1 10 1 1
D21 16 1 1 1 1 10
D22 16 10 1 1 10 1
D23 16 1 10 1 10 1
101
After the empirical simulations have been completed, the coefficients are con-
structed from the results of each measurement following the same methodolgy used in
the evaluation delay model.
d9 = D16−d10−16d11−d12−d13−d14−d15−d16 (5.27)
d10 =
10
9
(D16−D20)−16d11−d12−d13 (5.28)
d11 =
1
12
(D16−D17) (5.29)
d12 =
10
81
(D18−D22−D16+D20) (5.30)
d13 =
10
81
(D19−D16−D23+D20) (5.31)
d14 =
10
9
(D16−D18)+10d12−d15−d16 (5.32)
d15 =
1
9
(D19−D16)−d13 (5.33)
d16 =
1
9
(D21−D16) (5.34)
To determine the accuracy of the absolute delay values estimated from the
model, the modeled reset delay values from 4,907 distinct combinations of parameter
values and input configurations were compared with simulated measurements. Appli-
cation of the model to multiple CMOS processes ensures that the underlying assump-
tions of the model are not “over-fit” for a particular process or corner condition. On the
65 nm LP bulk CMOS process operating under typical conditions, the model yielded
and average absolute error 4.76 ps across all 4,907 points of comparison, as well as an
102
average percent error of 8.79. The maximum error was 12.29 ps in absolute terms, or
19.70% of the simulated delay. On the 65 nm GP bulk CMOS process operating under
typical conditions, the model yielded an average absolute error of 2.44 ps, equating to
an average percent error of 6.56, as well as a maximum error of 7.43 ps, or 15.88%.
5.2.3 Power modeling
Unlike a conventional static CMOS logic gate, the structure of a TLL gate does
not change significantly from function to function. Apart from a change in the num-
ber of parallel devices comprising the input networks M9 and M10, every element is
essentially identical. In addition to a regular structure, the TLL exhibits very regular
operation. With few exceptions (primarily between contentious and non-contentious
input configurations), the same nodes are charged and discharged on every clock cycle,
regardless of the combination of inputs applied. As a result, the power dissipated by
the gate changes very little from cycle to cycle.
Dynamic power dissipation is known to be proportional to CV 2dd f , where C is
the total capacitance being charged/discharged, Vdd is the supply voltage, and f is the
frequency of operation. The sizing of physical parameters in the TLL gate augment
the capacitance of the gate, thus for a fixed stimulus one would expect a roughly linear
increase in power dissipation as the physical parameters are increased in gate width or
length.
Like delay, power dissipation can be separated into the two phases of operation
of gate: reset and evaluation. Unlike delay, however, both components are equally im-
portant, as both will always be consumed on every cycle of the clock. Additionally,
since all physical parameters contribute more or less linearly to the overall power dissi-
pation of the gate, power is modeled as a single function combining both the evaluation
and reset phases.
For a non-contentious input configuration during evaluation, the rising edge of
the clock results in the charging of either N5 or N6, which results in a discharge on either
103
N1 or N2. When the clock edge falls, one of nodes N5 or N6 is discharged (since only
one was charged during evaluation), and one of nodes N1 or N2 is recharged. Since the
amount of capacitance charged/discharged is the same regardless of the number of the
active input transistors comprising either M9 or M10, power does not vary significantly
between different non-contentious input configurations. The power dissipation during
a non-contentious input configuration responds differently to increases in each physical
parameter, as shown in Figure 5.12.
Figure 5.12: Average power dissipation vs. parameter sizing for minimum-sized 16-
input TLL gate assuming an αL/αR combination of 1/0.
The first point demonstrated by this plot is the enormous cost of sizing WIP on
the overall power consumption. As mentioned previously, increasing WIP greatly in-
creases the clock load of the cell; this causes the clock power to rise to extreme heights,
particularly for functions with large fan-in or sum of weights. For primarily this rea-
son, WIP should remain at the minimum value in nearly all cases. The load X causes the
second greatest increase in power dissipation, although this is typically not a parame-
ter that is controllable by the designer of the cell. In terms of power optimization, the
parameters WDP, WDN , and WIN are the most interesting with regards to their impact on
104
the power dissipation of a gate. FIgure 5.13 shows the same information as Figure 5.12
with the curves for WIP and X removed.
Figure 5.13: Average power dissipation vs. parameter sizing for minimum-sized 16-
input TLL gate.
As the figure shows, WDP or WDN contribute roughly equivalent increases in
power dissipation as they are increased in size. An increase in the parameter WIN
contributes to a slightly greater increase due to the fact that, as with WIP, an increase in
WIN consumes additional clock power by increasing the load on the clk pin.
For a contentious input configuration, the rising edge of the clock results in the
charging of both N5 and N6, which results in a discharge on either N1 or N2 as well
as additional power dissipation due to contention in the amplifier. Power dissipated
during a contentious input configuration will be noticeably larger than that of a non-
contentious input configuration, as the amount of capacitance charged by the input
networks is effectively doubled. Input configuration will also have a noticeable effect
on power dissipation, as the difference in impedance between the two input networks
will determine the amount of the contention in the amplifier; the greater the contention,
the longer the differential amplifier takes to resolve the output, and the longer a path
105
between the supply voltage and ground remains active. Figure 5.14 shows the power
response of a TLL gate as a function of the contention between the two input networks.
41.8
Figure 5.14: Average power dissipation vs. αL/αR combination for minimum-sized
16-input TLL gate.
The power model is constructed for both contentious and non-contentious input
configurations. The non-contentious power model is given by Equation 5.35.
PNC = p0+ p1n+ p2WDP+ p3WDN + p4WIN + p5X (5.35)
The coefficients p0, p1, p2, p3, p4, and p5 are constants derived from empirical
simulation data. The contentious power model is given by Equation 5.36.
PC =
p6+ p7n+ p8WDP+ p9WDN + p10WIN + p11X
1−ρ ∆τinputτdi f f
−θ (5.36)
The coefficients p6, p7, p8, p9, p10, p11, ρ , and θ are constants derived from
empirical simulation data. The values ∆τinput and τdi f f are the same values derived
from from the delay model of the gate.
106
5.2.4 Power optimization
From the model and the plots of simulated results, it is apparent the minimum
power configuration of physical parameters for a TLL gate is which in which all phys-
ical parameters are minimum sized. Generally, this is not an interesting or useful solu-
tion. Rather than make minimization of power dissipation a goal of cell optimization, it
is far more useful to optimize delay subject to a maximum power dissipation constraint.
A decrease in power dissipation is achieved by reducing any one of the phys-
ical parameters of the gate. In generally, WIP will not be larger than minimum-sized,
although if this is not the case, a decrease in WIP will provide the most substantial
reduction. After WIP, a decrease in WIN will provide the second greatest reduction. Op-
timization of the delay of a TLL gate with respect to some maximum power constraint
first requires unconstrained delay optimization of the gate as detailed previously. Once
the optimal delay configuration of physical parameters has been determined, the power
model must be employed to estimate the power dissipation of the gate. If the power
dissipation is less than the maximum permitted, optimization is complete. If the power
dissipation is too large, however, additional sizing of the physical parameters becomes
necessary. Each iteration of the optimization procedure will entail one of the following
four possible steps:
• Decrease the parameter WDP by one sizing unit.
• Decrease the parameter WDN by one sizing unit.
• Decrease the parameter WIP by one sizing unit.
• Decrease the parameter WIN by one sizing unit.
The favored option is that which provides the maximum decrease in power dis-
sipation and minimum increase in delay. This procedure is repeated until the modeled
power dissipation has been reduced sufficiently to satisfy the maximum constraint. It
107
may be possible that the constraint is not satisfiable; this is the case if the maximum
power constraint is less than the power dissipated by a minimum-sized gate.
5.2.4.1 Model evaluation
Fitting the power model to a particular process requires the collection of a small
number of empirical HSpice simulation results. Specifically, the power dissipation
of a TLL element must be simulated and measured under the conditions specified in
Table 5.16.
Table 5.16: Required simulations of power dissipation for construction of the power
model.
Power αL/αR n WDP WDN WIP WIN X
P1 1/0 16 1 1 1 1 1
P2 1/0 4 1 1 1 1 1
P3 1/0 16 10 1 1 1 1
P4 1/0 16 1 10 1 1 1
P5 1/0 16 1 1 1 10 1
P6 1/0 16 1 1 1 1 10
P7 2/1 16 1 1 1 1 1
P8 2/1 4 1 1 1 1 1
P9 2/1 16 10 1 1 1 1
P10 2/1 16 1 10 1 1 1
P11 2/1 16 1 1 1 10 1
P12 2/1 16 1 1 1 1 10
P13 3/2 16 1 1 1 1 1
P14 4/3 16 1 1 1 1 1
P15 5/4 16 1 1 1 1 1
P16 6/5 16 1 1 1 1 1
P17 7/6 16 1 1 1 1 1
After the empirical simulations have been completed, the coefficients are con-
structed from the results of each measurement based on the assumptions presented by
Equations 5.35 and 5.36. Equations 5.37-5.48 detail the operation by which each coef-
ficient is derived.
108
p0 = P1−16p1− p2− p3− p4− p5 (5.37)
p1 =
1
12
(P1−P2) (5.38)
p2 =
1
9
(P3−P1) (5.39)
p3 =
1
9
(P4−P1) (5.40)
p4 =
1
9
(P5−P1) (5.41)
p5 =
1
9
(P6−P1) (5.42)
p6 = P7−16p7− p8− p9− p10− p11 (5.43)
p7 =
1
12
(P7−P8) (5.44)
p8 =
1
9
(P9−P7) (5.45)
p9 =
1
9
(P10−P7) (5.46)
p10 =
1
9
(P11−P7) (5.47)
p11 =
1
9
(P12−P7) (5.48)
109
The fitting parameters ρ and θ are determined from measurements P7 and P13-
17 and the computed values of p6-p11. As with the delay model, error is a convex
function of both parameters, and a single optimum value of each can be determined
quickly using a gradient descent. To demonstrate the accuracy of the model, modeled
data was compared to the same large set of HSpice simulated data used to evaluate the
accuracy of the delay model.
Table 5.17: Simulated power dissipation of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 17.33 µW 21.15 µW 21.90 µW 22.45 µW 22.96 µW 23.40 µW 23.80 µW
2 17.67 µW 21.98 µW 22.65 µW 23.22 µW 23.75 µW 24.20 µW 24.61 µW
3 18.22 µW 22.82 µW 23.47 µW 24.06 µW 24.59 µW 25.05 µW 25.49 µW
4 18.78 µW 23.67 µW 24.27 µW 24.92 µW 25.44 µW 25.94 µW 26.36 µW
5 19.37 µW 24.51 µW 25.13 µW 25.72 µW 26.29 µW 26.81 µW 27.25 µW
6 19.91 µW 25.32 µW 25.93 µW 26.53 µW 27.13 µW 27.65 µW 28.13 µW
7 20.48 µW 26.14 µW 26.79 µW 27.42 µW 28.01 µW 28.54 µW 29.03 µW
8 21.26 µW 27.20 µW 27.79 µW 28.42 µW 29.04 µW 29.41 µW 29.91 µW
9 21.82 µW 28.20 µW 28.45 µW 29.31 µW 29.70 µW 30.30 µW 30.83 µW
10 22.18 µW 28.66 µW 29.30 µW 29.95 µW 30.58 µW 31.39 µW 31.91 µW
11 22.74 µW 29.51 µW 30.14 µW 30.80 µW 31.48 µW 32.08 µW 32.60 µW
12 23.31 µW 30.48 µW 31.00 µW 31.85 µW 32.32 µW 32.94 µW 33.49 µW
13 23.86 µW 31.13 µW 32.04 µW 32.50 µW 33.15 µW 34.01 µW 34.37 µW
14 24.42 µW 32.18 µW 32.67 µW 33.34 µW 34.02 µW 34.70 µW 35.26 µW
15 24.99 µW 32.78 µW 33.49 µW 34.19 µW 34.92 µW 35.57 µW 36.17 µW
16 25.56 µW 33.61 µW 34.36 µW 35.05 µW 35.76 µW 36.45 µW 37.06 µW
On the 65 nm LP bulk CMOS process operating under typical conditions, the
model yielded an average absolute error of 1.19 µW across all 19,628 points of com-
parison, as well as an average percent error of 3.40. The maximum error was relatively
low, as well: 4.16 µW in absolute terms, or 16.09% of the simulated power dissipation.
On the 65 nm GP bulk CMOS process operating under typical conditions, the model
yielded an average absolute error of 2.33 µW, equating to an average percent error of
6.79, as well as a maximum error of 9.84 µW, or 31.16%.
110
Table 5.18: Simulated power dissipation of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 14.59 µW 18.19 µW 19.10 µW 19.87 µW 20.61 µW 21.25 µW 21.85 µW
2 15.10 µW 18.91 µW 19.76 µW 20.54 µW 21.22 µW 22.04 µW 22.42 µW
3 15.86 µW 19.71 µW 20.52 µW 21.32 µW 22.03 µW 22.66 µW 23.23 µW
4 16.21 µW 20.53 µW 21.32 µW 22.13 µW 22.85 µW 23.49 µW 24.09 µW
5 16.82 µW 21.34 µW 22.14 µW 22.96 µW 23.70 µW 24.34 µW 24.96 µW
6 17.37 µW 22.17 µW 22.92 µW 23.77 µW 24.54 µW 25.22 µW 25.84 µW
7 17.94 µW 23.00 µW 23.76 µW 24.62 µW 25.41 µW 26.11 µW 26.74 µW
8 18.70 µW 23.83 µW 24.57 µW 25.47 µW 26.28 µW 27.00 µW 27.66 µW
9 19.08 µW 24.65 µW 25.42 µW 26.32 µW 27.15 µW 27.91 µW 28.52 µW
10 19.65 µW 25.48 µW 26.28 µW 27.14 µW 28.03 µW 28.77 µW 29.46 µW
11 20.24 µW 26.29 µW 27.33 µW 27.98 µW 28.88 µW 29.68 µW 30.38 µW
12 20.81 µW 27.09 µW 27.95 µW 28.89 µW 29.77 µW 30.54 µW 31.29 µW
13 21.38 µW 27.95 µW 28.82 µW 29.73 µW 30.83 µW 31.46 µW 32.19 µW
14 21.94 µW 28.79 µW 29.80 µW 30.58 µW 31.72 µW 32.37 µW 33.12 µW
15 22.52 µW 29.58 µW 30.54 µW 31.59 µW 32.39 µW 33.25 µW 34.01 µW
16 23.09 µW 30.42 µW 31.52 µW 32.29 µW 33.48 µW 34.13 µW 34.94 µW
Table 5.19: Modeled power dissipation of a minimum-sized 16-input TLL gate in 65
nm LP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 17.01 µW 21.53 µW 21.92 µW 22.35 µW 22.81 µW 23.31 µW 23.80 µW
2 17.56 µW 22.32 µW 22.65 µW 23.00 µW 23.39 µW 23.80 µW 24.24 µW
3 18.11 µW 23.14 µW 23.45 µW 23.79 µW 24.15 µW 24.53 µW 24.94 µW
4 18.66 µW 23.98 µW 24.28 µW 24.61 µW 24.96 µW 25.33 µW 25.73 µW
5 19.21 µW 24.82 µW 25.12 µW 25.44 µW 25.79 µW 26.16 µW 26.55 µW
6 19.76 µW 25.66 µW 25.96 µW 26.29 µW 26.63 µW 27.00 µW 27.39 µW
7 20.31 µW 26.51 µW 26.81 µW 27.13 µW 27.48 µW 27.85 µW 28.24 µW
8 20.86 µW 27.35 µW 27.66 µW 27.98 µW 28.33 µW 28.70 µW 29.09 µW
9 21.41 µW 28.20 µW 28.51 µW 28.83 µW 29.18 µW 29.56 µW 29.95 µW
10 21.97 µW 29.05 µW 29.36 µW 29.69 µW 30.04 µW 30.41 µW 30.81 µW
11 22.52 µW 29.89 µW 30.21 µW 30.54 µW 30.90 µW 31.27 µW 31.67 µW
12 23.07 µW 30.74 µW 31.06 µW 31.39 µW 31.75 µW 32.13 µW 32.53 µW
13 23.62 µW 31.59 µW 31.91 µW 32.25 µW 32.61 µW 32.99 µW 33.39 µW
14 24.17 µW 32.44 µW 32.76 µW 33.10 µW 33.47 µW 33.85 µW 34.26 µW
15 24.72 µW 33.28 µW 33.61 µW 33.95 µW 34.32 µW 34.71 µW 35.12 µW
16 25.27 µW 34.13 µW 34.46 µW 34.81 µW 35.18 µW 35.57 µW 35.99 µW
111
Table 5.20: Modeled power dissipation of a minimum-sized 16-input TLL gate in 65
nm GP bulk CMOS with respect to WDN across a range of αL/αR combinations.
WDN 1/0 2/1 3/2 4/3 5/4 6/5 7/6
1 14.52 µW 18.70 µW 19.17 µW 19.76 µW 20.47 µW 21.31 µW 22.29 µW
2 15.07 µW 19.41 µW 19.79 µW 20.25 µW 20.80 µW 21.44 µW 22.17 µW
3 15.63 µW 20.18 µW 20.53 µW 20.95 µW 21.45 µW 22.03 µW 22.69 µW
4 16.19 µW 20.96 µW 21.29 µW 21.70 µW 22.18 µW 22.74 µW 23.36 µW
5 16.75 µW 21.74 µW 22.07 µW 22.48 µW 22.95 µW 23.49 µW 24.10 µW
6 17.31 µW 22.53 µW 22.86 µW 23.26 µW 23.73 µW 24.27 µW 24.87 µW
7 17.86 µW 23.32 µW 23.65 µW 24.05 µW 24.52 µW 25.06 µW 25.66 µW
8 18.42 µW 24.11 µW 24.44 µW 24.85 µW 25.32 µW 25.85 µW 26.46 µW
9 18.98 µW 24.90 µW 25.23 µW 25.64 µW 26.12 µW 26.66 µW 27.26 µW
10 19.54 µW 25.69 µW 26.03 µW 26.44 µW 26.92 µW 27.46 µW 28.07 µW
11 20.10 µW 26.48 µW 26.82 µW 27.24 µW 27.72 µW 28.27 µW 28.89 µW
12 20.65 µW 27.27 µW 27.62 µW 28.04 µW 28.53 µW 29.08 µW 29.70 µW
13 21.21 µW 28.06 µW 28.41 µW 28.84 µW 29.33 µW 29.89 µW 30.52 µW
14 21.77 µW 28.85 µW 29.21 µW 29.64 µW 30.14 µW 30.70 µW 31.34 µW
15 22.33 µW 29.65 µW 30.01 µW 30.44 µW 30.94 µW 31.51 µW 32.15 µW
16 22.89 µW 30.44 µW 30.80 µW 31.24 µW 31.75 µW 32.33 µW 32.97 µW
Figure 5.15: Modeled vs. simulated power dissipation vs. WDN for a minimum-sized
16-input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process operating
under typical conditions.
112
Figure 5.16: Modeled vs. simulated power dissipation vs. WDN for a minimum-sized
16-input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process operating
under typical conditions.
5.2.5 Reliability modeling
Optimization of gate delay and power dissipation is important, but even more
important is optimization of a TLL element with respect to reliability. As the physical
gate widths of the TLL element are modified, the noise margin of the gate varies as
well, and while it is desirable to design a gate that is both fast and low in power, it is
absolutely essential that such a gate operates as specified under all possible environ-
mental conditions. As a consequence of this, the noise margin of the TLL gate as a
function of its input configuration and physical parameters must be accurately modeled
and related to the failure probability of the gate when subjected to random variations.
Evaluation in a TLL gate can be likened to a write operation in an SRAM cell.
In an SRAM cell, a write operation is performed by establishing a path between the
storage nodes of the cell and the signal driven bit lines. During this operation, the node
that stores a logic 1 in the SRAM cell will resist an attempt to pull it low; if noise pulling
113
up on the node is too great, the cell will not write the new value, resulting in a failure.
In a TLL cell, both storage nodes are initially reset to logic 1. During evaluation, both
nodes will be pulled low, albeit at different times. The node that is pulled low first will
complete its discharge, while the node that is pulled low second will dip briefly before
being pulled back high. However, if noise pulling down on the second node is too great,
it may overpower the first, storing an incorrect value.
The reliability of a TLL gate is dependent upon its ability to distinguish between
which of its two input networks contains the lowest impedance. If the difference in
impedance is large, the difference in initial discharge time between the two branches
of the differential amplifier will be large, as well, resulting in only a small degree of
contention. If the difference in impedance is small, the two branches of the differential
amplifier will both begin to discharge at about the same time, resulting in a large degree
of contention.
While the relative impedance between the input networks is important, the prop-
erties of the differential amplifier itself play an important role, as well. The input net-
work with the lowest propagation delay will trigger the differential amplifier, causing
the corresponding branch of the amplifier to begin discharging. If the discharge delay
of the amplifier is small, a significant portion of the discharge may take place before
the input network with higher propagation delay triggers the opposite branch of the am-
plifier, reducing the contention. If the differential amplifier discharges very slowly, the
effect of the difference in initial discharge times of the two branches will be diminished
and contention will increase.
The basic model for the noise margin of the gate is therefore related to both the
difference in input network propagation delays ∆tinput as well as the base differential
discharge delay tdi f f . Since the noise margin grows as ∆τinput increases and shrinks
as τdi f f increases, the model uses
∆τinput
τdi f f as a metric for estimating the noise margin.
Such a metric is not useful for contention-free input configurations (αL = 0 or αR = 0),
114
as ∆τinputτdi f f will evaluate to approaching infinity in these cases. This is not a concern,
however, as only an extremely large amount of noise will disrupt the output of a TLL
gate during the evaluation of a non-contentious input configuration. The noise margin
of the gate as a percentage of the supply voltage is given in Equation 5.49.
NM = κ(
∆τinput
τdi f f
)φ (5.49)
The parameters κ and φ are are process and corner dependent, empirically de-
rived from HSpice simulation data.
5.2.6 Reliability optimization
A TLL gate with the maximum noise margin is one such that ∆τinput is al-
ways much larger than τdi f f . Ensuring such a condition can be extremely expensive,
however, in terms of both delay and power dissipation, as previous simulations of dif-
ferential threshold logic styles SCSDL and DCSTL have demonstrated. Additionally,
the amount of noise margin that is necessary for reliable operation of the cell will vary
from process to process and corner to corner, thus maximizing the reliability of cell
will rarely be necessary. Instead, a TLL cell must be designed satisfy some minimum
constraint on reliability.
An increase in reliability is achieved either by increasing ∆τinput or decreasing
τdi f f . The physical parameter WDN is the most useful with regards to these tasks, as
increasing it accomplishes both. The second most useful parameter to size is gener-
ally WIN , which increases ∆τinput only. The parameter WDP may increase or reduce
the noise margin of the gate, as sizing it increases both ∆τinput and τdi f f . Finally, an
increase in WIP decreases ∆τinput without changing τdi f f , thus should never be adjusted
if improvements in a gate’s noise margin are required.
Optimization of the delay of a TLL gate with respect to some minimum relia-
bility constraint first requires unconstrained delay optimization of the gate as detailed
previously. Once the optimal delay configuration of physical parameters has been de-
115
termined, the reliability model must be employed to estimate the noise margin of the
gate. If the noise margin is sufficient, optimization is complete. If the noise margin
is insufficient, however, additional sizing of the physical parameters becomes neces-
sary. Each iteration of the optimization procedure will entail one of the following four
possible steps:
• Increase the parameter WDP by one sizing unit.
• Decrease the parameter WDP by one sizing unit.
• Increase the parameter WDN by one sizing unit.
• Increase the parameter WIN by one sizing unit.
The favored option is that which provides the maximum increase in reliability
and minimum increase in delay. This procedure is repeated until the modeled noise
margin has been increased sufficiently to satisfy the minimum constraint. While any
reasonable reliability constraint is satisfiable, the cost may be prohibitively high in
terms in delay, power dissipation, or both. If power and reliability constraints are both
applied to the optimization of a gate, it may be impossible to satisfy both, particularly
for cells with input configurations that produce very high αL and αR. A designer has
little use for such a cell, thus the ability to successfully optimize with respect to both
constraints will determine the feasibility of TLL gates in a given process.
5.2.6.1 Model evaluation
Six HSpice measurements are required to determine the values of κ and φ .
Specifically, the noise margin must be measured for each of the cases specified in Ta-
ble 5.21.
The fitting parameters are determined from these measurements and the values
of τinput and τdi f f provided by the delay model. As with the fitting paremeters β and
γ in the delay model, error in computing the noise margin is a convex function of both
116
Table 5.21: Required simulations of noise margin for construction of the reliability
model.
Power αL/αR n WDP WDN WIP WIN X
P1 2/1 16 1 1 1 1 1
P2 3/2 16 1 1 1 1 1
P3 4/3 16 1 1 1 1 1
P4 5/4 16 1 1 1 1 1
P5 6/5 16 1 1 1 1 1
P6 7/6 16 1 1 1 1 1
parameters κ and φ ; a single optimum value of each can be determined quickly using
a gradient descent. For the 65 nm LP bulk CMOS process, the optimal parameters are
κ = 0.26 and φ = 0.65. For the 65 nm GP bulk CMOS process, the optimal parameters
are κ = 0.27 and φ = 0.69.
To demonstrate the accuracy of the model, modeled data was compared to a
set of HSpice simulated data for two different CMOS processes. Tables 5.22 and 5.23
display the simulated noise margin of TLL gate as a percentage of the supply voltage
vs. physical parameter WDN across a range of input configurations for both processes.
Tables 5.24 and 5.25 display the ratio ∆τinputτdi f f , while Tables 5.26 and 5.27 display the
modeled noise margin across the same data points. The same data is summarized in
Figures 5.17 and 5.18.
Note that this model becomes very inaccurate when ∆τinput > τdi f f , signifying
that the first branch of the differential amplifier is able to discharge before the second
branch begins its discharge attempt, effectively a contention-free scenario. In such a
case, the noise margin of the gate will generally be high enough to resist any reasonable
offset due to noise or process variations; accuracy of the model is therefore not critical
under such conditions. Across all of the points simulated in the 65 nm LP bulk CMOS
process, the average error is 2.72% of the supply voltage (32 mV); the maximum error
is 21.26% of the supply voltage (255 mV). For points such that the ratio ∆τinputτdi f f < 1, the
average error reduces to 0.69% of the supply voltage (8 mV) and the maximum error
117
Table 5.22: Simulated noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm LP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where large model inaccura-
cies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 26.16% 13.58% 8.25% 5.41% 3.83% 2.91%
2 29.66% 17.16% 10.91% 7.50% 5.50% 4.16%
3 30.83% 18.66% 12.25% 8.50% 6.25% 4.83%
4 31.75% 20.00% 13.33% 9.41% 7.00% 5.41%
5 32.25% 20.83% 14.08% 10.08% 7.50% 5.83%
6 32.41% 21.25% 14.66% 10.50% 7.91% 6.16%
7 32.50% 21.66% 15.00% 10.83% 8.16% 6.41%
8 32.50% 21.91% 15.33% 11.16% 8.41% 6.58%
9 32.50% 22.08% 15.58% 11.41% 8.66% 6.75%
10 32.41% 22.25% 15.83% 11.58% 8.83% 6.91%
11 32.33% 22.50% 16.08% 11.83% 9.00% 7.08%
12 32.33% 22.58% 16.25% 12.00% 9.16% 7.25%
13 32.25% 22.66% 16.41% 12.16% 9.33% 7.33%
14 32.25% 22.83% 16.58% 12.33% 9.41% 7.50%
15 32.16% 23.00% 16.75% 12.50% 9.58% 7.66%
16 32.16% 23.16% 16.91% 12.66% 9.75% 7.75%
Table 5.23: Simulated noise margin (as a % of Vdd) of a minimum-sized 16-input
TLL gate in 65 nm GP bulk CMOS with respect to WDN across a range of αL/αR
combinations. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where large model
inaccuracies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 26.60% 13.40% 7.80% 5.10% 3.60% 2.70%
2 31.90% 17.80% 10.90% 7.40% 5.30% 4.00%
3 34.60% 20.30% 12.80% 8.70% 6.40% 4.90%
4 36.10% 21.80% 14.00% 9.70% 7.10% 5.40%
5 37.10% 22.90% 14.90% 10.40% 7.60% 5.90%
6 37.80% 23.60% 15.50% 10.90% 8.10% 6.20%
7 38.40% 24.20% 16.10% 11.30% 8.40% 6.50%
8 38.80% 24.70% 16.50% 11.70% 8.70% 6.70%
9 39.10% 25.10% 16.90% 12.00% 8.90% 6.90%
10 39.30% 25.40% 17.20% 12.20% 9.20% 7.10%
11 39.50% 25.70% 17.40% 12.40% 9.30% 7.30%
12 39.70% 25.90% 17.70% 12.70% 9.50% 7.40%
13 39.80% 26.10% 17.90% 12.80% 9.60% 7.50%
14 39.90% 26.30% 18.00% 13.00% 9.80% 7.60%
15 40.00% 26.40% 18.20% 13.10% 9.90% 7.70%
16 40.10% 26.60% 18.30% 13.30% 10.00% 7.80%
118
Table 5.24: Modeled ∆τinputτdi f f for a minimum-sized 16-input TLL gate in 65 nm LP bulk
CMOS with respect to WDN across a range of αL/αR combinations. Shaded cells are
high reliability (∆τinputτdi f f > 1) cases where large model inaccuracies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 1.039 0.346 0.173 0.103 0.069 0.049
2 1.457 0.485 0.242 0.145 0.097 0.069
3 1.706 0.568 0.284 0.170 0.113 0.081
4 1.884 0.628 0.314 0.188 0.125 0.089
5 2.027 0.675 0.337 0.202 0.135 0.096
6 2.149 0.716 0.358 0.214 0.143 0.102
7 2.258 0.752 0.376 0.225 0.150 0.107
8 2.358 0.786 0.393 0.235 0.157 0.112
9 2.452 0.817 0.408 0.245 0.163 0.116
10 2.541 0.847 0.423 0.254 0.169 0.121
11 2.627 0.875 0.437 0.262 0.175 0.125
12 2.711 0.903 0.451 0.271 0.180 0.129
13 2.792 0.930 0.465 0.279 0.186 0.132
14 2.872 0.957 0.478 0.287 0.191 0.136
15 2.951 0.983 0.491 0.295 0.196 0.140
16 3.028 1.009 0.504 0.302 0.201 0.144
Table 5.25: Modeled ∆τinputτdi f f for a minimum-sized 16-input TLL gate in 65 nm GP bulk
CMOS with respect to WDN across a range of αL/αR combinations. Shaded cells are
high reliability (∆τinputτdi f f > 1) cases where large model inaccuracies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 0.964 0.321 0.160 0.096 0.064 0.045
2 1.370 0.456 0.228 0.137 0.091 0.065
3 1.607 0.535 0.267 0.160 0.107 0.076
4 1.770 0.590 0.295 0.177 0.118 0.084
5 1.895 0.631 0.315 0.189 0.126 0.090
6 1.997 0.665 0.332 0.199 0.133 0.095
7 2.085 0.695 0.347 0.208 0.139 0.099
8 2.163 0.721 0.360 0.216 0.144 0.103
9 2.234 0.744 0.372 0.223 0.148 0.106
10 2.300 0.766 0.383 0.230 0.153 0.109
11 2.362 0.787 0.393 0.236 0.157 0.112
12 2.421 0.807 0.403 0.242 0.161 0.115
13 2.477 0.825 0.412 0.247 0.165 0.117
14 2.532 0.844 0.422 0.253 0.168 0.120
15 2.585 0.861 0.430 0.258 0.172 0.123
16 2.637 0.879 0.439 0.263 0.175 0.125
119
Table 5.26: Modeled noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm LP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where large model inaccura-
cies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 26.67% 13.05% 8.32% 5.97% 4.58% 3.68%
2 33.21% 16.26% 10.36% 7.43% 5.71% 4.59%
3 36.79% 18.01% 11.48% 8.23% 6.32% 5.08%
4 39.25% 19.21% 12.24% 8.78% 6.75% 5.42%
5 41.15% 20.15% 12.84% 9.21% 7.07% 5.68%
6 42.75% 20.93% 13.33% 9.57% 7.35% 5.90%
7 44.14% 21.61% 13.77% 9.88% 7.59% 6.10%
8 45.41% 22.23% 14.16% 10.16% 7.81% 6.27%
9 46.57% 22.80% 14.53% 10.42% 8.01% 6.43%
10 47.67% 23.34% 14.87% 10.67% 8.20% 6.58%
11 48.72% 23.85% 15.20% 10.90% 8.38% 6.73%
12 49.72% 24.34% 15.51% 11.13% 8.55% 6.87%
13 50.68% 24.81% 15.81% 11.34% 8.71% 7.00%
14 51.62% 25.27% 16.10% 11.55% 8.87% 7.13%
15 52.53% 25.72% 16.39% 11.76% 9.03% 7.26%
16 53.42% 26.16% 16.67% 11.96% 9.19% 7.38%
Table 5.27: Modeled noise margin (as a % of Vdd) of a minimum-sized 16-input TLL
gate in 65 nm GP bulk CMOS with respect to WDN across a range of αL/αR combina-
tions. Shaded cells are high reliability (∆τinputτdi f f > 1) cases where large model inaccura-
cies occur.
WDN 2/1 3/2 4/3 5/4 6/5 7/6
1 26.33% 12.34% 7.65% 5.37% 4.06% 3.22%
2 33.56% 15.72% 9.74% 6.85% 5.18% 4.10%
3 37.46% 17.55% 10.88% 7.64% 5.78% 4.58%
4 40.05% 18.76% 11.63% 8.17% 6.18% 4.90%
5 41.97% 19.67% 12.19% 8.57% 6.47% 5.13%
6 43.52% 20.39% 12.64% 8.88% 6.71% 5.32%
7 44.83% 21.00% 13.02% 9.15% 6.92% 5.48%
8 45.98% 21.54% 13.35% 9.38% 7.09% 5.62%
9 47.02% 22.03% 13.65% 9.60% 7.25% 5.75%
10 47.97% 22.47% 13.93% 9.79% 7.40% 5.87%
11 48.85% 22.89% 14.19% 9.97% 7.54% 5.97%
12 49.69% 23.28% 14.43% 10.14% 7.67% 6.08%
13 50.49% 23.66% 14.66% 10.31% 7.79% 6.17%
14 51.26% 24.02% 14.88% 10.46% 7.91% 6.27%
15 52.00% 24.36% 15.10% 10.61% 8.02% 6.36%
16 52.72% 24.70% 15.31% 10.76% 8.13% 6.45%
120
Figure 5.17: Modeled vs. simulated noise margin vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm LP bulk CMOS process operating
under typical conditions.
reduces to 2.72% (32 mV). For the least reliable points such that the ratio ∆τinputτdi f f <
0.5, the average and maximum error reduce to 0.60% (7 mV) and 1.32% (15 mV),
respectively. Across all the points simulated in the 65 nm GP bulk CMOS process, the
average error is 2.80% of the supply voltage (28 mV) and the maximum error is 12.62%
(126 mV). For points such that the ratio ∆τinputτdi f f < 1, the average error reduces to 1.90%
of the supply voltage (19 mV) and the maximum error reduces to 3.26% (32 mV). For
the least reliable points such that the ratio ∆τinputτdi f f < 0.5, the average and maximum error
reduce to 1.74% (17 mV) and 3.26% (32 mV), respectively.
While noise margin is generally a good estimator of reliability, a single noise
margin will not determine the failure rate of a TLL element during Monte Carlo simu-
lation, since increasing the size of a physical parameter will not only change the noise
margin of a gate, but also the amount of noise introduced due to process variations.
121
Figure 5.18: Modeled vs. simulated noise margin vs. WDN for a minimum-sized 16-
input TLL gate over a range of αL/αR on a 65 nm GP bulk CMOS process operating
under typical conditions.
The most important parameter in determining the reliability of a gate is WDN ; the
magnitude of this parameter is a reasonably accurate indicator of the minimum amount
of noise margin required for reliable operation of a gate. The noise margin required
of a minimum-sized gate to achieve a specific yield decreases monotonically as WDN
increases. The minimum amount of noise margin required for minimum-sized gate
will vary from process to process and corner to corner; for the 65 nm LP bulk CMOS
process, the minimum noise margin required of a minimum-sized gate is approximately
24% of the supply voltage, decreasing by a factor roughly proportional to WDN0.25.
122
Chapter 6
TLL BASED DESIGN
The TLL gate is an efficient implementation of a (latching) threshold logic
function that overcomes the problem of reliability observed by many other differen-
tial threshold logic gates. In addition, models have been created which estimate the
delay, power, and robustness of a gate with reasonable accuracy, enabling the opti-
mization of a TLL gate according to the function it is implementing and the context in
which it appears. However, in order to demonstrate the advantages of threshold logic,
TLL gates must be applied to a complete design and show significant improvements
over conventional design techniques. To complete this task, a design must selected to
augment with TLL gates.
6.1 CMOS/TLL hybridization
TLL gates, while expressive, are sequential elements, and are thus difficult to
design with exclusively. To do so would require complex asynchronous handshaking
protocol, large numbers of buffering sequential elements, a high degree of wiring con-
gestion, and/or extremely large and power-hungry clock trees. These challenges have
been recognized by others experimenting with differential threshold logic [53], who
have in response proposed designs utilizing “hybridization”, or selective replacement
of traditional CMOS logic with threshold logic components, as shown in Figure 6.1.
Obviously, TLL gates can only absorb threshold functions, thus the amount of
combinational logic that can be absorbed through hybridization is highly dependent
upon the composition of the design and where hybridization points are chosen. Non-
threshold structures such as XOR and XNOR gates limit the extent to which hybridiza-
tion can occur, and may prevent it entirely, as Figure 6.2 demonstrates.
123
Combinational
logic
Combinational
logicdata
clk
logic absorption
Figure 6.1: Flip-flops and a portion of the preceding combinational logic are absorbed
into TLL gates through hybridization.
y
=
clk
y
a
b
c
d
e
f
a
b
c
d
e
f
clk
7
a
b
c
d
e
2
3
7
1
1
1
1
Figure 6.2: Poor absorption through hybridization due to non-threshold structures.
As shown in the figure, despite the size of the threshold function implemented,
the XNOR gate at the hybridization point completely thwarts all attempts to reduce gate
count or logic depth. The only reduction is the replacement of the XNOR gate with a
124
simple NOR gate, which comes at the cost of replacing a simple CMOS flip-flop with
a very large and expensive threshold logic gate. Keeping this in mind, designs demon-
strating the advantages of CMOS/TLL hybridization must be selected and optimized
judiciously.
6.2 Design of a 32-bit integer 2’s complement multiplier using TLL
When selecting candidates for hybridization, it is important to consider the
types of functions that are especially amenable to absorption. In the computation of
a threshold function, the TLL gate never considers the unique permutation of inputs,
only the weighted sum they yield. There are many widely used arithmetic operations
for which this is true as well, such as multiplication. Understandably, studies showing
how threshold logic can reduce multiplication complexity predate efficient threshold
logic gate implementations [34]. The multiplier is a fairly ubiquitous component of
microprocessors, digital processors, and graphics engines [69], used in all but the most
rudimentary designs, and is thus an excellent demonstration design for showing the
advantages of CMOS/TLL hybridization.
Multiplication in computing has been studied for as long as computing has ex-
isted, and as a result there are many, many different ways in which to implement a
multiplier in an IC. Each technique comes equipped with its own advantages in terms
of complexity, area, performance, and power dissipation. All multipliers, however, can
essentially be divided into two classes: linear and logarithmic.
Linear multipliers, such as the array multiplier, add the partial products of the
multiplication in series, producing the final product after a linear number of additions.
Linear multiplication is very efficient in terms of power and area. However, when high
performance is essential, only logarithmic multiplication will do. Logarithmic multipli-
ers employ a tree structure to parallelize the addition of partial products. A logarithmic
multiplier can be decomposed into three separate operations: partial product genera-
tion, partial product reduction, and partial product addition.
125
6.2.1 Partial product generation
The partial product generator of the multiplier takes in two n-bit operands as
inputs and produces n n-bit partial products. When summed, these n partial products
will add up to the final product. For an unsigned multiplier, an AND operation of each
pair of bits a[i]b[ j], where i and j are taken from 0 to n− 1 will produce the n partial
products of the multiplication. The partial products of a 2’s complement multiplier may
be computed using the same number of gates if modified Baugh-Wooley encoding [31]
is employed, where some of the AND functions are inverted.
In terms of data width, the partial product generator yields the widest region
in the multiplier (n2 bits) and is the only component of the multiplier that actually
increases the data width; the remaining componenents gradually collapse the n2 bits
produced by the partial product generator into the 2n bits of the final product. Due to
the high data width and simplicity of the functions involved, there is little opportunity
for threshold logic absorption in this component of the multiplier; absorption forward
from the first layer of flip-flops in the design would increase the number of sequential
elements in first layer from 64 to 1024, resulting in an enormous increase in both area
and power dissipation.
6.2.2 Partial product reduction
The partial product reducer is a component of the multiplier that is used to
convert the n partial products produced by the partial product generator into a smaller
number of equivalent partial products. Multiplication of two n-bit operands is equiva-
lent to the addition of n n-bit operands (of staggered significance). While this can be
done serially, as in an array multiplier, the delay of such an operation is linear with
respect to n. Partial product reduction trees, such as Wallace and Dadda trees [68, 16],
instead use parallel full adders as 3:2 counters to reduce the number of partial products
from n to two, at which point a single addition is required to compute the final product.
126
A full adder takes as input three bits of equal significance, producing a sin-
gle two-bit output reflecting the binary sum of the inputs. This two-bit vector can be
regarded as two single bit outputs of staggered significance, as seen in Figure 6.3.
FA
a
b
c z
y
a
b
c+
zy =  y 0 + z
Figure 6.3: 3:2 counter block diagram and operation.
Similarly, n paralllel full adders can be used to transform three n-bit binary
vectors of equal significance into two n-bit binary vectors of staggered significance, as
seen in Figure 6.4.
a[n-1] a[n-2] ... a[1] a[0]
b[n-1] b[n-2] ... b[1] b[0]
c[n-1] c[n-2] ... c[1] c[0]+
z[n-1] z[n-2] ... z[1] z[0]
y[n-1] y[n-2] y[n-3] ... y[0]   0
FA
a[0]
b[0]
c[0] z[0]
y[0]
FA
a[n-1]
b[n-1]
c[n-1] z[n-1]
y[n-1] +
Figure 6.4: Parallel 3:2 counter reducing three n-bit vectors into two n-bit vectors.
While partial product reduction trees are expensive in terms of area and power
dissipation, the delay of the tree is logarithmic with respect to n, and thus preferred
for high performance applications, particularly in designs that require multiplication of
large bit vectors. Figure 6.5 displays a complete schematic for a 32-bit partial product
reduction tree implemented using 3:2 counters.
As the figure demonstrates, the 32 partial products are organized into groups
of three, and then compressed using 3:2 counters. The process is repeated until only
two partial products remain. A total of eight stages of counters are required, as the
number of partial products is reduced from 32 to 22, 22 to 15, 15 to 10, 10 to 7, 7 to 5,
5 to 4, 4 to 3, and 3 to 2. While this is certainly more efficient than a linear reduction
127
FAFA
FA
FA
FA FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA FA FA
PP[0]
PP[1]
PP[2]
PP[3]
PP[4]
PP[5]
PP[6]
PP[7]
PP[8]
PP[0]
PP[1]
PP[27]
PP[28]
PP[29]
PP[30]
PP[31]
PP[24]
PP[25]
PP[26]
PP[21]
PP[22]
PP[23]
PP[18]
PP[19]
PP[20]
PP[15]
PP[16]
PP[17]
PP[12]
PP[13]
PP[14]
PP[9]
PP[10]
PP[11]
Figure 6.5: 32-bit partial production reduction tree implemented using 3:2 counters.
of partial products, it is not neccessarily the most efficient technique. The number
of counter stages required in a partial product tree can actually be reduced further if
larger counters are employed. A 7:3 counter, for instance, takes as input seven bits of
equal significance, producing a single three-bit output that reflects the binary sum of
the inputs. This three-bit vector can be regarded as three single bit outputs of staggered
significance, as seen in Figure 6.6.
7:3
a
b
c
z
yd
e
x
f
g +
zy =  x 0 0 + y 0 + z
a
b
c
d
e
f
g
x
Figure 6.6: 7:3 counter block diagram and operation.
128
If 7:3 counters are utilized in a partial product reduction tree instead of 3:2
counters, the number of stages required reduces to four as the number of partial prod-
ucts is reduced from 32 to 15, 15 to 7, 7 to 3, and 3 to 2. While an attractive option when
viewed abstractly, such a design is not typically implemented in practice. The reason
for this is that 7:3 counters are much more expensive to implement using conventional
technlogies than 3:2 counters in terms of delay, power, and area. Consider the Boolean
functions x, y, and z implemented by the counter given in Equations 6.1- 6.3.
x = a(b(c(d(e( f g+ f g)+ e( f g+ f g))+d(e( f g+ f g)+ e( f g+ f g)))+
c(d(e( f g+ f g)+ e( f g+ f g))+d(e( f g+ f g)+ e( f g+ f g))))+b(c(d(e( f g+
f g)+ e( f g+ f g))+d(e( f g+ f g)+ e( f g+ f g)))+ c(d(e( f g+ f g)+ e( f g+
f g))+d(e( f g+ f g)+ e( f g+ f g)))))+a(b(c(d(e( f g+ f g)+ e( f g+ f g))+
d(e( f g+ f g)+ e( f g+ f g)))+ c(d(e( f g+ f g)+ e( f g+ f g))+d(e( f g+
f g)+ e( f g+ f g))))+b(c(d(e( f g+ f g)+ e( f g+ f g))+d(e( f g+ f g)+
e( f g+ f g)))+ c(d(e( f g+ f g)+ e( f g+ f g))+d(e( f g+ f g)+ e( f g+ f g)))))
(6.1)
y = a(b(c(d(e f g+ e( f +g))+d(e( f +g)+ e( f +g)))+ c(d(e( f +g)+ e( f+
g))+d(e( f +g)+ e f g)))+b(c(d(e( f +g)+ e( f +g))+d(e( f +g)+ e f g))+
c(d(e( f +g)+ e f g)+d(e f g+ e f g))))+a(b(c(d(e( f +g)+ e( f +g))+
d(e( f +g)+ e f g))+ c(d(e( f +g)+ e f g)+d(e f g+ e f g)))+b(c(d(e( f+
g)+ e f g)+d(e f g+ e f g))+ c(d(e f g+ e f g)+d(e f g+ e( f +g)))))
(6.2)
z = a(b(c(d+ e+ f +g)+d(e+ f +g)+ e( f +g)+ f g)+ c(d(e+ f +g)+
e( f +g)+ f g)+d(e( f +g)+ f g)+ e f g)+b(c(d(e+ f +g)+ e( f +g)+
f g)+d(e( f +g)+ f g)+ e f g)+ c(d(e( f +g)+ f g)+ e f g)+de f g
(6.3)
Such functions are quite complex, and quite often most efficiently implemented
in static CMOS using 3:2 counters, as shown in Figure 6.7.
129
a
FA
FA
FA
FA
b
c
d
e
f
g
x
y
z
Figure 6.7: CMOS implementation of a 7:3 counter using a network of 3:2 counters
TLL gates, however, can be used implement 7:3 counters in a much more effi-
cient manner by hybridizing the function. A symmetric function, the 7:3 counter can
be decomposed into sorting and counting operations, providing a simple and efficient
mapping to threshold logic gates [3]. Note that if the inputs a− g are sorted such
that a ≥ b ≥ c ≥ d ≥ e ≥ f ≥ g, the functions x, y, and z become much simpler, as
demonstrated in Equations 6.4- 6.6.
x = ab+ cd+ e f +g (6.4)
y = bd+ f (6.5)
z = d (6.6)
Sorting seven bits requires seven threshold functions, which can be imple-
mented using seven parallel TLL gates implementating the functions given in Equa-
tions 6.7- 6.13.
asorted = Σ(a+b+ c+d+ e+ f +g)≥ 1 (6.7)
bsorted = Σ(a+b+ c+d+ e+ f +g)≥ 2 (6.8)
130
csorted = Σ(a+b+ c+d+ e+ f +g)≥ 3 (6.9)
dsorted = Σ(a+b+ c+d+ e+ f +g)≥ 4 (6.10)
esorted = Σ(a+b+ c+d+ e+ f +g)≥ 5 (6.11)
fsorted = Σ(a+b+ c+d+ e+ f +g)≥ 6 (6.12)
gsorted = Σ(a+b+ c+d+ e+ f +g)≥ 7 (6.13)
A schematic of the complete 7:3 counter utilizing TLL gates can be seen in Fig-
ure 6.8. Hybrid CMOS/TLL 7:3 counters may be inserted at any point in a traditional
CMOS partial product reduction tree to replace sequential elements and a substantial
part of the tree. The data width of the tree should be carefully considered when placing
hybrid 7:3 counters, as a wider dath path will correspond to a greater number of TLL
gates, which can result in increased wiring congestion and clock power dissipation.
6.2.3 Partial product addition
Once all of the partial products have been reduced to two vectors, the final two
partial products must be added together to compute the final product. The required
width of the adder will be between n and 2n bits, depending upon the structure of the
partial product reduction tree. As with partial product reduction, numerous architec-
tures exist for performing partial product addition. Serial implementations are slow,
exhibiting delays that are linear with respect to n, but economical in terms of area and
power consumption. Logarithmic tree-based implementations, such as Kogge-Stone,
Brent-Kung, and Han-Carlson adders [38, 11, 30], are more costly but preferred in
131
12
3
4
5
6
7
7
x
y
z
{a, b, c, d, e, f, g}
clk
Figure 6.8: Gate-level schematic of a clocked hybrid CMOS/TLL 7:3 counter.
high performance applications due to their much lower computation times. In such im-
plementations, propagate-generate networks are used to compute all of the carry bits
used in the addition in paralllel rather than one at a time.
In a single bit adder, a carry-out bit is generated if both inputs a and b assume
a logic 1. Similarly, the carry-in bit c is propagated to the carry-out bit if exactly one
of the two inputs a or b assumes a logic 1. The Boolean expressions for propagate (p)
and generate (g) are given in Equations 6.14 and 6.15.
p = a⊕b (6.14)
g = ab (6.15)
Once all of the bit-wise propagate and generate signals have been computed,
these signals are combined to compute the group propagate and generate signals as in
132
a small Manchester carry chain [36]. The final carry-out for bit i of the addition is
equivalent to the group generate signal for bits i through 0 (G[i : 0]). A radix 2 group
generate/propagate is computed according to Equations 6.16 and 6.17.
P[i : 0] = P[i : k]P[k−1 : 0] (6.16)
G[i : 0] = G[i : k]+P[i : k]G[k−1 : 0] (6.17)
In this case, G[i:0] is threshold with inputs G[i:0], P[i:k], and G[k-1:0], weight
vector {2, 1, 1}, and a threshold value of 2. P[i:0] is threshold with inputs P[i:k]
and P[k-1:0], weight vector {1, 1}, and a threshold value of 2. Although any group
generate/propagate function of any radix is threshold, there is a practical limit to how
much of the tree can be absorbed into a single layer of TLL gates, as the size of the
threshold gates required increases exponentially with the magnitude of the radix. In the
65 nm LP bulk CMOS process used, three is maximum radix permissible; four or more
will require cells that are large and difficult to size for reliability without substantial
performance and/or power dissipation costs.
Once all of the group generate and propagate bits have been computed, all of the
carry bits can be easily computed for each bit of the addition. Once all of the carry bits
are known, the sum bits which comprise the final product can be computed in parallel,
as given by Equation 6.18.
sum[n] = a[n]b[n]c[n]+a[n]b[n]c[n]+a[n]b[n]c[n]+a[n]b[n]c[n] (6.18)
In addition, there exists a relationship between each carry bit and the carry bit
preceding it, as shown in Equation 6.19.
c[n+1] = a[n]b[n]+a[n]c[n]+b[n]c[n] (6.19)
133
Substituting this relationship into Equation 6.18 yields an alternative computa-
tion of sum[n], as given in Equation 6.20.
sum[n] = c[n+1](a[n]+b[n]+ c[n])+a[n]b[n]c[n] (6.20)
This alternative encoding of the sum, unlike the original, is a threshold func-
tion [56]. It possesses inputs a[n], b[n], c[n], and c[n+1], a weight vector of {1, 1, 1,
2}, and the threshold value 3.
6.3 Multiplier architecture
To provide a quantitative measure of the advantages of CMOS/TLL hybridiza-
tion in an integer multiplier, a number of different multipliers were designed and simu-
lated. As a reference, a multiplier comprised solely of conventional CMOS components
was designed, as well. The reference design provides 2’s complement multiplication of
two 32-bit integer operands in a two-stage pipeline, shown in Figure 6.9.
Partial 
product 
generation
Partial 
product 
reduction
(Stage 1-5)
Partial 
product 
reduction
(Stage 6-8)
Partial 
product 
addition
(Stage 1-8)
32
32
64 1024 216 216 64 64119
a
p
b
clk
Figure 6.9: Block level schematic of the two-stage CMOS multiplier.
A second multiplier with same functional specifications as the reference was de-
signed and implemented using a combination of CMOS and TLL components; specif-
ically, a layer of hybrid 7:3 counters replaces stages 5, 6, and 7 in the partial product
reduction tree. The sum stage of the partial product addition tree is replaced by thresh-
old logic elements, as well. The block diagram of the hybrid CMOS/TLL design is
shown in Figure 6.10.
134
TLL 
hybrid
7:3 
counters
Partial 
product 
generation
Partial 
product 
reduction
(Stage 1-4)
Partial 
product 
reduction
(Stage 8)
Partial 
product 
addition
(Stage 1-7)
32
32
64 1024 273 162 64178121
a
p
b
clk
TLL
partial 
product 
addition
(Stage 8)
Figure 6.10: Block level schematic of the two-stage CMOS/TLL multiplier.
Once the precise set of threshold functions required in the hybrid CMOS/TLL
multiplier has been determined, standard cells for each must be created so that the
design can be synthesized, placed, and routed. A total of 29 unique functions are
utilized by the two layers of TLL gates in the multiplier designs. These are enumerated
in Table 6.1, along with the number of inputs n required in each input network of each
gate and the worst case input combinations in terms of delay (min) and power/reliability
(max).
6.4 Design of TLL standard cell library
TLL cells have a very regular structure, as the layout in Figure 6.11 shows.
Apart from the sizing of the physical parameters and the number of inputs in each
input network, the components of the gate are unchanged from cell to cell, regardless
of function. A single cell can thus potentially be used to implement a wide range of
threshold functions.
Depending on the signal assignment applied to the gate, the cell shown in Fig-
ure 6.11 has the ability to implement a variety of functions; these functions are enumer-
ated in Table 6.2. This set of realizable functions is determined solely by the number
of inputs in each input network. Naturally, the regular structure and robustness of TLL
gates make them very well suited for standard cell-based design. If the threshold func-
tions required to augment the multiplier design are known, it is a simple matter to
design and optimize standard cells for each function.
135
Table 6.1: TLL functions employed in CMOS/TLL multiplier designs.
Function n Min. worst case αL/αR Max. worst case αL/αR
1;1 4 1/0, 0/1 1/0, 0/1
11;1 6 1/0, 0/1 2/1, 0/1
11;2 6 1/0, 0/1 1/0, 1/2
111;1 8 1/0, 0/1 2/1, 0/1
111;2 6 2/1, 1/2 2/1, 1/2
111;3 8 1/0, 0/1 1/0, 1/2
1111;1 10 1/0, 0/1 2/1, 0/1
1111;2 8 2/1, 1/2 3/2, 2/3
1111;3 8 2/1, 1/2 3/2, 2/3
1111;4 10 1/0, 0/1 1/0, 1/2
11111;1 12 1/0, 0/1 2/1, 0/1
11111;2 10 2/1, 1/2 4/3, 2/3
11111;3 8 3/2, 2/3 3/2, 2/3
11111;4 10 2/1, 1/2 3/2, 3/4
11111;5 12 1/0, 0/1 1/0, 1/2
111111;1 14 1/0, 0/1 2/1, 0/1
111111;2 12 2/1, 1/2 4/3, 2/3
111111;3 10 3/2, 2/3 4/3, 2/3
111111;4 10 3/2, 2/3 3/2, 3/4
111111;5 12 2/1, 1/2 3/2, 3/4
111111;6 14 1/0, 0/1 1/0, 1/2
1111111;1 16 1/0, 0/1 2/1, 0/1
1111111;2 14 2/1, 1/2 4/3, 2/3
1111111;3 12 3/2, 2/3 5/4, 4/5
1111111;4 10 4/3, 3/4 4/3, 3/4
1111111;5 12 3/2, 2/3 5/4, 4/5
1111111;6 14 2/1, 1/2 3/2, 3/4
1111111;7 16 1/0, 0/1 1/0, 1/2
2111;3 8 3/2, 2/3 3/2, 2/3
136
Figure 6.11: Standard cell implementation of a TLL gate with 5 inputs in each input
network.
Table 6.2: Threshold functions realizable by the TLL cell in Figure 6.11.
Function
a
a + b
ab
a + b + c
a(b + c) + bc
abc
a + bc
a(b + c)
a(b + c + d) + b(c + d) + cd
a[b(c+d) + cd] + bcd
a(b + c + d) + bcd
a[b(c + d + e) + c(d + e) + de] + b[c(d + e) + de] + cde
137
Libraries of standard cells were designed using Cadence Virtuoso and the ST
Microelectronics 65 nm LP bulk CMOS process; cells were created to support all
threshold functions with a sum of weights of up to 13 and a threshold value of up to 7.
Variants of each cell were also designed to support up to five different drive strengths.
Cells were also developed for both standard and high threshold voltage. Each cell is
double standard height, with two Vdd supply rails and a single shared Vss supply rail.
All routing is performed in Metal 1 and Polysilicon layers. In addition, well taps are
included within each cell.
Cell netlists were extracted from layout using Synopsys StarXtract Version C-
2009.06 and characterized for delay, power, and leakage using Synopsys HSpice Ver-
sion B-2008.09. The characterized data was recorded and organized for compatibility
with the Synopsys Liberty format. LEF files were produced for each standard cell lay-
out using Cadence Abstract. Post-characterization, the cells are treated similarly to any
standard cell implemented using traditional CMOS components, and are fully com-
patible with commercial synthesis, place and route tools. The TLL standard cells can
be placed in a design side by side with traditional CMOS standard cells without any
changes to the automated design flow.
6.5 Synthesis, place, and route
Using Cadence Encounter version 7.1, the CMOS reference and CMOS/TLL
hybrid designs were each synthesized over a range of frequencies up to the maximum
attainable. Synthesis was performed assuming worst case PVT delay conditions (slow
NMOS, slow PMOS, a 1.1V supply, and a temperature of 105C) using the 65 nm LP
bulk CMOS and TLL standard cell libraries developed by ST Microelectronics and
VEDA lab personnel, respectively.
Following synthesis, the synthesized designs were placed and routed using Ca-
dence Encounter. All designs were assumed to have a square aspect ratio and a final tar-
get density of 65-75%. Input ports (including the clock port) were positioned equidis-
138
tantly on one edge of the design, while output ports were positioned equidistantly on
the opposite edge of the design. Five metal layers (three horizontal, two vertical) were
made available for routing. All designs were optimized until the tool reported timing
closure with no negative slack along any timing path. A comparison between the de-
signs can be found in Table 6.3. Plots providing a visual representation of the same
data assuming typical operating conditions are given in Figures 6.12 and 6.13.
Figure 6.12: Area vs. frequency comparison between placed and routed CMOS and
CMOS/TLL multiplier designs at typical design corner.
Area results post-place and route include all standard cells and routed intercon-
nect (including the clock tree), as well as the power grid, well taps, and filler cells. As
the table indicates, there is significant area overhead associated with the hybridization
of the multiplier using TLL cells. At the lowest synthesized frequency (357 MHz), the
CMOS/TLL multiplier requires 37% more area than an equivalent design utilizing only
CMOS. At low target frequencies, timing constraints in both designs are easily satis-
fied by a small number of modestly sized standard cells; as a consequence of this, the
increased number of sequential elements and significant area overhead of TLL cells in
the CMOS/TLL multiplier results in a larger occupation of area.
139
Figure 6.13: Standard cell count vs. frequency comparison between placed and routed
CMOS and CMOS/TLL multiplier designs at typical design corner.
Table 6.3: Place and route results of CMOS and CMOS/TLL multiplier designs.
Multiplier Target freq. (WC) PT freq. (WC) PT freq. (typ.) Block area Cell count
CMOS 357 MHz 350 MHz 558 MHz 0.037 mm2 3108
CMOS 500 MHz 460 MHz 769 MHz 0.057 mm2 7575
CMOS 555 MHz 507 MHz 839 MHz 0.066 mm2 9193
CMOS 600 MHz 562 MHz 928 MHz 0.082 mm2 10460
CMOS/TLL 357 MHz 416 MHz 644 MHz 0.051 mm2 3603
CMOS/TLL 500 MHz 485 MHz 763 MHz 0.056 mm2 4043
CMOS/TLL 555 MHz 525 MHz 853 MHz 0.059 mm2 4922
CMOS/TLL 600 MHz 562 MHz 928 MHz 0.066 mm2 6041
CMOS/TLL 666 MHz 620 MHz 1.029 GHz 0.069 mm2 7087
CMOS/TLL 769 MHz 680 MHz 1.110 GHz 0.092 mm2 9198
CMOS/TLL 833 MHz 757 MHz 1.233 GHz 0.122 mm2 10422
140
As the frequency is increased, however, this discrepancy in design area is re-
versed. The CMOS design possesses significantly less timing slack than the CMOS/TLL
design, and thus requires a greater amount of sizing as clock frequency increases than
its CMOS/TLL counterpart. At a synthesized frequency of 600 MHz, the highest at-
tainable by the CMOS design, the hybrid design achieves an area reduction of 1.24x.
This area advantage diminishes as the performance of the CMOS/TLL design is further
increased. A layout of a fully placed and routed CMOS/TLL multiplier design with
highlighted clock tree can be found in Figure 6.14.
Figure 6.14: Placed and routed layout of a CMOS/TLL multiplier design with high-
lighted clock tree.
6.6 Mixed-signal simulation
Post-place and route, each design was simulated for dynamic and leakage power
dissipation using Synopsys Nanosim Version B-2008.09. Dynamic power measure-
ments were averaged over 1000 clock cycles, applying a stimulus of 1000 pseudo-
141
random input vectors. A clock was applied at the maximum frequency supported by
each design. Clock power measurements were averaged over 1000 clock cycles, apply-
ing a constant input stimulus to ensure no activity in any combinational logic elements
of the data path. Leakage power measurements were taken assuming no activity on the
input signals or clock, with all idle at logic 0. All simulations were conducted assum-
ing typical and worst case PVT delay conditions. The results of the simulations can be
found in Tables 6.4 and 6.5. Plots providing a visual representation of the same data
assuming typical operating conditions are given in Figures 6.15 and 6.16.
m
Figure 6.15: Total power vs. frequency comparison between placed and routed CMOS
and CMOS/TLL multiplier designs at typical design corner.
As the tables indicate, the CMOS/TLL design is capable of much achieving
a much greater frequency than the maximum exhibited by the purely CMOS design,
approximately 33% higher under both typical and worst case delay conditions. This in-
crease in performance is not without significant cost, however. The fastest CMOS/TLL
design dissipates between 82 and 86% more power than the fastest CMOS design dur-
ing periods of high activity, and dissipates up to 53% more leakage power, as well.
142
Figure 6.16: Leakage power vs. frequency comparison between placed and routed
CMOS and CMOS/TLL multiplier designs at typical design corner.
Table 6.4: Nanosim simulation results of multiplier designs at worst case delay corner
(SS, 1.1V, 105C).
Multiplier Simulated freq. Total power Leakage power
CMOS 350 MHz 6.7 mW 2.3 µW
CMOS 460 MHz 15.9 mW 6.7 µW
CMOS 507 MHz 22.7 mW 9.5 µW
CMOS 562 MHz 29.6 mW 11.7 µW
CMOS/TLL 416 MHz 12.6 mW 1.9 µW
CMOS/TLL 485 MHz 16.4 mW 2.7 µW
CMOS/TLL 525 MHz 18.8 mW 4.0 µW
CMOS/TLL 562 MHz 21.3 mW 4.5 µW
CMOS/TLL 620 MHz 27.2 mW 6.8 µW
CMOS/TLL 680 MHz 37.9 mW 10.5 µW
CMOS/TLL 757 MHz 55.0 mW 18.0 µW
143
Table 6.5: Nanosim simulation results of multiplier designs at typical delay corner (TT,
1.2V, 25C).
Multiplier Simulated freq. Total power Leakage power
CMOS 558 MHz 12.5 mW 0.6 µW
CMOS 769 MHz 31.2 mW 2.4 µW
CMOS 839 MHz 44.3 mW 3.5 µW
CMOS 928 MHz 58.5 mW 4.5 µW
CMOS/TLL 644 MHz 23.3 mW 0.6 µW
CMOS/TLL 763 MHz 30.6 mW 0.9 µW
CMOS/TLL 853 MHz 36.2 mW 1.8 µW
CMOS/TLL 928 MHz 41.5 mW 1.8 µW
CMOS/TLL 1.029 GHz 53.5 mW 2.8 µW
CMOS/TLL 1.110 GHz 73.2 mW 3.8 µW
CMOS/TLL 1.233 GHz 106.5 mW 4.9 µW
Alternatively, rather than increase performance, it is possible for the CMOS/TLL
design to match the maximum frequency attainable by the CMOS design while provid-
ing a reduction to both power and leakage. The CMOS/TLL multiplier with the same
performance as the fastest CMOS design provides a reduction of between 1.38x and
1.40x in total power dissipation, as well as a 2.5x to 2.6x reduction in leakage power
dissipation.
6.7 Design and fabrication of a 65 nm LP bulk CMOS multiplier test architecture
To confirm the accuracy of the simulation data and the robustness of the hybrid
CMOS/TLL standard cell multiplier design, a test chip was designed and fabricated
using a combination of STMicroelectronics standard cells and the custom standard cell
libraries developed. In addition to the multipliers, the test architecture must also house
additional circuitry to provide data to and receive data from the multipliers. A block
diagram of the complete test architecture is shown in Figure 6.17.
144
Data 
sources Multipliers
Output 
sources
Clock 
generator
scan_si
scan_so
clk
en1
en2
rosc_sel,
clk_sel,
rosc_en
div_en
rst, ds_sel, 
scan_mode,
scan_en, 
lfsr_mode,
lfsr_en, 
fifo_mode, 
regf_en
rst, mult_sel,
mult_en
rst, result_sel,
tick_en, sign_en
64
6464
9 4 5
7
mclk
lfreq_op
5
Figure 6.17: Block level architecture of complete test chip.
Within the test architecture are a number of multipliers, data sources, output
sources, and clock sources. Most of the external signals provided to the chip are used
to select which sources are enabled and the mode in which the chip operates. Every
clock cycle, the selected multiplier will receive data from the selected data source and
produce new output, which is processed by the selected output source before returning
to the selected data source. Each of the sources available on the test chip are described
in detail in the following subsections.
6.7.1 Multipliers
Four multiplier designs utilizing the architecture described previously were in-
cluded in the test chip. These include one CMOS multiplier and three CMOS/TLL
multipliers. The CMOS multiplier (CMOS SVT) was designed using 65 nm LP SVT
145
standard cells and synthesized, placed, and routed for the maximum attainable fre-
quency. The first CMOS/TLL multiplier (TLL SVT slow) was designed using the 65
nm LP SVT standard cells and synthesized, placed, and routed for the same frequency
as the CMOS multiplier. The second CMOS/TLL multiplier (TLL SVT fast) was de-
signed using the 65 nm LP SVT standard cells and synthesized, placed, and routed for
the maximum attainable frequency. The final CMOS/TLL multiplier (TLL HVT) was
designed using the 65 nm LP HVT standard cells and synthesized, placed, and routed
for the maximum attainable frequency.
Each multiplier was designed with its own isolated power domain such that
the power dissipation (and leakage) of each multiplier can be measured separately.
The multipliers are enabled and reset via common I/O control signals mult en and rst,
respectively. Only one multiplier can be selected at a time, determined via the 2-bit I/O
control signal mult sel[1 : 0].
6.7.2 Data sources
Three 64-bit data sources were instantiated on the chip to provide data to the
multipliers, selected via the two-bit I/O control signal ds sel[1 : 0]. These include a scan
register, a linear feedback shift register, and a 64-word FIFO. A block level diagram of
the data sources is shown in Figure 6.18.
The scan register is the primary data source through which all data in and out of
the test design must pass. All data in and out of the test chip is scanned in and scanned
out serially via the I/O pins scan si and scan so, respectively, although all of the bits of
the scan register can be loaded in parallel internally as determined by the I/O control
signal scan mode. The scan register is enabled via the I/O control signal scan en, and
reset via the signal rst.
The linear feedback shift register, or LFSR, provides pseudo-random, high ac-
tivity data. The LFSR always provides the same sequence of data given a particular
initial state. The initial state can be reset to 0, or set to any 64-bit vector via data input
146
SCAN
LFSR
FIFO
scan_siscan_so
scan_enscan_moderst
rst
rst
lfsr_en
lfsr_mode
fifo_mode
regf_en
mclk
mult_input
result
64
64
64
64
64
ds_sel
2
Figure 6.18: Block diagram of multiplier test architecture data sources.
from the scan register, as determined by the control signal l f sr mode. The LFSR is
enabled via the I/O control signal l f sr en, and reset via the signal rst.
The FIFO provides a sequence of 64 specified data vectors, which at high fre-
quencies may be cycled indefinitely, as determined by the I/O control signal f i f o mode.
The data vectors are specified via the scan registers. The FIFO is enabled via the I/O
control signal reg f en, and reset via the signal rst.
6.7.3 Output selection and verification
Three sources of output are available, selected via a 2-bit I/O control signal
result sel[1 : 0]. These include direct output from the multipliers, an accumulated sig-
nature, and a clock cycle counter. A block level diagram of the output sources is shown
in Figure 6.19.
The direct output provides direct access to the 64-bit product of the currently
selected multiplier design, and is used primarily for functional testing. At low frequen-
147
result
mult_output
2
result_sel
Signature
generator
Clock cycle 
counter
sign_en
mclk
tick_en
rst
rst
64
64
64
64
64
lfreq_op
Figure 6.19: Block diagram of multiplier test architecture output sources.
cies, the output of each multiplier can be directly scanned out and verified before the
next input vector is applied.
The 128-bit signature generator, once enabled via the I/O control signal sign en,
accumulates results from the selected multiplier over numerous clock cycles; its pur-
pose is to provide verification of proper multiplier operation over a large set of input
vectors when the results of the multiplier are unable to be scanned out and analyzed
after each clock cycle (during high frequency testing, for example).
The clock cycle counter provides as a 64-bit vector the number of clock cycles
elapsed since it was enabled via the I/O control signal tick en. In addition, the clock
cycle counter outputs a heavily divided (by a factor of 218) clock signal to directly to
output pin l f req op in the I/O ring for direct observation via oscilloscope. Both the
signature generator and clock cycle counter are reset via the common control signal
rst.
6.7.4 Clock generation
Two external and two internal clock sources are available for testing purposes,
selected via the two-bit I/O control clk sel[1 : 0]. The architecture of the clock genera-
tion circuitry is shown in Figure 6.20.
The two external clocks are provided via the I/O ring. The first provided is a
simple clock signal (clk). The second is a two phase clock signal (en1, en2) designed as
148
Clock 
divider
rosc_sel
3
Ring osc. 0
Ring osc. 1
Ring osc. 2
Ring osc. 3
Ring osc. 4
Ring osc. 5
Ring osc. 6
Ring osc. 7
mclk
clk_sel
2
en1, en2
clk
scan[4:0]
5
div_en
rosc_en
Figure 6.20: Block diagram of multiplier test architecture clock generator.
a contingency in case excessive ringing was observed on the first external clock signal
during testing.
For high frequency testing, eight ring oscillators consisting of inverter chains
of lengths 16, 20, 24, 28, 32, 36, 40, and 44 are provided in an isolated power domain
surrounded by a large decoupling capacitance. All of the ring oscillators share a single
enable signal rosc en. Only one ring oscillator is enabled at a time, and is selected via
the 3-bit I/O control signal rosc sel[2 : 0].
A configurable clock divider is also present on the chip, which divides the se-
lected ring oscillator by an integer value between 1 and 31. The divisor is loaded from
the five least significant bits of the scan register, and is enabled by the I/O control signal
div en.
6.7.5 Top level design and manufacturing
A layout of the complete test chip is shown in Figure 6.21. The test chip was
manufactured by Circuits Multi Projets (CMP) in Grenoble, France. A magnified pho-
tograph of the die is shown in Figure 6.22.
The 25 bare dies received from CMP were packaged in 44 pin J-leaded chip
carriers with transparent lids by Advotech Company Inc. in Tempe, AZ, according
149
Figure 6.21: Layout of the complete test architecture with I/O ring.
Figure 6.22: Magnified and annotated photograph of a manufactured test die.
150
to the bonding diagram shown in Figure 6.23. Due to being improperly wire bonded
the first time, the 25 dies had to be unpackaged and re-bonded, again by Advotech
Company Inc. One of the 25 dies was lost in this process.
en1
en2
rosc_en
clk
rst
vddr
clk_sel[0]
clk_sel[1]
div_en
tick_en
N/C
lfsr_en
lfsr_mode
scan_en
scan_mode
scan_si
scan_so
lfreq_op
fifo_mode
regf_en
sign_en
N/C
re
s
u
lt_
s
e
l[0
]
re
s
u
lt_
s
e
l[1
]
v
d
d
v
d
d
1
m
u
lt_
s
e
l[0
]
m
u
lt_
s
e
l[1
]
v
d
d
2
g
n
d
m
u
lt_
e
n
N
/C
N
/C
N
/C
N
/C
v
d
d
3
v
d
d
4
v
d
d
e
g
n
d
e
d
s
_
s
e
l[1
]
d
s
_
s
e
l[0
]
ro
s
c
_
s
e
l[2
]
ro
s
c
_
s
e
l[0
]
ro
s
c
_
s
e
l[1
]
Figure 6.23: Bonding diagram of the multiplier test chip.
6.7.6 Measurement and analysis of the 65 nm LP bulk CMOS multiplier test chip
The PC used for testing was augmented with a Measurement Computing PCI-
DIO48H digital I/O board. Since the I/O board produces digital signals with a high
voltage of 5V and the I/O ring on the chip is designed for 2.5V signals, a voltage
divider was constructed from discrete components between the digital I/O board and
the test PCB to reduce the high voltage of the signals provided by the digital I/O board
to 2.5V.
For the purposes of testing the fabricated design, a custom PCB was designed
and manufactured to house the packaged dies. The I/O pins were designed such that
151
the 64-wire flat Rabbion cable connected to the digital I/O board in the PC could plug
directly into the PCB. The layout of the test PCB is featured in Figure 6.24, and a
photograph of the test PCB mounted with a packaged die is shown in Figure 6.25.
Figure 6.24: Layout of the test PCB.
Figure 6.25: Photograph of the test PCB mounted with a packaged die.
Additional equipment available for testing included two Tektronix PS280 DC
Power supplies, a Tektronix TDS205B two channel color digital phospor oscilloscope,
and a Web-tronics.com autoranging multimeter (model CSI 9903). The two power
supplies enable the supply of up to four unique DC supply voltages with a resolution
of 100 mV. The oscilloscope enables the periodic observation of the input and output
waveforms, and the multimeter enables the measurement of current with a minimum
resolution of 1 µA. A photograph of the complete test setup is shown in Figure 6.26.
152
Figure 6.26: Photograph of the complete test setup including PC, PCB, power supplies,
oscilloscope, and DMM.
Out of the 24 dies received post-packaging, 23 managed to successfully power
up. Each die was individually powered up on the test board using a DC power supply.
A voltage of 2.5V was provided to the I/O ring, and voltage of 1.2V was provided to
all of the power domains in the core. The single die that failed exhibited a large spike
in current after being connected to the DC power supply, indicating a short between the
power supply and ground. No activity was recorded on the failing die after the initial
current spike.
6.7.6.1 Functional testing
Each of the 23 packaged dies that passed the power up test was tested for proper
functionality. This entailed individual testing of each data source and each multiplier
both with and without signature verification for a small number of data vectors at low
frequency applied via the external clock signal (clk). Output received from the I/O pin
scan so was verified by comparing the measured output with simulated output for each
test. A divided clock signal was directly observable on the oscilloscope via the I/O pin
l f req op.
153
6.7.6.2 Performance testing
The role of performance testing was to determine the maximum operating fre-
quency of each multiplier using the internal ring oscillators as a clock source. Once the
initial state of the test chip was configured using the external clock and control signals,
the internal clock was selected and allowed to run for several thousand cycles. The
internal clock was then disabled, and the accumulated signature and clock cycle count
were scanned out using the external clock. If the signature and cycle count matched a
simulated reference, the multiplier was deemed to be functioning properly at the given
frequency.
Unfortunately, none of the four multipliers on any of the 23 working dies pro-
duced correct output for frequencies above 200 MHz when high activity input vectors
were provided via the LFSR. It is strongly suspected that this is due to an overly sparse
power grid in the design, resulting in an IR drop that prevents proper delivery of the
power necessary to operate the multipliers at high frequencies. This theory is sup-
ported by the fact that the current observed in the test chip actually decreases as the
clock frequency is increased at high frequencies, suggesting that the supply voltage has
decreased.
6.7.6.3 Power testing
While dynamic power measurements could not be taken at high frequencies due
to the problems with the power grid, it was possible to take measurements at frequen-
cies below 200 MHz. The dynamic power of each multiplier design was determined
by measuring the current of the multiplier’s power domain using the internal ring os-
cillators to supply the desired frequency, processing new input vectors from the LFSR
on every clock cycle. While the power measured is technically the total power of each
multiplier and not just dynamic power, the dynamic power is many orders of magnitude
larger than the leakage. The dynamic power measurements taken for each multiplier
154
are summarized in Table 6.6.
Table 6.6: Average measured multiplier power dissipation across 23 functional dies
(Vdd = 1.2V) at four different internal clock frequencies.
Multiplier 50 MHz 100 MHz 160 MHz 220 MHz
CMOS SVT 3.94 mW 6.98 mW 9.88 mW 11.88 mW
TLL SVT slow 2.56 mW 4.68 mW 6.85 mW 8.50 mW
TLL SVT fast 5.54 mW 8.78 mW 11.30 mW 12.47 mW
TLL HVT 4.76 mW 7.66 mW 9.58 mW 9.48 mW
The measured dynamic power values are roughly equivalent to the power dis-
sipation predicted by simulation. The leakage power of each multiplier design was
determined by measuring the current of the multiplier’s power domain while the chip
was in an idle state, with input vectors and the clock signal fixed at a constant logic 0.
The leakage measurements taken for each multiplier are summarized in Table 6.7.
Table 6.7: Measured multiplier leakage across 23 functional dies (Vdd = 1.2V).
Multiplier Min. leakage Max. leakage Average leakage
CMOS SVT 8.4 µW 13.2 µW 9.5 µW
TLL SVT slow 2.4 µW 3.6 µW 3.5 µW
TLL SVT fast 12.0 µW 21.6 µW 15.3 µW
TLL HVT <1.2 µW <1.2 µW <1.2 µW
The leakage measured is approximately 2-3x higher than that predicted by sim-
ulation. This may be due to the fact that, compared to dynamic power dissipation,
leakage power is much more sensitive to process variations and operating conditions.
It should be noted that while leakage measurements differ substantially from the simu-
lation results, the relative difference in leakage between the designs remains relatively
stable.
155
Chapter 7
ISSUE LOGIC WITH TLL
With the multiplier, it was shown that hybridization of a static CMOS design
using TLL gates to replace flip-flops and portions of combinational logic gates can sig-
nificantly reduce the gate count and logic depth of a pipeline, resulting in a design that
much faster, lower in power dissipation, or both. Through augmentation of the multi-
plier design it was shown that TLL hybridization is very amenable to sorting networks,
which appear often in arithmetic units. Such structures are not limited to the execution
stage, however, and appear in other parts of high performance processors as well.
The issue queue is an essential component of high performance processors, al-
lowing a single core to execute multiple instructions simultaneously [61], and can be
found in a number of commercial processor designs [20, 67]. The amount of perfor-
mance improvement afforded by the issue logic is a function of a number of parameters.
These include:
• The number of instructions in the queue n.
• The number of available registers r.
• The number of available execution units k.
Numerous studies have been performed to determine the effects of each param-
eter [21]. Figure 7.1 shows how performance is affected (in terms of instructions per
cycle) as the number of instructions in the queue is increased.
As the plot demonstrates, performance increases quickly the number of instruc-
tions is increased from a minimum value of 1. However, benefits gradually diminish
with further increases, dropping off substantially for most benchmarks beyond 32. This
is due to the fact that, during the execution of a process, it is not always possible to issue
the maximum number of instructions on every clock cycle; in practice, the probability
that i instructions can be executed on a given cycle decreases as i increases.
156
Figure 7.1: Instructions per cycle vs. number of instructions in the issue queue across
various floating point and integer benchmark processes (from [21]).
While providing a significant improvement in overall processor performance,
the issue queue is a significant source of power dissipation in the processor. In the Al-
pha 21264 [20], the issue logic alone dissipated approximately one fifth of the total pro-
cessor power [26]. Excessive parameter values require larger, more complicated issue
logic, further increasing the power cost of providing multi-issue capability. Through
threshold logic hybridization, the large quantities of power dissipated by the issue logic
can potentially be reduced.
7.1 Issue logic design
Based on the considerations stated previously, the issue logic architecture de-
signed as a candidate for threshold logic hybridization assumed a queue size of 32
instructions, 32 available registers, and four available execution units. According to
previous studies, these parameters should provide a significant improvement in proces-
sor performance without wasting excessive amounts of power and/or area. The issue
queue itself is composed of four separate subcomponents: the instruction scoreboard,
request logic, arbiter logic, and update logic.
7.1.1 Instruction scoreboard
The instruction scoreboard is a large array of latches or flip-flops assigned to
store a portion of a small number of program instructions. Each instruction must con-
157
tain the addresses of its two source operands. Associated with each instruction in the
scoreboard is a number that determines how the priority of the instruction relates to
that of the others. High priority instructions that are ready to be issued will be exe-
cuted before lower priority instructions that are ready. The priority of an instruction is
determined by its age; the older an instruction is, the more likely it is that it must be
executed before newer instructions will be able to issue.
An instruction scoreboard can be implemented in one of many different ways.
The registers associated with each instruction, for instance, can be one-hot encoded
or stored as binary addresses. In a one-hot encoded scheme, each operand address is
represented by n sequential elements, where n is the number of available registers. A
binary-coded scoreboard provides much more efficient storage of the register addresses,
requiring log2n sequential elements. The sequential elements in a one-hot encoded
scoreboard will exhibit very little switching activity on the outputs; only two elements
of each instruction may contain a logic 1 during each cycle, thus a maximum of four per
instruction will switch values on each positive clock edge. However, each sequential
element will still dissipate clock power and leakage power on every cycle. While a
binary-coded scoreboard will exhibit higher output switching activity than a one-hot
encoded scoreboard, the area, clock power, and leakage power requirements will be
much lower. In the design considered for this experiment, the difference in area, clock,
and leakage power is a factor of more than 10x; 12 sequential elements per instruction
using binary coding versus 128 in the one-hot encoded case. The gate level schematic
of binary coded scoreboard instruction is shown in Figure 7.2.
The input to each stored bit of the instruction scoreboard is determined by a
multiplexer controlled by the update logic. As instructions are granted issue, they are
removed from the scoreboard, creating a vacancy. The multiplexers redirect instruc-
tions into these holes to ensure that the relative priority between unissued instructions
is maintained and that every instruction in the scoreboard is valid. The size of the mul-
158
i[0]
i[1]
i[11]
i-1[11]
i-2[11]
i-3[11]
i-4[11]
i-1[1]
i-2[1]
i-3[1]
i-4[1]
i-1[0]
i-2[0]
i-3[0]
i-4[0]
shift[4:0]
shift[4:0]
shift[4:0]
clk
Figure 7.2: Operand storage of a single instruction in the instruction scoreboard.
tiplexer is determined by the number of available execution units, as this will determine
the maximum number of vacancies created during each cycle.
Clearly, there is little opportunity for hybridization in the scoreboard. While
it contains a large number of sequential elements, the logic feeding each element is a
multiplexer, which is a non-threshold function and severely limits absorption of logic
into the previous stage. Additionally, each sequential element has a large fan-out, which
prevents absorption into the next stage without substantially increasing the number
of sequential elements required. Neither option is very attractive if the end goal is a
reduction in power dissipation.
159
7.1.2 Request logic
The function of the request logic is to examine each instruction in the score-
board and determine which are ready to be issued. An instruction is defined as ready
if both of its source operands are ready to be read. In an n instruction issue queue, any
number of instructions between 0 and n may be ready to issue in a given cycle.
An instruction is determined to be ready or not ready for issue by comparing
each of its source operands with the ready signals of each register. If one or both of the
source registers of the instruction are not ready, the instruction cannot be issued. Im-
plementation of this comparison can be somewhat expensive if the number of available
registers is high. A static CMOS implementation of the request signal generator for a
single instruction is shown in Figure 7.3.
ready
valid
halt[0]
i[5]i[4]i[3]i[2]i[1]i[0]
i[11]i[10]i[9]i[8]i[7]i[6]
i[11]i[10]i[9]i[8]i[7]i[6]
halt[62]
halt[63]
halt[1]
i[5]i[4]i[3]i[2]i[1]i[0]
halt[2]
i[5]i[4]i[3]i[2]i[1]i[0]
halt[3]
i[5]i[4]i[3]i[2]i[1]i[0]
Figure 7.3: Request logic generating ready signal for a single instruction.
As indicated by the figure, the delay of the request logic is relatively small,
but the power dissipated can be significant due to the number of comparisons being
performed in parallel for each instruction. While hybridization of the request logic
from the instruction scoreboard is not feasible due to the extremely high data width at
that point, it is possible to improve the performance of the request logic by computing
the wide OR function using a TLL or CMOS domino gate.
160
7.1.3 Arbiter logic
The function of the arbiter logic is to prioritize all of the instructions that are
requesting to be issued in a given cycle and issue the highest priority instructions to the
execution units. While anywhere between 0 and n instructions may be ready to issue
on a given clock cycle, only a maximum of k instructions may actually issue, where k
is the number of execution units. An instruction may be issued to an execution unit if
it satisfies the following conditions:
• The instruction is ready to issue (request = logic 1)
• The number of higher priority instructions that are also ready to issue is less than
the number of available execution units.
Assuming instructions in the queue are pre-arranged from lowest to highest
priority, a sorting network is an efficient structure for computing the number of higher
priority instructions that are ready to issue for each instruction. In static CMOS, this
task is performed efficiently using a sorting network constructed from NAND and NOR
gates [47]. The network is simplified by the fact the exact number of higher priority
instructions that are ready is not required, only the number of ready instructions up to
and including four. The gate-level schematic of an 8-bit sorting network is shown in
Figure 7.4. The gate-level schematic of the complete arbiter logic implemented using
only static CMOS gates is shown in Figure 7.5.
As demonstrated previously in the multiplier design, a network of parallel TLL
gates is able to perform sorting operations very quickly and efficiently. While sorting a
significantly larger number of bits than that in the multiplier design, only the four least
significant bits of the output are necessary. Arbitration for each of the 32 instructions
in the queue can be performed with only four TLL gates.
While trivial for high priority instructions, such a design will require very large
fan-in TLL elements for the lowest priority instructions. The instruction with the lowest
161
a[0]
a[1]
a[2]
a[3]
a[4]
a[5]
a[6]
a[7]
y[0]
y[1]
y[2]
y[3]
Half_sort_8
Full_sort_8
Figure 7.4: 8-bit CMOS sorting logic.
priority at the top of the queue must check all 31 of the higher priority instructions
for readiness, issuing only if fewer than four are ready. Despite the high fan-in, the
threshold is the same for each function and relatively low, thus the yield probabilities of
each function are approximately identical and can be reasonably accomodated through
transistor sizing. A gate-level schematic of the arbiter logic for a single instruction
implemented using TLL gates is shown in Figure 7.6.
Alternatively, the arbiter can be realized using a combination of smaller TLL
gates and static CMOS elements. Using the same library that produced the CMOS/TLL
hybrid multiplier, the first layer of 1 to 8-bit sorters in Figure 7.5 can be implemented
using TLL cells as in Figure 7.6, while the remaining sorters can be implemented using
CMOS gates.
Reliability is a critical concern when designing the arbiter using TLL cells; an
error in the sorting network can potentially change the number of instructions that are
allowed to issue, which can be either inconvenient or catastrophic, depending on the
nature of the error. If the number of instructions issued is decreased due to an error,
the execution units are under-utilized for one clock cyle, but the logic is not otherwise
disrupted. If the number is increased due to an error, however, one or more instructions
162
Full_sort_8
Full_sort_7
Full_sort_6
Full_sort_5
Full_sort_4
Full_sort_3
Full_sort_2
Full_sort_1
a[7:0]
a[7:1]
a[7:2]
a[7:3]
a[7:4]
a[7:5]
a[7:6]
a[7]
y8[3:0]
y7[3:0]
y6[3:0]
y5[3:0]
y4[3:0]
y3[2:0]
y2[1:0]
y1
=
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
{a8[3:0], b8[3:0]} y8[3:0]
y7[3:0]
y6[3:0]
y5[3:0]
y4[3:0]
y3[3:0]
y2[3:0]
y1[3:0]
{a7[3:0], b7[3:0]}
{a6[3:0], b6[3:0]}
{a5[3:0], b5[3:0]}
{a4[3:0], b4[3:0]}
{a3[3:0], b3[3:0]}
{a2[3:0], b2[3:0]}
{a1[3:0], b1[3:0]}
=
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_8
Half_sort_7
Half_sort_6
Half_sort_5
{a8[3:0], b8[3:0]} y8[3:0]
y7[3:0]
y6[3:0]
y5[3:0]
y4[3:0]
y3[3:0]
y2[3:0]
y1[3:0]
{a7[3:0], b7[3:0]}
{a6[3:0], b6[3:0]}
{a5[3:0], b5[3:0]}
{a4[3:0], b4[3:0]}
{a3[3:0], b3[2:0]}
{a2[3:0], b2[1:0]}
{a1[3:0], b1}
=
26
26
26
32
32
8
8
8
8
32
32
26
{geq1[7:0], geq2[7:0], 
geq3[7:0], geq4[7:0]}
{geq1[15:8], geq2[15:8], 
geq3[15:8], geq4[15:8]}
{geq1[23:16], geq2[23:16], 
geq3[23:16], geq4[23:16]}
{geq1[31:24], geq2[30:24], 
geq3[29:24], geq4[28:24]}
ready
 [7:0]
ready
[15:8]
 ready
[23:16]
 ready
[31:24]
Figure 7.5: Full 32 instruction arbiter using 1 to 8-bit CMOS sorters.
163
12
3
4
geq1[i]
geq2[i]
geq3[i]
geq4[i]
clk
ready[31:i]
32-i
Figure 7.6: TLL arbiter for a single instruction.
may be issued to a non-existent execution unit and therefore lost, disrupting the proper
operation of the code being executed. To ensure that such an event does not occur, TLL
cells must be properly designed and optimized for reliable operation.
7.1.4 Update logic
The function of the update logic is to determine how to properly reorganize
the instruction scoreboard after arbitration. As discussed previously, as instructions are
issued they are removed from the scoreboard, creating a vacancy. Since the instructions
are organized by age (priority), new instructions cannot be inserted directly into these
holes. New instructions can only be added to the scoreboard at the beginning of the
queue, thus to make room older instructions must be shifted forward to fill the holes
left by completed instructions.
The number of spaces each instruction must shift is equivalent to the number of
vacancies ahead of it, i.e. the number of instructions being issued. A single instruction
within the queue with four execution units available will be updated with data from
one of five locations on every clock cycle. If the instruction does not issue, and none
164
of the instructions with higher priority issue, the instruction will remain in its present
location. Otherwise, the instruction will adopt the data of one of the four most adjacent
lower priority instructions in the queue, depending upon the number of issues observed.
A gate level schematic demonstrating the final generation of the shift and grant signals
of the issue logic is featured in Figure 7.7.
grant[i]
shift_0[i]
shift_1[i]
shift_2[i]
shift_3[i]
shift_4[i]
geq4[i+1]
ready[i]
valid[i]
geq1[i]
geq1[i]
geq3[i-2]
geq3[i-2]
geq4[i-3]
geq4[i-3]
valid[i]
geq2[i-1]
geq2[i-1]
valid[i]
Figure 7.7: Update logic generating grant and one-hot encoded shift signals for a single
instruction.
The update logic is clearly a very small part of the overall design, and not a
significant component of delay. While CMOS/TLL hybridization of the update logic is
possible, it is unlikely that substantial benefits would be derived from doing so.
7.2 Design of the CMOS/TLL issue queue
Based on preliminary simulations and synthesis results, the issue queue was
designed as a two stage pipeline rather than a single stage. Pipelining of issue logic
results in a reduction in average instructions per cycle, since instructions must remain
in the queue for a minimum of k−1 clock cycle before they are issued, where k is the
number of stages in the design. Previously published experimental results evaluating
the degradation of IPC due to pipelining show that the reduction in performance is
relatively small, however [32]. A single stage issue queue provides, on average, roughly
165
a 6% advantage in IPC over a two stage pipeline. Meanwhile, a two-stage CMOS issue
queue provides a maximum attainable frequency that is roughly 25% higher than that
of a single stage design. Increasing the pipeline depth from two to three results in
a minimal increase in maximum attainable frequency while further reducing average
IPC, thus two was chosen as the optimal pipeline depth.
Unlike the multiplier design, which is a simple feed-forward data path, the issue
logic contains cycles. As a result, pipelining of the design is more complex than sim-
ply adding a layer of registers, since the instructions in the scoreboard must be properly
synchronized with the shift signals generated by the update logic at all times. To accom-
plish this, multiplexers must be inserted into the design preceding the new sequential
elements. These additional multiplexers increase the overall delay of the design, and
must be considered when determining a suitable location for register insertion.
In the CMOS design, the optimal location for the registers was determined to
be within the arbiter logic, more specifically within the first layer of 1 to 8-bit sorters.
The additional multiplexers were inserted between the request logic and arbiter logic
(where the data width is minimized) so as to minimize the amount of additional logic
required as well as the fan-out from the update logic. A complete block level schematic
of the two stage CMOS issue queue is shown in Figure 7.8.
40
reg_clean
32
new_instr
Request 
logic
Pipeline
MUX
Arbiter logic 
(Stage 1-3)
353 33 33Instruction 
scoreboard
177 Arbiter logic 
(Stage 4-14)
Update 
logic
177 155 32 32
160
grant
clk
Figure 7.8: Block level schematic of two-stage CMOS issue queue.
The TLL design was constructed similarly to the CMOS design, with TLL gates
inserted in the arbiter logic instead of standard registers. Two different designs were
attempted; one in which the entire arbiter was hybridized using a single layer of high
166
fan-in TLL gates (referred to hereafter as CMOS/TLL design A), and one in which a
portion of the arbiter was hybridized using the same library of standard TLL cells used
in the hybrid multiplier design (referred to hereafter as CMOS/TLL design B). The
complete block level schematics of both two stage CMOS/TLL issue queue designs are
shown in Figures 7.9 and 7.10.
grant40
reg_clean
32
new_instr
Request 
logic
Pipeline
MUX
353 33 33Instruction 
scoreboard
clk
TLL Arbiter logic
155 Update 
logic
160
32 32
Figure 7.9: Block level schematic of two-stage CMOS/TLL issue queue design A, with
full hybridization of the arbiter logic.
grant40
reg_clean
32
new_instr
Request 
logic
Pipeline
MUX
353 33 33Instruction 
scoreboard
clk
TLL Arbiter 
logic 
(Stage 1-6)
137 Arbiter logic 
(Stage 7-14)
155 Update 
logic
160
32 32
Figure 7.10: Block level schematic of two-stage CMOS/TLL issue queue design B,
with partial hybridization of the arbiter logic.
The first stage of the pipeline is identical in both designs; stages 1-3 of the ar-
biter logic are absorbed by the TLL gates, providing a small improvement to the max-
imum attainable frequency. Design A absorbs arbiter logic stages 4-14 in the second
stage, while design B absorbs stages 4-6 only.
7.3 Synthesis, placement, and routing
To compare the quality of the the different designs, all were synthesized, placed,
and routed using Cadence Encounter. Both designs assumed a rectangular block area
167
with a height/width aspect ratio of 1.2 and targeted a final density of 65-70%.
The very large fan-in threshold functions required of many of the functions in
the CMOS/TLL issue queue design A cannot be implemented using the same TLL
standard cell library as in the multiplier, thus custom TLL macros were designed for
the issue logic. These cells employed very large (up to 64-input) input networks; the
layout of such a cell is shown in Figure 7.11.
Figure 7.11: Standard cell layout for extremely wide fan-in TLL gate.
While smaller than the combinational logic they were used to replace, the ex-
tremely large TLL cells of design A were found to be unwieldy during place and route,
as valid placements of the cells were highly restricted. While not an issue in less con-
strained designs, this created some difficulty at high frequencies.
The CMOS issue queue was synthesized, placed, and routed assuming a fre-
quency of 689 MHz at the worst case delay corner. Two versions of the CMOS/TLL
issue queue were synthesized, placed, and routed assuming the same frequency. Two
of the final design layouts are featured in Figures 7.12 and 7.13.
Unfortunately, while the CMOS/TLL issue queue design A implemented using
the large custom TLL macros demonstrated significant reductions in design area and
leakage power, the total power dissipated was actually higher than that of the CMOS
design. This is due to the fact that while the very large TLL gates all able to absorb the
entirety of the arbiter logic, they consume an exorbitant amount of clock power due to
the large capacitances present in the input networks of each gate.
168
Figure 7.12: Automatically placed and routed 689 MHz CMOS issue logic with high-
lighted clock tree.
CMOS/TLL issue queue design B provided superior results over design A in all
metrics as well as significantly lower total power dissipation compared to the CMOS
design. While it absorbs less of the second stage of the issue queue than design A,
both designs are constrained in time by the first stage. Thus, while the second stage of
CMOS/TLL design B contains more combinational logic than design A, it is loosely
constrained, and thus relatively inexpensive in terms of area and power dissipation.
At the same time, the smaller TLL gates utilized by CMOS/TLL design B consume
substantially less clock power than the gates of design A, thus the power advantages of
the hybridization are retained.
169
Figure 7.13: Automatically placed and routed 689 MHz CMOS/TLL issue logic with
highlighted clock tree.
The maximum achievable frequency of the CMOS/TLL issue queues were esti-
mated to be less than 10% higher than that of the CMOS design, a rather unimpressive
amount due to the limited absorption of the first stage of the design. Additionally,
the area, power, and leakage estimates of such solutions were prohibitive. Higher fre-
quency issue logic designs were thus not pursued on the grounds that the benefits could
not outweigh the incurred costs.
170
Post-place and route comparisons showing the total design area, gate count, and
power dissipation of the CMOS issue queue and CMOS/TLL issue queue design B at
the worst case and typical delay corners are given in Tables 7.1 and 7.2.
Table 7.1: Comparison between CMOS and CMOS/TLL (design B) issue logic post-
place and route assuming worst case delay conditions (slow NMOS, slow PMOS, 1.1V
supply, and 105C.
Design CMOS CMOS/TLL Reduction
Frequency 689 MHz 689 MHz 0.0%
Area 0.088 mm2 0.061 mm2 30.6%
Gate count 7800 5789 25.7%
Total power 37.72 mW 26.04 mW 31.0%
Leakage power 22.47 µW 9.22 µW 58.9%
Table 7.2: Comparison between CMOS and CMOS/TLL (design B) issue logic post-
place and route assuming typical delay conditions (typical NMOS, typical PMOS, 1.2V
supply, and 25C.
Design CMOS CMOS/TLL Reduction
Frequency 1.00 GHz 1.00 GHz 0.0%
Area 0.088 mm2 0.061 mm2 30.6%
Gate count 7800 5789 25.7%
Total power 63.49 mW 43.88 mW 30.9%
Leakage power 4.78 µW 4.27 µW 10.7%
As the table demonstrates, hybridization of the 1 to 8-input sorters using TLL
eliminates 2011 gates from the final placed and routed design, reducing area by a factor
of 1.44. This extends significant benefits to the power dissipation of the design, reduc-
ing total and leakage power by factors of up to 1.45 and 2.43, respectively. While not
any faster than the CMOS design, the area and power advantages provide a noticeable
improvement in the footprint of the issue logic as well as the processor as a whole.
171
Chapter 8
CONCLUSION
For decades, the ability of threshold logic to compress complex multi-level gate
networks into singular elements has motivated a great deal of research into threshold
logic function identification and synthesis procedures. While these advancements have
been accompanied by many proposals for physical implementations, nearly all suf-
fer from some deficiency in terms of size, performance, power dissipation, reliability,
and/or monetary expense. In this work, a new implementation of threshold logic (TLL)
was proposed which provides significant advantages in these metrics over existing cir-
cuit architectures. In addition, these cells are extremely regular can be quickly and
accurately characterized for delay, power dissipation, and noise margin using relatively
simple models.
TLL cells possess a number of drawbacks limiting their use in processor de-
signs, including large cell area and high clock power due to the large clock load and
wasted transitions due to the differential pre-charge operation. Application of TLL
cells must be performed extremely judiciously, as many cases exist in which careless
hybridization will actually diminish the performance or increase the power dissipation
of the target design.
Much of the area and power reduction benefits of hybridization are derived from
the ability of TLL to increase the timing slack of a heavily constrained pipeline, reduc-
ing the amount of buffering and sizing effort required by the design to meet timing.
The opportunities for area and power reduction through TLL hybridization in a design
will generally diminish as the timing constraints of the design are relaxed, as demon-
strated by the reduced returns of the lower frequency multiplier designs. The multiplier
and issue logic examples provided in the previous chapters show that if hybridization
is performed properly, a relatively small augmentation can provide substantial benefits.
Without any decrease in performance, such designs yielded area reductions of between
172
19.5% and 30.6%, active power reductions of between 28.0% and 31.0%, and leakage
power reductions of between 58.9% and 61.5%.
While all of the designs demonstrated in this work were hybridized manually, all
were constructed for compatibility with existing standard cell design flows. If threshold
logic synthesis procedures specific to TLL hybridization are developed which are able
to find the optimal replacement using TLL cells in a design, the advantages of thresh-
old logic hybridization could be applied to designs through a completely automated
procedure.
173
REFERENCES
[1] T. Akeyoshi, K. Maezawa, and T. Mizutani, “Weighted sum threshold logic oper-
ation of MOBILE (monostable-bistable transition logic element) using resonant-
tunneling transistors”, IEEE Electron Device Letters, Vol. 14, pp. 475–477, 1993.
[2] M. Anis, “Subthreshold leakage current: challenges and solutions”, International
Conference on Microelectronics, pp. 77–80, 2003.
[3] K. Aoyama, “Design methods for symmetric function generators based on thresh-
old elements”, IEEE Transactions on Computer Aided Design of Integrated Cir-
cuits and Systems, Vol. 26, pp. 1934–1946, 2007.
[4] M. J. Avedillo, J. M. Quintana, A. Rueda, and E. Jimenez, “A low-power CMOS
threshold-gate”, Electronics Letters, Vol. 31, pp. 2157–2159, 1995.
[5] S. Bandyopadhyay, V. P. Roychowdhury, and X. Wang, “Computing with quan-
tum dots: novel architectures for nanoelectronics”, Physics of Low-Dimensional
Structures, Vol. 8/9, pp. 29–82, 1995.
[6] V. Beiu, J. M. Quintana, and M. J. Avedillo, “VLSI implementations of threshold
logic - a comprehensive survey”, IEEE Transactions on Neural Networks, Vol. 14,
No. 5, pp. 1217–1243, 2003.
[7] W. Belluomini, D. Jamsek, A. K. Martin, C. McDowell, R. K. Montoye, H. C. Ngo,
and J. Sawada, “Limited switch dynamic logic circuits for high-speed low-power
circuit design”, IBM Journal of Research and Development, Vol. 50, No. 2.3, pp.
277-286, 2006.
[8] K. Bernstein and N. Rohrer, SOI Circuit Design Concepts, Kluwer Academic Pub-
lishers, 2000.
[9] D. Blauuw and B. Zhai, “Energy efficient design for subthreshold supply voltage
operation”, International Symposium on Circuits and Systems, pp. 29–32, 2006.
174
[10] V. Bohossian and J. Bruck, “Algebraic techniques for constructing minimal
weight threshold functions”, SIAM Journal on Discrete Mathematics, pp. 114–126,
2003.
[11] R. Brent and H. Kung, “A regular layout for parallel adders”, IEEE Transactions
on Computers, Vol. C-31, No. 3, pp. 260–264, 1982.
[12] P. Celinski, J. F. Lopez, S. Al-Sarawi, and D. Abbott, “Compact parallel (m, n)
counters based on self-timed threshold logic”, Electronics Letters, Vol. 38, pp. 633–
635, 2002.
[13] P. Celinski, J. F. Lopez, S. Al-Sarawi, and D. Abbott, “Low power, high speed,
charge recycling CMOS threshold logic gate”, Electronics Letters, Vol. 37, pp.
1067–1069, 2001.
[14] S. Cotofana and S. Vassiliadis, “Signed digit addition and related operations with
threshold logic”, IEEE Transactions on Computers, Vol. 49, No. 3, pp. 193–207,
2000.
[15] J. A. Cunningham, “Use and evaluation of yield models in integrated circuit man-
ufacturing”, IEEE Transactions on Semiconductor Manufacturing, Vol. 3, No. 2,
pp. 60–71, 1990.
[16] L. Dadda, “Some schemes for parallel multipliers”, Alta Frequenza, Vol. 34, No.
5, pp. 349–356, 1965.
[17] S. Dechu, M. K. Goparaju, and S. Tragoudas, “A metric of tolerance for the man-
ufacturing defects of threshold logic gates”, International Symposium on Defect
and Fault-Tolerance in VLSI Systems, pp. 318–326, 2006.
[18] B. El-Kareh, B. Chen, and T. Stanley, “Silicon on insulator - an emerging high-
leverage technology”, IEEE Transactions on Components, Packaging, and Manu-
facturing Technology, Vol. 18, No. 1, pp. 187–194, 1995.
175
[19] E. Fang, “Low-power, compact digital logic topology that facilitates large fan-in
and high-speed circuit performance”, US Patent 5670898, 1997.
[20] J. A. Farrell and T. C. Fischer, “Issue logic for a 600-MHz out-of-order execution
microprocessor”, IEEE Journal of Solid-State Circuits, Vol. 33, No. 5, pp. 707–712,
1998.
[21] D. Folegnani and A. Gonzalez, “Energy-effective issue logic”, International Sym-
posium on Computer Architecture, pp. 230–239, 2001.
[22] A. Forestier and M. R. Stan, “Limits to voltage scaling from the low power per-
spective”, Symposium on Integrated Circuits and Systems Design, pp. 365–370,
2000.
[23] R. Gonzalez, B. M. Gordon, and M. A. Horowitz, “Supply and threshold voltage
scaling for low power CMOS”, IEEE Journal of Solid-State Circuits, Vol. 32, No.
8, pp. 1210–1216, 1997.
[24] S. Goodwin-Johansson, “Circuit to perform variable threshold logic”, US Patent
4896059, 1990.
[25] M. K. Goparaju, A. K. Palaniswamy, and S. Tragoudas, “A fault tolerance aware
synthesis methodology for threshold logic gate networks”, International Sympo-
sium on Defect and Fault Tolerance of VLSI Systems, pp. 176–183, 2008.
[26] M. J. Gowan, L. L. Biro, and D. B. Jackson, “Power considerations in the design
of the Alpha 21264 microprocessor”, Design Automation Conference, pp. 726–
731, 1998.
[27] T. Gowda, S. Leshner, S. Vrudhula, and S. Kim, “Threshold logic gene regulatory
networks”, International Workshop on Genomic Signal Processing and Statistics,
pp. 1–4, 2007.
176
[28] T. Gowda and S. Leshner and S. Vrudhula and G. Konjevod, “Synthesis of thresh-
old logic circuits using tree matching”, European Conference on Circuit Theory
and Design, pp. 850–853, 2007.
[29] D. Hampel, K. J. Prost, and N. R. Scheinberg, “Threshold logic using comple-
mentary MOS device”, US Patent 3900742, 1975.
[30] T. Han and D. Carlson, “Fast area-efficient VLSI adders”, Symposium on Com-
puter Arithmetic, pp. 49–56, 1987.
[31] M. Hatamian and G. Cash, “A 70-MHz 8-bit x 8-bit parallel pipelined multiplier
in 2.5-µm CMOS”, IEEE Journal of Solid State Circuits, Vol. 21, No. 4, pp. 505–
513, 1986.
[32] D. S. Henry, B. C. Kuszmaul, G. H. Loh, R. Sami, “Circuits for wide-window
superscalar processors”, International Symposium on Computer Architecture, pp.
236–247, 2000.
[33] H. Y. Huang and T. N. Wang, “CMOS capacitor coupling logic (C3L) logic cir-
cuits”, Asia Pacific Conference on ASIC, pp. 33–36, 2000.
[34] S. L. Hurst, “Realisation of cellular arithmetic cells by threshold-logic assem-
blies”, Electronics Letters, Vol. 6, pp. 501–503, 1970.
[35] J. Kao, S. Narendra, and A. Chandrakasan, “Subthreshold leakage modeling and
reduction techniques”, International Conference on Computer Aided Design, pp.
141–148, 2002.
[36] T. Kilburn, D. Edwards, and D. Aspinall, “Parallel addition in a digital computer
- a new fast carry”, IEE Proceedings, Vol. 106B, pp. 460–464, 1959.
[37] H. Kim, M. Park, and K. Seo, “40 Gbps operation of MOBILE and its application
to weighted-sum threshold logic gate using only RTDs”, International Conference
on Indium Phosphide and Related Materials, pp. 1–4, 2008.
177
[38] P. Kogge and H. Stone, “A parallel algorithm for the efficient solution of a general
class of recurrence equations”, IEEE Transactions on Computers, Vol. C-22, No.
8, pp. 786–793, 1973.
[39] N. Kushiyama, C. Tan, R. Clark, J. Lin, F. Perner, L. Martin, M. Leonard, G.
Coussens, and K. Cham, “An experimental 295 MHz CMOS 4Kx256 SRAM us-
ing bidirectional read/write shared sense amps and self-timed pulsed word-line
drivers”, IEEE Journal of Solid-State Circuits, Vol. 30, No. 11, pp. 1286–1290,
1995.
[40] J. B. Lerch, “Threshold gate circuits employing field-effect transistors”, US Patent
3715603, 1973.
[41] F. J. List, “The static noise margin of SRAM cells”, European Solid-State Circuits
Conference, pp. 16-18, 1986
[42] M. Liu, M. Cai, and Y. Taur, “Scaling limit of CMOS supply voltage from noise
margin considerations”, International Conference on Simulation of Semiconductor
Processes and Devices, pp. 287–289.
[43] C. Long, J. Xiong, and Y. Liu, “Techniques of power-gating to kill sub-threshold
leakage”, Asia Pacific Conference on Circuits and Systems, pp. 952–955, 2006.
[44] J. A. H. Lopez, J. G. Tejero, J. F. Ramos, and A. G. Bohorquez, “New types of
digital comparators”, International Symposium on Circuits and Systems, Vol. 1,
pp. 29–32, 1995.
[45] S. J. Lovett, G. A. Gibbs, and A. Pancholy, “Yield and matching implications for
static RAM memory array sense-amplifier design”, IEEE Journal of Solid-State
Circuits, Vol. 35, No. 8, pp. 1200–1204, 2000.
[46] K. Maezawa and T. Mizutani, “A new resonant tunneling logic gate employing
monostable-bistable transition”, Japanese Journal of Applied Physics, Vol. 32, pp.
L42–L44, 1993.
178
[47] S. S. Mhambrey, L. T. Clark, S. K. Maurya, and K. S. Berezowski, “Out-of-order
issue logic using sorting networks”, Great Lakes Symposium on VLSI, pp. 385–
388, 2010.
[48] S. Muroga, Threshold Logic and Its Applications, Wiley-Interscience, 1971.
[49] S. Narendra, S. Borkar, V. De, D. Antoniadis, and A. Chandrakasan, “Scaling of
stack effect and its application for leakage reduction”, International Symposium on
Low Power Electronics and Design, pp. 195–200, 2001.
[50] M. Padure, S. Cotofana, C. Dan, S. Vassiliadis, and M. Bodea, “A new latch-based
threshold logic family”, International Semiconductor Conference, Vol. 2, pp. 531–
534, 2001.
[51] M. Padure, S. Cotofana, C. Dan, S. Vassiliadis, and M. Bodea, “Compact delay
modeling of latch-based threshold logic gates”, International Semiconductor Con-
ference, Vol. 2, pp. 317–320, 2002.
[52] M. Padure, S. Cotofana, and S. Vassiliadis, “Design and experimental results of a
CMOS flip-flop featuring embedded threshold logic”, International Symposium on
Circuits and Systems, Vol. 5, pp. 253–256, 2003.
[53] M. Padure, S. Cotofana, and S. Vassiliadis, “High-speed hybrid threshold-
Boolean logic counters and compressors”, International Midwest Symposium on
Circuits and Systems, Vol. 3, pp. 457–460, 2002.
[54] M. Padure, S. Cotofana, S. Vassiliadis, C. Dan, and M. Bodea, “A low-power
threshold logic family”, International Conference on Electronics, Circuits and Sys-
tems, Vol. 2, pp. 657–660, 2002.
[55] J. M. Quintana, M. J. Avedillo, R. Jimenez, and E. Rodriguez-Villegas, “Practical
low-cost CPL implementations of threshold logic functions”, Great Lakes Sympo-
sium on VLSI, pp. 139–144, 2001.
179
[56] J. F. Ramos and A. G. Bohorquez, “Two operand binary adders with threshold
logic”, IEEE Transactions on Computers, Vol. 48, No. 12, pp. 1324–1337, 1999.
[57] M. Rozenblat, “The use of multiaperture cores for the realization of threshold
functions of many variables”, IEEE Transactions on Magnetics, Vol. 5, No. 3, pp.
196–200, 1969.
[58] M. Sahami, “Generating neural networks through the induction of threshold logic
unit trees”, International Symposium on Intelligence in Neural and Biological Sys-
tems, pp. 108–115, 1995.
[59] N. Sherwani, Algorithms for VLSI Physical Design Automation, Kluwer Aca-
demic Publishers, 1994.
[60] T. Shibata and T. Ohmi, “An intelligent MOS transistor featuring gate-level
weighted sum and threshold operations”, IEDM Technical Digest, pp. 919–922,
1991.
[61] G. S. Sohi, “Instruction issue logic for high-performance, interruptible, multiple
functional unit, pipelined computers”, IEEE Transactions on Computers, Vol. 39,
No. 3, pp. 349–359, 1990.
[62] R. Strandberg and J. Yuan, “Single input current-sensing differential logic
(SCSDL)”, International Symposium on Circuits and Systems, Vol. 1, pp. 764–
767, 2000.
[63] J. L. Subirats, J. M. Jerez, and L. Franco, “A new decomposition algorithm for
threshold synthesis and generalization of Boolean functions”, IEEE Transactions
on Circuits and Systems, pp. 3188-3196, 2008.
[64] J. P. Sun, G. I. Haddad, P. Mazumder, and J. N. Schulman, “Resonant tunneling
diodes: models and properties”, Proceedings of the IEEE, Vol. 86, No. 4, pp. 641–
660, 1998.
[65] T. Takemoto, “MOS type semiconductor IC device”, US Patent 3911289, 1975.
180
[66] S. A. Tawfik and V. Kursun, “Low power and robust 7T Dual-Vt SRAM circuit”,
International Symposium on Circuits and Systems, pp. 1452-1455, 2008.
[67] J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy, “POWER4
system microarchitecture”, IBM Journal of Research and Development, Vol. 46,
No. 1, pp. 5–25, 2002.
[68] C. Wallace, “A suggestion for a fast multiplier”, IEEE Transactions on Electronic
Computers, pp. 14–17, 1964.
[69] N. Weste and D. Harris, CMOS VLSI Design: A Circuit and Systems Perspective,
Pearson Education, Inc., 2005.
[70] H. Yamauchi, T. Yabu, T. Yamada, and M. Inoue, “A circuit design to suppress
asymmetrical characteristics in high-density DRAM sense amplifiers”, IEEE Jour-
nal of Solid-State Circuits, Vol. 25, No. 1, pp. 36–41, 1990.
[71] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S. Nagao, S.
Kayano, and T. Nakano, “A divided word-line structure in the static RAM and its
application to a 64K full CMOS RAM”, IEEE Journal of Solid-State Circuits, Vol.
18, No. 5, pp. 479–485, 1983.
[72] R. Zhang, P. Gupta, L. Zhong, and N. K. Jha, “Synthesis and optimization of
threshold logic networks with application to nanotechnologies”, European Confer-
ence and Exhibition on Design, Automation and Test, Vol. 2, pp. 904–909, 2004.
181
APPENDIX
The body of work presented in this document was completed with substantial
assistance from others, in particular the multiplier test chip architecture presented in
Chapter 6. This appendix credits those who performed work in various roles, as detailed
by Table 8.1.
Table 8.1: Credits for multiplier test chip architecture.
Layout of TLL standard cell library Samuel Leshner, Saurabh Patel
Characterization of TLL standard cell library Indira Negi, Saurabh Patel, Samuel Leshner
Design of CMOS/TLL multipliers Samuel Leshner
Design of test architecture Krzysztof Berezowski
Synthesis and P&R of multipliers Krzysztof Berezowski
Verification of multipliers Krzysztof Berezowski
Synthesis and P&R of test architecture Krzysztof Berezowski
Verification of test architecture Krzysztof Berezowski
Synthesis and P&R of clock generation Samuel Leshner
Test architecture top level floorplanning and routing Samuel Leshner
Test architecture top level I/O ring and power grid Samuel Leshner
Test architecture signoff (DRC, LVS) Samuel Leshner
Nanosim simulations of multipliers Samuel Leshner
PrimeTime simulations of multipliers Gayathri Chalivendra
Nanosim simulations of test architecture Samuel Leshner
Test benches for test architecture Kryzsztof Berezowski, Samuel Leshner
PCB design and layout Xiaoyin Yao
Physical test environment setup Xiaoyin Yao
Physical testing of test chip Samuel Leshner, Saurabh Patel, Gayathri Chalivendra
182
This LaTeX document was generated using the Graduate College Format Ad-
vising tool. Please turn a copy of this page in when you submit your document to
Graduate College format advising. You may discard this page once you have printed
your final document. DO NOT TURN THIS PAGE IN WITH YOUR FINAL DOCU-
MENT!
183
