Leakage aware digital design optimization for minimal total power consumption in nanometer CMOS technologies by Schuster, Christian & Farine, Pierre-André
Leakage Aware Digital Design Optimization for Minimal
Total Power Consumption in Nanometer CMOS
Technologies
by
Schuster Christian
A thesis submitted to the Faculty of Science of the University of
Neuchaˆtel, in conformity with the requirements for the
degree of Doctor of Science
Institute of Microtechnology
University of Neuchaˆtel
Switzerland
21 March 2007
Copyright c© 2007 by Schuster Christian
This thesis was typeset with LATEX2ε by the author
ii

iv
To Mario, Silvana
and Eliana
v
vi
Abstract
Starting from deep submicron technologies (< 0.13µm), and even stronger in
nanometer technologies, static power consumption, due to leaky “off” transistors,
is becoming a non-negligible contributor to the total power dissipation. Under this
condition, the total power optimization problem changes considerably. The high par-
allelization approach commonly used today to increase performances, will soon result
in power inefficient designs. Indeed, the static power consumption of the large number
of rarely used transistors will highly penalize the total power consumption.
The purpose of this thesis is to investigate the influence of static power on the
design methodologies for low power. In particular, the effects of architectural as well
as technology modifications are explored. The use of technology as an optimization
parameter has become possible in recent technologies. In fact, they offer different
threshold voltages, each one showing a different trade-off between speed and leakage
current.
In this work, two different frameworks are considered. In the first one, both the
supply voltage and the transistor threshold voltage are freely tunable parameters.
This is the most general case and corresponds to the situation where the designer
has the largest freedom. In the latter framework, we assume that the designer cannot
change the supply voltage nor the transistor threshold voltage and they are hence con-
sidered constants. This case corresponds to the most common one, where the designer
has a supply voltage and a technology type (and hence a threshold voltage) fixed by
the application and by the devices the circuit has to interface. In both cases, lot of
efforts have been put to the development of a handy way to rapidly estimate the total
power consumption and consequently easily compare different architectural/technol-
ogy variants at the early stages of development.
Examples, based on multipliers, are used extensively in the whole thesis and, at
the end, the presented theory is applied to a real circuit implemented in a 90nm
technology by ST Microelectronics. Measurements show a very large variability of
the static power over 16 dies manufactured on the same wafer. For instance, the
highest static power consumption at nominal condition (Vdd=1V, f=62.5MHz) over
the lowest one corresponds to more than a factor of 2.5. Measured data also report
multipliers able to work at 210mV for a frequency of 1MHz!
vii
Keywords
Low power digital design, static power, leakage current, dynamic power, very low
supply voltage, multiplier, architecture, 90nm, CMOS nanometer technology.
Mots cle´s
Circuit nume´rique a` faible consommation, puissance statique, courant de fuite, puis-
sance dynamique, tension d’alimentation tre`s basse, multiplicateur, architecture, 90nm,
technologie CMOS nanome´trique.
viii
Acknowledgements
During the last four years, an important number of people have contributed to my
personal knowledge expansion and have helped me progressing in my thesis. I will
try to acknowledge most of them, being difficult to extensively report everyone in a
few lines. I apologize in advance for the missing ones.
I thank Prof P.-A. Farine who received me in his group and gave me the freedom
and all the tools that I needed to successfully finish my work. I also thank my thesis
co-director Prof. C. Piguet for his endless support and the large number of suggestions
he gave me during our weekly meetings. He also provided a great effort in promoting
my work outside the IMT walls. Moreover, I would like the acknowledge Dr. J.-L.
Nagel for being my project leader for the first three years and for sharing his vast
knowledge with me (besides sharing the office too). During the last year, my new
project leader Dr. S. Tanner helped me to finalize my work and supported me in the
integration and chip testing part of the project. Many thanks to Dr. M. Belleville
too, who kindly accepted to be one of the jury experts.
I am also grateful to all my colleagues, who created a pleasant ambient in the group
and helped me to master some not-so-easy-to-use tools in this work. Particularly, I
would like to thank P. Stadelmann and D. Manetti for all kind of discussions, C.
Robert for the help provided in the PCB design, R. Merz for the help in the use of
GPIB based instruments with MATLAB, without forgetting J.-L. Nagel, P. Thoppay
and M. Moridi for sharing the office with me.
Furthermore, I want to express my complete gratitude to my parents who permit-
ted me to successfully end my studies and always motivated me to progress: “You
can always fly higher, as long as you want it!”.
Finally, I would like to acknowledge my wonderful wife for always being beside
me, bearing with me, and accepting me as I am, with my merits and demerits as well
as my various moods: Thank you very much!
This work has been supported by CSEM (Neuchaˆtel, Switzerland) and the Swiss National Science
Foundation (SNSF, under grant 105619).
ix
x
Contents
1 Introduction 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Sources of dissipation in CMOS transistors 5
2.1 Dynamic consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Switching energy . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Shortcut energy . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Static consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Sub-threshold current . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Gate leakage current . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Reverse bias p-n junction leakage and band to band tunneling 12
2.2.4 Gate-Induced Drain Leakage (GIDL) . . . . . . . . . . . . . . 12
2.2.5 Punchthrough . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Delay and power models 15
3.1 Current models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Power models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Static power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Total power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Delay models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Technology characterization 21
4.1 Parameters extraction methodology . . . . . . . . . . . . . . . . . . . 21
4.1.1 The sub-threshold slope n . . . . . . . . . . . . . . . . . . . . 22
xi
4.1.2 The DIBL effect factor η . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 The α factor and the reference threshold voltage V th0 . . . . 23
4.1.4 The body effect coefficient γ . . . . . . . . . . . . . . . . . . . 23
4.1.5 Remark on Io . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 STM 90nm technology . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Low Vth Transistors (lvt) . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Standard Vth Transistors (svt) . . . . . . . . . . . . . . . . . 29
4.2.3 High Vth Transistors (hvt) . . . . . . . . . . . . . . . . . . . . 29
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Reference multiplier architectures 31
5.1 Ripple Carry Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 RCA parallel variations . . . . . . . . . . . . . . . . . . . . . . 34
5.1.2 RCA horizontal pipeline variations . . . . . . . . . . . . . . . 35
5.1.3 RCA diagonal pipeline variations . . . . . . . . . . . . . . . . 36
5.2 Wallace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Wallace parallel versions . . . . . . . . . . . . . . . . . . . . . 39
5.3 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Sequential-wallace . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Sequential parallel . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Total power comparison for free Vdd and free Vth 43
6.1 Existence of a total power consumption optimum . . . . . . . . . . . 43
6.2 Pdyn over Pstat ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 k1 derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Optimal Vdd and Vth formulas . . . . . . . . . . . . . . . . . . . . . 48
6.3.1 Optimal threshold voltage derivation . . . . . . . . . . . . . . 52
6.3.2 Optimal supply voltage derivation . . . . . . . . . . . . . . . . 55
6.4 Optimal total power . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4.1 Optimal power comparison with k1 constant . . . . . . . . . . 58
6.4.2 Absolute optimal total power . . . . . . . . . . . . . . . . . . 62
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 Architectural impact on total power 67
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xii
8 Technology impact on total power 77
8.1 Technology as a free parameter . . . . . . . . . . . . . . . . . . . . . 77
8.2 Application to technology selection . . . . . . . . . . . . . . . . . . . 79
8.3 Discussion on the modifiability of Vth . . . . . . . . . . . . . . . . . . 82
8.3.1 Body biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3.2 Transistor size modification . . . . . . . . . . . . . . . . . . . 83
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9 Total power comparison for fixed Vdd and fixed Vth 87
9.1 Total power comparison . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.2 Comparison of two architectures . . . . . . . . . . . . . . . . . . . . . 89
9.3 Selection of the best architecture . . . . . . . . . . . . . . . . . . . . 91
9.4 Designing new circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.5 Case study: 16bit multipliers . . . . . . . . . . . . . . . . . . . . . . . 93
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10 Physical implementation of four 32 bit multipliers 99
10.1 Circuit description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.1.1 Pseudo-random code generator . . . . . . . . . . . . . . . . . 100
10.1.2 Ring oscillators . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.2 Circuit design and implementation . . . . . . . . . . . . . . . . . . . 104
10.2.1 Nominal values . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3 Measurements setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.3.1 PCB design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.3.2 FPGA based signal generation . . . . . . . . . . . . . . . . . . 112
10.3.3 MATLAB based measurements automation . . . . . . . . . . . 114
10.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.4.1 Nominal values . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.4.2 Lowest working supply voltage . . . . . . . . . . . . . . . . . . 116
10.4.3 Optimal total power . . . . . . . . . . . . . . . . . . . . . . . 118
10.4.4 Power and delay variability . . . . . . . . . . . . . . . . . . . 120
10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
11 Conclusions 125
Bibliography 129
List of Publications 135
xiii
A VHDL source code 137
A.1 top.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.2 data gen.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.3 mult.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.4 mult par4.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.5 RCA generic arch.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.6 ring svt.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.7 top tb.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B Synopsys compilation scripts 161
B.1 compile top.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.2 read vhdl.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.3 power sdf.do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C SoC Encounter P&R scripts 167
C.1 main.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
C.2 top.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
C.3 IO Filler.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
C.4 do power domains.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
C.5 create global net.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
C.6 pwr.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
C.7 followPin.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.8 place output bufs.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.9 output nets.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.10 fix drc errors.tcl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
C.11 top.ctstch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
C.12 ioplace.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D FPGA source code 191
D.1 main FPGA.vhd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
E MATLAB based automated test functions 197
E.1 test mult.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
xiv
List of Figures
1.1 Ioff vs. Lg and total power vs. technology nodes . . . . . . . . . . . . 2
2.1 CMOS inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Sources of static power consumption in a NMOS transistor . . . . . . 8
2.3 Effect of Drain Induced Barrier Lowering (DIBL) on short channel tran-
sistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Schematic of the NMOS and PMOS transistors used for the extraction
of the sub-threshold slope n . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Linear fitting of ln(Ids(V gs)) for STM 90nm lvt . . . . . . . . . . . . 26
4.3 Linear fitting of ln(Ioff (V dd)) for 1 inverter . . . . . . . . . . . . . . . 27
4.4 Fitting of delay vs. Vdd for STM 90nm lvt . . . . . . . . . . . . . . . 27
4.5 Linear fitting of ln(Ioff (V bs)) for 1 inverter . . . . . . . . . . . . . . . 28
5.1 Full adder symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 8bit RCA multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Critical path in a 8bit RCA multiplier . . . . . . . . . . . . . . . . . . 33
5.4 2 times parallelized multiplier . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 2 stages horizontally pipelined 8 bit RCA . . . . . . . . . . . . . . . . 35
5.6 2 stages diagonally pipelined 8bit RCA . . . . . . . . . . . . . . . . . 37
5.7 Internal implementation of a Carry Save Adder (CSA) . . . . . . . . . 38
5.8 Wallace 8bit structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.9 Sequential multiplier structure (16bit) . . . . . . . . . . . . . . . . . . 39
5.10 Sequential multiplier (16bit) with a 4x16 Wallace implementation . . . 40
6.1 Relationship between Vdd and Vth for α = 1.65 and χ = 0.3 . . . . . 45
6.2 Total power consumption of a 16 bit Wallace multiplier . . . . . . . . 46
6.3 V dd1/α and its linear approximation . . . . . . . . . . . . . . . . . . . 49
6.4 Linearization coefficients for Vdd in [0.3V;1V] . . . . . . . . . . . . . . 50
6.5 Linearization coefficients for Vdd in [0.3V;0.6V] . . . . . . . . . . . . . 51
xv
6.6 Optimal V th vs. activity . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.7 Optimal V th vs. frequency . . . . . . . . . . . . . . . . . . . . . . . . 54
6.8 Optimal V th vs. logical depth . . . . . . . . . . . . . . . . . . . . . . 55
6.9 Optimal V dd vs. activity . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.10 Optimal V dd vs. frequency . . . . . . . . . . . . . . . . . . . . . . . . 57
6.11 Optimal V dd vs. logical depth . . . . . . . . . . . . . . . . . . . . . . 57
7.1 Optimal Vdd calculated with numerical computation . . . . . . . . . . 70
7.2 Optimal Vth calculated with numerical computation . . . . . . . . . . 72
7.3 Optimal total power calculated with numerical computation . . . . . . 73
8.1 Technology parameters influence on a RCA 16 multiplier in a SVT STM
90nm technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 Optimal total power consumption of ten 16 bit multipliers in all STM
90nm technology flavors . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 Vth vs. W for a NMOS transistor . . . . . . . . . . . . . . . . . . . . 84
8.4 Vth vs. W for a PMOS transistor . . . . . . . . . . . . . . . . . . . . 84
8.5 Vth vs. L for a NMOS transistor . . . . . . . . . . . . . . . . . . . . . 85
8.6 Vth vs. L for a PMOS transistor . . . . . . . . . . . . . . . . . . . . . 85
9.1 Lines of equal-consumption with f = 62.5MHz in a STM SVT 90nm
technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2 Thirteen 16 bit multipliers plotted on the cells vs. transitions space . 94
10.1 Block schematic of the test circuit . . . . . . . . . . . . . . . . . . . . 101
10.2 Schematic of the 64 bit linear feedback shift register . . . . . . . . . . 102
10.3 Probability distribution of the pseudo-random generated data for 500
and 10000 generated data . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4 Final layout of the demonstrator circuit . . . . . . . . . . . . . . . . . 104
10.5 Block view of the demonstrator circuit . . . . . . . . . . . . . . . . . . 105
10.6 Output pad level converter for different core supply voltages . . . . . . 107
10.7 Schematic of the PCB used to test the demonstrator circuit . . . . . . 111
10.8 Expected optimal supply voltage . . . . . . . . . . . . . . . . . . . . . 116
10.9 Measured optimal supply voltage for chip No.2 . . . . . . . . . . . . . 117
10.10 Measured optimal supply voltage for chip No.3 . . . . . . . . . . . . . 118
10.11 Expected optimal total power consumption . . . . . . . . . . . . . . . 119
10.12 Measured optimal total power consumption for chip No.2 . . . . . . . 119
10.13 Measured optimal total power consumption for chip No.3 . . . . . . . 120
xvi
10.14 Nominal static power distribution for 16 chips . . . . . . . . . . . . . 121
10.15 Nominal dynamic power distribution for 16 chips at 62.5MHz . . . . . 121
10.16 Delay distribution of the RCA SVT multiplier for 16 chips . . . . . . . 122
xvii
xviii
List of Tables
1.1 The International Technology Roadmap for Semiconductors [1] (ITRS),
update 2006 for low operating power, cost effective high volume MPU. 1
2.1 Manifestation of specific leakage mechanism in a NMOS transistor de-
pending on polarization . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Gate and sub-threshold leakage current for three different TSMC tech-
nologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Results of the sub-threshold slope extraction for STM 90nm lvt . . . . 26
4.2 Results of the DIBL effect coefficient extraction for STM 90nm lvt . . 26
4.3 Results for the α factor and V th0 for STM 90nm lvt . . . . . . . . . . 27
4.4 Results for the body effect coefficient for STM 90nm lvt . . . . . . . . 28
4.5 Io for a NAND2x2 gate from the STM 90nm lvt technology . . . . . . 29
4.6 Technology parameters summary for the STM 90nm lvt . . . . . . . . 29
4.7 Technology parameters summary for the STM 90nm svt . . . . . . . . 29
4.8 Technology parameters summary for the STM 90nm hvt . . . . . . . . 29
4.9 Technology parameters summary for the STM 90nm - Vdd = 1V . . . 30
5.1 Number of CSA levels for some typical multiplier width . . . . . . . . 39
5.2 Summary of the multipliers delays and cell counts . . . . . . . . . . . 42
6.1 Approximation of k1 for STM 90nm technology . . . . . . . . . . . . . 47
6.2 SIA ITRS 2004 expected transistors Ion/Ioff . . . . . . . . . . . . . . 48
6.3 Values of A and B for the three types of STM090 transistors . . . . . 51
6.4 Parameters of a 16 bit Wallace multiplier . . . . . . . . . . . . . . . . 53
6.5 Effect of parallelization on architectural parameters . . . . . . . . . . 60
6.6 Effect of pipelining on architectural parameters . . . . . . . . . . . . . 61
7.1 Nominal values for thirteen 16 bit multipliers based on the STM 90nm
technology and transistors of the SVT type. . . . . . . . . . . . . . . . 68
7.2 Optimal V dd, V th and Ptot. . . . . . . . . . . . . . . . . . . . . . . . 71
xix
8.1 Optimal total power consumption of thirteen 16 bit multipliers in all
STM 90nm technology flavors . . . . . . . . . . . . . . . . . . . . . . 80
9.1 Comparison table between two circuits having a difference of ∆N =
(N1 −N2) cells and ∆Tr = (a1N1 − a2N2) transitions. . . . . . . . . . 89
9.2 Consumption of the thirteen multipliers in µW for Vdd=1V, Vth=0.4V
and f=62.5MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.3 Consumption of the thirteen multipliers in µW for Vdd=1V, Vth=0.12V
and f=62.5MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.1 Nominal values of the 4 implemented multipliers. Nominal frequency is
62.5MHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.2 Pin assignments for the APEX EP20K600EFC672 FPGA . . . . . . . 113
10.3 Measured nominal (1V@62.5MHz) power consumption and maximal
working frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xx
List of Symbols
Symbol Description Unit
V ds Transistor drain-to-source voltage [V]
V gs Transistor gate-to-source voltage [V]
V bs Transistor bulk-to-source voltage [V]
V dd Power supply voltage [V]
V th0 Reference transistor threshold voltage [V]
V th = V th0− ηV ds− γV bs Effective threshold voltage [V]
η DIBL effect coefficient
γ Body bias effect coefficient
n Sub-threshold slope
Ut = kbT/q Thermal potential [V]
kb = 1.38E-23 Boltzmann constant [J/K]
T Temperature [K]
q = 1.6E-19 Elementary charge [C]
Ion Transistor on current [A]
Ioff Transistor off current [A]
I0 Reference current [A]
µ0 Low field mobility [cm
2/V/s]
µeff Effective carriers mobility [cm
2/V/s]
α Alpha power law coefficient
kt Delay proportional constant
Ci Capacitance of node i on the critical path [F]
C Average cell capacitance [F]
LD Logical depth
a Circuit activity
N Number of cells
f Circuit working frequency [Hz]
xxi
Symbol Description Unit
χ Intrinsic design delay relating V dd to V th
k1 = Pdyn/Pstat Dynamic power over static power ratio
tcout Full adder carry out delay [s]
tsum Full adder sum delay [s]
tdff Register delay [s]
tdff setup Register setup time [s]
tFA Worst case full adder delay [s]
tbk adder Brent-kung adder delay [s]
xxii
Chapter 1
Introduction
1.1 Motivations
Digital integrated circuits are found everywhere in modern life and many of them are
embedded in mobile devices where limited power resource is available (e.g. mobile
phones, watches, mobile computers, personal assistants, ...). To permit an usable bat-
tery runtime, such devices must be designed to consume the lowest possible power.
Furthermore, low power is also very important for non-portable devices, too. Indeed,
a reduced power consumption can highly decrease the packaging costs and highly
increase the circuit reliability, which is tightly related to the circuit working temper-
ature. For these reasons, low power design is now mandatory for all types of digital
circuits.
2006 2007 2008 2009 2010 2011 2012 2013 2014
Technology node [nm] 90 65 65 65 45 45 45 32 32
Printed gate length [nm] 48 42 38 34 30 27 24 21 19
Transistors Number [M ] 193 386 386 386 773 773 773 1546 1546
Chip size [mm2] 88 140 111 88 140 111 88 140 111
Voltage supply [V] 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6
Internal frequency [GHz] 6.7 9.2 10.9 12.3 15 17 20 22 28
Total power [W] 98 104 111 116 119 119 125 137 137
Table 1.1: The International Technology Roadmap for Semiconductors [1] (ITRS),
update 2006 for low operating power, cost effective high volume MPU.
As shown in Table 1.1, the number of transistors per circuit will continue to
increase as predicted by Moore’s law [2], whereas the transistor sizes will continue to
shrink. Despite a decreased supply voltage, the total power will continue to increase.
The reduction of the supply voltage is dictated by the need to maintain the electric
field constant on the ever shrinking gate oxide. Unfortunately, to keep transistor speed
1
2 Chapter 1. Introduction
(proportional to the transistor “on” current) acceptable, the threshold voltage must be
reduced too, which results in an exponential increase of the “off” transistor current,
i.e. the current constantly flowing through the transistor even when it should be
“non-conducting”.
Figure 1.1: The left graph shows the transistor off-state current versus the gate length,
squares indicate pre-production transistors and diamonds indicate research devices.
The histogram on the right shows the total power as a function of technology node,
for a fixed (30m) total transistor width. Source: Intel [3].
The left part of Fig. 1.1 shows the exponential increase of static power for real
transistors of various sizes. By looking at the right part of Fig. 1.1, we can observe
that this exponential increase of the static power can reach a point (starting on the
90nm node on the histogram) where it completely cancels the benefit of a reduced
dynamic power (due to reduced capacitances and supply voltage).
Static consumption being now an important contributor to the total power, the
design methodologies used in the past, based on dynamic power considerations only,
are not effective any more and need to be reconsidered.
In the recent past years, static power was only relevant when the circuit was idle.
This explains why many of static power reduction techniques are only applicable
when blocks are unused. A typical example can be the Gated-Vdd approach [4] [5]
[6], where a transistor is put between the real supply voltage and a virtual supply
voltage, allowing to power off the unused blocks. However, Fig. 1.1 clearly shows that
static power reduction should now be tackled in running mode, too.
Moreover, the large majority of the existing leakage reduction techniques apply at
circuit and transistor level. Examples are:
• Multi V th technology, with fast low V th transistors on critical paths and slow
high V th transistors outside critical paths (MTCMOS) [7] [8] [9] [10]
• Electrical regulation of V th (VTCMOS, SATS) [11] [12] [13]
1.2. Thesis outline 3
• DTCMOS (Dynamic V th) with transistor bodies connected to MOS gates [4]
[14]
This thesis considers the reduction of the total power, i.e. dynamic plus static con-
tributions, at a high level and during runtime. Basically, the low power consumption
is searched through architectural and technology modifications in modern nanometer
CMOS processes.
1.2 Thesis outline
In Chapter 2, the main sources of power consumption in CMOS technologies are re-
viewed, with an emphasis on the static ones. This permits to define the delay and
power models in Chapter 3. These models are extensively used in the entire thesis
and are hence considered as the foundation of this work. In Chapter 4, the 90nm
CMOS technology from ST Microelectronics is described in details and the required
model parameters are derived from SPICE-like simulations. Chapter 5 illustrates and
describes the different multiplier architectures used in the various examples and case
studies. In Chapter 6, the models for a total power consumption comparison in the
case where the supply voltage and the threshold voltage are freely modifiable is de-
rived. In particular, this chapter shows that, under such conditions, the total power
consumption (for a given delay) presents a minimum. Its application to architectural
modifications is reported in Chapter 7, followed by a similar analysis for technology
modifications in Chapter 8. A different situation is considered in Chapter 9, where
total power comparison models and charts are obtained for the case where the supply
and threshold voltages are fixed. Finally, Chapter 10 reports the power consumptions
of a circuit manufactured in a 90nm technology. This circuit is composed by 4 multi-
pliers presenting different combinations of architecture and technology modifications.
The thesis is closed by the conclusions in Chapter 11.
1.3 Contributions
The main contributions provided by this thesis are:
• Chapter 2-3: Collection and description of existing models for static power,
dynamic power, total power and delay.
• Chapter 4: Complete characterization of the ST Microelectronics 90nm general
purpose technology for all three available transistor types (LVT, SVT, HVT).
4 Chapter 1. Introduction
• Chapter 5: Detailed description and classification of thirteen multiplier archi-
tectures.
• Chapter 6: Development and analysis of closed-form equations for optimal total
power, optimal supply voltage and optimal threshold voltage in a scenario where
supply and threshold voltages are freely tunable.
• Chapter 7: Applications of the theory exposed in Chapter 6 to architecture
modifications.
• Chapter 8: Applications of the theory exposed in Chapter 6 to technology mod-
ifications.
• Chapter 9: Development and application of easy-to-use equations and graphical
tools for architectures comparison under fixed supply and threshold voltages
condition.
• Chapter 10: Implementation, testing and analysis of a physical realisation of 4
multipliers representing different combinations of technology flavors and archi-
tectures.
Chapter 2
Sources of dissipation in CMOS
transistors
Circuits designed before 1980 were mainly implemented in NMOS technology. Such
devices presented the major inconvenient of a large current constantly flowing through
the circuit even when no transitions occurred. To solve this issue, CMOS (Comple-
mentary Metal Oxide Semiconductor) technology was introduced. This seemed to be
an ultimate solution for avoiding static power consumption. Thus, the only remaining
sources of dissipation were the switched capacitance power (due to the charging/dis-
charging of capacitance nodes) and the shortcut power (due to the current flowing
from supply voltage (V dd) to the ground (V ss) when switching), both only present
during node transitions.
Unfortunately, the constant dimension reduction driven by Moore’s law and the
corresponding reduction of the supply voltage (needed to maintain the electric field on
the transistor gates constant) yielded a huge increase of the static power consumption,
taking it back to a non negligible source of consumption. The reasons why this
occurred are mainly two. The former is the reduction of the threshold voltage imposed
by the V dd reduction in order to maintain the speed acceptable, and the latter is
the new electrical effects originated by the reduction of the transistors geometrical
dimensions, known under the name of short channel effects.
Starting from 0.13µm technology node (i.e. a technology with a minimal transistor
size of 0.13µm), the static power consumption cannot be neglected anymore and must
be added to the dynamic power to correctly estimate the total power consumption.
In this chapter, the sources of dissipation in CMOS transistors are discussed in
details, with a special focus on those contributing to the static consumption.
5
6 Chapter 2. Sources of dissipation in CMOS transistors
2.1 Dynamic consumption
Dynamic consumption is considered as the dissipation that occurs only when the
circuit is active (i.e. internal circuit nodes are switching).
Two distinct contributions exist. The first is the so called switching energy and
corresponds to the energy required to charge (and discharge) the node capacitances
during transitions. The second is the energy dissipated during transitions due to the
conductive path existing, for a short period of time, between the supply voltage and
the ground. This effect is known as shortcut or short-circuit.
2.1.1 Switching energy
The energy consumed to charge (and then discharge) a capacitance C to a voltage V
is given byI [7]:
Capacitance switching energy = CV 2 (2.1)
This type of consumption can easily be reduced from a technology node to the other
by reducing capacitance C and supply voltage V . Both reductions are effectively
obtained in a new scaled technology; in fact, the supply voltage has to be reduced
in order to avoid high electric fields on the transistor gates and the reduction of the
transistor physical dimensions automatically results in reduced capacitances. This
type of dissipation was the primary source of consumption in active mode for circuit
implemented in technology larger than 0.13µm [15].
2.1.2 Shortcut energy
The second source of dynamic consumption arises from shortcut paths. Consider a
CMOS inverter (Fig. 2.1) with the input node at zero. In this condition the NMOS
transistor is off and the PMOS transistor is conducting. Now, if the input node po-
tential increases from 0 to V dd, the NMOS will start to conduct for V in > V th nmos
while the PMOS is still on, which result in a current flowing from V dd to V ss. Then,
when V in acquires the potential V dd − V th pmos, the PMOS stops to conduct and
the shortcut current vanishes too.
Clearly, this type of conduction only exists if the supply voltage V dd is greater
than the sum of the NMOS/PMOS sub-threshold voltages (V th nmos+ V th pmos).
IThis equation refers to the energy required to charge and discharge the capacitance, both pro-
cesses contributing as 1/2CV 2
2.1. Dynamic consumption 7
CL
Vdd
P1
N1
Vss
IN OUT
short-cut
current
switching
current
Figure 2.1: CMOS inverter
The energy dissipated during one transition can be expressed as [16]:
Shortcut energy per transition ∝ (V dd− V th nmos− V th pmos)3 · τ (2.2)
With V dd the supply voltage, V th nmos and V th pmos the threshold voltages for
NMOS and PMOS, respectively and τ is the transition time, i.e. the period of time
needed to sweep the input voltage from 0 to V dd. More accurate models can be found
in [17] [18] [19].
For well designed cells (i.e. with balanced rising and falling edges), the shortcut
energy is in general much smaller than the switching energy. Moreover, for very low
supply voltage designs, the value V dd − V th nmos − V th pmos can be very small.
Additionally, the case where V dd < V th nmos+ V th pmos will not present shortcut
dissipation at all. For these reasons, in modern designs, shortcut power is often
not considered or is simply included in the switching consumption by increasing the
switching capacitance to an equivalent capacitance which incorporates the shortcut
effect.
8 Chapter 2. Sources of dissipation in CMOS transistors
2.2 Static consumption
Contrary to the dynamic consumption, static power is defined as the consumption
originated from currents constantly flowing from V dd to ground. This means that
even when the circuit is in idle mode (no transition occurs), power continues to be
dissipated. For long channel transistors with high threshold voltage, this type of
dissipation was completely negligible. Unfortunately, present and future technologies
will suffer from high static power, which could even exceed the dynamic contribution in
active mode. Hence, it is of uttermost importance to consider this type of dissipation
in present and future design methodologies.
To understand the main sources of static dissipation, let us look at the structure
of a transistor in CMOS technology. Fig. 2.2 shows 5 different leakage mechanisms
that can be observed in a CMOS transistor (only the NMOS transistor is illustrated,
as PMOS behaves exactly in the same way).
These mechanisms are:
(a) Sub-threshold current;
(b) Gate leakage current;
(c) Reverse-bias p-n junction current and band to band tunneling;
(d) Gate-Induced Drain Leakage (GIDL) current;
(e) Punchthrough current.
n+n+p+
Gate
p-well
substrate
Drain
sub-threshold
(a)
punchthrough
(e)
gate leakage (b) GIDL leakage (d)
p-n junction
(c)
p-n junction
(c)
SourceBody
Figure 2.2: Sources of static power consumption in a NMOS transistor
2.2. Static consumption 9
2.2.1 Sub-threshold current
The most important leakage current is the sub-threshold one originated by the dif-
fusion of minority carriers in a non conducting transistor (Vgate − Vsource < V th).
Under this condition, the transistor is operating in weak inversion. The potential ap-
plied between drain and source creates a flow of the minority carriers on the surface
of the channel. The equation describing this mechanism is [20] [21]:
Isub-threshold = Io · e−
V th
nUt
(
1− e−V dsUt
)
≈ Io · e−V thnUt (2.3)
With Io the reference static current, V th the threshold voltage, n the sub-threshold
slope, Ut(≡ kbT/q) the thermal potential and V ds the Drain-Source voltage.
Eq. (2.3) shows an exponential dependency of the sub-threshold current on the
threshold voltage V th. This is the reason why the low V th characterizing recent
technologies leads to large sub-threshold currents. Moreover, in typical digital designs,
V ds is much larger than nUt, which leads to the approximation 1− e−V dsUt ≈ 1.
The value of V th is not fixed for a given technology; in fact, it can be modulated
through different effects like:
• Drain Induced Barrier Lowering (DIBL) effect: In short channel transistors,
the potential on the drain contact modulates the threshold voltage by lowering
the energy barrier at the surface of the channel. A schematic representation of
this effect is illustrated in Fig. 2.3. For long channel transistors (L1), the po-
tential in the channel is independent on the drain voltage (V d1 and V d2 show
the same potential profile), whereas for short channels (L2), an increase of the
drain voltage also reduces the barrier energy level in the channel, which can
be modeled by a reduction of the threshold voltage. Ideally, the DIBL effect
doesn’t change the sub-threshold slope n. DIBL can be reduced by using high
surface and channel doping and shallow source/drain junction depths.
• Body effect: The body effect appears when a potential difference is present be-
tween body (bulk) and source. This happens because bulk and source operate
as a reverse biased p-n junction. By increasing the body potential in a NMOS
or by decreasing it in a PMOS (forward biasing), the junction depletion reduces
the channel potential and the sub-threshold leakage current increases. Similarly,
a reduction of the body potential (lower than V ss for NMOS and higher than
V dd for PMOS, called reverse biasing) increases the channel potential, leading
to a reduced sub-threshold leakage. It should be noted that for body-source po-
tentials (V bs) higher than 0.5 V the p-n junction starts to conduct as forward
10 Chapter 2. Sources of dissipation in CMOS transistors
biased diode, drawing very large current, which has to be avoided at all costs.
Body effect is more pronounced for high bulk doping levels and decreases as
substrate reverse bias increases. At V bs = 0, the body effect sensitivity is equal
to (n− 1), with n the sub-threshold slope. The body effect can be modeled as
a modification of the threshold voltage V th.
Channel position
B
a
rr
ie
r 
p
o
te
n
ti
a
l e
n
e
rg
y
Vd1
Vs1
Vd2>Vds1
Vd1
Vd2>Vd1
DIBL
L2 L1
Figure 2.3: Effect of Drain Induced Barrier Lowering (DIBL) on short channel tran-
sistors
By considering the effects of DIBL and body bias, the threshold voltage can be
expressed by [22] [23] [24]:
V th = V th0− ηV ds− γV bs (2.4)
With V th0 the reference threshold voltage for V ds = V bs = 0, η (eta) the DIBL
effect coefficient and γ (gamma, equal to n-1 for V bs = 0) the linearized body effect
coefficient.
By considering the described effects, the sub-threshold current can be expressed
as:
Isub-threshold = Io · e−
V th0−ηV ds−γV bs
nUt (2.5)
2.2. Static consumption 11
2.2.2 Gate leakage current
The transistor gate potential influences the charges in the channel by electrostatic
effect: an accumulation of holes in the gate produces an accumulation of electrons at
the surface of the channel, obtaining exactly the behavior of a capacitance with gate
and channel as poles and the silicon oxide as dielectric. Ideally, no current should
occur across the gate oxide, but practically some electrons are able to pass through the
oxide, generating a gate current. The mechanisms behind this effect can be divided
into two categories: oxide tunneling and hot carrier injection.
Oxide tunneling current
Tunneling through the gate oxide is primarily due to direct tunneling across very thin
oxide layers (less than 3-4 nm). A model for this effect has been reported in [25] [26]:
Igate = Kg ·W
(
V
tox
)2
e−αgtox/V (2.6)
With Kg and αg
II (alpha gate) experimentally derived constants, W the width of the
transistor, tox the gate oxide thickness and V the potential across the gate oxide. The
previous equation clearly shows how the reduction of the oxide thickness exponentially
increases the tunneling effect. An efficient way to reduce this source of leakage in
future technologies is to use other insulators with a higher dielectric constant, resulting
in a higher effective oxide thickness (i.e. the thickness of the silicon oxide that would
show the same behavior as this high dielectric insulator). In this way, it should be
possible to maintain the gate tunneling current to acceptable (i.e. negligible) levels.
The main candidates to substitute the silicon oxide (κ = 3.9) are the hafnium oxide
(HfO2, κ = 25) and Hafnium silicate (HfSiO4, κ = 11) [27].
Hot carrier injection
Due to the high electric field in the interface Si−SiO2 (channel-oxide), electrons and
holes can gain sufficient energy to enter into the gate oxide. Because the effective
mass of the electrons, as well as their barrier height, is lower than the corresponding
ones for holes, electrons injection is much more probable [28]. A reduction of the
supply voltage will reduce the electric field on the gate, also reducing in this way the
hot carrier injection.
IIThis αg parameter has nothing to do with the α parameter used in the alpha power law model
of the transistor on current, which is extensively used in this thesis
12 Chapter 2. Sources of dissipation in CMOS transistors
2.2.3 Reverse bias p-n junction leakage and band to band
tunneling
In the normal transistor operation mode, the drain/source to well junctions are re-
verse biased. Under this condition, a small current exists due to the drift of carriers
originated by the thermal electron-hole generation. Nevertheless, in advanced short
channel MOS (where heavily doped and shallow junctions are used), such effects are
masked by the dominating band-to-band tunneling.
Band to band tunneling happens on junctions with high electric field (> 106V/cm)
and is due to the direct tunneling of electrons from the band of valence of the p region
to the band of conduction in the n region. Closed form equations describing this type
of leakage exist [25] [29].
2.2.4 Gate-Induced Drain Leakage (GIDL)
In the overlapping zone between gate and drain, a high electric field can exist, leading
to the generation of currents from drain to substrate. Consider a NMOS transistor;
when a low gate potential is applied (V g near zero volts or below), holes accumulate
at the surface and create a region which is more heavily p doped than the substrate.
If this happens while the drain is connected to a high potential (let say V dd), the
depletion layer near the drain becomes narrower. If this is important enough to invert
the polarity of the n+ drain region under the gate, high field effects like band-to-band
tunneling, avalanche multiplication and traps-assisted tunneling take place. As a
consequence minority carriers are emitted in the drain region underneath the gate
and pushed to the substrate due to the vertical electric field. All these effects are
increased by a reduction of the gate oxide thickness.
This type of leakage is especially important for “relatively high” supply voltage
circuits (V dd > 1.1 V). Low power digital designs, with very low supply voltage (i.e.
V dd around 0.5V), are not heavily concerned by this type of leakage. More detail on
GIDL effect can be found in [30] [28] [25].
The equivalent of the GIDL effect for a “high” source potential is called GISL
(Gate-Induced Source Leakage). This effect is generally not considered because, in
normal transistor operations, the source will show a low or zero potential compared
to the bulk.
2.3. Summary 13
2.2.5 Punchthrough
With the physical dimensions reduction, the depletion layers of source and drain
become nearer and nearer until they touch each other, originating punchthourgh cur-
rents. In submicron MOS transistors, implants at the substrate surface aiming V th
adjustment are used, forcing the punchthrough to occur deeper in the substrate. The
size of the depletions directly depends on the V ds potential. Hence, low voltage design
can prevent the generation of punchthrough currents [31] [25].
2.3 Summary
In deep sub-micron and nanometer technologies, the dynamic power consumption is
no longer the only relevant source of power dissipation. In fact, present and future
technologies will be characterized by large static power consumption coming from
different leakage sources. In this chapter, the principal ones have been explained.
However, it is important to observe that, depending on the transistor polarization,
only a part of the described mechanisms occur. All realistic combinations of polar-
ization are shown in Table 2.1 for a NMOS transistor.
Vg Vd Vs Sub-threshold Gate leakage p-n junction GID/SL Punchtrough
0 0 0 NO NO NO NO NO
0 0 1 YES NO YES GISL YES
0 1 0 YES NO YES GIDL YES
0 1 1 NO NO YES BOTH NO
1 0 0 NO YES NO NO NO
1 1 1 NO YES YES NO NO
Table 2.1: Manifestation of specific leakage mechanism in a NMOS transistor depend-
ing on polarization
In a typical CMOS digital design, the NMOS transistor will have two modes of
operation: Vg/Vd/Vs = 0/1/0 for the off transistor and Vg/Vd/Vs = 1/0/0 for a con-
ducting transistor. When the transistor is on (conducting), the only mechanism that
occurs is the gate leakage, whereas for an off transistor, sub-threshold, p-n junction,
GIDL and punchthrough could be present. Nevertheless, the use of very low supply
voltage (less than 1V) maintains the p-n junction and the punchthrough effects much
lower compared to the sub-threshold one. Moreover, for gate potentials no lower than
Vss for NMOS and not higher than Vdd for PMOS, the GIDL mechanism can also
be neglected.
14 Chapter 2. Sources of dissipation in CMOS transistors
To summarize, the main sources of static power are the sub-threshold current for
off transistors and gate leakage for conducting transistors.
CLN90G CL013GHP CL013LVHP
Transistor size [nm] 90 130 130
Vdd [V] 1.0 1.2 1.0
INVD1
Sub-treshold current [nW] 5.14 0.56 6.93
Gate leakage current [nW] 0.82 0.10 0.34
NAND2D1
Sub-treshold current [nW] 4.91 0.61 6.63
Gate leakage current [nW] 1.40 0.15 0.56
Table 2.2: Gate and sub-threshold leakage current for three different TSMC technolo-
gies
Table 2.2 reports the sub-threshold and gate leakage power dissipation in 3 recent
technologies. Values are reported for an inverter (INVD1) and a 2 input NAND gate
(NAND2D1)[32]. We can observe that sub-threshold current remains the principal
source of static power dissipation in deep sub-micron and nanometer technologies.
The next generations could see an exponential increase in the gate leakage if silicon
oxide is still used as insulator. Luckily, referring to [33], high dielectric constant
oxide should be used starting from 2007. Intel also announced at the end of January
2007 [34] that high-k gate oxide will be used in their 45nm technology for the new
generation of the Intel Core 2 Duo, Intel Core 2 Quad and Xeon families of multi-core
processors.
Chapter 3
Delay and power models
3.1 Current models
As stated in Chapter 2, the main contribution to static power comes from sub-
threshold currents flowing from drain to source in off transistors. In short channel
transistors (L < 1µm), the voltage applied between drain an source also influences
the channel conduction by a mechanism known as Drain Induced Barrier Lowering
(DIBL).
Ioff = Ioe
−V th0−ηV dd−γV bs
nUt = Ioe
−V th
nUt (3.1)
With Io the reference static current, V th0 the reference threshold voltage, V th the
modulated threshold voltage, η the DIBL effect coefficient, V dd the supply voltage,
γ the body effect coefficient, V bs the body-source voltage, n the sub-threshold slope
and Ut the thermal potential (≡ kT/q).
The ”on” current, i.e. the current flowing in a conducting transistor can be ap-
proximated by the following formula [35] [36] [37] [38]:
Ion = Io
(
e
αnUt
)α
(Vdd − Vth)α (3.2)
With Io the reference static current, e the euler number, α the alpha power law
coefficient, n the sub-threshold slope, Ut the thermal potential, V dd the supply voltage
and V th(≡ V th0− ηV − γVbs) the effective threshold voltage.
This model is an empirical fitting equation that accounts for the carriers mobility
reduction. According to [39], the parameter α can be related to mobility by:
α = 1 +
µeff
µ0
(3.3)
15
16 Chapter 3. Delay and power models
With µeff the effective carriers mobility and µ0 the low field mobility. Being 0 <
µeff ≤ µ0, the parameter α will always be included in the range [1;2]; with α = 2 for
long channel transistors.
Based on these equations, it is now possible to define the dynamic and static power
consumption as well as delay models.
3.2 Power models
As illustrated in the previous chapter, the total power can be divided into two cate-
gories: dynamic and static power.
3.2.1 Dynamic power
Dynamic power is due to the dissipation during the capacitances charge/discharge
process. The well known equation describing it is:
Pdyn =
(
N∑
i
aiCi
)
f · V 2dd = aCNfV 2dd (3.4)
With ai the switching probability per clock period of the node i, Ci is the capacitance
of node i plus the internal cell capacitance driven by node i, f is the circuit frequency,
V dd the supply voltage, N the number of cells, a the average activity per cell better
understood as the average number of switching cells over the number of total cells dur-
ing a clock cycle and C is the equivalent capacitance defined as (
∑
i aiCi) /aN . Using
the proposed definition of activity, only the transitions from 0 to 1 are considered.
The expression of aCN using average parameters must be treated carefully. First,
the average activity on the net is considered the same as the average activity in
the cells, moreover the equivalent capacitance C is only equal to the average cell
capacitance (net + internal cell) when all cells present the same activity, which is
practically never the case. Therefore, C depends on activity distribution. For this
reason, every time the parameters aCN are used together in equations, they must be
considered as
∑
i aiCi, rather than average activity times average capacitance times
number of cells.
A second contribution to dynamic power comes from the shortcut dissipation due
to current flowing from V dd to V ss during node transition. As seen in Chapter 2,
this contribution is inexistent for supply voltage V dd smaller than NMOS plus PMOS
threshold voltages, and is very small for V dd near V thn+V thp. Moreover, the quick
3.2. Power models 17
transition time, typically present in current technologies, further reduces the shortcut
dissipation. Thus, this source of dynamic power can simply be accounted by lumping
this effect into the cell capacitance, which will increase slightly.
3.2.2 Static power
This new source of dissipation coming from non-ideal transistor behavior is particu-
larly important in deep submicron technologies and can become the main contributor
even in running mode. Moreover, this type of consumption is always present as long
as the circuit is supplied. Hence, even when the circuit does nothing (idle mode),
static power continues to be dissipated. For simplicity of the model, only the main
contributor (i.e. sub-threshold current) is considered. For a detailed discussion on
the others existing sources of static power consumption, please refer to Chapter 2.
Static power model is given by:
Pstat = Vdd ·
N∑
i
Ioff(i) = N · Vdd · Ioe−
V th
nUt = N · Vdd · Ioe−
V th0−ηV dd−γVbs
nUt (3.5)
With N the number of cells, V dd the supply voltage, Io the cell reference current,
n the sub-threshold slope, Ut the thermal potential, V th the modulated threshold
voltage, V th0 the reference threshold voltage, η the DIBL coefficient and γ the body
bias coefficient.
It is important to note that Io in Eq. (3.5) is the average reference off-current per
cell. This factor is different from the single transistor reference off-current, because
complex cells present a modified Io due to stack effect, different transistor sizing,
etc. According to [40], the ratio kdesign = (Iocell)/(Iotransistor)/(# of transistors)
is about 1.4 for flip-flops, 2.0 for latches, 1.2 for 6T RAM cells and 11 for static
logic. We carried out the same calculation for few cells with a driving force of 2
in the STM 90nm SVT technology and our results show a kdesign spanning over a
slightly narrower range; in fact, we obtain a kdesignof 7.3 for a NAND gate, 6.5 for
AND gate, 2.5 for a flip-flop and 3.7 for a full adder. Nevertheless, this shows that
the static power consumption per cell can vary from cell to cell. For this reason,
power comparison using Eq. (3.5) requires that both circuits present the same type
of cells (i.e. static logic) or a similar distribution of different cell types. Otherwise, a
compensation factor should be used depending on the type of cells used.
18 Chapter 3. Delay and power models
3.2.3 Total power
Total power is defined as the sum of dynamic plus static consumption. Referring to
the previous sub-chapters, the total power model is given by:
Ptot = Pdyn + Pstat
= aCNfV 2dd +N · Vdd · Ioe−
V th
nUt
= N · Vdd
(
aCfVdd + Ioe
−V th
nUt
)
(3.6)
With N the number of cells, V dd the supply voltage, a the circuit activity, C the
equivalent capacitance, f the frequency, Io the average off-current per cell, V th the
modulated threshold voltage, n the sub-threshold slope and Ut the thermal potential.
3.3 Delay models
All power related discussions are worthless if the circuit delay (related to performance)
is not considered. The model retained here is the very common one, that considers
the delay of a cell as the time needed to charge the load capacitance by a driving
current. So, to charge a capacitance C to the potential V the number of electric
charges needed is Q = CV . Considering that these charges are coming at the speed
of Ion [A=C/s], it is easy to find that:
tgate = kt
CV
Ion
(3.7)
With kt a constant accounting for the fact that the driving current is not constant
during the capacitance charge (the values of this constant for the technology flavors
used in this thesis are 15.1 for LVT, 24.7 for SVT and 30.1 for HVT. These values were
obtained by multiplying the delay of a NAND2x2 cell with Ion and then by dividing it
by the driven capacitance and by the supply voltage). Ion is the on transistor current
and its formulation is given by Eq. (3.2).
In a digital design, the maximal achievable frequency is the inverse of the sum of
delays on the critical path. In a mathematical form it appears as:
(fmax)
−1 = tcritical path = kt
LD∑
i
Ci · Vdd
Ion
= ktC
LD · Vdd
Ion
(3.8)
3.4. Summary 19
With Ci the load capacitance i on the critical path, LD the logical depth defined as
the number of cells forming the critical path, C the average critical path capacitance
defined as
∑
iCi/LD.
Combining Eq. (3.8) with Eq. (3.2) yields:
fmax =
Io · eα
kt · C · LD · (αnUt)α
(Vdd − Vth)α
Vdd
(3.9)
In the previous equation, it is interesting to observe that a high Io correspond-
ing to a high leaky technology also corresponds to a high maximal frequency, thus
underlining the tight relation between high performance and static dissipation.
3.4 Summary
In this chapter, equations for the dynamic and static power consumption as well
as the circuit delay (corresponding to the maximal frequency) have been obtained
starting from simple and well known expressions of the on and off currents of a CMOS
transistor. These equations are the foundation for the theory presented in this thesis.
The use of very simplified equations, as well as the exclusion of secondary effects
like gate leakage, are voluntary. This is necessary in order to be able to work with
analytical expressions or simple closed form approximations, which makes it possible
to understand the influence of each single parameter on the lowest achievable total
power consumption.
20 Chapter 3. Delay and power models
Chapter 4
Technology characterization
The equations in the Chapter 3 depend on a certain number of technology parame-
ters that must be characterized for a given technology before the equations can be
exploited. To be sure that they really match the models used in this work, every
parameter have been estimated by fitting SPICE simulations curves to our models
with the program Graphical Analysis v3.2. The obtained values can vary compared
to the original SPICE parameters, because used models are different. Actually, our
models (explained in previous chapters) are much simpler than the BSIM3.3 ones,
which are what the provided SPICE libraries use. In this thesis, the technology of
ST Microelectronics with a minimal size of 90nm has been chosen as reference. The
advantage of this technology is that it is available for 3 different transistor types,
corresponding to 3 different threshold voltages.
4.1 Parameters extraction methodology
The technology parameters required in this work are:
• n : the sub-threshold slope;
• η : the DIBL effect coefficient;
• α : the alpha power law coefficient;
• V th0 : the reference transistor threshold;
• γ : the body effect coefficient.
Each one of these parameters will be discussed in details in the following sections.
21
22 Chapter 4. Technology characterization
4.1.1 The sub-threshold slope n
The sub-threshold slope n is extracted from the simulation of Ids(V gs). The schematic
used to measure the Ids current is reported in Fig. 4.1.
D
B
G
S
VbsVgs
S
B
G
D
VbsVgs
Vss
Vdd
Vss
Vdd
NMOS PMOS
Figure 4.1: Schematic of the NMOS (left) and PMOS (right) transistors used for
the extraction of the sub-threshold slope n . Transistor sizes are: Wnmos = 0.51µm,
Wpmos = 0.88µm and Lnmos = Lpmos = 0.1µm
The equation of the drain current in weak inversion is given by:
Ids(V gs) = Ioe
V gs−V th
nUt (4.1)
Consequently, by considering the natural logarithm of the previous equation, the
simulated curve should match the corresponding linear function:
ln (Ids(V gs)) =
1
nUt
· V gs+
[
ln(Io)− Vth
nUt
]
≡ m · V gs+ b (4.2)
Through a linear fitting, it is possible to extract the slope m of Eq. (4.2) to obtain
1/nUt. Knowing the temperature used during the simulation, Ut(≡ kbT/q) is also
known (kb = 1.38E-23, q = 1.6E-19).
As the values of n for the NMOS and the PMOS transistors can be different, the
retained value will be their average.
The size of both NMOS and PMOS used in the SPICE simulations are the same
than the corresponding ones in an inverter cell with a driving force of one.
4.1. Parameters extraction methodology 23
4.1.2 The DIBL effect factor η
The extraction of the DIBL effect factor η is very similar to how n is obtained. The
difference comes from the swept variable during simulation, which is now V dd, while
V gs is set to 0V , thus resulting in an off transistor. The corresponding equations are:
Ioff (Vdd) = Ioe
−Vth0−ηVdd
nUt (4.3)
ln (Ioff (Vdd)) =
η
nUt
· Vdd +
[
ln(Io)− Vth0
nUt
]
≡ m · Vdd + b (4.4)
Once the slope η/nUt has been extracted, η is easily obtained, since 1/nUt was
estimated in the previous section 4.1.1.
The static current Ioff is measured as the supply current on a closed chain com-
posed by an even number of inverters (10 in our case). In such a configuration, the
circuit is in a stable condition and no node transitions occur. All inverters present a
driving force of one.
4.1.3 The α factor and the reference threshold voltage V th0
The parameter α (discussed in Chapter 3) and the reference threshold voltage V th0
can both be estimated by fitting the delay equation (from Eq. (3.9)):
Delay(Vdd) ∝ Vdd
(Vdd − Vth)α =
Vdd
(Vdd(1 + η)− Vth0)α (4.5)
As η is a known parameter, a non-linear curve fitting on a circuit delay plotted
in function of V dd permits to determine the values of α and V th0. Because both
parameters are referred to the circuit delay (and this is the way the parameters will
be used later), their values can be quite different from the single NMOS or PMOS
ones defined by the manufacturer.
The delays are obtained by measuring the oscillating frequencies of a ring oscillator
formed by 9 inverters with a driving force of one.
4.1.4 The body effect coefficient γ
The body effect coefficient γ models the first order influence of the body potential to
the reference threshold voltage V th0:
V th(V bs) = V th0− γV bs (4.6)
24 Chapter 4. Technology characterization
The extracting methodology for this parameter is the same as for the DIBL effect
coefficient η, but the measured parameter is Ioff (V bs):
Ioff (V bs) = Ioe
−Vth0−ηVdd−γV bs
nUt (4.7)
ln (Ioff (V bs)) =
γ
nUt
· V bs+
[
ln(Io)− Vth0 − ηVdd
nUt
]
≡ m · V bs+ b (4.8)
A simple linear curve fitting on ln(Ioff (V bs)) is enough to determine m = γ/nUT .
It is then easy to multiply the previous value by nUt to obtain γ.
Here too, the static current Ioff is obtained by simulating a looped chain composed
by an even number (10) of inverter with a driving force of one.
It is important to note that the body bias potential must be kept below 0.5V in
the forward bias condition (V bs > 0). Otherwise, the p-n junction between the body
and the source will start to conduct as a forward-biased diode, creating an extremely
large leakage current.
4.1.5 Remark on Io
The parameter Io representing the reference static current is also a technology related
parameter, but its value cannot be extracted and used in a universal way as it is
done for the other technology parameters. In fact, in this work, Io is considered as
the reference static power per cell. This means that the specific value is dependent
on the cells used (as discussed in Chapter 3.2.2) and cannot simply be represented
with an unique value. Except when stated differently, the average Io of a circuit is
estimated from cell nominal values of the static power in the following way:
Io =
Total Nominal Static Power
V dd nom ·N e
V th
nUt (4.9)
With V dd nom the nominal supply voltage and N the number of cells.
4.2 STM 90nm technology
The STM 90nm is the most recent technology available at our laboratory and it
presents the following main features:
• Designed for 1.0V ± 10% applications, with 1.8V/2.5V/3.3V IO’s
• Shallow trench isolation, isolated P-Well (DNW) twin-tub, single poly CMOS
process using a type <100> P-substrate
4.2. STM 90nm technology 25
• 16A˚ gate oxide
• Cobalt silicide on junctions, polysilicon gates, lines, resistors on active and in-
terconnect poly (N+ or P+)
• Dual Vth transistors
• IOs using 2.8nm or 5.0nm or 6.5nm gate oxide for 1.8V or 2.5V or 3.3V respec-
tively
• 6 to 9 metal levels
• Damascene Copper for all metals
• Thick metal layer for power, clock, busses and major interconnect signal distri-
bution, as well as for inductors in Analog/RF applications
• Tight pitch levels for routing on thin copper for lower metal layers
• Low K (< 3.0) inter-metal dielectric for thin metal layers
To extract the required parameters for each one of the 3 transistor flavors, the
program ELDO version 6.1 1.1 from Mentor Graphics (SPICE-like simulator) has
been used.
4.2.1 Low Vth Transistors (lvt)
The “Low Vth” transistor type is the fastest available flavor in the STM 90nm general
purpose technology, and is used for applications where the speed is of primary impor-
tance. The disadvantage of this type of transistors is that, due to the low threshold
voltage (Vth), the static power is very high.
The sub-threshold slope n
It is important to note that the linear fitting on Eq. (4.2) must be estimated on a
region where the transistor is in the weak inversion mode (i.e. V dd < V th). Otherwise
Eq. (4.2) is no longer valid and the alpha power law should be used instead to describe
the transistor current. In our case the fitting range apply to V dd ∈ [0V ; 0.2V ].
Moreover, the temperature was set to 27◦C, corresponding to an Ut = 0.02588V .
The following table summarizes the parameters extraction:
26 Chapter 4. Technology characterization
0.0 0.2 0.4 0.6
-18
-16
-14
-12
Vgs [V]
     Linear Fit For:  VDD=1:LN(IDS_P)
     y = mX+b
     m(Slope): 22.386
     b(Y-Intercept): -17.527
     Correlation:0.99882
     Linear Fit For:  VDD=1:LN(IDS_N)
     y = mX+b
     m(Slope): 23.052
     b(Y-Intercept): -17.648
     Correlation:0.99936
Figure 4.2: Linear fitting of ln(Ids(V gs)) for STM 90nm lvt
m = 1/nUt[V −1] unified 1/nUT [V −1] Ut[V ] n unified n
NMOS 23.05
22.72 0.02588
1.68
1.70
PMOS 22.39 1.73
Table 4.1: Results of the sub-threshold slope extraction for STM 90nm lvt
The DIBL effect factor η
The DIBL effect factor η is extracted from the curve ln(Ioff (V dd)). The static power
is measured on a chain of 10 inverters, all with a driving force of one.
The results of the curve fitting in Fig. 4.3 are summarized in Table 4.2.
m = η/nUt unified 1/nUt (from table 4.1) η
1.98 22.72 0.087
Table 4.2: Results of the DIBL effect coefficient extraction for STM 90nm lvt
The α factor and the reference threshold voltage V th0
The extraction of the α factor and of the reference threshold voltage V th0 is done
conjointly by fitting the non-linear equation (4.5) with a known value for η. The
delays are obtained by measuring the oscillating frequency of a ring oscillator formed
by 9 inverters with a driving force of one.
4.2. STM 90nm technology 27
0.3 0.5 0.7 0.9
-18.5
-18.0
Vdd [V]
     Linear Fit For:  Data Set:LN(IOFF 1INV)
     y = mX+b
     m(Slope): 1.9827
     b(Y-Intercept): -19.608
     Correlation:0.99966
Figure 4.3: Linear fitting of ln(Ioff (V dd)) for 1 inverter (averaged over 10 inverters)
0.4 0.6 0.8 1.0
0.5
1.0
1.5
Vdd [V]
     Auto Fit For:  Data Set:Delay
     y = KX/(X*1.087-VTH)^alpha
     K: 0.16933 +/- 0.00089123
     VTH: 0.34208 +/- 0.0050786
     alpha: 1.5670 +/- 0.024541
     RMSE: 0.0030049
Figure 4.4: Fitting of delay vs. Vdd for STM 90nm lvt
Results, based on the fitting in figure 4.4, are presented in Table 4.3.
α Vth0
1.56 0.342
Table 4.3: Results for the α factor and V th0 for STM 90nm lvt
28 Chapter 4. Technology characterization
The body effect coefficient γ
The extraction of the body effect factor is achieved with a linear fitting on the curve
ln(Ioff (V bs)). The static current is measured over 10 inverters connected in chain
and the result has been divided by 10 to average the static current to one inverter.
-0.4 -0.2 0.0 0.2 0.4
-19
-18
-17
Vbs [V]
     Auto Fit For:  Data Set:LN(IOFF 1INV)
     y = mX+b
     m: 2.7199 +/- 0.046844
     RMSE: 0.047310
Figure 4.5: Linear fitting of ln(Ioff (V bs)) for 1 inverter (averaged over 10 inverters)
It should be noted that the γ is only a first order approximation of the body bias
effect, because, as shown in Fig. 4.5, the curve is more like a square root function
than a linear one.
Results are summarized by Table 4.4.
m = γ/nUt unified 1/nUt (from table 4.1) γ
2.72 22.72 0.12
Table 4.4: Results for the body effect coefficient for STM 90nm lvt
Io for a 2 inputs NAND gate
Even if it is not possible to give here a unique Io value for the technology, the value
Io for a 2 inputs NAND gate with a driving force of 2 is given as a reference.
4.2. STM 90nm technology 29
Io [µA] 30.9
Table 4.5: Io for a NAND2x2 gate from the STM 90nm lvt technology
Summary of the lvt technology parameters
All the technology parameters for the lvt flavor are summarized by Table 4.6.
V th0 [V] α n η γ Io(NAND2x2) [µA]
0.342 1.56 1.70 0.087 0.12 30.9
Table 4.6: Technology parameters summary for the STM 90nm lvt
4.2.2 Standard Vth Transistors (svt)
The “Standard Vth” transistor type is an all-purpose flavor where delay and static
power has been traded-off to match typical design requirements. The procedure used
to characterize this technology variation is exactly the same as the one used for lvt.
For the sake of simplicity, only the summary table is reported.
V th0 [V] α 1/nUt[V −1] n η γ Io(NAND2x2) [µA]
0.353 1.65 26.30 1.47 0.060 0.14 26.0
Table 4.7: Technology parameters summary for the STM 90nm svt
4.2.3 High Vth Transistors (hvt)
The “High Vth” transistor type is a flavor especially optimized for extremely low static
power consumption. Typical applications for this technology variation are circuit idle
most of the time and/or where speed/performance are not of utmost importance. The
procedure used to characterize this technology variation is exactly the same as the
one used for lvt. For the sake of simplicity, only the summary table is reported.
V th0 [V] α 1/nUt[V −1] n η γ Io(NAND2x2) [µA]
0.425 1.84 26.16 1.48 0.062 0.19 17.7
Table 4.8: Technology parameters summary for the STM 90nm hvt
30 Chapter 4. Technology characterization
4.3 Summary
In this chapter, the methodology used to extract the technology parameters has
been presented. After an introduction of the general procedure, the parameters
V th0, α, n, η, γ and Io(NAND2x2) have been evaluated for all 3 transistor flavors
available in the STM 90nm general purpose technology. In order to have an easy
access to the extracted data, values are summarized in Table 4.9.
V th0 [V] α 1/nUt[V −1] n η γ Io(NAND2x2) [µA]
lvt 0.342 1.56 22.72 1.70 0.087 0.12 30.9
svt 0.353 1.65 26.30 1.47 0.060 0.14 26.0
hvt 0.425 1.84 26.16 1.48 0.062 0.19 17.7
Table 4.9: Technology parameters summary for the STM 90nm - Vdd = 1V
Chapter 5
Reference multiplier architectures
This chapter presents a set of 13 reference multipliers widely used in this thesis. The
reason why we choose multipliers as reference comes from the fact that many possible
implementations exist, each one with very different characteristics. The architectures
proposed in this chapter can be divided into 3 families, each one containing more
variations of the basic implementation.
The 3 families are:
1. Ripple Carry Array (RCA): This structure is based on a regular matrix of
full adders; considered versions are:
• basic;
• 2 and 4 times parallel;
• 2 and 4 times horizontal pipeline;
• 2 and 4 times diagonal pipeline.
2. Wallace: This type of multiplier is based on a tree of full adder used as 3-to-2
compressors, considered versions are:
• basic;
• 2 and 4 times parallel.
3. Sequential: Here the multiplication is obtained by a sequential add and shift
implementation, considered versions are:
• basic;
• sequential-wallace;
• 2 times parallel.
31
32 Chapter 5. Reference multiplier architectures
5.1 Ripple Carry Array
The Ripple Carry Array multiplier (or RCA) is the most intuitive implementation
for a multiplier. Its structure derives from the way we usually do multiplications by
hand. That is, a sum of shifted partial products. A partial product (Pi) is the result
of the multiplication of the multiplicand (A, first number to multiply) with one bit
of the multiplier (B, second number to multiply). Practically, the multiplication by
a bit is obtained by AND gates. The number of partial products will be equal to the
size of B in bits. Mathematically, this can by written as (2i represents the bits shift):
M = A ∗B =
size(B)−1∑
i=0
Pi · 2i =
size(B)−1∑
i=0
(A and Bi) · 2i (5.1)
In a physical implementation, the summation showed in Eq. (5.1) will be obtained by
a series of full adder (FA), i.e. a 1 bit adder defined as:
S = a xor b xor cin
Cout = (a and b) or (a and cin) or (b and cin)
A graphical representation of a FA is provided in Fig. 5.1.
FA
a b
S
cout cin
Figure 5.1: Full adder symbol
By implementing Eq. (5.1) directly, a multiplier known with the name of Rip-
ple Carry Array multiplier (or RCA) can be constructed. Fig. 5.2 represents such
implementation for N=8.
The first line of full adders in a RCA doesn’t have to sum the partial products
with the result of the precedent line because no precedent result exists. Hence, only
partial products (AND gates) are generated by the synthesis tools. Moreover, the
most right cell of each line has a fixed carry in of zero. Those cells can be simplified
to an Half Adder (HA) i.e. an adder without carry in. The logical expressions of an
HA are:
S = a xor b
Cout = a and b
5.1. Ripple Carry Array 33
M15 M14 M13 M12 M11 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1 M0
0
0
0
0
0
0
0
0
B0
B1
B2
B3
B4
B5
B6
B7
A0A1A2A3A4A5A6A7
0 0 0 0 0 0 00
FA
a b
S
cinco
u
t
Refence cell
Figure 5.2: 8bit RCA multiplier
The FA has two characteristic delays. The first is the time that a signal needs to
propagate from the inputs (a and b and cin) to the sum port (S). The second is the
propagation delay for a signal going from the inputs (a and b and cin) to the carry
out port (Cout). Fig. 5.3 shows one of the possible critical path that exists in such a
multiplier.
M15 M14 M13 M12 M11 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1 M0
0
0
0
0
0
0
0
0
B0
B1
B2
B3
B4
B5
B6
B7
A0A1A2A3A4A5A6A7
0 0 0 0 0 0 00
FA
a b
S
cinco
u
t
Refence cell
carry delay
sum delay
Figure 5.3: Critical path in a 8bit RCA multiplier
It is not surprising why, in Fig. 5.3, the critical path doesn’t include the first line of
full adders; indeed, it corresponds to simple AND gates for the generation of partial
34 Chapter 5. Reference multiplier architectures
products (because a and cin are zero), and they are executed in parallel with the
partial products of the second line (corresponding to the bit B1).
The total delay for a RCA is given by :
t(Basic RCA) = (2 ·N − 2) · tcout + (N − 2) · tsum (5.2)
With N the size of the multiplier, tcout the carry out delay, tsum the sum delay.
The structure presented in Fig. 5.2 and Fig. 5.3 are what we will call the “basic
RCA” implementation. Others RCA implementations are explained hereafter.
5.1.1 RCA parallel variations
The first transformation of the RCA multiplier is the parallelization: the RCA mul-
tiplier is implemented twice (or more in general) and the data is multiplexed to a
different multiplier at each clock period. The advantage of this architecture is that
each multiplier has two (or as many as the number of instantiated blocks in general)
clock periods to terminate the computation. So, the throughput is the same than for
the non parallelized version, but the latency is bigger (corresponding to the number
of blocks). Fig. 5.4 shows the structure of a 2 times parallelized multiplier.
MULTIPLIER_0
QD A
B
EN
QD
EN
MULTIPLIER_1
QD A
B
EN
QD
EN
A
OUT
B
clk
sel
QD
M
U
LT
IP
L
E
X
E
R
Figure 5.4: 2 times parallelized multiplier
5.1. Ripple Carry Array 35
The sel signal is used to select which multiplier will calculate the multiplication
for the incoming data and it typically switches each clock cycles. The use of the input
registers is required in order to latch the data at the input of the multipliers. In
fact, each multiplier has now more than 1 clock cycle (corresponding to the degree
of parallelization) to compute one multiplication, and the incoming data need to be
stable over those clock cycles. Considering the throughput frequency as the reference
clock, the effective logical depth, defined as the real logical depth divided by the
number of clock cycles the signals have for propagating through it, is now reduced by
the number of parallelizations.
The major drawback of the parallelization process is that the hardware is more
than doubled (or N times for an N times parallel implementation). This also means
that the static power is also more than doubled, while the dynamic power is only
slightly increased due to the added registers and multiplexer.
5.1.2 RCA horizontal pipeline variations
The goal of pipelining is to reduce the critical path (logical depth) by inserting regis-
ter banks in the design. This can be done in several ways with considerably different
results. The more intuitive and easy manner to realize it is to “cut” the RCA horizon-
tally in the middle of the structure. This can be imagined as two NxN/2 multipliers
divided by a register bank as showed in Fig. 5.5.
M15 M14 M13 M12 M11 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1 M0
0
0
0
0
0
0
0
0
B0
B1
B2
B3
B4
B5
B6
B7
A0A1A2A3A4A5A6A7
0 0 0 0 0 0 00
FA
a b
S
cinco
u
t
Refence cell
registers
carry delay
sum delay
Figure 5.5: 2 stages horizontally pipelined 8 bit RCA
The number of registers needed to divide the multiplier in this way is easily ob-
36 Chapter 5. Reference multiplier architectures
tained from Fig. 5.5. Actually, all bits of A (N registers) plus all the result bits of
the previous stage (N+N/2 registers) must be latched. Moreover, in order to main-
tain data synchronization, the most significant bits of B must be latched too (N/2
registers). Hence, the total overhead corresponds to 3N registers.
The critical path after such an architectural transformation is:
t(Horizontal Pipeline) = (3/2N − 1) · tcout + (1/2N − 1) · tsum + tdff (5.3)
With N the size of the multiplier, tcout the carry out delay, tsum the sum delay and
tdff the registers delay.
The “vertical delay” (corresponding to the tsum) is effectively reduced by two, but
the “horizontal delay” (related to tcout) is just reduced by about 4/3. Additionally,
the “clk to Q” delay of a register must be added. Hence, the global delay reduction
compared to the non pipelined version is far from the expected (or hoped) value of 2.
A similar calculation can be done for a 4 stages pipeline, in this case the critical
path delay will be of (5/4N − 1)tcout + (1/4N − 1)tsum + tdff + tdff setup.
It is important to remark that pipelining remains interesting only for a small
number of stages (2 - 4); in fact, the quantity of needed registers rapidly grows for
a large number of stages and the overhead is quickly non-negligible. In the case of a
RCA multiplier with width N and S stages of pipeline, we have a register overhead
of 3*N*(S-1). Just as an example, a 32 bit / 4 stages horizontal pipeline multiplier
needs 288 extra flip-flops!
5.1.3 RCA diagonal pipeline variations
From a delay point of view, a better way to pipeline an RCA multiplier is to divide it in
diagonal. This approach is less easy to code in a high level language compared to the
horizontal split. In fact, the split parts cannot be considered anymore as multipliers of
reduced size. An example on how to diagonally pipeline a RCA multiplier is illustrated
in Fig. 5.6.
The critical path for a 2 stages diagonal pipeline is obtained by:
t(Diagonal Pipeline) = 3/4N · tcout + (3/4N − 1) · tsum + tdff (5.4)
With the diagonal pipeline implementation, the register overhead is slightly greater
than the horizontal pipeline. In fact, for two stages pipeline, we can count N latches for
the A bits, 3/4N latches for the B bits, 5/4N registers for the internal sum propagation
5.2. Wallace 37
M15 M14 M13 M12 M11 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1 M0
0
0
0
0
0
0
0
0
B0
B1
B2
B3
B4
B5
B6
B7
A0A1A2A3A4A5A6A7
0 0 0 0 0 0 00
FA
a b
S
cinco
u
t
Refence cell
registers
carry delay
sum delay
Figure 5.6: 2 stages diagonally pipelined 8bit RCA
and 1/2N registers for the carry propagation. All these contributions account for 3.5N
registers. This value can be compared to the one for horizontal pipeline case where
3N registers were needed.
In a 4 stages diagonal pipeline version, the register overhead for each of the two
new added banks is: 3/4N registers for the A bits, 3/8N registers for the B bits,
13/8N registers for internal sum propagation and 3/8N for the carry propagation.
The total number of registers per stage is hence 25/8N. Summing all extra registers,
the total overhead for a 4 stages diagonal pipeline is: 3.5N+2*(25/8N)=39/4N and
the corresponding delay would be of (3/4N−1)tsum+tdff+tdff setup. Just to compare
with the horizontal pipeline version, a 32bit / 4 stages diagonal pipeline multiplier
has 312 extra registers.
5.2 Wallace
The Wallace multiplier [41] [42] [43] is a very rapid and well balanced architecture.
To achieve this efficiency, the partial products (i.e. A · Bi, called P0-P7 in Fig. 5.8)
are summed in parallel by using Carry Save Adders (CSA) [44]. A CSA (Fig. 5.7) is
nothing else than a series of full adders disposed in a 3-2 compressor way. In a CSA,
there exists no propagation delay between the full adders, consequently the total delay
corresponds to the worst case delay of one FA. The main drawback of a CSA is that
it doesn’t return a unique sum but two vectors with a sum (S plus shifted C) equal
to the sum of the three input vectors (x+y+z = S+2C).
38 Chapter 5. Reference multiplier architectures
CSA FA
a
co
u
t cin
b
x0 y0 z0
S
S0C0
FA
a
co
u
t cin
b
x1 y1 z1
S
S1C1
FA
a
co
u
t cin
b
xn-1 yn-1 zn-1
S
Sn-1Cn-1
x y z
C S
Figure 5.7: Internal implementation of a Carry Save Adder (CSA)
The structure of a Wallace multiplier is shown in Fig. 5.8 for a 8 bit version. The
partial products P0-P7 are added 3 by 3 with CSAs until only two bit vectors remain
(Sum and Carry). At this point, a fast final adder will sum them to obtain the result
of the multiplication. The kind of final adder can vary from one implementation to the
other. In the Wallace tree implementations presented in this thesis, a Brent-Kung [45]
adder is used. The advantage of the Brent-Kung (bk) implementation is that it is
very fast.
CSA
CSA
CSA
CSA
Brent-Kung ADDER
CSA
CSA
P0P1P2P3P4P5P6
M
P7
x x x x x x x x
x x x x x x x x 0 
x x x x x x x x 0 0
x x x x x x x x 0 0 0
x x x x x x x x 0 0 0 0
x x x x x x x x 0 0 0 0 0
x x x x x x x x 0 0 0 0 0 0
x x x x x x x x 0 0 0 0 0 0 0
P0
P1
P2
P3
P4
P5
P6
P7
LEVEL
1
2
3
4
Figure 5.8: Wallace 8bit structure
The worst case delay for the multiplier tree (without the final adder) is equal to
the number of levels times the worst case delay of a FA.
To calculate the total delay of the Wallace tree multiplier, the delay of the final
adder (Brent-Kung type in this case) needs to be added.
t(Basic Wallace) ≈ log1.5(N) · tFA + tbk adder (5.5)
With N the bit width of the multiplier, tFA the worst delay for a full adder and
tbk adder the delay of the final bk adder, which is also dependent on the size of the
multiplier.
5.3. Sequential 39
Data width Number of levels
8 4
16 6
32 8
64 10
128 12
N ≈ log1.5(N)
Table 5.1: Number of CSA levels for some typical multiplier width
5.2.1 Wallace parallel versions
The parallelized versions of the Wallace multiplier are obtained exactly in the same
way as for the RCA (Fig. 5.4). The description in Section 5.1.1 remains valid for the
Wallace multiplier, too.
5.3 Sequential
The Sequential multiplier takes its name from the fact that this implementation uses
several clock cycles to compute one multiplication by sequentially “adding and shift-
ing” the previous partial result. The structure of such multiplier is illustrated in
Fig. 5.9. The main advantage of this implementation is the compactness of the cir-
cuit. In fact, to calculate a 16 bit multiplication, only a 17 bit adder with some
registers and a bit of control logic are required. On the other hand, the result will
not be available until 16 clock cycles have taken place.
Mult_reg(32) A_reg(16)
AB00...000
16
16+1
16
Figure 5.9: Sequential multiplier structure (16bit)
In the case of the present thesis, the adder used in the “add and shift” structure
is a Brent-Kung type (bk), which is known for being a very rapid adder. Considering
40 Chapter 5. Reference multiplier architectures
the best case where the adder uses exactly all the clock cycle period, the total delay
of a Sequential multiplier is given by:
t(Sequential) = N · (tbk adder + tAND + tdff + tdff setup) (5.6)
With N the multiplier bit width, tbk adder the Brent-Kung adder delay, tAND the delay
of the AND gate used to generate the partial products, tdff the registers clock-to-Q
delay and tdff setup the registers setup time.
In the case where the clock frequency is smaller than the maximal allowed one,
the total delay will correspond to N · tclock.
5.3.1 Sequential-wallace
A special modification of the Sequential multiplier is what we call the Sequential-
wallace (Fig. 5.10). The idea is to reduce the number of clock cycles required to com-
pute one multiplication by adding partial multiplications rather than partial products.
In the case of a 16 bit implementation (as reported in Fig. 5.10), a 4x16 bit Wallace
multiplier is used to compute partial multiplications and then the results are summed
sequentially. In this way we obtain a version between the Wallace (large area, small
delay) and the Sequential (small area, large delay). Actually, for the proposed ex-
ample, only 4 clock cycles are required per multiplication compared to the 16 cycles
necessary for the basic Sequential implementation.
Mult_reg(32) A_reg(16)
AB00...000
16x4
20
16
20
4
Wallace tree
4x16 mult
Figure 5.10: Sequential multiplier (16bit) with a 4x16 Wallace implementation
The delay of a Sequential-wallace multiplier is obtained by:
t(Sequential-wallace) =M · (tbk adder + tN/MxN Wallace + tdff + tdff setup) (5.7)
5.4. Summary 41
With M(< N) the number of required cycles, N the bit width of the multiplier,
tbk adder the Brent-Kung adder delay and tN/MxN Wallace the delay of the N/MxN
Wallace multiplier.
5.3.2 Sequential parallel
The parallelized version of the Sequential multiplier is obtained exactly the same way
as for the RCA and the Wallace (Fig. 5.4). The only difference is the sel pin that
only switches once every N clock cycles, where N is the size of the multiplier.
5.4 Summary
In this chapter, 13 multiplier architectures have been discussed. These circuits are
divided in 3 families (namely RCA, Wallace and Sequential) and they cover a large
combination of delay, area and complexity. For this reason, they are well suited as
reference circuits for the discussions presented further in this thesis. For commodity,
the periods of the maximal throughput frequency as well as the cell count for each
design are summarized in Table 5.2. The equations of the cell count for the Wallace
implementations are obtained from [46] [47].
42 Chapter 5. Reference multiplier architectures
N
am
e
P
eriod
of
the
m
axim
al
throughput
frequency
N
um
ber
of
com
binatorial
cells
R
egs
a
R
C
A
basic
(2·N
−
2)·t
c
o
u
t +
(N
−
2)·t
s
u
m
(N
−
1)
2·FA
+
N
·H
A
+
N
2·A
N
D
4
N
R
C
A
parallel2
(N
−
1)·t
c
o
u
t +
(N
/2−
1)·t
s
u
m
+
t
m
u
x
2(N
−
1)
2·FA
+
2N
·H
A
+
2N
2·A
N
D
+
M
U
X
2
6
N
R
C
A
parallel4
(N
/2−
1/2)·t
c
o
u
t +
(N
/4−
1
/2)·t
s
u
m
+
t
m
u
x
4(N
−
1)
2·FA
+
4N
·H
A
+
4N
2·A
N
D
+
M
U
X
4
10N
R
C
A
horiz.
pipeline
2
(3/2
N
−
1)·t
c
o
u
t +
(1/2N
−
1)·
t
s
u
m
+
t
df
f
(N
−
1)
2·FA
+
N
·H
A
+
N
2·A
N
D
7
N
R
C
A
horiz.
pipeline
4
(5/4
N
−
1)t
c
o
u
t +
(1/4N
−
1)t
s
u
m
+
t
df
f
+
t
df
f
s
e
tu
p
(N
−
1)
2·FA
+
N
·H
A
+
N
2·A
N
D
13N
R
C
A
diag.
pipeline
2
3
/4N
·t
c
o
u
t +
(3/4N
−
1)·t
s
u
m
+
t
df
f
(N
−
1)
2·FA
+
N
·H
A
+
N
2·A
N
D
7
N
R
C
A
diag.
pipeline
4
(3/4
N
−
1)t
s
u
m
+
t
df
f
+
t
df
f
s
e
tu
p
(N
−
1)
2·FA
+
N
·H
A
+
N
2·A
N
D
13N
W
allace
basic
≈
log
1
.5 (N
)·
t
F
A
+
t
b
k
a
d
d
e
r
≈
(N
2−
2N
)·FA
+
”few
”
H
A
+
N
2·A
N
D
4
N
W
allace
parallel
2
≈
1/2(log
1
.5 (N
)·t
F
A
+
t
b
k
a
d
d
e
r )
+
t
m
u
x
≈
2(N
2−
2
N
)·FA
+
”few
”
H
A
+
2N
2·A
N
D
+
M
U
X
2
6
N
W
allace
parallel
4
≈
1/4(log
1
.5 (N
)·t
F
A
+
t
b
k
a
d
d
e
r )
+
t
m
u
x
≈
4(N
2−
2
N
)·FA
+
”few
”
H
A
+
4N
2·A
N
D
+
M
U
X
4
10N
Sequential
basic
N
·(t
b
k
a
d
d
e
r
+
t
A
N
D
+
t
df
f
+
t
df
f
s
e
tu
p )
N
·FA
+
H
A
+
N
·A
N
D
5
N
Sequential-w
allace
M
·(t
b
k
a
d
d
e
r
+
t
N
/
M
x
N
W
a
lla
c
e
+
t
df
f
+
t
df
f
s
e
tu
p )
(N
2/M
−
N
/M
)·FA
+
”few
”
H
A
+
N
2/M
·A
N
D
5
N
Sequential
parallel
2
N
/2·(t
b
k
a
d
d
e
r
+
t
A
N
D
+
t
df
f
+
t
df
f
s
e
tu
p )
+
t
m
u
x
2N
·FA
+
2H
A
+
2N
·A
N
D
+
M
U
X
2
8
N
T
ab
le
5.2:
S
u
m
m
ary
of
th
e
m
u
ltip
liers
d
elay
s
an
d
cell
cou
n
ts
aIncluding
the
input
and
output
latching
registers
Chapter 6
Total power comparison for free
Vdd and free Vth
A very effective way to reduce the total power consumption in digital circuits is the
reduction of the supply voltage V dd. This approach is simple and easy to implement
and it will simultaneously reduce dynamic power in a square way and static power
linearly. Unfortunately, in this way, the performances or speed rapidly decrease. In
order to avoid this, it is possible to re-establish the original performances by reducing
the transistors threshold voltage V th. The price for this is an exponential increase
of the static power. For this reason, counterbalancing the reduction of the dynamic
power with the increase of static power leads to a point in the (V dd, V th) space where,
for a given delay, the total power presents a minimum. This chapter will discuss this
minimum of the total power consumption and will derive an approximated formula
for the total power at the optimal (V dd, V th) point.
6.1 Existence of a total power consumption opti-
mum
To convince the reader of the existence of the minimum of the total power consump-
tion, it is important to recall the power and delay equations reported in Chapter 3:
Ptot = Pdyn+ Pstat = aCNfV 2dd +NVddI0e
−V th
nUt (6.1)
fmax =
Ion
kt · C · LD · Vdd =
I0 · eα
kt · C · LD · (αnUt)α
(Vdd − Vth)α
Vdd
(6.2)
43
44 Chapter 6. Total power comparison for free Vdd and free Vth
With a the activity factor, C the equivalent capacitance per cell, N the number of
cells, f the working frequency, V dd the supply voltage, I0 the reference transistor cur-
rent, V th the transistor threshold current, n the sub-threshold slope, Ut the thermal
potential, kt the delay proportional constant, LD the logical depth and α the alpha
power law coefficient.
If now we consider that the frequency fmax (called f from now on) is fixed and
defined by the application, it is possible to rewrite Eq. (6.2) to obtain the formula
tying V dd and V th together:
Vth = Vdd − χ · V 1/αdd with: χα =
kt · C · f · LD
I0
(
e
αnUt
)α (6.3)
The parameter χ in Eq. (6.3) is a very important one. This parameter ties together
the supply voltage and the threshold voltage. Its value represents a kind of “global
rapidness” accounting for both technology and architectural impacts. Actually, a
large χ means a “slow” design, which can be due to a large logical depth or a slow
technology or a combination of architectural and technology parameters. The presence
of the working frequency in the equation of χ shows that the concept of slow or
quick design is dependent on the desired working frequency. For instance, a design
considered rapid for a working frequency of 1MHz, could be considered slow for a
working frequency of 100MHz.
A graphical representation of Eq. (6.3) is given in Fig. 6.1. There, we can see that
the reduction of the supply voltage requires a reduction of the threshold voltage too in
order to maintain speed. Even if there exists an infinite number of couples (V dd,V th)
showing the same performance, they don’t present the same power consumption. In
fact, while the reduction of the supply voltage V dd reduces the dynamic power in
a square way and reduces the static power linearly, the reduction of the threshold
voltage V th shows an exponential increase of the static power. Due to the exponen-
tial nature of this last dependency, the static power increase can rapidly cancel the
benefit of the reduced supply voltage V dd. Therefore, between all the combinations
of (V dd,V th) guaranteeing the desired speed, only one couple will result in the lowest
power consumption for a given architecture (Fig. 6.2). From now on, this working
condition will be called optimal working point or ideal working point.
The location of this optimal working point and its associated total power con-
sumption are tightly related to architectural and technology parameters. For instance,
Fig. 6.2 illustrates the fact that reducing the activity factor allows a reduction of Ptot,
whereas it tends to increase the optimal V dd and V th. As architectural modifications
6.2. Pdyn over Pstat ratio 45
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Vdd [V]
Vt
h 
[V]
Figure 6.1: Relationship between Vdd and Vth for α = 1.65 and χ = 0.3
will change simultaneously several factors (not just the activity), it is necessary to
develop a methodology to evaluate the influence of such transformations on the total
power consumption (Ptot).
In related contributions ([48], [49], [50], [51], [52], [53], [54]), the authors preferred
to seek for the minimum of the energy rather than the minimum of the total power
as done in this work. From a mathematical point of view, looking for the minimum
of the energy is slightly easier and the results are different from what we derive here.
Indeed, they found that the minimum of total energy is most of the time located in
the weak-inversion transistor region (optimal V dd < optimal V th), which corresponds
to very low performances logic.
6.2 Pdyn over Pstat ratio
Looking at the ratio Pdyn over Pstat at the optimal working point in Fig. 6.2, it is
possible to observe that dynamic contribution still remain greater than the static one.
k1 =
Pdyn
Pstat
∣∣∣∣∣
optimum
(6.4)
This ratio (k1) is a measurement of the circuit usefulness. In fact, rarely used
46 Chapter 6. Total power comparison for free Vdd and free Vth
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Vdd [V]
Vth [V]
P
o
w
e
r 
[m
W
]
0.18 0.21 0.25 0.29 0.34 0.38 0.42 0.46 0.50 0.55 0.59
tot
P
2
tot
P
1
tot
P
3 dynP
1
dyn
P
2
dyn
P
3
stat
P
P  / P
1
   = 3.14   @a = 0.27
P  / P
2
   = 3.55   @a = 0.13 
P
(P       independent of activity)
stat
stat
stat
dyn
dyn
dyn stat
 / P
3
   = 3.95   @a = 0.07 
⊗
⊗
⊗
Figure 6.2: Total power consumption of a 16 bit Wallace multiplier in a STM 90nm
technology (CMOS090-SVT, 100MHz) with freely modifiable Vdd and Vth. Three
different circuit activities (a) are reported. The optimal working points are marked
by a cross mark.
transistors will provide low k1 due to the high static consumption compared to the
dynamic one. For this reason, it is better to have fewer transistors (less static power)
working more actively (more dynamic power) than having lots of idle transistors that
just increase the static power. In related works [55] [56], authors stated that this ratio
should be equal to 1, whereas our experiences, based on many designs (like multipliers,
FIR, shift registers, micro-processors, counters, ...) in deep sub-micron technologies
(0.18µm, 0.13µm and 90nm), suggest that typical values of k1 are between 3 and 7.
6.2. Pdyn over Pstat ratio 47
6.2.1 k1 derivation
A precise calculation of k1 is possible and easy to obtain. In fact, k1 can be derived
by searching the minimum of Ptot(Vdd) as:
∂Ptot(Vdd)
∂Vdd
=
∂Pdyn(Vdd)
∂Vdd
+
∂Pstat(Vdd)
∂Vdd
= 0 (6.5)
The combination of Eq. (6.5) with Eq. (6.1) and Eq. (6.3) leads to:
k1 =
(α− 1)V optdd + V optth
2nUtα
− 1
2
(6.6)
With α the alpha power law coefficient, n the sub-threshold slope and Ut the thermal
potential.
Table 6.1 shows the equivalent of Eq. (6.6) in the case of the STM 90nm technology
(used values are obtained from Chapter 4).
LVT k1 ≈ 4.0V optdd + 7.3V optth − 0.5
SVT k1 ≈ 5.2V optdd + 8.0V optth − 0.5
HVT k1 ≈ 5.9V optdd + 7.1V optth − 0.5
Table 6.1: Approximation of k1 for STM 90nm technology
From the equations in Table 6.1 is possible to see how the case k1=1 is very
difficult to reach and it would correspond to extremely low optimal V dd and V th.
In Eq. (6.6), k1 was expressed in term of optimal V dd and optimal V th, but it
can also be related to the on current (Ion) and the off current (Ioff), or better to the
ratio of these two. In fact, using Eq. (6.1) and Eq. (6.2):
Pdyn = k1 · Pstat (6.7)
a · C ·N · V 2dd
Iopton
kt · C · LD · Vdd = k1 ·N · Vdd · I
opt
off (6.8)
k1 =
a
LD
1
kt
Ion
Ioff
∣∣∣∣∣
opt
(6.9)
If now we remember that kt is just a constant, k1 can easily be expressed by:
k1 ∝ a
LD
Ion
Ioff
∣∣∣∣∣
opt
(6.10)
48 Chapter 6. Total power comparison for free Vdd and free Vth
It is important to note that in Eq. (6.10) the Ion/Ioff also depends on activity (a)
and logical depth (LD).
Based on SIA International Technology Roadmap for Semiconductors 2004 [33],
the expected ratios of Ion over Ioff for present and future technologies are:
Year 2006 2009 2012 2015 2018
HP 23400 22714 17900 7000 4380
LOP 203333 154000 118571 90000 31667
LSTP 25.5E6 17.5E6 13.2E6 10.9E6 9.9E6
Table 6.2: SIA ITRS 2004 expected transistors Ion/Ioff for High Performance (HP),
Low Operating Power (LOP) and Low Standby Power (LSTP) circuits.
Looking at Table 6.2, we see how the ratio Ion over Ioff will decrease with time
due to the large increase of the static power consumption. On the other hand, we
have previously seen that the variable k1 doesn’t change so much. Hence, we can
conclude that an architecture with activity a and logical depth LD working at its
optimal condition in a present technology will require a higher ratio a/LD in a future
technology. This can be achieved by reducing the logical depth LD, but also by
increasing the activity a, which correspond to having a better use of the implemented
hardware. This reasoning, for instance, will tend to favor pipeline over parallelization.
Indeed, the ratio a/LD is increased in a pipelined design due to the reduction of
LD, whereas the same ratio will remain almost unchanged during parallelization (cf
Table 6.5 and Table 6.6).
By using Eq. (6.22) (derived later in this chapter), it is possible to express k1 in
a much simpler expression.
k1 =
Pdyn
Pstat
∣∣∣∣∣
opt
=
aCNfV 2dd
I0NVdde−V th/nUT
∼= aCfVdd
2nUtaCf/(1− χA) =
Vdd
2nUt
(1− χA) (6.11)
In a similar way, Eq. (6.11) can also be expressed by the optimal V th by applying
Eq. (6.15).
k1 =
Vth + χB
2nUt
(6.12)
6.3 Optimal Vdd and Vth formulas
In this section the complete derivation of the optimal threshold voltage and supply
voltage is presented. The difficulty of the derivation is to express V thopt without
6.3. Optimal Vdd and Vth formulas 49
the use of V ddopt and vice-versa, i.e. we need to decouple these two variables. To
achieve this, we need to linearize the expression V dd1/α (with α the alpha power law
coefficient, its value spanning from 1 to 2), which is the origin of the transcendental
nature of Eq. (6.3).
Fig. 6.3 shows the expression V dd1/α and its linear approximation for V dd from
0.3V to 1V for α = 1.65.
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Vdd [V]
Vd
d1
/α
α
 = 1.65
Figure 6.3: V dd1/α [solid line] and its linear approximation [dashed line]
From this figure, we see how well V dd1/α can be linearized over a relative large
interval, leading to the follow approximation:
V
1/α
dd ≈ A(α) · Vdd +B(α) (6.13)
With A and B depending on α but also on the interval of V dd where the ap-
proximation is done. A and B can be determined numerically (easy) and analytically
(more complex, but feasible). For V dd in the interval [0.3V;1V], the graph in Fig. 6.4
can be used to estimate A and B.
The lower graph in Fig. 6.4 shows the maximal error in percent obtained with
the proposed linear approximation. For the range of V dd restricted to the interval
[0.3V;1V] the error always remain lower than 5%. It is important to note that newer
technology will tend to have even smaller values of α which results in an even better
50 Chapter 6. Total power comparison for free Vdd and free Vth
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0
0.2
0.4
0.6
0.8
1
α
A 
an
d 
B
∈ [0.3 and 1]
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0
1
2
3
4
5
α
M
a
x
 
Er
ro
r 
[%
]
A
B
Vdd1/α ≈ AV dd+B    for    V dd   
Figure 6.4: Linearization coefficients for Vdd in [0.3V;1V]
approximation of Eq. (6.13). Moreover, in the case where a better approximation is
needed, the error can be further reduced by limiting the range of V dd.
In Fig. 6.5, the parameters A and B are calculated for V dd between 0.3V and 0.6V
and they report a maximal error lower than 1.4%. The values of A and B for the α
corresponding to the three different variations of the STM 90nm CMOS technology
are reported in Table 6.3.
Using the approximation in Eq. (6.13) is now possible to rewrite Eq. (6.3) in a
simpler way:
V optth (V
opt
dd )
∼= V optdd − χ(A · V optdd +B) = V optdd (1− χ · A)− χ ·B (6.14)
6.3. Optimal Vdd and Vth formulas 51
 
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0
0.2
0.4
0.6
0.8
1
α
A 
an
d 
B
Vdd1/α ≈ AV dd+B    for    V dd   ∈ [0.3 and 0.6]
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
α
M
ax
 
Er
ro
r [%
]
A
B
Figure 6.5: Linearization coefficients for Vdd in [0.3V;0.6V]
V dd ∈ [0.3V ; 1V ] V dd ∈ [0.3V ; 0.6V ]
LVT SVT HVT LVT SVT HVT
α 1.56 1.65 1.84 1.56 1.65 1.84
A(α) 0.760 0.731 0.676 0.859 0.835 0.788
B(α) 0.260 0.286 0.342 0.210 0.238 0.290
Table 6.3: Values of A and B for the three types of STM090 transistors
Such an approximation is now invertible and permits to estimate the optimal V dd:
V optdd (V
opt
th )
∼= V
opt
th + χ ·B
1− χ · A =︸︷︷︸
V th=V th0−ηV dd
V optth0 + χ ·B
1− χ · A+ η (6.15)
52 Chapter 6. Total power comparison for free Vdd and free Vth
Another useful expression is the first derivative of V th with respect to V dd. This
expression will be used in the next sections, but for the sake of simplicity, it will be
presented here. From Eq. (6.14) the partial derivative becomes:
∂V optth
∂V optdd
∼= (1− χ · A) (6.16)
6.3.1 Optimal threshold voltage derivation
The expression of the optimal threshold voltage can be derived by searching for the
V th that would minimize the total power consumption. Hence:
∂Ptot(Vth)
∂Vth
=
∂Pdyn(Vth)
∂Vth
+
∂Pstat(Vth)
∂Vth
= 0 (6.17)
Or, better:
∂Pdyn(Vth)
∂Vth
= −∂Pstat(Vth)
∂Vth
(6.18)
It is now possible to substitute Eq. (6.1) in Eq. (6.18) to obtain:
2aCNfVdd
∂Vdd
∂Vth
= −I0Ne−V th/nUt
(
∂Vdd
∂Vth
− Vdd
nUt
)
(6.19)
eV th/nUt =
I0
2nUtaCf
(
∂Vth
∂Vdd
− nUt
Vdd
)
(6.20)
eV th/nUt ∼=︸︷︷︸
Eq. (6.16)
I0
2nUtaCf
(
1− χA− nUt
Vdd
)
(6.21)
At room temperature, nUt is about 0.04V (refer to Table 4.9 for the exact value in
the case of STM090 technology). So, even if for instance the optimal supply voltage
will be as low as 0.4V, the ratio nUt/Vdd will be as low as 0.1 or even lower for higher
optimal V dd. For this reason, we consider this term negligible compared to 1 − χA.
This is a mandatory approximation in order to be able to decouple V th and V dd.
The optimal V th can finally be calculated:
eV th/nUt ∼= I0
2nUtaCf
(1− χA) (6.22)
V optth
∼= nUt ln
(
I0
2nUtaCf
(1− χA)
)
with: χα =
kt · C · f · LD
I0
(
e
αnUt
)α (6.23)
6.3. Optimal Vdd and Vth formulas 53
Eq. (6.23) shows the influence of architectural parameters (like a, LD [included
in χ], f) and technology parameters (like I0, n, C, α, kt) to the optimal threshold
voltage V th.
Consider a 16 bit Wallace multiplier with the following properties:
Technology STM090 SVT
Nominal Dynamic Power 693.28 µW
Nominal Static Power 9.90 µW
Nominal Activity 0.267
Nominal Frequency 100 MHz
Nominal Max Delay 2.38 ns
Nominal Supply voltage 1 V
Nominal Threshold voltage 0.353 V
Table 6.4: Parameters of a 16 bit Wallace multiplier
Fig. 6.6 shows the optimal V th vs. activity for the multiplier described in Ta-
ble 6.4, while maintaining the other architectural parameters constant.
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
0.1
0.2
0.3
0.4
activity
Op
tim
al 
Vt
h[V
]
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
2
3
4
5
activity
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
Figure 6.6: Optimal V th vs. activity
The optimal V th has been calculated in two separated ways. The former, called
analytical approximation on the plot, is the direct use of Eq. (6.23) with V dd1/α lin-
54 Chapter 6. Total power comparison for free Vdd and free Vth
earized over the interval [0.3V;1V], whereas the second, called numerical computation
on the plot, is obtained with a high resolution numerical computation based on the
non-approximated Eq. (6.1) and Eq. (6.3).
The first remark on Fig. 6.6 is that the error of the approximation remains lower
than 5% for the proposed range of activities.
Another interesting point is the shape of the curve optimal V th vs. a. In fact, we
can observe how V thopt increases for low activity, while it decreases for high activities,
as already noted on Fig. 6.2. Moreover, it is visible that for high activities, V thopt
becomes almost constant or varies only very slightly.
A similar graph is found in Fig. 6.7, but this time with the frequency as a variable
parameter, while the other parameters are kept constant.
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
frequency [MHz]
Op
tim
al 
Vt
h[V
]
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
frequency [MHz]
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
Figure 6.7: Optimal V th vs. frequency
As expected, the increase of the working frequency results in a reduction of the
optimal V th. In fact, in order to achieve the higher frequency, V th is reduced to
obtain a larger (V dd− V th).
The last optimal V th graph is Fig. 6.8 and it shows the optimal V th vs. the logical
depth (LD).
It is important to note that the optimal V th is almost insensitive to the logical
depth. This can be quite surprising, but it is explained by the important change in
6.3. Optimal Vdd and Vth formulas 55
0
0.1
0.2
0.3
0.4
Op
tim
al 
Vt
h[V
]
1.5
2
2.5
3
3.5
4
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
0 8 16 24 32 40 48
Logical Depth
0 8 16 24 32 40 48
Logical Depth
Figure 6.8: Optimal V th vs. logical depth
the optimal V dd (refer to the next section), which “absorbs” almost completely the
changes in the logical depth.
In the case where the nominal technology values of V dd, V th, Pdyn and Pstat
are known, Eq. (6.23) can be also written as:
V optth
∼= nUt ln
(
Pstatnom
Pdynnom
V nomdd e
(V th0nom−ηV ddnom)/nUt
2nUt
(1− χA)
)
(6.24)
6.3.2 Optimal supply voltage derivation
Once the optimal V th has been calculated, the derivation of the optimal V dd is very
simple thanks to Eq. (6.15). In fact, by simply replacing the expression of V thopt,
Eq. (6.25) and Eq. (6.26) can be obtained.
V optdd
∼=
nUt ln
(
I0
2nUtaCf
(1− χA)
)
+ χB
1− χA with: χ
α =
kt · C · f · LD
I0
(
e
αnUt
)α (6.25)
V optdd
∼=
nUt ln
(
Pstatnom
Pdynnom
V nomdd e
(V th0nom−ηV ddnom)/nUt
2nUt
(1− χA)
)
+ χB
1− χA (6.26)
56 Chapter 6. Total power comparison for free Vdd and free Vth
To discuss the validity of this approximation, we can reconsider the circuit de-
scribed in Table 6.4. Fig. 6.9 shows the optimal V dd for different activities. The
values of V ddopt are calculated in two ways. The analytical approximation is based
on Eq. (6.25), whereas the numerical computation is based on the non-approximated
equations (6.1) and (6.3).
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
0.1
0.2
0.3
0.4
activity
Op
tim
al 
Vd
d[V
]
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
1
2
3
4
5
activity
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
Figure 6.9: Optimal V dd vs. activity
From Fig. 6.9, we can see that, for the chosen range of activity, the error remains
smaller than 5%. Moreover, by looking at the shape of the V ddopt curve, we observe a
trend very similar to the one for V th. Actually, the increase of activity reduces both
V thopt and V ddopt in a similar way. This can be explained by the fact that a change
in activity doesn’t modify the timing constraints, and hence the difference V dd−V th
(cf. Eq. (6.2)) remains almost unchanged.
A similar graph can be plotted for the frequency as the free variable. This situation
is represented by Figure 6.10.
It is interesting to note the shape of the V ddopt(f) curve. For the high frequencies
the behavior corresponds to what we would expect, in fact the reduction of the working
frequency allows a reduction of the optimal supply voltage (which correspond to an
increase of the optimal threshold voltage), but for low frequencies the optimal V dd
starts to increase again. This behavior comes from the high increase of the optimal
6.3. Optimal Vdd and Vth formulas 57
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
frequency [MHz]
Op
tim
al 
Vd
d[V
]
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
frequency [MHz]
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
Figure 6.10: Optimal V dd vs. frequency
0 8 16 24 32 40 48
0 8 16 24 32 40 48
0
0.2
0.4
0.6
Logical Depth
Op
tim
al 
Vd
d[V
]
-1
0
1
2
3
4
5
Logical Depth
Er
ro
r i
n 
pe
rc
en
t
Analytical approximation
Numerical computation
Figure 6.11: Optimal V dd vs. logical depth
V th in this zone. In fact, to avoid a weak inversion regime (V dd < V th), V ddopt
58 Chapter 6. Total power comparison for free Vdd and free Vth
needs to increase in order to maintain the difference V dd− V th positive.
The last graph of the optimal V dd is reported in Fig. 6.11. There, V ddopt is
plotted versus the logical depth. This curve shows an almost linear behavior. In fact,
as stated before, the change in the timing requirements resulting from the change in
the logical depth affects almost exclusively the optimal V dd whereas the optimal V th
remains quite constant (cf. Fig. 6.8).
Finally, we can say that frequency mainly affects the optimal V th, logical depth
mainly affects the optimal V dd, and activity affects both of them.
6.4 Optimal total power
From what has been developed in the previous pages, it is now possible to obtain
some approximations of the optimal total power consumption. Unfortunately, due
to the transcendental nature of the involved equations, no exact formula exists to
determine the optimal Ptot. Nevertheless, with the help of a few basic assumptions,
approximated equations can be found. In the next sections, two different approaches
are proposed. The former develops a rough way to compare architectures that present
similar values of k1 (≡ optimal Pdyn/Pstat), whereas the latter is a much more precise
approximation for an absolute optimal total power estimation.
6.4.1 Optimal power comparison with k1 constant
For this first derivation, the assumption is done that k1 is constant or at least varies
very few. This rough approach can be used as a quick way to compare the optimal
total power consumption of two (or more) circuits having very similar characteristics
in the sense of a similar k1 (≡ optimal Pdyn/Pstat).
The optimal total power can be expressed with k1 as:
Ptotopt = Pdynopt
(
1 +
1
k1
)
(6.27)
From our experience, typical values of k1 span from 3 to 7 considering very different
architectural blocks like multipliers, adders, counters, shift registers, FIR, micropro-
cessors, etc. In the case of circuits with similar functions and working conditions,
k1 can be considered constant, at least for a first rough approximation. Just as an
example, ten different 16bit multipliers (7 RCA variations and 3 Wallace variations)
implemented in a STM 90nm technology and with a working frequency of 33 MHz
have a k1 included in the range between 4.22 and 4.69.
6.4. Optimal total power 59
To fix the ideas, the error introduced by a ∆k1 6= 0 can be calculated:
∆Ptot =
∂
∂k1
Pdyn
(
1 +
1
k1
)
∆k1 = −Pdyn∆k1
k12
= −Ptot ∆k1
k1(k1 + 1)
(6.28)
Or:
∆Ptot
P tot
= −∆k1/k1
k1 + 1
(6.29)
Practically, Eq. (6.29) means that the relative error (∆k1/k1) introduced by a non
constant k1 has an effect divided by k1 + 1 on the optimal total power Ptot. Hence
the worst case ∆Ptot/P tot in our example of the ten 16 bit multipliers presents an
error of about 2.1%.
Thanks to the constant k1 hypothesis, the optimal total power consumption com-
parison is now reduced to the comparison of the optimal dynamic power (Pdyn).
Ptot′
?
< Ptot (6.30)
Pdyn′
(
1 +
1
k1
)
?
< Pdyn
(
1 +
1
k1
)
(6.31)
Pdyn′
?
< Pdyn (6.32)
a′C ′N ′f ′V ′2dd
?
< aCNfV 2dd (6.33)
V ′dd
?
< Vdd
√
aCNf
a′C ′N ′f ′
(6.34)
The parameters with an apostrophe (’) correspond to the new architecture which
is compared to a reference design (no apostrophe).
Parallelization example
To better understand the usefulness of Eq. (6.34), let us apply it to the case of a circuit
parallelization. Table 6.5 reports the typical architectural parameter variations in the
case of a P times parallelization.
In a parallelization process, the number of cells is more than P times the original
one due to the overhead introduced mainly by the multiplexer and the additional
registers required to maintain a valid data on both blocks. We can define the Dynamic
OverHead (DOH) as the relative increment of the dynamic power due to this overhead
at nominal conditions (i.e. Pdyn′nom = (1 +DOH)Pdynnom).
60 Chapter 6. Total power comparison for free Vdd and free Vth
Symbol Name Effect of parallelization
a activity ≈ /P
N number of cells ≈ *P+ overhead
LDeff effective logical depth /P
f frequency unchanged
Table 6.5: Effect of parallelization on architectural parameters
From Eq. (6.34) we now know that in order to reduce the optimal power consump-
tion through parallelization, the following expression must be respected:
V ′dd
!
< Vdd/
√
1 +DOH (6.35)
With V ′dd the optimal supply voltage after the parallelization and Vdd the optimal
supply voltage before parallelization.
On the other hand, the optimal V th, which depends mainly on activity, can be
approximated as (from Eq. (6.23)):
V th′ ∼= V th+ nUt lnP (6.36)
With V th′ the optimal threshold voltage after parallelization and V th the optimal
threshold voltage before parallelization.
Moreover, from Eq. (6.3) we can write:
V ′dd − V ′th
V
′1/α
dd
= χ′ = χ/P 1/α =
Vdd − Vth
P 1/αV
1/α
dd
(6.37)
The combination of Eq. (6.35), Eq. (6.36) and Eq. (6.37) yields:
χ
!
>
(
P
√
1 +DOH
Vdd
)1/α (
Vdd√
1 +DOH
− Vth − nUt lnP
)
(6.38)
All parameters in Eq. (6.38) refer to the design before parallelization. Hence, to
know if a circuit can reach a lower optimal total power through parallelization it is
sufficient to check that the previous inequality is respected.
In the same way, it is possible to determine the maximal value of DOH that still
allow power savings when parallelization is performed.
6.4. Optimal total power 61
Pipelining example
The same approach can be carried out in the case of a pipelining transformation. The
effect of a typical pipelining transformation to the architectural parameters is shown
in Table 6.6.
Symbol Name Effect of parallelization
a activity ≈ unchanged
N number of cells + registers overhead
LDeff effective logical depth /pf
f frequency unchanged
Table 6.6: Effect of pipelining on architectural parameters
Ideally, the critical path would be divided by two (or by the number of pipelining
stages in general) through a register bank insertion. Unfortunately, this ideal factor
is practically never achieved because it is rare to be able to split the path exactly in
the middle. For the sake of generality, the factor pf (pipeline factor) is introduced.
Its value represents the achieved ratio between the logical depth before and after the
pipeline transformation.
Unlike the parallelization, the activity on a pipeline transformation remains almost
unchanged, even if a small reduction could be observed due to less glitches. This will
also mean that the optimal threshold voltage after the transformation is practically
the same as before:
V th′ ≈ V th (6.39)
With V th and V th′ the optimal threshold voltage before and after the transfor-
mation respectively.
The overhead in a pipeline structure comes from the registers banks inserted in
the data path to cut it in different segments. Like before, this overhead is considered
as a dynamic power overhead and will be represented by the variable DOH (defined
before). So, the condition on the optimal supply voltage remains the same as for the
parallelization, i.e.:
V ′dd
!
< Vdd/
√
1 +DOH (6.40)
Once more, a third condition can be obtained from Eq. (6.3):
V ′dd − V ′th
V
′1/α
dd
= χ′ = χ/p1/αf =
Vdd − Vth
p
1/α
f V
1/α
dd
(6.41)
62 Chapter 6. Total power comparison for free Vdd and free Vth
The combination of Eq. (6.39), Eq. (6.40) and Eq. (6.41) gives:
pf
!
>
1√
1 +DOH
(
Vdd − Vth
Vdd/
√
1 +DOH − Vth
)α
(6.42)
Or:
χ
!
>
(
pf
√
1 +DOH
Vdd
)1/α
(Vdd/
√
1 +DOH − Vth) (6.43)
Or even:
χ
!
>
(
pf
√
1 +DOH
Vdd
)1/α
Vdd − Vdd/
√
1 +DOH
(pf
√
1 +DOH)1/α − 1 (6.44)
If one of the conditions in Eq. (6.42) or Eq. (6.43) or Eq. (6.44) is respected,
pipelining the design is worthwhile from a optimal total power point of view.
Considering both the results for parallelization and pipelining, we can say that
these transformations are more effective for large logical depths or high frequencies.
Moreover, new technologies will tend to reduce the value of χ, making pipelining and
parallelization less interesting techniques.
If we want to compare parallelization against pipelining, we can use Eq. (6.38)
and Eq. (6.43). The two equations are very similar. If we consider that nUt lnP is
much smaller than Vdd/
√
1 +DOH − Vth, which is in general the case, and we also
assume that both transformations have the same DOH, we can compare paralleliza-
tion against pipelining by simply comparing the parameter P against pf . As we have
seen before, pf is always smaller than the ideal factor which would correspond to the
number of stages. So, for the same degree of pipelining and parallelization, pf will
always be smaller than the factor P . For this reason we can conclude that the condi-
tion in Eq. (6.43) will be easier to fulfill compared to Eq. (6.38), making pipelining a
preferred transformation against parallelization.
6.4.2 Absolute optimal total power
The previous section illustrates a rough approximation to quickly compare architec-
tures with a similar k1. Even if this approach can be useful, we would sometimes
prefer to be able to estimate the absolute value of the optimal total power, rather
than by comparison with other architectures.
With Eq. (6.23) and Eq. (6.25), we are able to calculate the optimal total power,
but it could be useful to be able to express the optimal total power directly from the
6.4. Optimal total power 63
architectural and technology parameters. This would avoid the need to pre-calculate
the optimal threshold and supply voltage and would permit to better understand the
influence of the architectural and technology parameters on the optimal total power.
Let us start by including Eq. (6.23) in the total power equation:
Ptot = aCNfV 2dd +NVddI0e
−V th
nUt (6.45)
= aCNfV 2dd + 2Vdd
nUtaCNf
1− χA (6.46)
= aCNf
(
V 2dd + 2Vdd
nUt
1− χA
)
(6.47)
Eq. (6.47) shows a term in V 2dd and a term in 2Vdd. This means that two of the three
terms of the square development of (a + b)2 = a2 + 2ab + b2 are present. Supposing
that the missing term (b2) is very small compared to the sum of the other two, then
the development can be reversed.
Ptot = aCNf
(
V 2dd + 2Vdd
nUt
1− χA
)
(6.48)
≈ aCNf
V 2dd + 2Vdd nUt1− χA +
(
nUt
1− χA
)2 (6.49)
= aCNf
(
Vdd +
nUt
1− χA
)2
(6.50)
The approximation that has just been used is the same as the one used to obtain
Eq. (6.23), namely that nUt/Vdd  (1−χA). The validity of this approximation can
be verified in the practical cases reported in the next chapters.
Finally, the expression of the optimal supply voltage (Eq. (6.25)) can be inserted
in Eq. (6.50) to obtain the optimal total power formula.
Ptotopt ∼= aCNf
(1− χA)2
[
nUt
(
ln
(
I0
2nUtaCf
(1− χA)
)
+ 1
)
+ χB
]2
(6.51)
Eq. (6.51) is a fundamental equation, in fact it permits to analytically estimate an
approximation of the optimal total power directly from architectural parameters like
activity (a), number of cells (N), frequency (f), logical depth (LD, included in χ) and
technology parameters like transistor reference current (I0), sub-threshold slope (n),
alpha power law coefficient (α, include in A and B), delay coefficient (kt, included in
64 Chapter 6. Total power comparison for free Vdd and free Vth
χ) and average capacitance C. The detailed discussion of the influence of these two
families of parameters on the optimal total power consumption will be carried out in
the next two chapters.
An alternative expression for the optimal total power can also be obtained by com-
bining Eq. (6.50) with Eq. (6.11). The resulting formula illustrates the relationship
between the optimal total power and k1:
Ptotopt ∼= aCNf
(
nUt
1− χA
)2
(2k1 + 1)2 (6.52)
6.5 Summary
In this chapter, we have discussed the existence of a total power consumption optimum
characterized by a trade-off between dynamic and static power contributions. We have
also seen that typical values of k1 (optimal Pdyn over optimal Pstat ratio) are between
3 and 7.
After that, we have developed models for the optimal supply voltage and optimal
threshold voltage, showing that frequency modifications mainly influence V th, logical
depth modifications mainly affect V dd, whereas activity modifications have impacts
on both of them. Then, a total power comparison based on the rough assumption of
a quasi-constant k1 revealed that pipelining and parallelization are more effective for
large logical depths and high frequencies and that new technologies (which will tend
to have lower χ) will make these two transformations less interesting. Finally, we
observed that the condition for achieving a power saving through pipelining is more
easily fulfilled than the one for parallelization.
In the case where an absolute estimation of the optimal total power is required,
the expression of an approximated closed-form equation has been given.
The most important equations provided in this chapter are summarized below to
permit a quick access.
Starting from:
Ptot = Pdyn+ Pstat = aCNfV 2dd +NVddI0e
−V th
nUt
V optth = V
opt
dd − χ · (V optdd )1/α with: χα =
kt · C · f · LD
I0
(
e
αnUt
)α
6.5. Summary 65
we obtained:
V optth
∼= nUt ln
(
I0
2nUtaCf
(1− χA)
)
∼= nUt ln
(
Pstatnom
Pdynnom
V nomdd e
(V th0nom−ηV ddnom)/nUt
2nUt
(1− χA)
)
V optdd
∼=
nUt ln
(
I0
2nUtaCf
(1− χA)
)
+ χB
1− χA
∼=
nUt ln
(
Pstatnom
Pdynnom
V nomdd e
(V th0nom−ηV ddnom)/nUt
2nUt
(1− χA)
)
+ χB
1− χA
Ptotopt ∼= aCNf
(1− χA)2
[
nUt
(
ln
(
I0
2nUtaCf
(1− χA)
)
+ 1
)
+ χB
]2
66 Chapter 6. Total power comparison for free Vdd and free Vth
Chapter 7
Architectural impact on total
power
Many architectural parameters, e.g. activity a, number of cells N , logical depth LD
(contained in χ), influence the optimal total power consumption (Eq. (6.51)). Know-
ing the effect of an architecture transformation (e.g. pipelining or parallelization) on
such parameters allows to directly determine if a power saving can be obtained, just
by using Eq. (6.51).
To discuss the impact of architectural modification on the optimal total power
consumption, a set of thirteen 16 bit multipliers (described in details in Chapter 5)
was designed in VHDL and synthesized using Synopsys Design Compiler (V2004.06).
The library used for the synthesis was the 90nm CMOS090GPSVT from ST Micro-
electronics.
The data characterizing these thirteen multipliers at their nominal values (the ones
provided by Synopsys DC) are reported in Table 7.1. Every multiplier works with a
frequency able to generate one completed multiplication every 16ns. This means, for
instance, that the 16 bit sequential architecture requires a local clock period of 1ns,
whereas the 2 times parallelized implementation has 32ns of time per block.
The definitions of the parameters reported in Table 7.1 are:
• Cells: the number of design cells. One cell can be a very simple one (like an
inverter) or a complex one (like a full adder);
• Nets: the number of inter-cells nets in the design;
• Area: the area of the design core; pads and routing spaces are not included;
• Activity: the average number of switching nets over the total number of nets per
clock period. These values are obtained by an event-driven simulation under
67
68 Chapter 7. Architectural impact on total power
N
om
in
al
valu
es
C
ells
N
ets
A
rea
A
ctiv
ity
D
elay
L
D
eff
χ
χ
α
V
d
d
V
th
0
P
d
y
n
P
stat
P
tot
[µ
m
2]
[n
s]
[V
]
[V
]
[µ
W
]
[µ
W
]
[µ
W
]
R
C
A
b
asic
640
897
6655.9
0.518
5.99
179
0.390
0.211
1.0
0.353
732.9
8.5
741.38
R
C
A
p
arallel
2
1289
1771
13495.0
0.274
6.08
91
0.258
0.107
1.0
0.353
834.6
16.9
851.49
R
C
A
p
arallel
4
2644
3574
26803.4
0.136
6.14
46
0.171
0.054
1.0
0.353
895.6
33.7
929.29
R
C
A
h
oriz.
p
ip
elin
e
2
688
945
7329.8
0.413
4.13
123
0.311
0.146
1.0
0.353
678.8
9.4
688.18
R
C
A
h
oriz.
p
ip
elin
e
4
816
1073
8745.7
0.338
2.95
88
0.254
0.104
1.0
0.353
735.1
11.3
746.31
R
C
A
d
iag.
p
ip
elin
e
2
701
958
7427.5
0.432
3.67
110
0.290
0.129
1.0
0.353
722.1
9.5
731.60
R
C
A
d
iag.
p
ip
elin
e
4
823
1082
8930.1
0.358
2.27
68
0.216
0.08
1.0
0.353
786.8
11.5
798.33
W
allace
b
asic
789
1037
7119.0
0.347
2.45
73
0.227
0.086
1.0
0.353
542.5
9.3
551.80
W
allace
p
arallel
2
1604
2068
14437.8
0.181
2.54
38
0.152
0.045
1.0
0.353
644.8
18.4
663.20
W
allace
p
arallel
4
3252
4146
28627.6
0.091
2.61
19
0.102
0.023
1.0
0.353
717.5
36.7
754.15
S
eq
u
en
tial
b
asic
289
323
2565.1
2.888
0.86
411
0.645
0.485
1.0
0.353
2456.3
3.2
2459.49
S
eq
u
en
tial-w
allace
399
477
3590.3
1.080
3.09
369
0.605
0.436
1.0
0.353
1093.0
4.7
1097.71
S
eq
u
en
tial
p
arallel
2
594
628
4658.2
1.742
0.87
208
0.427
0.245
1.0
0.353
2230.8
5.7
2236.57
T
ab
le
7.1:
N
om
in
al
valu
es
for
th
irteen
16
b
it
m
u
ltip
liers
b
ased
on
th
e
S
T
M
90n
m
tech
n
ology
an
d
tran
sistors
of
th
e
S
V
T
ty
p
e
at
a
th
rou
gh
p
u
t
freq
u
en
cy
of
62.5M
H
z
69
ModelSIM (from MentorGraphics). The results are based on the multiplica-
tion of uniformly distributed pseudo-random data during 2µs; Standard library
delays are used so that glitches can be accounted;
• Delay: the typical combinatorial delay from register output to register input on
the critical path;
• LD eff: the effective logical depth in equivalent NAND2 gates. The term “ef-
fective” is related to the fact that the length of the logical depth is considered
against the throughput frequency or one-complete-multiplication frequency. In
the case of a parallelization, for instance, LD eff corresponds to half of the real
LD because each block has two clock periods to compute one multiplication.
Similarly, in the case of the sequential implementation, the LD eff represents 16
times the real LD because to complete one multiplication, 16 1ns clock periods
are required. The delay of the reference NAND2 gate has been estimated by
building a 1000 NAND2 inverter chain. The inversion effect has been obtained
by tying the two inputs together. The resulting delay per gate is 33.5ps for the
SVT transistor type;
• χ and χα: these two parameters are obtained by using Eq. (6.3) from the nominal
V dd, V th and delay. These parameters are reported there to be easily accessible
during the following discussions;
• Nominal Vdd: the nominal technology supply voltage;
• Nominal Vth0: the nominal technology threshold voltage;
• Nominal Pdyn: the nominal dynamic power consumption as reported by Syn-
opsys DC;
• Nominal Pstat: the nominal static power consumption as reported by Synopsys
DC;
• Nominal Ptot: the nominal total power consumption obtained by summing the
nominal Pdyn and the nominal Pstat.
With the data reported in Table 7.1, the optimal supply voltage V dd and the
optimal threshold voltage V th can now be calculated. The values of V dd and V th
in Table 7.2 are obtained in two different ways. In the first case, called numerical
computation, a high resolution numerical search of the optimal supply and threshold
voltage is used. This approach is very time consuming and requires the calculation of
70 Chapter 7. Architectural impact on total power
a high number of total power consumption for a large amount of couple (V dd, V th)
(100’000 in our case) using the non approximated equations described in Chapter 3.
Moreover, such type of calculation doesn’t permit to understand the real effect of
each parameter on the final result. However, results calculated in this way are precise
(up to the precision of models used) and for this reason they will be considered as
a reference to be compared to the other approach which is based on Eq. (6.23) and
Eq. (6.25) and is called analytical approximation. In this latter case, the optimal V dd
and the optimal V th can easily be calculated from the values reported in Table 7.1.
The error between the reference data (numerical computation) and the analytical
approximation is also reported in the same table. All the errors remains bounded to
a few percent.
In Fig. 7.1 and Fig. 7.2, the same results are reported in a graphical manner,
making it easier to read.
0.2
0.3
0.4
0.5
0.6
0.7
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
S
eq
ue
nt
ia
l
S
eq
ue
nt
ia
l 4
_1
6
S
eq
ue
nt
ia
l p
ar
al
le
l
O
p
ti
m
a
l 
V
d
d
 [
V
]
Numerical computation
Analytical approximation
1.0
1.5
2.0
2.5
3.0
3.5
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
Se
qu
en
tia
l
Se
qu
en
tia
l 4
_1
6
Se
qu
en
tia
l p
ar
al
le
l
V
d
d
 a
p
p
ro
x
im
a
ti
o
n
 e
rr
o
r 
[%
]
Figure 7.1: Optimal Vdd calculated with numerical computation (STM 90nm,
62.5MHz) using Eq. (6.25)
71
O
p
ti
m
al
va
lu
es
N
u
m
er
ic
al
A
n
al
y
ti
ca
l
A
p
p
ro
x
.
N
u
m
er
ic
al
A
n
al
y
ti
ca
l
A
p
p
ro
x
.
co
m
p
u
ta
ti
on
ap
p
ro
x
im
at
io
n
er
ro
r
co
m
p
u
ta
ti
on
ap
p
ro
x
.
er
ro
r
V
d
d
V
th
V
d
d
V
th
V
d
d
V
th
P
d
y
n
P
st
at
P
to
t
k
1
P
to
t
P
to
t
[V
]
[V
]
[V
]
[V
]
[%
]
[%
]
[µ
W
]
[µ
W
]
[µ
W
]
[µ
W
]
[%
]
R
C
A
b
as
ic
0.
43
7
0.
20
1
0.
44
4
0.
20
6
1.
5
2.
8
14
0.
08
41
.6
2
18
1.
70
3.
4
18
3.
16
0.
8
R
C
A
p
ar
al
le
l
2
0.
36
8
0.
22
7
0.
37
6
0.
23
3
2.
2
2.
9
11
3.
10
35
.1
7
14
8.
27
3.
2
15
0.
40
1.
4
R
C
A
p
ar
al
le
l
4
0.
34
4
0.
25
4
0.
35
1
0.
26
0
2.
1
2.
4
10
5.
83
32
.0
1
13
7.
84
3.
3
14
0.
05
1.
6
R
C
A
h
or
iz
.
p
ip
el
in
e
2
0.
38
5
0.
21
1
0.
39
3
0.
21
7
2.
1
2.
8
10
0.
53
31
.7
2
13
2.
25
3.
2
13
3.
99
1.
3
R
C
A
h
or
iz
.
p
ip
el
in
e
4
0.
35
1
0.
21
6
0.
36
0
0.
22
3
2.
5
3.
3
90
.4
7
29
.7
6
12
0.
23
3.
0
12
2.
27
1.
7
R
C
A
d
ia
g.
p
ip
el
in
e
2
0.
36
7
0.
20
9
0.
37
5
0.
21
6
2.
4
3.
2
97
.1
1
31
.7
0
12
8.
81
3.
1
13
0.
79
1.
5
R
C
A
d
ia
g.
p
ip
el
in
e
4
0.
32
5
0.
21
6
0.
33
5
0.
22
3
3.
0
3.
4
83
.1
5
28
.6
7
11
1.
82
2.
9
11
4.
37
2.
3
W
al
la
ce
b
as
ic
0.
33
9
0.
22
2
0.
34
8
0.
22
8
2.
7
3.
0
62
.4
4
20
.6
7
83
.1
1
3.
0
84
.6
4
1.
8
W
al
la
ce
p
ar
al
le
l
2
0.
32
1
0.
24
4
0.
32
9
0.
25
1
2.
4
2.
8
66
.2
3
21
.3
3
87
.5
6
3.
1
89
.3
0
2.
0
W
al
la
ce
p
ar
al
le
l
4
0.
32
0
0.
26
9
0.
32
6
0.
27
5
2.
0
2.
1
73
.2
9
22
.2
4
95
.5
3
3.
3
97
.1
7
1.
7
S
eq
u
en
ti
al
b
as
ic
0.
56
3
0.
10
7
0.
57
0
0.
10
9
1.
3
2.
0
77
7.
27
23
7.
90
10
15
.1
7
3.
3
10
45
.7
5
3.
0
S
eq
u
en
ti
al
-w
al
la
ce
0.
60
0
0.
15
7
0.
60
8
0.
15
7
1.
3
0.
0
39
3.
74
10
2.
14
49
5.
88
3.
9
51
2.
09
3.
3
S
eq
u
en
ti
al
p
ar
al
le
l
2
0.
37
4
0.
13
9
0.
38
7
0.
14
7
3.
5
6.
4
31
2.
19
12
2.
73
43
4.
92
2.
5
44
3.
82
2.
0
T
ab
le
7.
2:
O
p
ti
m
al
V
d
d
,
V
th
an
d
P
to
t.
T
h
es
e
va
lu
es
ar
e
ca
lc
u
la
te
d
on
ce
w
it
h
a
n
u
m
er
ic
al
co
m
p
u
ta
ti
on
an
d
on
ce
u
si
n
g
E
q
.
(6
.2
5)
fo
r
V
d
d
,
E
q
.
(6
.2
3)
fo
r
V
th
an
d
E
q
.
(6
.5
1)
fo
r
P
to
t.
R
el
at
iv
e
er
ro
rs
ar
e
sh
ow
n
in
th
e
co
rr
es
p
on
d
in
g
co
lu
m
n
s.
U
se
d
A
an
d
B
fa
ct
or
s
ar
e
fo
r
V
d
d
∈
[0
.3
V
;0
.6
]
72 Chapter 7. Architectural impact on total power
0.10
0.15
0.20
0.25
0.30
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
S
eq
ue
nt
ia
l
S
eq
ue
nt
ia
l 4
_1
6
S
eq
ue
nt
ia
l p
ar
al
le
l
O
p
ti
m
a
l 
V
th
 [
V
]
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
S
eq
ue
nt
ia
l
S
eq
ue
nt
ia
l 4
_1
6
S
eq
ue
nt
ia
l p
ar
al
le
l
V
th
 a
p
p
ro
x
im
a
ti
o
n
 e
rr
o
r 
[%
]
Numerical computation
Analytical approximation
Figure 7.2: Optimal Vth calculated with numerical computation (STM 90nm,
62.5MHz) using Eq. (6.23)
What we can observe from the values of V ddopt and V thopt is, for instance, the
effect of parallelization. In such a transformation, V dd is reduced and V th is increased.
Both trends will favor a lower total power by reducing dynamic and static power at
the same time. It is also interesting to note that the reduction of the supply voltage
is less important for Wallace than for RCA. This can be easily explained by the
lower χ factor of the Wallace implementation. In fact, being the Wallace already a
quick architecture compared to the required frequency (62.5 MHz), the gain from the
reduction of the effective logical depth (LD eff) is only marginal, whereas it is much
more consequent for the RCA multiplier.
It is also possible to observe that V th is almost constant for the pipeline trans-
formation as it was deduced in Chapter 6. Finally, the large delay involved in the
sequential architectures (corresponding to a high χ) clearly shows a high V dd and a
low V th, both negatively impacting the total power.
73
Nevertheless, optimal V dd and V th are not mandatory to compute the optimal
total power consumption, thanks to Eq. (6.51). In fact, all required parameters can be
obtained from Table 7.1 without needing intermediate steps. Once more, the results of
our analytical approximation are compared to the numerical computation, where no
approximations are applied. Results are reported in Table 7.2 with the corresponding
errors. The same results are also provided in a graphical way in Fig. 7.3.
0
200
400
600
800
1000
1200
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
S
eq
ue
nt
ia
l
S
eq
ue
nt
ia
l 4
_1
6
S
eq
ue
nt
ia
l p
ar
al
le
l
R
C
A
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. p
ip
e2
R
C
A
 h
or
iz
. p
ip
e4
R
C
A
 d
ia
g.
 p
ip
e2
R
C
A
 d
ia
g.
 p
ip
e4
W
al
la
ce
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
S
eq
ue
nt
ia
l
S
eq
ue
nt
ia
l 4
_1
6
S
eq
ue
nt
ia
l p
ar
al
le
l
0.5
1.0
1.5
2.0
2.5
3.0
3.5
P
to
t 
a
p
p
ro
x
im
a
ti
o
n
 e
rr
o
r 
[%
]
O
p
ti
m
a
l 
to
ta
l 
p
o
w
e
r 
[u
W
]
Numerical computation
Analytical approximation
Figure 7.3: Optimal total power calculated with numerical computation (STM 90nm,
62.5MHz) using Eq. (6.51)
It is interesting to see that the errors for Eq. (6.51) over a set of so different
implementations is always less than 3.5%. The second quite evident thing is the
74 Chapter 7. Architectural impact on total power
huge optimal total consumption of the three sequential implementations compared to
non-sequential ones. The explanation for this effect can be found by looking at the
χ factor (Eq. (6.3)). This parameter, which establishes the relationship between the
optimal V dd and the optimal V th, directly depends on the effective logical depth,
which is very large for these three architectures. A large logical depth (i.e. a large χ)
results in a high optimal V dd (which increases the dynamic power in a square way
and the static power linearly), and in a low optimal V th (which increase the static
power exponentially!). Moreover, sequential structures also present large activities.
Because their activity is defined over a period of the throughput clock, it is not
uncommon to observe activities higher than 1. Unfortunately, this high activity (a)
is not counterbalanced by a small enough number of cells (N), which results in a
much higher number of transitions (a · N) compared to the others implementations.
As stated in Eq. (6.51) a large number of transitions also penalize the optimal total
power consumption.
The RCA architecture is based on a very regular structure that permits many
variations to be implemented. Both parallelization and pipelining transformations
shorten the effective logical depth (which correspond to a reduction of χ, although
not proportionally). In this case, the benefit of the relaxed timing constraints permits
to further reduce V dd and increase V th, reducing this way the optimal total power
consumption.
The diagonal pipelined versions present a lower χ and a lower activity compared
to the classical horizontal pipeline versions, and hence they feature a lower optimal
total power consumption. Nevertheless, the gain in power between the two ways
of pipelining is small, and the time spent by the designer to correctly implement a
diagonal pipeline may not be worth the resulting gain in power.
Finally, the Wallace family presents the fastest circuits of our set. By applying a
parallelization to the basic version, we observe that, similar to the RCA family, the
logical depth is reduced and hence χ is also reduced. Once more, this results in a lower
V dd and higher V th, which should be synonymous of power saving. However, if we
look at the resulting optimal power we see that the Wallace basic version has a lower
optimal total power compared to the two parallelized versions. The explanation comes
from the fact that, the Wallace architecture being already a fast circuit (compared
to the desired clock frequency), the reduction of χ obtained by parallelization is only
marginal and its benefit is canceled by the increase of the static power due to the
doubling in hardware and the overhead introduced to multiplex data. This is not the
case for the RCA because its χ is higher. This example illustrates very well how the
7.1. Summary 75
same architectural transformation can yield completely different results. Fortunately,
all these cases are well modeled by Eq. (6.51).
7.1 Summary
In this chapter we have shown how the architectural parameters like activity a, log-
ical depth LD and frequency f can modify the optimal supply voltage V dd, the
optimal threshold voltage V th and finally the optimal total power Ptot of a design.
In particular, we have pointed out how sequential circuits, characterized by very slow
architectures (large LD), really present a huge power consumption compared to the
other designs. Hence, unless a circuit working at extremely low frequency is needed,
sequential implementations are not well suited for low power when working at the
optimal point.
On the other hand, fast circuits (showing a short LD) like Wallace are not inter-
esting for parallelization because the large increase of static consumption, caused by
the hardware replication, easily cancels the poor benefit obtained from the reduced
critical path.
For an architecture with an average logical depth like the RCA, we can observe
that a moderate power gain can be obtained through parallelization, but even in this
case, pipeline transformation reports better results with a much smaller area, which
also correspond to lower production costs.
This leads us to the conclusion that, in designs where the static power consumption
in not negligible, parallelization is rarely a good choice and most of the time pipelining
should be preferred.
76 Chapter 7. Architectural impact on total power
Chapter 8
Technology impact on total power
As explained in Chapter 6, the optimal total power not only depends on architectural
parameters, but it also depends on technology parameters. In the past, it was in
general not possible to change these parameters, because the designer had a given
technology to use and was not able to modify them. This may change in the future.
Until now, new technology nodes always presented better performances and a better
power characteristics compared to the precedent ones, but nowadays, with the high
increase in leakage current, performance gain can correspond to a power lost. For
this reason, the technologies start now to exist under different “flavors”, which are in
general characterized by their V th. For instance, the technology used in this thesis
presents three different types of transistors, namely Low Vth (LVT), Standard Vth
(SVT) and High Vth (HVT). Moreover, two of these three kinds can be implemented
together on the same chip. Under such conditions, it is interesting to determine,
between the proposed flavors, the best suited for a required work. Before that, we will
consider the virtual case where the technology parameters could be freely modified in
an independent way. This will permit us to understand the influence of each parameter
to the optimal total power.
8.1 Technology as a free parameter
In general, technology parameters (I0, n, α, kt, C) are not independent and the
variation of one of them results in a variation of others. Nevertheless, to understand
the importance and the effective influence of a specific parameter, it is useful to
observe how the total power is modified by single parameter variations. This is shown
in Fig. 8.1 for a RCA 16 bit multiplier. The nominal case (no technology parameters
variations) corresponds to the RCA basic structure reported in Table 7.1.
77
78 Chapter 8. Technology impact on total power
0.5
1
1.5
2
2.5
10
0
10
1
10
2
param
eter
'/param
eter
Optimal total power [uW]
RCA STM
090 SVT
Ionkt αC
F
igu
re
8.1:
T
ech
n
ology
p
aram
eters
in
fl
u
en
ce
on
a
R
C
A
16
m
u
ltip
lier
in
a
S
V
T
S
T
M
90n
m
tech
n
ology
8.2. Application to technology selection 79
The abscissa represents the ratio of the new (modified) parameter over the original
one, while the ordinate represents the optimal total power consumption.
The most sensitive parameter is α. This parameter comes from the alpha power law
fitting formula and it represents the velocity saturation of electrons/holes. Typically,
switching to a newer (finer) technology corresponds to a lower α. From Fig. 8.1,
we can see how this is penalizing for the optimal total power. Actually, a low α
will correspond to a reduced Ion current, which also means a slower technology. In
practice, the speed reduction caused by α is largely counterbalanced by the reduced
capacitances and kt.
Moreover, it is interesting to observe that an increase of I0, results in a very
moderate power saving. The explanation comes from the fact that a bigger I0 not
only increases the static power, but also increases the on current by the same amount.
Hence, it results that the speed related parameter χ is reduced, achieving a moderate
gain. Conversely, the reduction of I0 can highly penalize the total power. Once again,
the delay increase easily explains this behavior.
The behavior of the capacitance C or delay parameter kt is not really surprising.
In fact, an increase of C means an augmentation of the delay (like for kt) and so a
worst optimal total power.
Finally, the curve of n shows a important increase of the optimal total power for
an increase of the parameter and vice-versa. In fact, an increase in the factor n is
equivalent to a reduction of V th, i.e. an increase of the leakage current.
To summarize, the ideal technology would be characterized by a low C, kt and n,
whereas I0 and α should be as high as possible. This may not be the trend in coming
technologies, for instance in the case of α.
8.2 Application to technology selection
The 90nm technology from ST Microelectronics is available with 3 different transistor
types (LVT; SVT; HVT). The optimal total power consumption for the 13 multipliers
of Chapter 5 has been calculated for all existing flavors. Table 8.1 shows the results.
By looking at the bold values, which represent the best technology choices for a given
architecture, we can see that the best transistor type is not always the same. In
particular, the HVT is the best for 6 cases, the SVT the best for 5 cases and the LVT
is the best for 2 cases.
To better illustrate these results, they have been plotted in Fig. 8.2. Data cor-
responding to the sequential versions are omitted to permit a better reading of the
80 Chapter 8. Technology impact on total power
Optimal Ptot [µW ]
Design Name LVT SVT HVT
RCA basic 197.43 181.70 182.11
RCA parallel 179.39 148.27 152.53
RCA parallel 4 176.46 137.84 135.16
RCA horiz. Pipeline 2 151.93 132.25 128.06
RCA horiz. Pipeline 4 142.77 120.23 113.34
RCA diag. Pipeline 2 143.44 128.81 129.81
RCA diag. Pipeline 4 136.82 111.82 112.03
Wallace basic 80.26 83.11 96.95
Wallace parallel 104.17 87.56 81.13
Wallace parallel 4 121.57 95.53 85.98
Sequential basic 1547.98 1015.17 1007.49
Sequential-wallace 358.37 495.88 483.10
Sequential parallel 2 620.49 434.92 486.46
Table 8.1: Optimal total power consumption of thirteen 16 bit multipliers in all STM
90nm technology flavors. The bold values represent the best technology choice for the
given architecture.
other cases.
Looking at the data for the three Wallace implementations, we can observe the
effect of parallelization in different technology conditions. If we consider the HVT
type (high Vth, hence low static power), we see that the parallelization of the basic
implementation is interesting from a power point of view because doubling the hard-
ware (so doubling the static power) is not so negative compared to reduction of the
supply voltage and the increase of the threshold voltage coming from the relaxed tim-
ing constraints. Nevertheless, if the transformation is iterated one more time, leading
to the Wallace parallel 4, the power figure is now starting to degrade, because V dd
and V th are now only slightly modified, whereas the static power is doubled compared
to Wallace parallel 2.
In the case of the SVT (standard Vth) the 2 times parallelization is already a bad
transformation for low power, getting even worst in the 4 times parallelized version.
This can be explained by a greater static power compared to the HVT, which penalize
all types of parallelization for the Wallace structure.
Finally, the results for LVT (low Vth, hence high static power) clearly show an
important increase of the optimal total power for each parallelized version. Once more,
it is the doubling (or multiplying by 4) of the hardware that cannot be tolerated in a
8.2. Application to technology selection 81
0
50
100
150
200
250
R
C
A
 b
as
ic
R
C
A
 p
ar
al
le
l
R
C
A
 p
ar
al
le
l 4
R
C
A
 h
or
iz
. P
ip
el
in
e 
2
R
C
A
 h
or
iz
. P
ip
el
in
e 
4
R
C
A
 d
ia
g.
 P
ip
el
in
e 
2
R
C
A
 d
ia
g.
 P
ip
el
in
e 
4
W
al
la
ce
 b
as
ic
W
al
la
ce
 p
ar
al
le
l
W
al
la
ce
 p
ar
al
le
l 4
O
p
ti
m
a
l 
to
ta
l 
p
o
w
e
r 
[u
W
]
LVT
SVT
HVT
Figure 8.2: Optimal total power consumption of ten 16 bit multipliers in all STM
90nm technology flavors
flavor with so much leakage.
On the other hand, the parallelization of the RCA family remains interesting for all
the three transistors types. This can be explained by the fact that the RCA multiplier
has a longer logical depth and hence a higher χ compared to the Wallace. For this
reason, the parallelization has a much important effect on the reduction of V dd and
the increase of V th which can overcome the increase of hardware and hence of static
power.
From Fig. 8.2, it is also possible to note that the pipeline transformations on the
RCA multipliers present a better power consumption compared to the parallelized
versions. This comes from the fact that pipelining can reduce the timing constraints
without the need of doubling the static power due to hardware replication. It is hence
possible to conclude that for technologies characterized by important leakage power,
this situation being probably representative of all future technologies, pipelining needs
to be preferred over parallelization. This also needs to be understood by the CAD
programmers in order to include powerful automated pipelining tools that will replace
the present massively parallelization-based algorithms.
Considering all the architectures and transistors types, the best choice for a fre-
quency of 62.5MHz is the Wallace basic implemented with a LVT transistor flavor.
82 Chapter 8. Technology impact on total power
8.3 Discussion on the modifiability of Vth
All the theory developed in the last chapters considers V th as a freely modifiable
parameter. This is not the way people normally think about the threshold voltage,
probably because the modification of V th is not an easy task. In the precedent section,
we discussed the possibility to select the best technology flavor from a set of given
ones. This does not allow a continuous modification of the V th, but still permits
to modify it in a discrete way. An important drawback of such an approach is that
the V th cannot be dynamically modified to follow the various runtime needs. In this
section, two other possible ways to interact with the threshold voltage are presented.
8.3.1 Body biasing
In Chapter 2, we discussed the body effect showing how a voltage between the body
and the source of a transistor (V bs) can modify the threshold voltage. The body
biasing equation is replicated there:
V th = V th0− ηV ds− γV bs (8.1)
With η the DIBL effect coefficient and γ the body bias coefficient.
This is clearly a simplification of the relationship between V th and V bs, but it
is useful to understand the principle. In a more precise way, the body bias can be
modeled by [57]:
V th(V bs) = V th(V bs = 0)−
√
2qSNA
C0
(√
2ψB + V bs−
√
2ψB
)
(8.2)
With q the elementary charge, S the silicon permittivity, NA the acceptor impurity
density in the channel, C0 the gate oxide capacitance per unit area, ψB the Fermi
potential and V bs the voltage between body and source.
From Eq. (8.2), we observe that the ability to modify V th is more efficient for V bs
near zero, whereas it decreases in a typical square root way for larger values of V bs.
Moreover, the pre-factor
√
2qSNA/C0 tends to be smaller with newer technologies
due to the reduction of the oxide thickness and hence the range where V th can be
modified will tend to be reduced on all new technology nodes.
Another important point is the sign of V bs. In fact, the body can have a potential
higher or lower than the source. When the body potential is higher than the source for
the NMOS and lower than the source for the PMOS, the polarization is called forward
8.3. Discussion on the modifiability of Vth 83
body biasing (FBB) and it corresponds to a reduction of the threshold voltage. The
contrary, i.e. the body potential lower than the source for the NMOS and higher than
the source for the PMOS, is called reverse body bias (RBB) and results in an increase
of V th.
If the RBB have no limit on the maximal V bs other than the maximum reverse-
bias junction potential, this is not the case for FBB. In FBB, if the potential goes
over 0.5V the p-n junction between body and source will start to conduct, creating a
very high current flow. For this reason, FBB always needs to be lower than 0.5V.
Just as an example, a FBB of 0.5V (the maximum applicable) on the 90nm STM
SVT technology shows a V th reduction of only 40 mV, whereas the same FBB corre-
spond to a V th variation of 60mV for the 130nm STM technology.
H. Ananthan & al. showed in [58] [59] that the FBB has the advantage to reduce
the sensitivity of V th to variations in gate length, oxide thickness and channel doping
and it is hence preferable to RBB.
The principles of body bias has been successfully applied in circuits like the
150MHz discrete cosine transformation core processor of Kuroda et al. [60], the
200MHz processor of Mizuno et al. [13] and the 1Ghz router of Narendra et al. [61]
8.3.2 Transistor size modification
Another way to modify the threshold voltage of a transistor is by modifying its physical
dimensions. The important dimensions of the transistor are the width (W) and the
length (L) of its channel.
Fig. 8.3 (NMOS) and Fig. 8.4 (PMOS) show the plots of V th versus W for the
130nm STM (HCMOS9GP LL) technology. These graphs are part of the STM docu-
mentation and the details on how to generate them are not known. Nevertheless, these
plots are very useful to understand the behavior of V th under a transistor resizing.
From these graphs, we can remark that the influence of W to the V th presents a
huge asymmetry between the NMOS and the PMOS transistors. In fact, for instance,
the maximal change of V th due to a modification of W from 0.3 µm to 10 µm (which
is a very large modification) corresponds to about 60 mV for the NMOS, whereas
it is of only 6-7 mV for the PMOS. This means that any scaling of the device will
create a completely unbalanced charging/discharging delays that will result in high
shortcut currents, not mentioning the capacitances increase due to bigger channel
area. Although the modification of channel width is probably not the best technique
to modify the V th, it is reported here for completeness.
The other modifiable size of the transistor is the channel length. In Fig. 8.5
84 Chapter 8. Technology impact on total power
Figure 8.3: Vth vs. W for a NMOS transistor. Curves correspond to Slow-Slow(SSA),
Typical-Typical(TT) and Fast-Fast(FFA) corners
Figure 8.4: Vth vs. W for a PMOS transistor. Curves correspond to Slow-Slow(SSA),
Typical-Typical(TT) and Fast-Fast(FFA) corners
(NMOS) and Fig. 8.6 (PMOS) the curves of Vth versus L are plotted for the
HCMOS9GP LL 130 nm STM technology. There, we can see that for small increases
of the channel length, both NMOS and PMOS behave in a similar way with a relative
steep slope. This is exactly the idea exploited by Gupta et al. [62]. What they
8.3. Discussion on the modifiability of Vth 85
propose is to slightly increase (less than 10%) the transistors length L of devices that
are not on the critical path, achieving a static power reduction of about 30% and
delay penalty smaller than 10% in a 130nm technology.
Figure 8.5: Vth vs. L for a NMOS transistor. Curves correspond to Slow-Slow(SSA),
Typical-Typical(TT) and Fast-Fast(FFA) corners
Figure 8.6: Vth vs. L for a PMOS transistor. Curves correspond to Slow-Slow(SSA),
Typical-Typical(TT) and Fast-Fast(FFA) corners
86 Chapter 8. Technology impact on total power
It is also important to note that transistor size modifications influence more pa-
rameters than simply the threshold voltage V th and the obtained V th modifications
are very moderate. For these reasons, technology flavor selection and body bias are
preferable techniques to use for modifying the sub-threshold voltage V th.
8.4 Summary
In this chapter we have discussed the influence of the principle technology parameters
on the optimal total power. In particular, we have observed that an ideal technology
would be characterized by low C, kt and n, whereas I0 and α should be as high as
possible. Unfortunately, this will probably not be the trend of the future technologies.
Then we have analyzed thirteen different 16 bit multipliers synthesized in the
three different technology flavors proposed by the STM 90nm technology. This il-
lustrates very well how the technology can be used as a design parameter to achieve
the lowest possible total power consumption. In the examples proposed, the best
architecture/technology flavor is the Wallace basic in a LVT transistor type.
Finally, other two methods for modifying the sub-threshold voltage are proposed;
namely body bias and transistor resizing. For both techniques, advantages and limi-
tations have been discussed.
Chapter 9
Total power comparison for fixed
Vdd and fixed Vth
This chapter presents a new methodology allowing to compare several architectures
performing the same function and to select, among them, the one presenting the lowest
total power consumption under fixed supply voltage (V dd), threshold voltage (V th)
and frequency (f) constraints. This situation is much more common to designers
than the one proposed in Chapter 6, because most of the time they cannot choose the
technology to use. Moreover, this approach could be applied in parallel to the free
V dd/V th one. Actually, the best V th and V dd could be chosen for the main block of
the design and all the others will need to adapt. Thanks to the theory of this chapter
secondary blocks can be optimized, too.
The lowest total power consumption, which is closely related to the architecture,
results clearly from a trade-off between static and dynamic power. Static power
reduction leads to the selection of architectures with a small number of cells and not
with a small number of transitions, as it was the case when only dynamic power
reduction was targeted. As an example, this methodology is applied to the selection
of the lowest power consuming architecture among a set of thirteen 16 bit multipliers
(described in Chapter 5). Moreover, by understanding the mechanism behind this
selection, it is possible to propose and implement new architectures that will consume
even less power as reported in Section 9.4.
9.1 Total power comparison
To be able to compare the consumption of two architectures under the same supply
voltage V dd, threshold voltage V th and frequency f , we need a definition of the total
87
88 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
power. Once more, the used equation is the one described in Chapter 3.
Ptot = Pdyn+ Pstat = aCNfV 2dd +NVddI0e
−V th
nUt (9.1)
The equivalent capacity is roughly related to the average cell capacitance and
could be obtained by dividing the dynamic power consumption by the number of
transitions (a ·N), the squared supply voltage and the working frequency. Therefore
C is not exactly the same for two circuits implementing the same function because it
varies with their respective distribution of activity and capacitance products over the
nodes. The same observation holds for the leakage current I0, which represents an
average static consumption per cell over the entire circuit, although some cells clearly
involve more leakage than others. Considering that the methodology presented here is
applied to the comparison of architectures performing the same task, we assume that
the equivalent capacitance C and the average leakage current I0 remain sufficiently
similar across the set of architectures.
All the architectures in the implementation set share the same V dd, V th and f ,
but present different values for a (activity) and N (number of cells). Two architectures
are characterized by a1 and N1, and a2 and N2 respectively, and their total power
consumption can be compared as follows:
a1N1CfV
2
dd +N1VddI0e
−V th
nUt
?
< a2N2CfV
2
dd +N2VddI0e
−V th
nUt (9.2)
The inequality (9.2) is true if the first architecture consumes less power than the
second one. This equation can be rewritten in the form:
(N1 −N2) ?< −(a1N1 − a2N2) CVddf
I0e
−V th
nUt
(9.3)
Then, by defining the difference between the number of cells as ∆N = (N1 −N2)
and the difference between the number of transitions as ∆Tr = (a1N1 − a2N2), we
can finally express this comparison as:
∆N
?
< −∆Tr CVddf
I0e
−V th
nUt
(9.4)
∆N
?
< −∆Tr ·R(Vdd, Vth, f) (9.5)
The expression R(Vdd, Vth, f) in Eq. (9.5) depends on V dd, V th, f and some tech-
nology parameters, which are imposed to the designer and are hence constant. More-
over, the value of R is always positive.
9.2. Comparison of two architectures 89
Eq. (9.4) shows that the comparison of the total power consumption between two
architectures depends on the difference between the number of cells (∆N) and on
the difference between the number of transitions (∆Tr). This is quite different from
the conventional approach where only the number of transitions is relevant as only
dynamic power consumption is taken into account.
9.2 Comparison of two architectures
A logical function can be implemented in several ways, using different topologies,
for instance by parallelizing, pipelining or performing algorithmic improvements. All
these various structures can be categorized based on their characteristics: number of
cells, logical depth, number of transitions and activity (Table 7.1 is an example of
such a classification). Two architectures can lead to positive or negative ∆N and
∆Tr values while the value of R (Eq. (9.5)) is always positive. If both designs present
the same amount of cells and transitions (i.e. ∆N = 0 and ∆Tr = 0), the power
consumption will clearly be the same. An architecture with more cells and more
transitions will always consume more power, because inequality (9.5) becomes trivial,
i.e. independent of R. Conversely, if one design has more cells but less transitions
compared to the other (i.e. ∆N > 0 and ∆Tr < 0 or vice versa), the choice of the
architecture consuming less power is more complex and depends on R. This means
that the selection will depend on the working conditions too, i.e. on V dd, V th, f and
the technology parameters. All possible cases are summarized in Table 9.1 .
∆Tr > 0 ∆Tr = 0 ∆Tr < 0
∆N > 0 Circuit 2 Circuit 2 Depends on Eq. (9.5)
∆N = 0 Circuit 2 Same consumption Circuit 1
∆N < 0 Depends on Eq. (9.5) Circuit 1 Circuit 1
Table 9.1: Comparison table between two circuits having a difference of ∆N = (N1−
N2) cells and ∆Tr = (a1N1 − a2N2) transitions. The circuit indicated is the one
presenting the lowest total power consumption
Plotting the lines of equal-consumption (i.e. R(V dd, V th, f) = −∆N/∆Tr) on
the space (V dd, V th) allows a better understanding of the role of R in the architec-
ture selection (Fig. 9.1). These equal-consumption lines delimit the points where two
designs having the corresponding ratio −∆N/∆Tr will present the same power con-
sumption, despite the fact that the absolute value will vary with V dd and V th. For
instance, if two architectures operating at V dd=1 V and V th=0.33 have −∆N/∆Tr
90 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
0.1
0.1 0.1
0.1
0.27
0.27 0.27
0.27
0.7
0.7 0.7
0.7
2
2 2
2
5
5
5
5
10
10 1
0
10
20
20 2
0
20
100
100 1
00
100
 -(∆N)/(∆Transitions)
Vdd [V]
Vt
h0
 
[V]
0 0.5 1 1.5 2
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Less transitions is better
Less cells is better
Figure 9.1: Lines of equal-consumption with f = 62.5MHz in a STM 90nm SVT
technology. The Vdd and Vth constraints can be represented with a point on this
plot. A pair of architectures to be compared corresponds to one −∆N/∆Tr line
in this space. If the working point is located above the −∆N/∆Tr line, then the
architecture with less transitions is better in term of power consumption, otherwise
the design with less cells is preferred
= 100, they will present the same total power consumption. Otherwise, when the
design constraints represented by V dd and V th correspond to a point that is above
the equal-consumption line (which would be the case for V dd=1V and V th=0.4V in
our example), the circuit with less transitions will dissipate less power. Conversely,
if the working point is located below the equal-consumption line (which would be
the case for V dd=1V and V th=0.2V), the design with less cells will consume less
power. Actually, increasing V th results in a large decrease in static power, which in
turn leads to a consumption dominated by the dynamic contribution. The architec-
ture with fewer transitions is then naturally preferred. It is important to remember
that the plot of Fig. 9.1 depends on the technology used. Here, the STM 90nm SVT
technology was chosen, which corresponds to an average C/I0 of 1.36E-9 [s/V] and a
working frequency of 62.5MHz.
9.3. Selection of the best architecture 91
9.3 Selection of the best architecture
The methodology illustrated in the precedent section to compare two architectures can
be iterated over a large number of implementations of the same logical function. In this
way, by repeating the comparisons on couples of structures, it is possible to eliminate
the worst architectures and quickly converge to the best design for the specified V dd,
V th and f constraints. It is important to note that the selected architecture is not
always the same, but depends on the values of V dd, V th and f . This methodology
can be used to easily select the better architecture under new constraints without
re-synthesis. Generally speaking, the approach can be summarized as follows:
1. Delay constraints: Given V dd, V th, f , architectures that are too slow to meet
the timing constraints are eliminated. A slow architecture can be parallelized
or pipelined to meet the constraints, but this represents a new architecture to
be added to the set of structures to compare.
2. Compare a couple of architectures: The comparison of two architectures
is achieved using the parameter −∆N/∆Tr. If this value is negative the archi-
tecture with fewer cells and less transitions is chosen (circuit 1 or 2 in Table 9.1
when −∆N/∆Tr is negative). On the other hand, when −∆N/∆Tr > 0, the
choice depends on Eq. (9.5) and therefore on the position of the working point
with respect to the line of equal consumption.
3. Repeat step 2 for all remaining architectures: It can be a good idea to
start eliminating trivial cases (−∆N/∆Tr < 0) in order to reduce the number
of non-trivial comparisons performed by using Fig. 9.1. Elimination of architec-
tures will rapidly converge to a design presenting the overall lower total power
consumption for the given working conditions (V dd, V th, f).
9.4 Designing new circuits
In addition to the above considerations, the same graphical tool can be used to define
guidelines for the design of new architectures (i.e. not yet present in the set of avail-
able architectures) presenting an even smaller total power consumption. First, the
−∆N/∆Tr line that crosses the (V dd, V th) constraint point can be determined from
Fig. 9.1. As a reminder, two architectures having this −∆N/∆Tr share the same
power consumption under these constraints, whereas the architecture with fewer cells
should be favored when this −∆N/∆Tr ratio is higher.
92 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
Starting from an existing design with N1 cells and Tr1 transitions, a new archi-
tecture with less cells (N2 < N1) can be searched for, which will usually present
also more transitions (the trivial case where N2 < N1 and Tr2 < Tr1 would be in
fact always better but rarely realizable). This new version with N2 < N1 cells and
Tr2 > Tr1 transitions will consume less power, if and only if the ratio −∆N/∆Tr
is higher than the one extracted from the line crossing the (V dd, V th) constraints.
Indeed, in this case this line will actually pass above the working point in Fig. 9.1
and the new design with fewer cells will consume less power. Conversely, an architec-
ture presenting a reduced number of transitions (which in general will present more
cells) can be searched for. In this case, the new structure should present a ratio
−∆N/∆Tr smaller than the one that can be read from the line crossing the (V dd,
V th) constraints in Fig. 9.1.
As an example, an existing circuit with 10’000 cells and 100 transitions is work-
ing at V dd=1V, V th=0.24V and f=62.5MHz and a new architecture consuming less
power is sought. Fig. 9.1 specifies that in order to consume less power a new architec-
ture must have a −∆N/∆Tr greater than 10 when reducing the number of cells, or
smaller than 10 when reducing the number of transitions. Supposing that the designer
can achieve a reduction of 1000 cells (N2 = 9000) by an architectural transformation,
he should verify that the number of transitions of this new design is no more than
200 (∆Tr < 100), which is necessary in order to have −∆N/∆Tr greater than 10.
When performing a parallelization, the number of cells is more than doubled (due
to the multiplexer overhead) and the activity is reduced by slightly less than two. In
general, this results in a small increase of the number of transitions and in a large in-
crease in the number of cells. For this reason, parallelized versions will always present
more power consumption than the original design at the same working conditions.
However, when the original architecture does not meet the speed requirements, the
parallelization can relax the timing constraints to achieve the required performances.
This is the only case where a parallelized architecture may be useful when V dd and
V th are fixed.
The same situation arises with pipelining where the overhead due to the extra
registers often largely cancels the activity reduction achieved by suppressing glitches.
At the same time, the number of cells increases due to the same overhead and, as a
result, pipelining a circuit at the same working conditions is in general not interesting.
Nevertheless, the pipelining technique can be used to reduce the logical depth and
hence relax the timing constraints of circuits that do not meet the speed constraints
at the required V dd and V th.
9.5. Case study: 16bit multipliers 93
9.5 Case study: 16bit multipliers
To show how to apply the ideas of this chapter to a practical case, we will, one more
time, refer to the thirteen 16 multiplier described in Chapter 5. The data of the
architectural parameters for all the structures is available in Table 7.1.
Knowing that the key parameters for power discrimination are the number of cells
(N) and the number of transitions (Tr), all architectures can be represented as points
on a plot of N versus Tr (Fig. 9.2). The label on the arcs connecting points stands
for the value of −∆N/∆Tr for the corresponding couple of architectures. Fig. 9.2
allows a very easy detection of trivial cases characterized by −∆N/∆Tr < 0, as the
slope of their arc is positive. Conversely, non-trivial cases present a negative slope.
In Fig. 9.2, only non-trivial arcs are shown.
A. Example 1: Vdd = 1V, Vth = 0.4V, f = 62.5MHz
Applying the methodology described in section 9.3, we have:
1. Delay constraints: All design can work at these conditions.
2. Compare a couple of architectures: Architectures connected by a positive
slope arc in Fig. 9.2, i.e. trivial cases such as RCA parallel 4 against Wallace
parallel 2, are first considered. As RCA parallel 4 presents more cells and more
transitions than Wallace parallel 2, it is eliminated.
3. Repeat step 2 for all remaining architectures:
• By comparing other trivial cases, we can easily eliminate RCA horizontal
pipeline 4, RCA diagonal pipeline 4, RCA parallel 2, Wallace parallel 2 and
Wallace parallel 4 in favor of Wallace. Moreover the RCA diagonal pipeline
2 is eliminated in favor of RCA horizontal pipeline 2 and Sequential parallel
in favor of the basic Sequential.
• The remaining cases are then considered. Looking at RCA and Sequential
in Fig. 9.2, it can be seen that the arc connecting the two structures is
characterized by −∆N/∆Tr = 0.7. On Fig. 9.1, the equal-consumption
line corresponding to this value splits the space in two regions with the label
“less transition is better” on the upper part and “less cells is better” in the
lower part, meaning that at V dd=1V and V th=0.14V the two designs will
consume the same amount of power. However, in our example the working
point corresponding to V th=0.4V lies in the upper part of the plot where
94 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
1
6
 b
it m
u
ltip
lie
rs-(Δ
N
)/(Δ
T
ra
n
s
itio
n
s
)
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
3
0
0
0
3
5
0
0
C
e
lls
 [N
]
Transitions (a*N)
S
e
q
u
e
n
tia
l p
a
ra
lle
l
W
a
lla
ce
 p
a
ra
lle
l 4
R
C
A
 p
a
ra
lle
l 4
R
C
A
 p
a
ra
lle
l 2
R
C
A
R
C
A
 h
o
riz. p
ip
e
2 R
C
A
 d
ia
g
. p
ip
e
2
R
C
A
 h
o
riz. p
ip
e
4
R
C
A
 d
ia
g
. p
ip
e
4
W
a
lla
ce
 p
a
ra
lle
l 2
W
a
lla
ce
S
e
q
u
e
n
tia
l
w
a
lla
ce
 4
_
1
6
S
e
q
u
e
n
tia
l
3
.6
0
4
.9
6
2.41
2.07
0.70
0.27
2.431.97
2.78
1.35
1.59
10.03
1.02
F
igu
re
9.2:
T
h
irteen
16
b
it
m
u
ltip
liers
p
lotted
on
th
e
cells
v
s.
tran
sition
s
sp
ace
9.5. Case study: 16bit multipliers 95
the better structure is characterized by less transitions. Consequently,
the RCA design is selected. The same reasoning can be applied to the
Sequential-wallace 4 16 architectures which is eliminated in favor of the
RCA. In fact, if the equal-consumption line is located in the lower part of
Fig. 9.1, i.e. at low V th, a working point above this line is dominated by
dynamic consumption rather than static power. For this reason, designs
with fewer transitions will present also less total power dissipation. The
remaining architectures are RCA, RCA horizontal pipeline 2 and Wallace,
but having all low values of −∆N/∆Tr compared to Wallace (1.59 and
10.03 respectively) only the Wallace structure remains.
For V dd=1V, V th=0.4V and f=62.5MHz, the better architecture from a power
point of view is the Wallace. In order to validate the methodology, the total power
consumption of all designs was calculated for the given operating conditions and is
shown in Table 9.2.
RCA RCA par2 RCA par4 RCA horiz.pipe2 RCA horiz.pipe4
735.4 839.5 905.4 681.5 738.3
RCA diag.pipe2 RCA diag.pipe4 Wallace Wallace par2 Wallace par4
724.9 790.1 545.2 650.2 728.1
Sequential Sequential-wallace 4 16 Sequential parallel
2457.2 1094.4 2232.5
Table 9.2: Consumption of the thirteen multipliers in µW for Vdd=1V, Vth=0.4V
and f=62.5MHz.
These values are first obtained at the nominal conditions (V dd=1V, V th0 =
0.353V) and then dynamic and static powers are separately recalculated based on
Eq. (9.1) for the proposed working condition (i.e. V dd=1V, V th=0.4V).
B. Example 2: Vdd = 1V, Vth = 0.12V, f = 62.5MHz
As a second example, we choose a working condition with a very low threshold voltage
(V th=0.12V) and the same supply voltage and frequency as in the previous example.
1. Delay constraints: In this case too, all designs meet the timing constraints.
2. Compare a couple of architectures: As in the previous example, trivial
cases are detected first. Hence, the RCA parallel 4 is eliminated in favor of the
Wallace parallel 2.
96 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
3. Repeat step 2 for all remaining architectures:
• By comparing other trivial cases, we can easily eliminate RCA horizontal
pipeline 4, RCA diagonal pipeline 4, RCA parallel 2, Wallace parallel 2
and Wallace parallel 4 in favor of Wallace. Moreover, the RCA diagonal
pipeline 2 is eliminated in favor of the RCA horizontal pipeline 2 and the
Sequential parallel in favor of the basic Sequential.
• The remaining architectures are: RCA, RCA horizontal pipeline 2, Se-
quential, Sequential-wallace 4 16 and Wallace. As before, the couple RCA
and Sequential is characterized by −∆N/∆Tr = 0.7, which corresponds
to an equal-consumption line on Fig. 9.1. For V dd=1V these architectures
will have the same power consumption if the threshold voltage is equal to
0.14V. As the imposed V th is a little lower (0.12V), it is located in the
region where less cells are preferred. Hence, the Sequential architecture
will be selected. Similar is the comparison between the RCA horizontal
pipeline 2 and the Wallace. With a −∆N/∆Tr of 10.03, we know that the
architecture with less cells is preferred (i.e. the RCA horizontal pipeline 2).
For the same reason, the Sequential-wallace 4 16 will be preferred over the
RCA horizontal pipeline 2. Finally, the comparison between the Sequential
and the Sequential-wallace 4 16 is characterized by a −∆N/∆Tr = 0.27.
From Fig. 9.1 we can see that the equal-consumption line passes under the
working conditions couple (V dd,V th), meaning that the circuit with less
transitions will present the best power figure. Hence, the only remaining
architecture is the Sequential-wallace 4 16.
The results of the methodology indicate that the Sequential-wallace 4 16 is the
circuit presenting the lowest total power consumption for V dd=1V, V th=0.12V and
f=62.5MHz.
RCA RCA par2 RCA par4 RCA horiz.pipe2 RCA horiz.pipe4
4618.5 8571.8 16342.5 4987.5 5898.4
RCA diag.pipe2 RCA diag.pipe4 Wallace Wallace par2 Wallace par4
5070.3 6072.1 4788.5 9082.5 17537.0
Sequential Sequential-wallace 4 16 Sequential parallel
3939.4 3245.1 4857.9
Table 9.3: Consumption of the thirteen multipliers in µW for Vdd=1V, Vth=0.12V
and f=62.5MHz.
9.6. Summary 97
The actual power consumption in these conditions is shown (after calculation) in
Table 9.3, confirming that the Sequential-wallace 4 16 presents actually the lowest
total power consumption.
9.6 Summary
This chapter presented a new design methodology allowing the selection of the ar-
chitecture presenting the lowest total power consumption within a set of equivalent
designs working at the same (fixed) V dd, V th and f . This methodology considers
dynamic power consumption (proportional to the number of transitions), as well as
static power consumption (directly related to the number of cells). An example of
application was reported for thirteen 16 bit multipliers, showing that, depending on
the working condition (i.e. V dd, V th and f), the architecture with the lowest to-
tal power dissipation is not always the same. Moreover, this technique allows the
determination of the architecture presenting the lowest total power consumption for
conditions which are different from the one used during synthesis, without the need
of re-synthesizing all the circuits.
98 Chapter 9. Total power comparison for fixed Vdd and fixed Vth
Chapter 10
Physical implementation of four 32
bit multipliers
In the previous chapters, the models for the optimal total power consumption have
been proposed. In order to validate the reported equations and to reinforce the
drawn conclusions, a physical ASIC implementation has been done. The circuit was
designed to demonstrate both architectural and technology influences to the optimal
total power consumption in the case where the static power consumption also largely
contribute to the total power. This has been achieved with a state-of-the-art 90nm
technology from ST Microelectronics. The main advantage of this technology is the
possibility to integrate, on the same die, 2 different kinds of transistor out of the 3
available. In this way, it is possible to “emulate” the effects of a technology change
on the total power consumption with a single chip.
The implemented design is composed by two 32 bit multipliers (RCA basic and
RCA parallel 4, these structures being described in details in Chapter 5) implemented
once with the Standard Vth (SVT) transistors and once with the Low Vth (LVT)
transistors, giving a total of 4 multipliers.
After a detailed description of the ASIC structure and functionality, this chapter
will present the tools and resources used for the measurements. Then, measured
data will be reported and commented. Finally, a discussion on technology parameter
variations closes the chapter.
10.1 Circuit description
The test ASIC is mainly formed by 4 multipliers corresponding to all possible combina-
tions of two technology flavors (SVT/LVT) with two architectures (RCA basic/RCA
99
100 Chapter 10. Physical implementation of four 32 bit multipliers
parallel 4). The combinations are:
• mult 0: RCA basic with SVT transistor type;
• mult 1: RCA parallel 4 with SVT transistor type;
• mult 2: RCA basic with LVT transistor type;
• mult 3: RCA parallel 4 with LVT transistor type.
The choice of the RCA as the block to be implemented comes from the need to
have an architecture “slow enough” (in fact, the RCA has a logical depth larger than
the Wallace) to have the expected total power crosses (reported at the end of this
chapter) at relatively low frequency (under 20MHz in this case). This permitted us to
reduce the requirements for the testing tools. Fig. 10.1 illustrates the block diagram
of the test circuit. All multipliers have a data size of 32 bit, which corresponds to 64
output bits. Each multiplier also has a separated power supply in order to be able
to measure its power consumption without including the rest of the circuit. For the
same reason, the clock signal was multiplexed to each block. In fact, in this way, only
the clock tree corresponding to the desired multiplier is accounted during the power
measurements. This clock multiplexing, as well as the multiplier register enables and
the output demultiplexer are controlled by the external signal sel, which is the binary
representation of the number corresponding to the multiplier under test.
To be able to verify the correct functioning of the multipliers over many multipli-
cations, the results are added with the precedents and only the final sum is verified.
Mathematically, the content of the shift register after n multiplications can be ex-
pressed by:
Sum =
[
n∑
i=0
multiplication(i)
]
mod 264 (10.1)
This sum is stored in a 64 bit shift register which permits to serially output the
result externally in order to be checked after the test.
10.1.1 Pseudo-random code generator
The circuit being designed to work at a maximal frequency of 62.5MHz (corresponding
to 16ns of clock period) at nominal conditions (i.e. V dd=1V), it was not possible to
externally generate the input data for the multipliers due to the high throughput
required. Hence, a pseudo-random data generator has been implemented internally.
This generator is based on a linear feedback shift register [63] [64] and is constructed
10.1. Circuit description 101
3
2
3
2
3
2
3
2
3
2
3
2
3
2
3
2
6
4
6
4
6
4
6
4
se
l(
1
:0
)
2
s_
o
u
t
se
l_
re
g
s_
in
lo
a
d
_
n
sh
if
t_
n
   
   
   
 c
lk
V
d
d
_
IO
V
d
d
_
g
V
d
d
_
co
re
V
ss
_
g
V
d
d
_
m
0
V
d
d
_
m
1
   
   
V
d
d
_
m
1
   
   
V
d
d
_
m
2
V
d
d
_
m
3
p
_
in
p
_
o
u
t
s_
in
lo
a
d
_
n
sh
if
t_
n
cl
k
6
4
b
it
 s
h
ft
_
re
g
6
4
6
4
One-hot
Decoder
M
u
lt
_
0
R
C
A
S
V
T
A
M
B
M
u
lt
_
1
R
C
A
P
a
ra
ll
e
l 4
S
V
T
A
M
B
M
u
lt
_
2
R
C
A
LV
T
A
M
B
M
u
lt
_
3
R
C
A
P
a
ra
ll
e
l 4
LV
T
A
M
B
e
n e
n
e
n
e
n
e
n e
n
e
n
e
n
p
se
u
d
o
 
ra
n
d
o
m
g
e
n
e
ra
to
r
6
4
6
4
6
4
6
4
+
cl
k
rs
t_
n
6
4
6
4
0 1 2 3
m
u
x1
m
u
x2
0 1 0 1
su
m
ra
n
d
_
d
a
ta
cl
k
0
cl
k
1
cl
k
2
cl
k
3
g
lo
b
a
l_
a
_
b
b
ig
_
m
u
x
0
0
0
1
1
0
1
1
g
e
n
e
ra
l_
m
d
a
ta
_
g
e
n
.v
h
d
m
u
lt
_
0
.v
h
d
m
u
lt
_
1
.v
h
d
m
u
lt
_
2
.v
h
d
m
u
lt
_
3
.v
h
d
to
p
.v
h
d
cl
k
0
cl
k
1
cl
k
2
cl
k
3
F
ig
u
re
10
.1
:
B
lo
ck
sc
h
em
at
ic
of
th
e
te
st
ci
rc
u
it
102 Chapter 10. Physical implementation of four 32 bit multipliers
as a shift register with some bits logically “xnored” and seeded to the shift register
input. The schematic of the data generator is depicted in Fig. 10.2.
QD QD QD QD QD QD QD QD QD
63 62 61 60 59  3  2  1  0
clk
Figure 10.2: Schematic of the 64 bit linear feedback shift register
The data is 64 bit wide and provides the two 32 bit vectors used as the two inputs
of the multiplier under test.
The particularity of a linear feedback shift register (lfsr) is that all possible codes
are generated in a equally distributed way, without repetitions, until all codes have
passed. The only code never generated , and also the one to be avoided, is the “all-
ones” code, which is a stable code and always generates itself. Another advantage
of this implementation is the fact that the generated sequence is always the same
given the same starting code. In the case of our circuit, the shift register will be reset
prior to every multiplication so that knowing the number of executed multiplications
n permits us to pre-calculate the result of the cyclic adder expressed in Eq. (10.1) and
in this way being able to verify that all the multiplications were executed correctly.
Fig. 10.3 shows the distribution of the generated numbers after 500 and after
10000 clock cycles. In the case of 500 generated numbers, it is possible to observe
a slightly non uniform distribution due to the small number of generated data. If
the amount of generated numbers increases, the distribution of probabilities becomes
more uniform, as shown in Fig. 10.3. It is also interesting to note that, due to the
shift nature of generated data, splitting the 64 bit code in two 32 bit vectors doesn’t
change the probability distribution, actually the new derivated vectors will present
the same probability distribution as the original one. Moreover, the multiplication of
two uniform distribution results in a distribution proportional to ln(1/x) as shown by
the last two graphs of Fig. 10.3.
10.1. Circuit description 103
x 1018
0
500
1000
Linear Feedback Shift Register (sequence of 10000 states)
0 0.5 1 1.5 2 2.5 3 3.5 4
x 109
0
500
1000
0 0.5 1 1.5 2 2.5 3 3.5 4
x 109
0
500
1000
x 1018
0
1000
2000
3000
0 1.8 3.7 5.5 7.4 9.2 11.1 12.9 14.8 16.6
0 1.8 3.7 5.5 7.4 9.2 11.1 12.9 14.8 16.6
0 1.8 3.7 5.5 7.4 9.2 11.1 12.9 14.8 16.6
0 1.8 3.7 5.5 7.4 9.2 11.1 12.9 14.8 16.6
x 1018
0
20
40
60
Linear Feedback Shift Register (sequence of 500 states)
0 0.5 1 1.5 2 2.5 3 3.5 4
x 109
0
20
40
60
0 0.5 1 1.5 2 2.5 3 3.5 4
x 109
0
20
40
60
x 1018
0
50
100
DATA(63:0)
DATA(31:0)
DATA(63:32)
DATA(63:32) * DATA(31:0)  
DATA(63:0)
DATA(31:0)
DATA(63:32)
DATA(63:32) * DATA(31:0)  
Figure 10.3: Probability distribution of the pseudo-random generated data for 500
and 10000 generated data
10.1.2 Ring oscillators
Besides the design described in the Fig. 10.1, two small ring oscillators have been
added to the implemented circuit. One is implemented with inverters based on SVT
transistors, whereas the other is implemented with inverters based on LVT transis-
tors. Both ring oscillators were designed to have an oscillation frequency of 62.5MHz
at nominal conditions, which corresponds to the expected working frequency of the
multipliers under the same conditions. This means:
• ring lvt: 533 inverters (IVLVTX1)
• ring svt: 437 inverters (IVSVTX1)
104 Chapter 10. Physical implementation of four 32 bit multipliers
10.2 Circuit design and implementation
The design has been written in the VHDL language and the source code can be found
in Appendix A. The synthesis of this code has been done using Synopsys Design
Compiler V2004.06-SP1 and the activity annotation for accurate power estimation has
been obtained with ModelSIM from MentorGraphics version 5.6f. All the Synopsys
scripts can be found in Appendix B.
The technology used for the synthesis is the 90nm from ST Microelectronics. This
technology has been fully described in Chapter 4.
The results of the synthesis are stored in a verilog netlist ready to be used for the
Place&Route (P&R) software. In our case, we used SoC Encounter version 4.10 from
Cadence. The scripts used for P&R are reported in Appendix C.
Finally, the design passed the DRC (Design Rule Check) done using Calibre DRC
from MentorGraphics. The final layout of the circuit is shown in Fig. 10.4.
1
24
23
22
2
3
7654 8 9
1
8
1
9
2
0
2
1
1
7
1
6
12
13
14
15
11
10
Figure 10.4: Final layout of the demonstrator circuit
In Fig. 10.4 we can recognize the two RCA parallel 4 multipliers in the upper part,
the two RCA basic multipliers in the lower left part, whereas the control logic and
the data generator are located in the middle left part. The square block located in
10.2. Circuit design and implementation 105
the bottom right angle is a compensation circuit required to stabilize the IO cells. A
block view of the design in reported in Fig. 10.5.
1
24
23
22
2
3
7654 8 9
1
8
1
9
2
0
2
1
1
7
1
6
12
13
14
15
11
10
Vss_2
Shift_n
Z_svt
Z_lvt
S_out
S_in
Rst_n
Sel_reg
Sel1
Sel0
Vss_g
Vdd_g
V
d
d
_
0
cl
k
V
d
d
_
2
lo
a
d
_
n
V
ss
_
1
V
ss
_
re
f
V
d
d
_
1
V
S
S
_
IO
V
d
d
_
co
V
d
d
_
IO
V
S
S
_
3
V
d
d
_
3
Mult_0
Mult_1 Mult_3
IO_REF_
COMPENSATION
random generator
+ serial interface 
Mult_2
Figure 10.5: Block view of the demonstrator circuit
The pin names and their functions are:
1. Z lvt: Output of the ring oscillator formed by 533 LVT type inverters;
2. S out: Serial output of the shift registers. This output is used to read the
content of the shift registers. From the read value the correct multiplier behavior
can be verified;
3. S in: Serial input of the shift registers. This pin can be used to enter a value
to be multiplied or to verify the correct functioning of the shift registers;
4. Vdd 0: Supply voltage for the multiplier 0 (RCA basic with SVT transistors);
5. Clk: Clock of the system;
6. Vdd 2: Supply voltage for the multiplier 2 (RCA basic with LVT transistors);
106 Chapter 10. Physical implementation of four 32 bit multipliers
7. Load n: When low, data is loaded in parallel from the p in input into the
shift registers (see Fig. 10.1). This is the typical behavior during the sum and
accumulation process;
8. Vss 1: System ground;
9. Vss ref: System ground;
10. Vdd g: Supply voltage for the IO REF COMPENSATION block (1.0V);
11. Vss g: System ground;
12. Sel0: Bit zero of the sel signal. This signal select which multiplier is under test;
13. Sel1: Bit one of the sel signal. Sel coding is binary;
14. Sel reg: Selector for routing data from the pseudo-random number generator
and to/from the shift registers;
15. Rst n: System asynchronous reset signal, active low;
16. Vdd 3: Supply voltage for the multiplier 3 (RCA parallel 4 with LVT transis-
tors);
17. Vss 3: System ground;
18. Vdd IO: I/O supply voltage (3.3V);
19. Vdd co: Supply voltage for the pseudo-random generator and serial interface
block;
20. Vss IO: IO ground;
21. Vdd 1: Supply voltage for the multiplier 1 (RCA parallel 4 with SVT transis-
tors);
22. Vss 2: System ground;
23. shift n: When low (and load n is high), data in the shift register shifts one bit
on each clock rising edge;
24. Z svt: Output of the ring oscillator formed by 437 SVT type inverters;
10.2. Circuit design and implementation 107
Figure 10.6: Output pad level converter for different core supply voltages. The linear
ramp represents the core supply voltage, the line marked with triangles and constantly
bound to zero is the logical level from the core and the line marked by wide rectangles
is the corresponding IO output.
This circuit being destined to work at very low supply voltage (<0.5V), the level
converter included in the standard output cells is not suited for granting a good level
conversion under this condition as reported by Fig. 10.6.
Actually, in Fig. 10.6 we can observe that, for a core powered with a tension lower
than about 0.45V, the output value jumps to 3.3V whereas 0V should be reported
instead. For this reason the output ports (luckily only 3 ports of the design are
outputs) have been assigned as analog pads and the level conversion has been left to
an external circuit. This problem doesn’t exist for the input ports, in fact, signals
coming with a higher voltage than the core supply are never confused with the “0”
logic level.
10.2.1 Nominal values
The nominal synthesis values, as well as the architectural parameters, for the four
implemented multipliers are reported in Table 10.1.
The definitions of the parameters reported in Table 10.1 are:
• Cells: the number of design cells. Note however that cell can be a very simple
108 Chapter 10. Physical implementation of four 32 bit multipliers
N
o
m
in
a
l
v
a
lu
e
s
C
e
lls
N
e
ts
A
re
a
A
ctiv
ity
D
e
la
y
L
D
e
ff
χ
χ
α
V
d
d
V
th
0
P
d
y
n
P
sta
t
P
to
t
[µ
m
2]
[n
s]
[V
]
[V
]
[µ
W
]
[µ
W
]
[µ
W
]
M
u
lt0
2633
3660
27250.1
0.686
13.28
396
0.6315
0.4684
1.0
0.353
3651
33
3684
M
u
lt1
10122
14029
104333.4
0.194
13.65
102
0.2772
0.1204
1.0
0.353
4092
128
4220
M
u
lt2
2568
3595
27105.2
0.696
10.73
392
0.5767
0.4237
1.0
0.342
4066
273
4339
M
u
lt3
9732
13639
103502.6
0.197
11.16
102
0.2432
0.1102
1.0
0.342
4492
1030
5522
T
ab
le
10.1:
N
om
in
al
valu
es
of
th
e
4
im
p
lem
en
ted
m
u
ltip
liers.
N
om
in
al
freq
u
en
cy
is
62.5M
H
z
10.2. Circuit design and implementation 109
one (like an inverter) or a complex one (like a full adder);
• Nets: the number of inter-cells nets in the design;
• Area: the area of the design core; pads and routing spaces are not included;
• Activity: the average number of switching nets over the total number of nets
per clock period. These values are obtained by an event-driven simulation under
ModelSIM (from MentorGraphics). The results are based on the multiplication
of pseudo-random data over 500 multiplications; Standard library delays are
used so that glitches can be accounted;
• Delay: the typical combinatorial delay from register output to register input on
the critical path;
• LD eff: the effective logical depth in equivalent NAND2 gates. The term “ef-
fective” is used to emphasize the fact that the length of the logical depth is
considered against the throughput frequency or one-complete-multiplication fre-
quency. In the case of a 4 times parallelization, for instance, LD eff corresponds
to a forth of the real LD because each block has four clock periods to compute
one multiplication. The delay of the reference NAND2 gate has been estimated
by putting 1000 NAND2 as in a chain of inverters. The inversion effect has
been obtained by tying the two inputs together. The resulting delay per gate is
33.5ps for the SVT transistor type and 27.4ps for the LVT;
• χ and χα: these two parameters are obtained by using Eq. (6.3) from the nominal
V dd, V th and delay;
• Nominal Vdd: the nominal technology supply voltage;
• Nominal Vth0: the nominal technology threshold voltage;
• Nominal Pdyn: the nominal dynamic power consumption as reported by Syn-
opsys DC;
• Nominal Pstat: the nominal static power consumption as reported by Synopsys
DC;
• Nominal Ptot: the nominal total power consumption obtained by summing the
nominal Pdyn and the nominal Pstat.
110 Chapter 10. Physical implementation of four 32 bit multipliers
10.3 Measurements setup
For measuring the power consumption of each multiplier at their limit of function-
ality (i.e. the lowest possible supply voltage guaranteeing correct results for a given
frequency) the following things are required:
• Generate the supply voltages: The circuit requires many different supply
voltages in order to work. The multiplier under test needs a separate supply
voltage. Then, the core logic, containing the pseudo-random data generator and
the cyclic adder, requires a supply voltage at the same potential in order to in-
ternally interface the multipliers without problems. Moreover, the IO controller
IO REF COMPENSATION should always be maintained to 1.0V and finally
the IO pads must be powered with 3.3V.
• Generate the control signals: The circuit requires a clock and a reset signal.
Besides, other signals must be generated in order to select the multiplier under
test and to read/write the shift registers for checking the correct functioning of
the multiplier. All these signals are generated by an Altera FPGA based board.
• Convert output pins to 3.3V logic level: As reported previously, the circuit
outputs (namely Z lvt, Z svt and S out) are implemented as analog signals and
they hence need to be converted to a 3.3V logical level in order to be interfaced
by the FPGA. This is obtained by putting discrete comparator devices on the
output signals.
• Measure the consumed multiplier current independently: Finally, once
the circuit can run, we must be able to measure the consumed current of the
specific multiplier under test. This is accomplished by multiplexing the multi-
plier power supply to the correct multiplier power pins through reed relays. The
advantage of using reed relays is that, the contact being mechanical, virtually
no extra consumption is added to the measure, which would not be the case if
a CMOS multiplexer circuit would be used instead.
10.3.1 PCB design
Fig. 10.7 shows the schematic of the PCB (Printed Circuit Board, designed with
Altium Designer 2004 SP3, formerly Protel) used to interface the demonstrator circuit.
The three connectors J5, J7, J9 are the “bridges” between the PCB and the FPGA
board. On the right of the schematic, we can see the 4 reed relays (K1-K4) used to
10.3. Measurements setup 111
11
22
33
44
55
66
77
88
D
D
C
C
B
B
A
A
T
it
le
N
u
m
b
er
R
ev
is
io
n
S
iz
e
A
3
D
at
e:
3
0
.0
1
.2
0
0
7
S
h
ee
t 
  
 o
f
F
il
e:
C
:\
D
o
cu
m
en
ts
 a
n
d
 S
et
ti
n
g
s\
..
\S
h
ee
t1
.S
ch
D
o
cD
ra
w
n
 B
y
:
z_
lv
t
1
s_
o
u
t
2
s_
in
3
V
d
d
0
4
cl
k
5
V
d
d
2
6
lo
ad
_
n
7
V
ss
1
8
V
ss
R
ef
9
V
d
d
_
g
_
1
V
0
1
0
V
ss
_
g
1
1
se
l0
1
2
se
l1
1
3
se
l_
re
g
1
4
rs
t_
n
1
5
V
d
d
3
1
6
V
ss
3
1
7
V
d
d
_
IO
_
3
V
3
1
8
V
d
d
_
co
re
1
9
V
ss
_
IO
2
0
V
d
d
1
2
1
V
ss
2
2
2
sh
if
t_
n
2
3
z_
sv
t
2
4
U
3
to
p
_
w
it
h
_
io
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
J9 H
ea
d
er
 2
5
X
2
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
J5 H
ea
d
er
 2
5
X
2
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
19 20
21 22
23 24
25 26
27 28
29 30
31 32
J7
43
5 2
1
U
2
43
5 2
1
U
4
43
5 2
1
U
5
3
V
3
1
0
0
KR
2
1
2
5
V
0
5V0
1
2
V
C
O
R
E
VCORE
1
2
V
D
D
M
VDDM
1
2
A
M
P
L
0
O
KL
1
O
K
L
1
3
O
K
L
2
L
3
L
1
2
O
K
L
1
1
O
K
L
1
0
O
K
R21
100
R20
100
R
1
7
1
0
0
R
1
6
1
0
0
R
1
5
1
0
0
R
1
4
1
0
0
R19
100
R18
100
K
1
K
2
K
3
K
4
43
6 2
1
U
6
A
D
8
6
9
1
5
V
0
R
1
1
1
0
0
k
R
1
3
2
2
k
5
V
0
5
K
R
1
2
1
V
0
3
V
3
+
C
2
1
0
u
+
C
1
1
0
u
5
V
0
5
V
0
5
V
0
5
V
0
S
4
S
3
S
2
S
1
R
7
1
0
k
R
8
1
0
k
R
9
1
0
k
R
1
0
1
0
k
3
V
3
1
2
JP
1
JU
M
P
E
R
R
1
V
D
D
M
V
D
D
M
V
D
D
M
V
D
D
M
V
D
D
M
5
1
2
3
4
cl
o
ck
1
C
O
A
X
-M
5
1
234
P
1
5
1
234
P
3
5
1
234
P
2
1
JT
9
1
JT
3 1
JT
6
1
JT
8
1
JT
7
1
JT
4
1
JT
2
1
JT
5
1
JT
1
C
5
1
0
0
n
C
4
1
0
0
n
C
3
1
0
0
n
v
in
3
gnd
1
v
o
u
t
2
U
1
L
M
 1
0
8
6
IT
-3
.3
1
2
U
1
0
A
D
M
7
4
0
6
M
3
4
U
1
0
B
D
M
7
4
0
6
M
5
6
U
1
0
C
D
M
7
4
0
6
M
9
8
U
1
0
D
D
M
7
4
0
6
M
D
1
D
3
D
4
D
2
5
V
0
5
V
0
5
V
0
5
V
0
3
v
3
3
v
3
3
V
3
C
6
1
0
0
n
C
7
1
0
0
n
C
8
1
0
0
n
C
9
1
0
0
n
1
V
0
5
V
0
T
o
p
_
w
it
h
_
io
 t
es
t 
b
o
ar
d
1
.1
S
ch
u
st
er
 C
h
ri
st
ia
n
C
1
3
1
0
0
n
C
1
1
1
0
0
n
C
1
2 1
0
0
n
C
1
0
1
0
0
n
R
3
5
0
F
ig
u
re
10
.7
:
S
ch
em
at
ic
of
th
e
P
C
B
u
se
d
to
te
st
th
e
d
em
on
st
ra
to
r
ci
rc
u
it
112 Chapter 10. Physical implementation of four 32 bit multipliers
supply the multiplier under test and the corresponding LEDs which provide a visual
feedback on which multiplier is currently selected. On the left bottom corner there
is a small user interface with 4 buttons and 4 LEDs. Two of these LEDs (OK, KO)
are used to show if the content of the shift register was the expected one, i.e. if the
multiplier worked correctly or not. The other buttons/LEDs are there to expand
the functionalities if needed. The comparators (U2, U4, U5), required to convert
the output level of the three output pins Z lvt, Z svt and S out to 3.3V, are visible
on the left part of the schematic with extra connectors (P1-P3, on top) designed
for debugging purpose. The reference voltage defining the separation between the
logical level 0 and the logical level 1 has been obtained with a potentiometer from
the VCORE pin. In this way, the reference voltage will always be proportional to
the supply voltage used for the core. All the chip input signals are connected directly
to the FPGA through J9. The 3.3V is generated from the 5V on the card with a
voltage regulator shown in the top right edge. A stabilized 1.0V source was difficult
to obtain from the 5V as no voltage regulator was found that can provide tensions
so low. For this reason, this supply voltage has be generated using an operational
amplifier used as a voltage follower. In this configuration, the tension set at the input
through a resistor divider is replicated at the output (almost) independently from the
drawn current. This block is shown in the bottom-centered part of Fig. 10.7. Finally,
the multipliers power source is obtained externally by the connector VDDM and the
current drawn is measured by applying a ammeter to the AMP connector. The tension
for the core (which is all the design but the multipliers) can be obtained from VDDM
with the jumper JP1 set or supplied separately by the VCORE connector.
10.3.2 FPGA based signal generation
The FPGA development card used in this work was a Nova Constellation 20 KE
card [65], which is based on a Altera APEX EP20K600EFC672 FPGA. This card has
150 user programmable IOs working at 3.3V. It can be programmed through USB
and JTAG interfaces. A serial programmer is also present, which permits automated
FPGA reconfiguration on power-ups. Moreover, this card supports the SignalTrap II
technology from Altera, allowing registers read back through JTAG during runtime.
This feature is very practical for debugging. The card is powered by 5.0V and an
internal 40MHz clock frequency is present. In our case, an external oscillator will be
used in order to be able to measure the power consumption for different frequencies.
The FPGA code has been written in VHDL and compiled with Altera Quartus II
v6.0 SP1. The source code is reported in Appendix D.
10.3. Measurements setup 113
The FPGA pin assignments are reported in Table 10.2.
Name PIN Name PIN
OK led PIN E13 CHIP rst n PIN N19
KO led PIN H15 CHIP sel[0] PIN T22
Power mult0 PIN F12 CHIP sel[1] PIN M17
Power mult1 PIN H13 CHIP sel reg PIN L20
Power mult2 PIN J16 CHIP shift n PIN T23
Power mult3 PIN K15 CHIP sin PIN R23
Switch1 PIN E16 CHIP sout PIN M21
Switch2 PIN G16 ext clock PIN G15
Switch3 PIN H16 mult num[0] PIN E14
Switch4 PIN E15 mult num[1] PIN F15
CHIP clock PIN N22 LED2 PIN G18
CHIP load n PIN M18 LED3 PIN F18
Z svt PIN U21 Z lvt PIN U22
Table 10.2: Pin assignments for the APEX EP20K600EFC672 FPGA
The FPGA code does:
• Select the desired multiplier;
• Reset internal registers;
• Execute 10’000’000 multiplications and accumulate the results on the 64 bit
register;
• Read back the content of the accumulator register;
• Verify the read data with the expected value and output the decision on the
pass/fail pins;
• At the end of this sequence, the chip clock is stopped to allow static power
measurements.
A particularity of this code is the use of two clock frequencies for the circuit under
test, depending on the executed task. In fact, while the chip clock runs at full speed
(the same of the FPGA) during the execution of the 10’000’000 multiplications, a clock
divided by 4 is used during the data read-back phase. This was required in order to
execute tests with frequencies bigger than 35MHz (like the nominal circuit frequency
114 Chapter 10. Physical implementation of four 32 bit multipliers
of 62.5MHz). The limiting factor was the propagation delay of the comparator used
to convert the low voltage level of the s out pin to the 3.3V level of the FPGA. In
fact, if the frequency was too high, the read value was latched before it was ready.
10.3.3 MATLAB based measurements automation
To test the manufactured circuits, lots of current measurements were required at
difference frequencies, supply voltages and this for every multiplier. Moreover, the
measurement of the power consumption during runtime needed to be synchronized
with the design under test. For these reasons, an automated way to set the parameters
(frequency, supply voltage) and to check the results was required.
To perform an automated measurement the following devices have been used:
• Agilent 33250A: Frequency generator, this device can generate a square wave
frequency up to 80MHz;
• Keithley 213: Power supply and control signal generator, this device is a Quad
Voltage Source (QVS) and includes 8 digital inputs and 8 digital outputs.
• Keithley Sourcemeter 2400: Power supply and ammeter with a precision
up to 10 pA.
All this devices support the GPIB (General Purpose Interface Bus) protocol. This
protocol is a standard for controlling devices remotely. The described tools were
connected with a cable to a computer provided with a National Instrument acquisition
card and controlled by MATLAB. In order to be able to use the GPIB protocol, the
Instrument Control Toolbox for MATLAB was required. The MATLAB source code
used for the measurements is reported Appendix E.
To determine if a multiplier was able to work at the given frequency and supply
voltage the test was performed 10 times in a row with the same frequency and supply
voltage. If at least one of these 10 tries was successful, the multiplier was considered
capable to work at this condition (even if not all the times).
The frequency range for most of the tests span from 1 to 20MHz, whereas the
supply voltage accuracy chosen was of 10mV.
Finally, the core (i.e. all the design but the multipliers) was supplied with 100mV
more than the multiplier under test, and this to avoid as much as possible to be
limited by the working supply voltage of the data generator block.
10.4. Measurements 115
10.4 Measurements
Two chips (No.2 and No.3) have been chosen (without any particular reason) for a
complete power consumption analysis and discussion. First, the power measurements
at nominal conditions (V dd=1V and f=62.5MHz) and their comparison with values
reported by Synopsys DC will be considered. Later, the detailed power measurements
for each multiplier of both chips will be carried out for frequencies ranging from 1 to
20MHz. Finally, a discussion on the power and delay variability with data measured
over 16 dies manufactured on the same wafer will be presented.
10.4.1 Nominal values
The nominal power consumptions and the critical path delay for chip No.2 and No.3
are reported in Table 10.3.
Chip No.2 Chip No.3
Mult 0 Mult 1 Mult 2 Mult 3 Mult 0 Mult 1 Mult 2 Mult 3
Pstat [µW ] 132 501 1152 4515 169 631 1571 5931
Pdyn [µW ] 3080 3312 2957 3395 3103 3350 2978 3385
Ptot [µW ] 3212 3813 4109 7910 3272 3981 4549 9316
fmax [MHz] 74.5 - - - 75.4 - - -
Delay [ns] 13.42 - - - 13.26 - - -
Table 10.3: Measured nominal (1V@62.5MHz) power consumption and maximal work-
ing frequency
These values can be compared with the ones provided by Synopsys and reported in
Table 10.1. The most remarkable difference comes from the static power consumption.
Indeed, real measurements of static power report values around 4-5 times bigger than
the expected ones. This clearly point out a big problem related to the nanometer
CMOS technologies: the parameters variability. As explained more in details on the
coming subsections, this extreme increase of the static power should mainly be due to
a threshold voltage much lower than the expected one, probably coming from a not
so well mastered effective transistor dimensions and doping profiles. Nevertheless, the
ratios of the static power between the 4 multipliers in the same chip remain almost
correct, like the parallel 4 version which shows 4 times the static power of the basic
version.
Regarding the dynamic power, the measurements are less astonishing, but still
the results show a dynamic consumption lower than the expected one. The reasons
116 Chapter 10. Physical implementation of four 32 bit multipliers
could be lower capacitances, due to variable transistor effective dimension, and/or an
activity slightly different (as a reminder, activity of all nodes, including internal cell
nodes, was estimated based on the activity on the nets connecting cells).
The delay of the critical path was measured by increasing the frequency at the
nominal supply voltage of 1V until the multiplier stop working. The measurement
was only possible for the Multiplier 0 (RCA basic SVT) because the frequency gen-
erator available at our laboratory only reach the 80MHz, and this was not enough
for measuring the other three multipliers. The measured delay is very near to the
expected one of 13.28ns.
10.4.2 Lowest working supply voltage
The expected lowest working supply voltage for a given frequency is reported in
Fig. 10.8 and is based on Eq. (6.15).
0 2 4 6 8 10 12 14 16 18 20
0.35
0.4
0.45
0.5
0.55
Frequency [MHz]
Op
tim
al
 su
pp
ly 
vo
lta
ge
 V
dd
 [V
]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.8: Expected optimal supply voltage
As we can observe, the supply voltages are reduced until they reach the correspond-
ing threshold voltage at 0MHz. The non parallel versions have a slightly steeper slope
compared to the parallelized versions. This is due to the larger LD eff for the basic
version, making it “harder” to reduce the supply voltage unless it reaches very low
frequencies. Mathematically, the larger LD eff is observed as a bigger χ.
10.4. Measurements 117
A similar plot has been obtained, by measurement, for chip No.2 and No.3. Results
are reported in Fig. 10.9 and Fig. 10.10 respectively.
0 2 4 6 8 10 12 14 16 18 200.2
0.25
0.3
0.35
0.4
0.45
0.5
Frequency [MHz]
Op
tim
al
 
su
pp
ly 
vo
lta
ge
 V
dd
 [V
]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.9: Measured optimal supply voltage for chip No.2
At a first look, Fig. 10.9 and Fig. 10.10 show the same shape of Fig. 10.8, but in
reality they present lower values compared to the theoretical case. In particular, it is
interesting to note the converging values for very low frequencies (the missing values
are due to non working conditions resulting from a too low supply voltage). As seen
before, this converging values correspond to the threshold voltage of the technology.
From these plots, it is possible to imagine that the threshold voltages for the measured
circuits should be around 0.2V or even lower. This is quite different from the one
around 0.33V reported in Fig. 10.8 (Remember that V th = V th0 − ηV dd). With
a lower V th is now understandable why optimal V dd are lower than the theoretical
ones, while the shape of the plot is maintained.
This much lower threshold voltage can now also easily explain the large factor of 4-
5 between the measured static power and the expected one at the nominal conditions.
In fact, the static power depends exponentially on the threshold voltage, as reported
in Eq. (3.5).
It is also worth to note that all multipliers of chip No.2 and No.3 worked at 250mV
and that two multipliers (mult 1 and mult 2) of chip No.3 worked at a supply voltage
as slow as 210mV with a frequency of 1MHz!
118 Chapter 10. Physical implementation of four 32 bit multipliers
0 2 4 6 8 10 12 14 16 18 200.2
0.25
0.3
0.35
0.4
0.45
Frequency [MHz]
Op
tim
al
 
su
pp
ly 
vo
lta
ge
 V
dd
 
[V]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.10: Measured optimal supply voltage for chip No.3
10.4.3 Optimal total power
The total power consumption can now be calculated for the lowest working supply
voltage (optimal V dd) thanks to Eq. (3.6). Fig. 10.11 illustrates it for the theoretical
case.
The measured optimal total power for the chip No.2 and No.3 are reported in
Fig. 10.12 and Fig. 10.13 respectively. The missing points correspond to values of the
optimal V dd too low to permit correct measurements.
As for the optimal supply voltage, we can observe that the shape of the plots
measured is very similar to the theoretical one, but the corresponding optimal power
is lower for the real circuits. This can, once more, be explained by the lower real
threshold voltage, which permits a lower optimal supply voltage and hence a lower
optimal total power.
The measured optimal supply voltages for mult 3 (RCA parallel 4 LVT) were very
low and for this reason, reported optimal total power should be taken with care.
In both chips, the measurements for the multipliers corresponding to the SVT
transistor type are very similar, whereas chip No.3 shows a slightly higher consumption
for the LVT type compared to No.2. This can be explained by the higher power static
consumption of chip No.3 as reported in Table 10.3, which manifests it mainly on
10.4. Measurements 119
0 2 4 6 8 10 12 14 16 18 200
50
100
150
200
250
300
350
Frequency [MHz]
Op
tim
al
 to
ta
l p
ow
er
 c
on
su
m
pt
io
n
 [u
W
]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.11: Expected optimal total power consumption
0 2 4 6 8 10 12 14 16 18 200
50
100
150
200
250
Frequency [MHz]
Op
tim
al
 
to
ta
l p
ow
er
 c
on
su
m
pt
io
n 
[uW
]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.12: Measured optimal total power consumption for chip No.2
LVT multipliers where static power is predominant.
The large variations in the technology parameters, discussed further in the next
120 Chapter 10. Physical implementation of four 32 bit multipliers
0 2 4 6 8 10 12 14 16 18 200
50
100
150
200
250
300
Frequency [MHz]
Op
tim
al 
to
ta
l p
ow
er
 c
on
su
m
pt
io
n 
[uW
]
rca32 svt
rca32_par4 svt
rca32 lvt
rca32_par4 lvt
Figure 10.13: Measured optimal total power consumption for chip No.3
section, makes it very difficult to accurately predict the optimal total power over a so
large range of frequencies. Nevertheless, the main shapes of the plots are maintained.
In particular, let consider the cross points between the RCA basic SVT curves and
both RCA parallel 4 SVT and RCA basic LVT. In the theoretical plot these crosses
occur at 7MHz and 17MHz respectively.
If we look to the same crosses on the measured data, we observe them at 5MHz and
13MHz for chip No.2 and at 6MHz and 17MHz for chip No.3. These results are very
similar to the expected ones, considering the high technology parameters variations
observed.
Practically, we can say that if a design is destined to work at 2MHz, the RCA
basic SVT is the best choice for low power, if it is designed for 10MHz RCA parallel 4
shows a better power profile and at 20MHz RCA basic SVT will consume more than
the RCA basic LVT which will consume more than the RCA parallel 4 SVT.
10.4.4 Power and delay variability
In the preceding discussions, it was pointed out many times that technology param-
eters are quite variable from die to die even when they come from the same wafer, as
it is the case for all the chips investigated in this thesis.
To explore a little deeper this aspect, the static power, dynamic power and critical
10.4. Measurements 121
path delay (obtained from the maximal working frequency) of the multiplier 0 (RCA
basic SVT) at nominal conditions (1V/62.5MHz) have been measured for 16 different
dies.
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
0.200
C
2M
0
C
3M
0
C
6M
0
C
7M
0
C
8M
0
C
9M
0
C
10
M
0
C
12
M
0
C
13
M
0
C
14
M
0
C
15
M
0
C
16
M
0
C
17
M
0
C
18
M
0
C
19
M
0
C
20
M
0
N
o
m
in
a
l 
P
s
ta
t 
 [
m
W
]
Figure 10.14: Nominal static power distribution for 16 chips
2.94
2.96
2.98
3
3.02
3.04
3.06
3.08
3.1
3.12
3.14
C
2M
0
C
3M
0
C
6M
0
C
7M
0
C
8M
0
C
9M
0
C
10
M
0
C
12
M
0
C
13
M
0
C
14
M
0
C
15
M
0
C
16
M
0
C
17
M
0
C
18
M
0
C
19
M
0
C
20
M
0
N
o
m
in
a
l 
P
d
y
n
 [
m
W
]
Figure 10.15: Nominal dynamic power distribution for 16 chips at 62.5MHz
The data corresponding to the nominal static power is reported in Fig. 10.14. Here,
we can see that the static power spans from a minimum of 75 µW to a maximum of
190 µW , which correspond to a factor larger than 2.5! Moreover, the average value
of 117 µW is more than 3.5 times larger the value estimated by Synopsys! This
variability makes very problematic the power estimation for circuits dominated by
static power.
122 Chapter 10. Physical implementation of four 32 bit multipliers
The nominal dynamic power consumption presents a much lower variability be-
tween dies, as illustrated in Fig. 10.15. In fact, all measured values are included in a
range from 3012 µW to 3121 µW , which correspond to a variation of ±2% around the
average value of 3062 µW . This is “only” 17% lower compared to the value provided
by Synopsys. Moreover, by comparing the static power distribution with the dynamic
one, we can observe a small correlation between the two. Actually, most of the time
a die with a higher static power consumption, also shows a relative high dynamic
power. A possible answer to this can come from the shortcut current (explained in
Chapter 2.1.2). In fact, a higher sub-threshold current (lower V th or higher I0 or
both) also means a higher “on” current, which increases the shortcut dissipation.
This could also explain why the variations of the dynamic power only account for a
few percents.
12.8
13
13.2
13.4
13.6
13.8
14
14.2
C
2M
0
C
3M
0
C
6M
0
C
7M
0
C
8M
0
C
9M
0
C
10
M
0
C
12
M
0
C
13
M
0
C
14
M
0
C
15
M
0
C
16
M
0
C
17
M
0
C
18
M
0
C
19
M
0
C
20
M
0
D
e
la
y
 [
n
s
]
Figure 10.16: Delay distribution of the RCA SVT multiplier for 16 chips
Fig. 10.16 reports the measured critical path variability over 16 different dies.
As for the dynamic power, the variation is quite limited and corresponds to ±3%
around the average value of 13.55 ns. Moreover, this delay is only 2 % larger than the
value reported by Synopsys. It is also worth noting that no correlation was observed
between the power consumption and delay distribution.
10.5 Summary
This chapter discussed the demonstrator circuit used to investigate the influence of
technology and architectural modifications to the optimal total power. The technol-
ogy used was the 90nm from ST Microelectronics, which permitted us to implement
10.5. Summary 123
two different transistor types on the same chip. Moreover, two different 32 bit mul-
tipliers were implemented for each transistor type, yielding a total of 4 multipliers.
The first part of the chapter was dedicated to the circuit design and conception, then
the description of the measurements setup follows and, at the end, the measured data
were exposed and commented. In particular, we observed an average static power
3.5 times higher than the typical values estimated by Synopsys, whereas the dynamic
power was only 17% lower on average. The large difference between the simulation
and real measurements can be explained with the threshold voltage, which, in reality,
appeared to be much lower than the theoretical one. Besides these important differ-
ences observable at nominal conditions (V dd=1V and f=62.5MHz), the total power
for multipliers working at the lowest possible supply voltages was discussed. The
measured values showed a shape very similar to the expected one, but with different
absolute values. This can also be explained by the lower real V th. At the end of
the chapter, the variability of powers and delay were reported for the same multiplier
in 16 different chips. The results showed a static power varying as much as a factor
2.5 between the lowest and highest value for multipliers coming from the same wafer!
Without doubt, this large variability of static power will be a main issue in nanometer
CMOS technologies, especially for designs where static power is a large contributor.
124 Chapter 10. Physical implementation of four 32 bit multipliers
Chapter 11
Conclusions
With the introduction of nanometer CMOS technologies, new sources of power dissi-
pation appeared. The continue shrinking of the transistor sizes, dictated by Moore’s
law, reached a point where new physical phenomena need to be faced. One of the
most important problems related to these new phenomena is the huge increase of the
static power consumption, which can become even bigger than the dynamic power
for a running circuits. The static power consumption is the portion of the power
dissipation that is constantly flowing from V dd to V ss, even when the circuit is in
idle state. For nowadays technologies, the principal contributor to static power comes
from the sub-threshold current flowing through the transistors in off state. This type
of current arises from the diffusion of the minority carriers in the transistor channel.
The reason why this current is increasing so much in recent nanometer technologies
is that it has an exponential dependency on the transistor threshold voltage, which is
constantly reduced with new technologies to maintain the speed acceptable.
The goal of this thesis was to investigate the low power methodologies in technolo-
gies dominated by a large static power consumption. In particular, we were interested
in the architectural as well as in the technology influence on the total power consump-
tion.
The principal theoretical framework exposed in this thesis considers a scenario in
which both the supply voltage and the threshold voltage can be freely modified. Under
such assumption, the total power consumption clearly shows a minimum located at
very low supply voltages (examples showed optimal V dd lower than 0.4V even at
frequency as 62.5MHz). The derivation of the ratio k1 (i.e. optimal dynamic power
over optimal static power) showed that, this ratio being quite constant compared
to the variation of Ion/Ioff between technology nodes, nanometer technologies will
require a growing ratio a/LD (activity over logical depth) to reach this optimum.
125
126 Chapter 11. Conclusions
This, for instance, will make pipelining preferable over parallelization. After that,
we have seen the influence of a, LD and f to the optimal V dd and V th, showing
that frequency mainly influences the optimal V th, logical depth mainly influences
the optimal V dd, while activity influences both of them. By comparing architectures
under the rough approximation of a quasi-constant k1, we realized that pipelining
and parallelization are more effective for low power when they show high logical
depth and high frequency. We also observed that new technologies, characterized
by a lower χ factor compared to older ones, will tend to penalize pipelining and
parallelization, whereas the condition for a power saving by pipelining remains easier
to fulfill compared to the parallelization one.
Going behind the quasi-constant k1 approach, analytical closed-form equations
has been derived for the calculation of the optimal V dd, optimal V th and optimal
total power directly from the architectural and technology parameters. Thanks to
these equations, we observed that the optimal V th is quite unchanged by pipelining,
while the parallelization increases it by a precise amount, which only depends on the
degree of parallelization. Moreover, sequential multipliers were clearly shown to be
inadequate for low power at the optimal working condition due the large effective
logical depth and the high number of transitions (a ·N).
From a low power point of view, the best characteristics for an ideal technology
would be a capacitance C, delay constant kt and sub-threshold slope n as low as
possible, whereas the reference current I0 and alpha power law coefficient α should
be as high as possible.
After the technology influence discussion, a few possibilities for modifying the
threshold voltage (like body bias, transistor resizing, technology choice) were also
presented.
Under all the investigated architectural and technology modifications, the simple
approximated analytical equations developed in Chapter 6 for the optimal V dd, V th
and Ptot showed very good results, reporting errors always lower than a few percent
compared to numerical computation based on non-approximated equations.
In a second framework, the opposite case was considered, in which the threshold
voltage as well as the supply voltage were assumed constant. This particular case
was explored because it corresponds to the most typical case for industrial designers.
In fact, they often have a fixed supply voltage and threshold voltage imposed by
the technology and/or the devices the circuit has to interface. Under this condition,
graphical tools for total power comparison of different architectures were presented.
Examples of application of these tools to the same multipliers used in the precedent
127
framework were reported. In particular, we showed that, depending on the constraints
used, the multiplier presenting the lowest total power is not always the same.
At the end of the thesis, a physical implementation of four different 32 bit mul-
tipliers was presented. These 4 multipliers represent all the possible combinations
between two transistor types (SVT and LVT) and two architectures (RCA basic and
RCA parallel 4). After an in-deep description of the circuit design flow and mea-
surement setup, the nominal power consumptions as well as the optimal ones (those
corresponding to the lowest working supply voltages) were compared to the theoretical
values. The measured data showed, in average, a static power 3.5 times larger than
expected. This was supposed to be due to real threshold voltages much lower than
the simulated ones. Nevertheless, the shapes of the plots remained very similar to the
expected ones. This means that, even if the absolute values were not well estimated
by the models (due to the large technology parameters variability), the relation be-
tween them was respected. This was essential to be able to predict which multiplier
presented the lowest total power for a given working frequency. It is also interesting
to note that a few multipliers were able to work at 210mV of supply voltage at a
frequency of 1MHz. Finally, the variability of powers and delay for 16 chips coming
from the same wafer were reported. In particular, the variations on the static power
at nominal condition (V dd=1V, f=62.5MHz) were strongly fluctuating, accounting
for a factor of more than 2.5 between the highest and the lowest measured values. On
the other hand, the variations on the dynamic power and delays were within ±3%.
From these observations, we can conclude that the major problem that the tech-
nologues will have to face in the future will be the difficulty to master the variations
of the technology parameters. The price to pay for not achieving it would be lots of
circuit instabilities and very low production yields, due to many dies unable to meet
the specifications.
128 Chapter 11. Conclusions
Bibliography
[1] SIA ITRS roadmap update 2006 - http://www.itrs.net/.
[2] http://en.wikipedia.org/wiki/moore’s law.
[3] Transistor elements for 30nm physical gate length and beyond. Intel Technology
Journal, Vol. 06(No. 2):42–54, May 2002.
[4] H. Soeleman, K. Roy, and B. Paul. Robust ultra-low power sub-threshold DT-
MOS logic. International Symposium on Low Power Electronics and Design,
2000.
[5] T. Enomoto, Y. Oka, H. Shikano, and T. Harada. A self controllable voltage
level (SVL) circuit for low power high speed CMOS circuit. European Solid-State
Circuits Conference, pages 411–414, 2002.
[6] S. Cserveny, J.-M. Masgonty, and C. Piguet. Stand-by power reduction for storage
circuits. PATMOS Conference, September 2003.
[7] S.M. Kang and Y. Leblebici. CMOS Digital Integrated Circuits: Analysis and
Design, Third Edition. McGraw-Hill, 2003.
[8] M. Anis and M. Elmasry. Multi-Threshold CMOS Digital Circuits. Kluwer Aca-
demic Publisher, 2003.
[9] K. Usami, N. Kawabe, M.Koizumi, K. Seta, and T. Furusawa. Automated se-
lective Multi-Threshold design for ultra-low standby applications. International
Symposium on Low Power Electronics and Design, 2002.
[10] J. Kao and A. Chandrakasan. MTCMOS sequential circuits. European Solid-
State Circuits Conference, 2001.
[11] V.R. von Kaenel, M.D. Pardoen, E. Dijkstra, and E.A. Vittoz. Automatic ad-
justment of threshold & supply voltage for minimum power consumption CMOS
129
130 Bibliography
digital circuits. IEEE Symposium on Low Power Electronics, pages 78–79, Oc-
tober 1994.
[12] C. H. Kim and K. Roy. Dynamic Vt SRAM: A leakage tolerant cache memory for
low voltage microprocessor. International Symposium on Low Power Electronics
and Design, 2002.
[13] H. Mizuno, K. Ishibashi, T. Shimura, T. Hattori, S. Narita, K. Shiozawa, S. Ikeda,
and K. Uchiyama. An 18uA standby current 1.8V, 200MHz microprocessor with
self-substrate-biased data-retention mode. IEEE International Solid-State Cir-
cuits Conference, pages 280–281, 1999.
[14] F. Assaderaghi, D. Sinitsky, S. Parke, S. Bokor, P.K. Ko, and C.Hu. A dynamic
threshold voltage MOSFET (DTMOS) for ultra-low voltage operation. IEEE
International Electron Devices Meeting Technical Digest, pages 809–812, 1994.
[15] A. P. Chandrakasan and R.W. Brodersen. Low power CMOS digital design.
IEEE Journal of Solid-State Circuits, Vol. 27(No. 4):473–484, April 1992.
[16] H.J.M Veendrick. Short-circuit dissipation of static CMOS circuitry and its im-
pact on the design of buffer circuits. IEEE Journal of Solid-State Circuits, Vol.
19(No. 4):468–473, August 1984.
[17] D. Auverne, P.Maurine, and N. Aze´mard. Low Power Electronics Design, Chap-
ter 6:Modeling for Designing in Deep Submicron Technologies. CRC Press, 2005.
[18] K. Nose and T. Sakurai. Closed-form expression for short-circuit power of short
channel CMOS gates and its scaling characteristics. International Technical Con-
ference on Circuits/Systems, Computers and Communications, pages 1741–1744,
1998.
[19] S. Turgis, N. Azemard, and D. Auvergne. Short-circuit power dissipation calcu-
lation on CMOS inverters using the equivalent short-circuit capacitance concept.
PATMOS Conference, 1995.
[20] T. Sakurai and A. R. Newton. Alpha-power law MOSFET model and its appli-
cations to CMOS inverter delay and other formulas. IEEE Journal of Solid-State
Circuits, Vol. 25(No. 2):584–594, April 1990.
[21] T. Sakurai and A.R. Newton. A simple MOSFET model for circuit analysis.
IEEE Transactions on Electron Devices, Vol. 38(No. 4):887–894, April 1991.
Bibliography 131
[22] J.L. Rossello´ and J. Segura. Accurate modelling of leakage currents in nanometre
CMOS technologies. Electronics Letters, Vol. 41(No. 3):122–124, February 2005.
[23] Z. Chen, M. Johnson, L. Wei, and K. Roy. Estimation of standby leakage power in
CMOS circuits considering accurate modeling of transistor stacks. International
Symposium on Low Power Electronics and Design, pages 239–244, 1998.
[24] B.J. Sheu, D.L.Scharfetter, P.-K. Ko, and M.-C. Jeng. BSIM: Berkley Short-
Channel IGFET model for MOS transistors. IEEE Journal of Solid-Sate Circuits,
Vol. 22(No. 4):558–566, August 1987.
[25] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meinand. Leakage current mech-
anisms and leakage reduction techniques in deep-submicrometer CMOS circuits.
Proceedings of the IEEE, Vol. 91(No. 2):pp. 305–327, February 2003.
[26] N. S. Kim, T. Austin, D. Blaauw, T.Mudge, K. Flauter, J.S. Hu, M.J. Irwin,
M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static
power. IEEE Computer, pages 68–75, December 2003.
[27] John Robertson. High dielectric constant gate oxides for metal oxide si transis-
tors. Reports on Progress in Physics, Vol. 69:327–396, 2006.
[28] A. Keshavarzi, K. Roy, and C. F. Hawkins. Intrinsic leakage in low power deep
submicron CMOS ICs. IEEE International Test Conference, pages 146–155, 1997.
[29] S. Mukhopadhyay and K. Roy. Modeling and estimation of total leakage current
in nano-scaled CMOS devices considering the effect of parameter variation. IEEE
International Symposium on Circuits and Systems, pages 172–175, 2003.
[30] A. Ferre´ and J. Figueras. Low Power Electronics Design, Chapter 3: Leakage in
CMOS nanometric technologies. CRC Press, 2005.
[31] D. Helms, E. Schmidt, and W. Nebel. Tutorial: Leakage in CMOS circuits - an
introduction. PATMOS Conference, pages 17–35, 2004.
[32] TSMC. ANTCBN90G 110A - TSMC technology manual.
[33] SIA ITRS roadmap 2004 - http://www.itrs.net/.
[34] http://www.intel.com/pressroom/archive/releases/20070128comp.htm.
[35] K. Nose and T. Sakurai. Optimization of Vdd and Vth for low-power and high-
speed applications. Asia South Pacific Design Automation Conference, pages
469–474, January 2000.
132 Bibliography
[36] M. H. Fino. A simple submicron MOSFET model and its application to the
analyitcal characterization of analog circuits. European Conference on Circuit
Theory and Design, August 2005.
[37] K.A. Bowman, B.L. Austin, J.C. Eble, Xinghai Tang, and J.D. Meindl. A physical
alpha-power law MOSFET model. IEEE Journal of Solid-State Circuits, Vol.
34(No. 10):1410–1414, October 1999.
[38] T. Sakurai. Alpha power-law MOS model. IEEE Solid-State Circuits Society
Newsletter, Vol. 9(No. 4):4–5, October 2004.
[39] J.L Rossello´ and J. Segura. Charge-based analytical model for evaluation of power
consumption in submicron CMOS buffers. IEEE Transaction on Computer-Aided
Design of Integrated Circuits and System, Vol. 21(No. 4), April 2002.
[40] J.A. Butts and G.S. Sohi. A static power model for architects. ACM International
Symposium on Microarchitecture, pages 191–201, December 2000.
[41] C.S. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Electronic
Computers, Vol. 13:14–17, February 1964.
[42] P. C. H. Meier. Analysis and Design of Low Power Digital Multipliers. PhD
thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, 1999.
[43] H. Reza. A new multiplier using Wallace structure and carry select adder with
pipelining. IEEE International Symposium on Circuits and Systems, 2002.
[44] R. Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their
Synthesis. PhD thesis, Swiss Federal Institute of Technology, Zurich, 1997.
[45] R. P. Brent and H. T. Kung. A regular layout for parallel adders. IEEE Trans-
actions on Computers, Vol. 31(No. 3):260–264, 1982.
[46] W. J. Townsend, E. E. Swartzlander, Jr., and J. A. Abraham. A comparison
of dadda and wallace multiplier delays. In F. T. Luk, editor, Advanced Signal
Processing Algorithms, Architectures, and Implementations XIII, pages 552–560,
December 2003.
[47] K.A.C. Bickerstaff, M. Schulte, and Jr. Swartzlander, E.E. Reduced area multi-
pliers. International Conference on Application-Specific Array Processors, pages
478–489, October 1993.
Bibliography 133
[48] A. Wang and A. Chandrakasan. A 180-mV subthrehsold FFT processor using a
minimum energy design methodology. IEEE Journal of Solid-State Circuits, Vol.
40(No. 1):310–319, January 2005.
[49] H. Q. Dao, B. R. Zeydel, and V. G. Oklobdzija. Architectural considerations for
energy efficiency. International Conference on Computer Design, pages 13–16,
2005.
[50] B. Zhai et. al. A 2.6pJ/Inst subthreshold sensor processor for optimal energy
efficiency. VLSI Circuits Symposium, 2006.
[51] B. H. Calhoun and A. Chandrakasan. Characterizing and modeling minimum
energy operation for subthrehsold circuits. International Symposium on Low
Power Electronics and Design, pages 90–95, 2004.
[52] S. Hanson, B. Zhai, D. Blaauw, D. Sylvester, A. Bryant, and X. Wang. Energy
optimality and variability in subthreshold design. International Symposium on
Low Power Electronics and Design, pages 363–365, October 2006.
[53] D. Markovic, V. Stojanovic, B. Nikolic, M. A. Horowitz, and R. W. Brodersen.
Methods for true energy-performance optimization. IEEE Journal of Solid-State
Circuits, Vol. 39(No. 8):1282–1293, August 2004.
[54] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner. Theoretical and practical
limits of dynamic voltage scaling. Design Automation Conference, pages 868–873,
2004.
[55] J. Burr and A. Peterson. Ultra low power CMOS technology. NASA VLSI Design
Symposium, pages 4.2.1–4.2.13, 1991.
[56] C. Heer and J. Berthold. Designing low power circuits: an industrial point of
view. PATMOS Conference, September 2001.
[57] S.M. Sze. Semiconductor Devices - Physics and Technology. John Wiley & Sons,
1985.
[58] H. Ananthan, C. H. Kim, and K. Roy. Larger-then-Vdd forward body bias in sub-
0.5 V nanoscale CMOS. IEEE International Symposium Low Power Electronics
and Design, pages 8–13, 2004.
[59] H. Ananthan. Evaluation of digital forward body bias for 70nm bulk CMOS.
Class Project, EE 695K, Fall 2003.
134 Bibliography
[60] T. Kuroda et al. A 0.9V, 150MHz, 10mW, 4 mm2, 2D discrete cosine transform
core processor with variable threshold voltage scheme. IEEE Journal of Solid-
State Circuits, Vol. 31(No. 11):1770–1779, 1996.
[61] S. Narendra et al. 1.1V 1GHz communications router with on-chip body bias
in 150nm CMOS. IEEE International Solid-State Circuits Conference, pages
270–271, 2002.
[62] P. Gupta, A.B. Kahng, P. Sharma, and D. Sylvester. Gate-length biasing for
runtime-leakage control. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, Vol. 25:1475–1485, 2006.
[63] http://en.wikipedia.org/wiki/linear feedback shift register.
[64] XilinX. Xilinx LogiCORE: Linear Feedback Shift Register V3.0, 28 March 2003.
[65] http://www.nova-eng.com/inside.asp?n=products&p=constellation.
List of Publications
Conferences
• C. Piguet, C. Schuster, J.-L. Nagel. “Static and dynamic power reduction by
architecture selection”. Proc. Int’l Workshop on Power and Timing Modeling,
Optimization and Simulation, PATMOS’06, Montpellier, France, September 13-
15, 2006.
• C. Piguet, C. Schuster, J.-L. Nagel. “Leakage reduction at architecture level”.
Proc. Int’l Conference on Integrated Circuit Design & Technology, ICICDT’06,
Padova, Italy, May 24-26, 2006.
• C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. “Architectural and technology
influence on the optimal total power consumption”. Design, Automation and
Test in Europe Conference, DATE06, Munich, Germany, March 06-10, 2006.
• C. Piguet, C. Schuster, J.-L. Nagel. “Re´duction des consommations statique
et dynamique par se´lection des architectures”. 5e`me journe´es d’e´tudes Faible
Tension Faible Consommation, FTFC05, Paris, May 18-19, 2005.
• C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. “Conception d’architectures
a` Vdd et Vth impose´s avec consommation totale minimale”. Journe´es Fran-
cophones sur l’Ade´quation Algorithme Architecture, JFAAA’05, Dijon, France,
January 18-21, 2005.
• C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. “Leakage reduction at the
architectural level and its application to 16 bit multiplier architectures”. Proc.
Int’l Workshop on Power and Timing Modeling, Optimization and Simulation,
PATMOS’04, Santorini Island, Greece, September 15-17, 2004.
• C. Piguet, C. Schuster, J.-L. Nagel. “Optimizing architecture activity and logic
depth for static and dynamic power reduction”. Proc. of the 2nd Northeast
135
136 Bibliography
Workshop on Circuits and Systems, NewCAS’04, Montre´al, Canada, June 20-
23, 2004
Journals
• C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. “An architecture design
methodology for minimal total power consumption at fixed Vdd and Vth”.
Journal of Low Power Electronics, Vol.1(No.1):pp.3-10, April, 2005.
Appendix A
VHDL source code
A.1 top.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : C i r cu i t top (32 b i t )
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : top . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−02−17
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : This i s the top o f the des ign .
14 −− I t i n c l ude s the f o l l ow b l o c k s :
15 −− − data gen
16 −− − 2 mult32
17 −− − 2 mult32 p a r a l l e l 4
18 −− − one−hot decoder and mux
19 −− − 2 r ing o s c i l l a t o r s
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
21 −− Copyright ( c ) 2006
22 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
23 −− Revis ions :
24 −− Date Version Author Descr ip t ion
25 −− 2006−02−17 1.0 schus t e r Created
26 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
27
28 l ibrary i e e e ;
29 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
30
31 entity top i s
32
33 port (
34 c l k : in s t d l o g i c ; −− c l o c k
35 r s t n : in s t d l o g i c ; −− a c t i v e low async r e s e t
137
138 Appendix A. VHDL source code
36 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
37 s i n : in s t d l o g i c ; −− s e r i a l input
38 s ou t : out s t d l o g i c ; −− s e r i a l output
39 l oad n : in s t d l o g i c ; −− when low r e g i s t e r s are loaded in p a r a l l e l
40 s h i f t n : in s t d l o g i c ; −− when low data i s s h i f t e d
41 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
42 −− s e l e c t the source f o r da ta ou t as we l l as f o r the data saved in r e g i s t e r s
43 s e l r e g : in s t d l o g i c ;
44 s e l : in s t d l o g i c v e c t o r (1 downto 0) ; −− s e l e c t the working mu l t i p l i e r
45 Z svt : out s t d l o g i c ; −− out s v t r ing o s c i l l a t o r
46 Z lv t : out s t d l o g i c ) ; −− out l v t r ing o s c i l l a t o r
47
48 end top ;
49
50 architecture arch of top i s
51
52 component data gen
53 port (
54 c l k : in s t d l o g i c ;
55 r s t n : in s t d l o g i c ;
56 da ta in v : in s t d l o g i c v e c t o r (63 downto 0) ;
57 data out v : out s t d l o g i c v e c t o r (63 downto 0) ;
58 s i n : in s t d l o g i c ;
59 s ou t : out s t d l o g i c ;
60 l oad n : in s t d l o g i c ;
61 s h i f t n : in s t d l o g i c ;
62 s e l r e g : in s t d l o g i c ) ;
63 end component ;
64
65 component mult
66 port (
67 c l k : in s t d l o g i c ;
68 r s t n : in s t d l o g i c ;
69 en : in s t d l o g i c ;
70 a v : in s t d l o g i c v e c t o r (31 downto 0) ;
71 b v : in s t d l o g i c v e c t o r (31 downto 0) ;
72 m v : out s t d l o g i c v e c t o r (63 downto 0) ) ;
73 end component ;
74
75 component mult par4
76 port (
77 c l k : in s t d l o g i c ;
78 r s t n : in s t d l o g i c ;
79 en : in s t d l o g i c ;
80 a v : in s t d l o g i c v e c t o r (31 downto 0) ;
81 b v : in s t d l o g i c v e c t o r (31 downto 0) ;
82 m v : out s t d l o g i c v e c t o r (63 downto 0) ) ;
83 end component ;
84
85 component r i n g s v t
86 generic (
87 l ength : i n t e g e r ) ;
88 port (
89 Z : out s t d l o g i c ) ;
90 end component ;
A.1. top.vhd 139
91
92 component r i n g l v t
93 generic (
94 l ength : i n t e g e r ) ;
95 port (
96 Z : out s t d l o g i c ) ;
97 end component ;
98
99 −− demu l t i p l e xed mu t l i p l i e r s output
100 signal general m : s t d l o g i c v e c t o r (63 downto 0) ;
101 −− mu l t i p l i e r s input data conta in ing both A and B
102 signal g ene r a l a b : s t d l o g i c v e c t o r (63 downto 0) ;
103 −− mu l t i p l i e r s input data separe ted as A and B
104 signal genera l a , g ene ra l b : s t d l o g i c v e c t o r (31 downto 0) ;
105
106
107 signal m0 v , m1 v , m2 v , m3 v : s t d l o g i c v e c t o r (63 downto 0) ; −− mu l t i p l i e r s
108 −− r e s u l t s
109 signal en0 , en1 , en2 , en3 : s t d l o g i c ; −− mu l t i p l i e r s r e g i s t e r s enab le
110 signal clk0 , c lk1 , c lk2 , c lk3 : s t d l o g i c ; −− mu l t i p l i e r s r e g i s t e r s c l o c k
111
112 begin −− arch
113 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
114 −− component mapping
115 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
116 data gen 1 : data gen
117 port map (
118 c l k => c lk ,
119 r s t n => r s t n ,
120 da ta in v => general m ,
121 data out v => gene ra l a b ,
122 s i n => s i n ,
123 s ou t => s out ,
124 l oad n => load n ,
125 s h i f t n => s h i f t n ,
126 s e l r e g => s e l r e g ) ;
127
128 mult 0 : mult
129 port map (
130 c l k => clk0 ,
131 r s t n => r s t n ,
132 en => en0 ,
133 a v => genera l a ,
134 b v => genera l b ,
135 m v => m0 v) ;
136 mult 1 : mult par4
137 port map (
138 c l k => clk1 ,
139 r s t n => r s t n ,
140 en => en1 ,
141 a v => genera l a ,
142 b v => genera l b ,
143 m v => m1 v) ;
144 mult 2 : mult
145 port map (
140 Appendix A. VHDL source code
146 c l k => clk2 ,
147 r s t n => r s t n ,
148 en => en2 ,
149 a v => genera l a ,
150 b v => genera l b ,
151 m v => m2 v) ;
152 mult 3 : mult par4
153 port map (
154 c l k => clk3 ,
155 r s t n => r s t n ,
156 en => en3 ,
157 a v => genera l a ,
158 b v => genera l b ,
159 m v => m3 v) ;
160
161 r i n g s v t 1 : r i n g s v t
162 generic map (
163 l ength => 437)
164 port map (
165 Z => Z svt ) ;
166 r i n g l v t 1 : r i n g l v t
167 generic map (
168 l ength => 533)
169 port map (
170 Z => Z lv t ) ;
171
172 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
173 −−combina tor ia l par t
174 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
175 g ene r a l a <= gene r a l a b (63 downto 32) ;
176 gene ra l b <= gene r a l a b (31 downto 0) ;
177
178
179 −− one−hot decoder
180 en0 <= ’1 ’ when s e l = ”00” else ’ 0 ’ ;
181 en1 <= ’1 ’ when s e l = ”01” else ’ 0 ’ ;
182 en2 <= ’1 ’ when s e l = ”10” else ’ 0 ’ ;
183 en3 <= ’1 ’ when s e l = ”11” else ’ 0 ’ ;
184
185 −− c l o c k demux
186 c lk0 <= c lk when s e l = ”00” else ’ 0 ’ ;
187 c lk1 <= c lk when s e l = ”01” else ’ 0 ’ ;
188 c lk2 <= c lk when s e l = ”10” else ’ 0 ’ ;
189 c lk3 <= c lk when s e l = ”11” else ’ 0 ’ ;
190
191 −− ouput mux
192 with s e l select
193 general m <=
194 m0 v when ”00” ,
195 m1 v when ”01” ,
196 m2 v when ”10” ,
197 m3 v when ”11” ,
198 m0 v when others ;
199
200 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
A.2. data gen.vhd 141
201 −−s e q u en t i a l par t
202 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
203
204 end arch ;
A.2 data gen.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : Data generator (32 b i t )
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : data gen . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−02−15
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : This b l o c k conta ins the pseudo−random data generator ,
14 −− as we l l as the c y c l i c adder and corresponding r e g i s t e r s .
15 −− Both generated random and input data can be outputed in
16 −− s e r i a l by the s h i f t r e g b l o c k .
17 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
18 −− Copyright ( c ) 2006
19 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
20 −− Revis ions :
21 −− Date Version Author Descr ip t ion
22 −− 2006−02−15 1.0 schus t e r Created
23 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
24 l ibrary i e e e ;
25 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
26 use i e e e . s t d l o g i c un s i g n ed . a l l ; −− used to add s t d l o g i c v e c t o r s
27
28 −− designware implementation o f the
29 −− s h i f t r e g i s t e r
30 l ibrary DWARE, DW03;
31 use DWARE. DWpackages . a l l ;
32 use DW03. DW03 components . a l l ;
33
34 entity data gen i s
35 port (
36 c l k : in s t d l o g i c ; −− c l o c k
37 r s t n : in s t d l o g i c ; −− a c t i v e low async r e s e t
38 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
39 −− data coming from the mu l t i p l i e r s ( r e s u l t )
40 da ta in v : in s t d l o g i c v e c t o r (63 downto 0) ;
41 −− genera ted da ta ( ex tern or pseudo random)
42 data out v : out s t d l o g i c v e c t o r (63 downto 0) ;
43 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
44 s i n : in s t d l o g i c ; −− s e r i a l input
45 s ou t : out s t d l o g i c ; −− s e r i a l output
46 l oad n : in s t d l o g i c ; −− when low r e g i s t e r s are loaded in p a r a l l e l
47 s h i f t n : in s t d l o g i c ; −− when low data i s s h i f t e d
142 Appendix A. VHDL source code
48 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
49 −− s e l e c t the source f o r da ta ou t as we l l as f o r the data saved in r e g i s t e r s
50 s e l r e g : in s t d l o g i c
51 ) ;
52
53 end data gen ;
54
55 architecture arch of data gen i s
56 −− DesignWare s h i f t r e g i s t e r
57 component DW03 shfreg
58 generic (
59 i n s t l e n g t h : i n t e g e r ) ;
60 port (
61 i n s t c l k : in s t d l o g i c ;
62 i n s t s i n : in s t d l o g i c ;
63 i n s t p i n : in s t d l o g i c v e c t o r ( i n s t l e n g th −1 downto 0) ;
64 i n s t s h i f t n : in s t d l o g i c ;
65 i n s t l o a d n : in s t d l o g i c ;
66 p ou t i n s t : out s t d l o g i c v e c t o r ( i n s t l e n g th −1 downto 0) ) ;
67 end component ;
68
69 −− l o c a l s i g n a l s
70 signal sum v : s t d l o g i c v e c t o r (63 downto 0) ; −− r e s u l t o f the adder
71 signal p out v : s t d l o g i c v e c t o r (63 downto 0) ; −− output o f the r e g i s t e r bank
72 signal p in v : s t d l o g i c v e c t o r (63 downto 0) ; −− input o f the r e g i s t e r bank
73 signal rand data v : s t d l o g i c v e c t o r (63 downto 0) ; −− output o f the pseudo
74 −− random generator
75 signal next rand data v : s t d l o g i c v e c t o r (63 downto 0) ; −− next o f rand data
76
77 −− l o c a l cons tant s
78 constant i n s t l e n g t h : natura l := 64 ; −− s i z e o f the s h i f t r e g i s t e r bank
79
80 begin −− arch
81
82 −− r e cu r s i v e c y c l i c adder
83 adder :
84 sum v <= data in v + p out v ;
85
86 −− l i n k the h i g h e s t b i t o f p ou t v to s ou t
87 s ou t <= p out v (63) ;
88
89
90 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
91 −− in s tance o f the s h i f t r e g i s t e r
92 −− based on a DW model
93 s h i f t r e g i s t e r : DW03 shftreg
94 generic map ( l ength => i n s t l e n g t h )
95 port map ( c l k => c lk , s i n => s i n , p in => p in v ,
96 s h i f t n => s h i f t n , load n => load n ,
97 p out => p out v ) ;
98
99
100 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
101 −− purpose : i n s t an c i a t i on o f mux1 and mux2
102 −− type : combinat iona l
A.3. mult.vhd 143
103 −− inpu t s : s e l r e g , sum v , rand data v , p ou t v
104 −− outputs : p in v , da ta ou t v
105 muxes : process ( s e l r e g , sum v , rand data v , p out v )
106 begin −− process
107 case s e l r e g i s
108 when ’ 0 ’ =>
109 p in v <= sum v ;
110 data out v <= rand data v ;
111 when ’ 1 ’ =>
112 p in v <= rand data v ;
113 data out v <= p out v ;
114 when others => null ;
115 end case ;
116 end process ;
117
118
119 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
120 −− pseudo−random code generator
121 −− the next b i t i s based on the
122 −− taps 63 , 61 , 60 , 0
123 −− the s t a t e to avoid i s 1 . . . 1
124 pseudo rand l og i c :
125 next rand data v <= (( rand data v (63) xnor rand data v (61) ) xnor
126 ( rand data v (60) xnor rand data v (0 ) ) ) &
127 rand data v (63 downto 1) ;
128
129 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
130 −− purpose : i n s e r t s h i f t r e g i s t e r bank used fo r the pseudo code genera t ion
131 −− type : s e q u en t i a l
132 −− inpu t s : c l k , r s t n , nex t rand da ta v
133 −− outputs : rand data v
134 pseudo rand regs : process ( c lk , r s t n )
135 begin −− process
136 i f r s t n = ’0 ’ then −− asynchronous r e s e t ( a c t i v e low )
137 rand data v <= ( others => ’ 0 ’ ) ;
138 e l s i f c lk ’ event and c l k = ’1 ’ then −− r i s i n g c l o c k edge
139 rand data v <= next rand data v ;
140 end i f ;
141 end process ;
142
143
144 end arch ;
A.3 mult.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : Simple mu l t i p l i e r (32 b i t )
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : mult . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−02−16
9 −− Last update : 2006−09−22
144 Appendix A. VHDL source code
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : s imple mu l t i p l i e r b l o c k with r e g i s t e r s
14 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
15 −− Copyright ( c ) 2006
16 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
17 −− Revis ions :
18 −− Date Version Author Descr ip t ion
19 −− 2006−02−16 1.0 schus t e r Created
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
21 l ibrary i e e e ;
22 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
23 use i e e e . s t d l o g i c un s i g n ed . a l l ;
24
25
26 entity mult i s
27 port (
28 c l k : in s t d l o g i c ; −− c l o c k
29 r s t n : in s t d l o g i c ; −− a c t i v e low async r e s e t
30 en : in s t d l o g i c ; −− r e g i s t e r s enab le
31 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
32 a v : in s t d l o g i c v e c t o r (31 downto 0) ; −− input A
33 b v : in s t d l o g i c v e c t o r (31 downto 0) ; −− input B
34 m v : out s t d l o g i c v e c t o r (63 downto 0) −− r e s u l t
35 ) ;
36 end mult ;
37
38 architecture arch of mult i s
39
40 −− gener i c RCA mu l t i p l i e r d e c l a ra t i on
41 component RCA
42 generic (
43 A width : i n t e g e r ; −− s i z e o f A
44 B width : i n t e g e r ) ; −− s i z e o f B
45 port (
46 S : out s t d l o g i c v e c t o r ( A width+B width−1 downto 0) ;
47 A : in s t d l o g i c v e c t o r (A width−1 downto 0) ;
48 B : in s t d l o g i c v e c t o r ( B width−1 downto 0) ) ;
49 end component ;
50
51 −− l o c a l s i g n a l s
52 signal a i n t v : s t d l o g i c v e c t o r (31 downto 0) ;
53 signal b i n t v : s t d l o g i c v e c t o r (31 downto 0) ;
54 signal m int v : s t d l o g i c v e c t o r (63 downto 0) ;
55
56 begin −− arch
57
58 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
59 −−combina tor ia l par t
60 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
61
62 −− i n f e r the 32 b i t mu l t i p l i e r
63 mult 1 : RCA
64 generic map (
A.4. mult par4.vhd 145
65 A width => 32 ,
66 B width => 32)
67 port map (
68 S => m int v ,
69 A => a in t v ,
70 B => b i n t v ) ;
71
72 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
73 −−s e q u en t i a l par t
74 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
75
76 −− purpose : input and output r e g i s t e r s
77 −− type : s e q u en t i a l
78 −− inpu t s : c l k , r s t n , a v , b v , m int v
79 −− outputs : a in t v , b i n t v , m v
80 mult1 regs : process ( c lk , r s t n )
81 begin −− process mul t1 regs
82 i f r s t n = ’0 ’ then −− asynchronous r e s e t ( a c t i v e low )
83 a i n t v <= ( others => ’ 0 ’ ) ;
84 b i n t v <= ( others => ’ 0 ’ ) ;
85 m v <= ( others => ’ 0 ’ ) ;
86 e l s i f c lk ’ event and c l k = ’1 ’ then −− r i s i n g c l o c k edge
87 i f en = ’1 ’ then
88 a i n t v <= a v ;
89 b i n t v <= b v ;
90 m v <= m int v ;
91 end i f ;
92 end i f ;
93 end process mult1 regs ;
94
95 end arch ;
A.4 mult par4.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : 4 t imes p a r a l l e l mu l t i p l i e r (32 b i t )
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : mult par4 . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−02−16
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : 4 t imes p a r a l l e l mu l t i p l i e r b l o c k with r e g i s t e r s
14 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
15 −− Copyright ( c ) 2006
16 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
17 −− Revis ions :
18 −− Date Version Author Descr ip t ion
19 −− 2006−02−16 1.0 schus t e r Created
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
146 Appendix A. VHDL source code
21 l ibrary i e e e ;
22 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
23 use i e e e . s t d l o g i c un s i g n ed . a l l ;
24
25 entity mult par4 i s
26 port (
27 c l k : in s t d l o g i c ; −− c l o c k
28 r s t n : in s t d l o g i c ; −− a c t i v e low async
29 en : in s t d l o g i c ; −− r e g i s t e r s enab le
30 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
31 a v : in s t d l o g i c v e c t o r (31 downto 0) ; −− input A
32 b v : in s t d l o g i c v e c t o r (31 downto 0) ; −− input B
33 m v : out s t d l o g i c v e c t o r (63 downto 0) −− r e s u l t
34 ) ;
35 end mult par4 ;
36
37
38 architecture arch of mult par4 i s
39
40 −− gener i c RCA mu l t i p l i e r d e c l a ra t i on
41 component RCA
42 generic (
43 A width : i n t e g e r ; −− s i z e o f A
44 B width : i n t e g e r ) ; −− s i z e o f B
45 port (
46 S : out s t d l o g i c v e c t o r ( A width+B width−1 downto 0) ;
47 A : in s t d l o g i c v e c t o r (A width−1 downto 0) ;
48 B : in s t d l o g i c v e c t o r ( B width−1 downto 0) ) ;
49 end component ;
50
51 signal count , next count : s t d l o g i c v e c t o r (1 downto 0) ;
52 signal A0 , A1 , A2 , A3 , B0 , B1 , B2 , B3 : s t d l o g i c v e c t o r (31 downto 0) ;
53 signal S , S0 , S1 , S2 , S3 : s t d l o g i c v e c t o r (63 downto 0) ;
54
55 begin
56
57 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
58 −−combina tor ia l par t
59 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
60
61 −− output mu l t i p l e x e r
62 with count select
63 S <= S0 when ”00” ,
64 S1 when ”01” ,
65 S2 when ”11” ,
66 S3 when ”10” ,
67 S3 when others ;
68
69 −−mu l t i p l e x e r s counter incrementer 00 −> 01 −> 11 −> 10 −>
70 next count <= ”01” when count = ”00” else
71 ”11” when count = ”01” else
72 ”10” when count = ”11” else
73 ”00” ;
74
75 −−implementation o f the four 32 b i t mu l t i p l i e r s
A.4. mult par4.vhd 147
76 mult par4 0 : RCA
77 generic map (
78 A width => 32 ,
79 B width => 32)
80 port map (
81 S => S0 ,
82 A => A0 ,
83 B => B0) ;
84 mult par4 1 : RCA
85 generic map (
86 A width => 32 ,
87 B width => 32)
88 port map (
89 S => S1 ,
90 A => A1 ,
91 B => B1) ;
92 mult par4 2 : RCA
93 generic map (
94 A width => 32 ,
95 B width => 32)
96 port map (
97 S => S2 ,
98 A => A2 ,
99 B => B2) ;
100 mult par4 3 : RCA
101 generic map (
102 A width => 32 ,
103 B width => 32)
104 port map (
105 S => S3 ,
106 A => A3 ,
107 B => B3) ;
108
109 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
110 −−s e q u en t i a l par t
111 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
112 process ( c lk , r s t n )
113 begin
114 i f r s t n = ’0 ’ then
115 m v <= ( others => ’ 0 ’ ) ;
116 count <= ”00” ;
117 A0 <= ( others => ’ 0 ’ ) ;
118 A1 <= ( others => ’ 0 ’ ) ;
119 A2 <= ( others => ’ 0 ’ ) ;
120 A3 <= ( others => ’ 0 ’ ) ;
121 B0 <= ( others => ’ 0 ’ ) ;
122 B1 <= ( others => ’ 0 ’ ) ;
123 B2 <= ( others => ’ 0 ’ ) ;
124 B3 <= ( others => ’ 0 ’ ) ;
125 e l s i f c l k = ’1 ’ and c lk ’ event then
126 i f en = ’1 ’ then
127 −− output r e g i s t e r s
128 m v <= S ;
129 −− increment s t a t e machine counter
130 count <= next count ;
148 Appendix A. VHDL source code
131 −− input reg and demu l t i p l e x e r f o r A
132 i f count = ”00” then A0 <= a v ; end i f ;
133 i f count = ”01” then A1 <= a v ; end i f ;
134 i f count = ”11” then A2 <= a v ; end i f ;
135 i f count = ”10” then A3 <= a v ; end i f ;
136 −− input reg and demu l t i p l e x e r f o r B
137 i f count = ”00” then B0 <= b v ; end i f ;
138 i f count = ”01” then B1 <= b v ; end i f ;
139 i f count = ”11” then B2 <= b v ; end i f ;
140 i f count = ”10” then B3 <= b v ; end i f ;
141 end i f ;
142 end i f ;
143 end process ;
144
145 end arch ;
A.5 RCA generic arch.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : Genreric RCA Mu l i t p l i e r
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : RCA generic arch . vhd
6 −− Author : <mtschuster@WS−3439>
7 −− Company :
8 −− Created : 2006−04−27
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : Simple Ripp le Carry Array mu l t i p l i e r implementation
14 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
15 −− Copyright ( c ) 2006
16 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
17 −− Revis ions :
18 −− Date Version Author Descr ip t ion
19 −− 2006−04−27 1.0 mtschuster Created
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
21 l ibrary i e e e ;
22 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
23 use i e e e . s t d l o g i c un s i g n ed . a l l ;
24
25 entity RCA i s
26 generic (
27 A width : i n t e g e r := 32 ; −−s i z e o f A
28 B width : i n t e g e r := 32) ; −−s i z e o f B
29 port (
30 S : out s t d l o g i c v e c t o r ( A width+B width−1 downto 0) ; −−output
31 A : in s t d l o g i c v e c t o r (A width−1 downto 0) ; −−input A
32 B : in s t d l o g i c v e c t o r ( B width−1 downto 0) ) ; −−input B
33 end RCA;
34
35 architecture arch of RCA i s
36
A.6. ring svt.vhd 149
37 type s t d l o g i c a r r a y i s −− array o f i n t e r na l nodes
38 array ( B width−1 downto 1) of s t d l o g i c v e c t o r (A width−1 downto 0) ;
39 type e n l a r g e d s t d l o g i c a r r a y i s −− extended array o f i n t e r na l nodes
40 array ( B width−1 downto 0) of s t d l o g i c v e c t o r ( A width downto 0) ;
41
42 −− l o c a l s i g n a l s
43 signal AandB : s t d l o g i c a r r a y ; −−p a r t i a l products
44 signal S p a r t i a l : e n l a r g e d s t d l o g i c a r r a y ; −−i n t e r na l sums
45 signal I n i t v a l : s t d l o g i c v e c t o r (A width−1 downto 0) ; −− f i r s t l i n e va lue s
46
47 begin
48
49 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
50 −−combina tor ia l par t
51 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
52
53 f i r s t c e l l : −−implemet f i r s t l i n e
54 I n i t v a l <= A when B(0) = ’1 ’ else ( others => ’ 0 ’ ) ;
55 S p a r t i a l ( 0 ) <= ’0 ’& I n i t v a l ;
56 S (0) <= S pa r t i a l ( 0 ) (0 ) ;
57
58 i n t c e l l : −−implement i n t e rna l l i n e s
59 for i in 1 to B width−1 generate
60 S( i ) <= S pa r t i a l ( i ) (0 ) ;
61 AandB( i ) <= A when B( i ) = ’1 ’ else ( others => ’ 0 ’ ) ;
62 S p a r t i a l ( i ) <= ( ’0 ’& S p a r t i a l ( i −1) ( A width downto 1) )+( ’0 ’&AandB( i ) ) ;
63 end generate ;
64
65 l a s t c e l l : −−copy the r e s u l t to the output
66 S(A width+B width−1 downto B width ) <= S pa r t i a l ( B width−1) ( A width downto 1) ;
67
68 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
69 −−s e q u en t i a l par t
70 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
71 end arch ;
A.6 ring svt.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : Ring o s c i l l a t o r SVT
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : r i n g s v t . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−04−27
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : A simple r ing o s c i l l a t o r with d i r e c t i n s t a t i a t i o n o f the
14 −− STM 090 SVT techno logy i n v e r t e r s . In v e r t e r type i s IVSVTX1
15 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
16 −− Copyright ( c ) 2006
150 Appendix A. VHDL source code
17 −− Revis ions :
18 −− Date Version Author Descr ip t ion
19 −− 2006−04−27 1.0 schus t e r Created
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
21
22 l ibrary i e e e ;
23 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
24
25 entity r i n g s v t i s
26 generic (
27 l ength : i n t e g e r := 100) ; −− d e f a u l t i n v e r t e r chain l eng t h
28 port (
29 Z : out s t d l o g i c −− r ing o s c i l l a t o r output
30 ) ;
31 end r i n g s v t ;
32
33 architecture arch of r i n g s v t i s
34
35 −− l o c a l s i g n a l
36 signal i n t e r n a l n e t s : s t d l o g i c v e c t o r ( length−1 downto 0) ;
37
38 −−dec l a r e the techno logy i n v e r t e r
39 component IVSVTX1
40 port (
41 Z : out STD LOGIC ; −−in
42 A : in STD LOGIC −−out
43 ) ;
44 end component ;
45
46 begin −− arch
47
48 −−connect each i n v e r t e r with the f o l l ow
49 i nvs : for i in 0 to l ength−2 generate
50 IVSVTX1 gen : IVSVTX1
51 port map (
52 Z => i n t e r n a l n e t s ( i ) ,
53 A => i n t e r n a l n e t s ( i +1) ) ;
54 end generate i nvs ;
55
56 −−connect l a s t i n v e r t e r with the f i r s t
57 IVSVTX1 last : IVSVTX1
58 port map (
59 Z => i n t e r n a l n e t s ( length −1) ,
60 A => i n t e r n a l n e t s (0 ) ) ;
61
62 −−output the f i r s t node
63 Z <= i n t e r n a l n e t s (0 ) ;
64
65 end arch ;
A.7 top tb.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : Testbench fo r des ign ” top”
A.7. top tb.vhd 151
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : t o p t b . vhd
6 −− Author : <schuster@zebra>
7 −− Company :
8 −− Created : 2006−02−17
9 −− Last update : 2006−09−22
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : Testbench fo r des ign ” top”
14 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
15 −− Copyright ( c ) 2006
16 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
17 −− Revis ions :
18 −− Date Version Author Descr ip t ion
19 −− 2006−02−17 1.0 schus t e r Created
20 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
21
22 l ibrary i e e e ;
23 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
24 use i e e e . s t d l o g i c t e x t i o . a l l ; −− wr i t e s t d l o g i c s i g n a l to l i n e
25 use i e e e . s t d l o g i c un s i g n ed . a l l ;
26 l ibrary std ;
27 use std . t e x t i o . a l l ; −− output data to s t d ou t pu t or t e x t f i l e
28
29 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
30
31 entity top tb i s
32
33 end top tb ;
34
35 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
36
37 architecture t o p f un c t e s t of top tb i s
38
39 −−component d e ca l r a t i on
40 component top
41 port (
42 c l k : in s t d l o g i c ; −−c l o c k
43 r s t n : in s t d l o g i c ; −−a c t i v e low async r e s e t
44 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
45 s i n : in s t d l o g i c ; −−s e r i a l data in
46 s ou t : out s t d l o g i c ; −−s e r i a l data out
47 l oad n : in s t d l o g i c ; −−r e g i s t e r s p a r a l l e l load when low
48 s h i f t n : in s t d l o g i c ; −−s h i f t data when low
49 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
50 −− s e l e c t the source f o r da ta ou t as we l l as f o r data saved in r e g i s t e r s
51 s e l r e g : in s t d l o g i c ;
52 s e l : in s t d l o g i c v e c t o r (1 downto 0) ; −−s e l e c t mu l t i p l i e r
53 Z svt : out s t d l o g i c ; −−SVT ring o s c i l l a t o r
54 Z lv t : out s t d l o g i c ) ; −−LVT ring o s c i l l a t o r
55 end component ;
56
57 −− component por t s
152 Appendix A. VHDL source code
58 signal r s t n : s t d l o g i c ; −−a c t i v e low async r e s e t
59 signal s i n : s t d l o g i c ; −−s e r i a l data in
60 signal s ou t : s t d l o g i c ; −−s e r i a l data out
61 signal l oad n : s t d l o g i c ; −−r e g i s t e r s p a r a l l e l load when low
62 signal s h i f t n : s t d l o g i c ; −−s h i f t data when low
63 signal s e l r e g : s t d l o g i c ; −−r e g i s t e r s s e l e c t
64 signal s e l : s t d l o g i c v e c t o r (1 downto 0) ; −−s e l e c t mu l t i p l i e r
65 signal Z svt : s t d l o g i c ; −−SVT ring o s c i l l a t o r
66 signal Z lv t : s t d l o g i c ; −−LVT ring o s c i l l a t o r
67 −− c l o c k
68 signal c l k : s t d l o g i c := ’ 1 ’ ; −−c l o c k
69
70 −− cons tan t s
71 constant HALFCLOCKPERIOD : time := 8 ns ;
72
73 begin −− t o p f u n c t e s t
74
75 −− component i n s t a n t i a t i o n
76 DUT : top
77 port map (
78 c l k => c lk ,
79 r s t n => r s t n ,
80 s i n => s i n ,
81 s ou t => s out ,
82 l oad n => load n ,
83 s h i f t n => s h i f t n ,
84 s e l r e g => s e l r e g ,
85 s e l => s e l ,
86 Z svt => Z svt ,
87 Z lv t => Z lv t ) ;
88
89 −− c l o c k generat ion
90 c l k <= not c l k after HALFCLOCKPERIOD;
91
92 −−main t e s t b ench processor
93 main : process
94 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
95 variable match v : boolean ; −− check ou tpu t da ta
96 variable pass v : boolean ; −− de t e c t i f an error accured
97
98 variable g l ob a l p a s s v : boolean := true ; −− pass f l a g f o r a l l t e s t s
99
100 variable d l : l i n e ; −− output l i n e f o r debugging purpose
101
102 variable DEBUGMODE : boolean := f a l s e ; −− a c t i v a t e / d e s a c t i v a t e verbose mode
103
104 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
105
106 −− procedures
107 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
108 −−check i f s p e c i f i e d t e s t pass
109 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
110 procedure ch e ck e r r o r s ( test name : in s t r i n g ) i s
111 begin
112 i f pass v then
A.7. top tb.vhd 153
113 report test name & ” ALL PASS ” ;
114 else
115 report test name & ” FAILED ” ;
116 end i f ;
117 g l oba l p a s s v := g l oba l p a s s v and pass v ;
118 end procedure ch e ck e r r o r s ;
119 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
120 −−check i f a l l t e s t s pass
121 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
122 procedure g l oba l ch e ck i s
123 begin
124 i f g l oba l p a s s v then
125 report ” ALL TESTS PASS −> OK! ” ;
126 else
127 report ” ONE OR MORE TEST FAILED ” ;
128 end i f ;
129 end procedure g l oba l ch e ck ;
130 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
131 −−wr i t e and read s p e c i f i c pa t t e rn s to /from the s h i f t r e g i s t e r
132 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
133 procedure s e r i a l r e a d w r i t e i s
134 −−four d i f f e r e n t pa t t e rn to t e s t
135 constant SERIAL DATA IN0 : s t d l o g i c v e c t o r (63 downto 0) := ( others => ’ 0 ’ ) ;
136 constant SERIAL DATA IN1 : s t d l o g i c v e c t o r (63 downto 0) := ( others => ’ 1 ’ ) ;
137 constant SERIAL DATA IN2 : s t d l o g i c v e c t o r (63 downto 0) :=
138 X”5555555555555555” ; −−”0101010101010.. .1010101010101010101”;
139 constant SERIAL DATA IN3 : s t d l o g i c v e c t o r (63 downto 0) :=
140 X”AAAAAAAAAAAAAAAA” ; −−”1010101010101.. .0101010101010101010”;
141
142 variable s e r i a l d a t a o u t : s t d l o g i c v e c t o r (63 downto 0) ; −− read data
143 −− from the r e g i s t e r s
144 begin
145 −− i n i t
146 r s t n <= ’0 ’ ;
147 l oad n <= ’1 ’ ;
148 s h i f t n <= ’1 ’ ;
149 s i n <= ’0 ’ ;
150 s e l r e g <= ’1 ’ ;
151 s e l <= ”00” ;
152
153 pass v := true ;
154 wait for 9∗HALFCLOCKPERIOD;
155 wait until c l k = ’0 ’ ;
156
157 −− wr i t e f i r s t pa t t e rn
158 s h i f t n <= ’0 ’ ;
159 for i in 63 downto 0 loop
160 s i n <= SERIAL DATA IN0( i ) ;
161 wait until c l k = ’ 0 ’ ;
162 end loop ; −− i
163
164 −−wr i t e second pa t t e rn and read f i r s t pa t t e rn
165 for i in 63 downto 0 loop
166 s i n <= SERIAL DATA IN1( i ) ;
167 s e r i a l d a t a o u t ( i ) := s out ;
154 Appendix A. VHDL source code
168 wait until c l k = ’ 0 ’ ;
169 end loop ; −− i
170
171 −−check f i r s t pa t t e rn
172 match v := s e r i a l d a t a o u t = SERIAL DATA IN0 ;
173 pass v := pass v and match v ;
174 assert s e r i a l d a t a o u t = SERIAL DATA IN0
175 report ”Error on SERIAL DATA IN0” severity e r r o r ;
176 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
177 i f DEBUGMODE then
178 report ”SERIAL DATA IN0 read ! ” severity note ;
179 end i f ;
180
181 −−wr i t e t h i r d pa t t e rn and read second pa t t e rn
182 for i in 63 downto 0 loop
183 s i n <= SERIAL DATA IN2( i ) ;
184 s e r i a l d a t a o u t ( i ) := s out ;
185 wait until c l k = ’ 0 ’ ;
186 end loop ; −− i
187
188 −−check second pa t t e rn
189 match v := s e r i a l d a t a o u t = SERIAL DATA IN1 ;
190 pass v := pass v and match v ;
191 assert s e r i a l d a t a o u t = SERIAL DATA IN1
192 report ”Error on SERIAL DATA IN1” severity e r r o r ;
193 i f DEBUGMODE then
194 report ”SERIAL DATA IN1 read ! ” severity note ;
195 end i f ;
196
197 −−wr i t e f our th pa t t e rn and read the t h i r d pa t t e rn
198 for i in 63 downto 0 loop
199 s i n <= SERIAL DATA IN3( i ) ;
200 s e r i a l d a t a o u t ( i ) := s out ;
201 wait until c l k = ’ 0 ’ ;
202 end loop ; −− i
203
204 −−check the t h i r d pa t t e rn
205 match v := s e r i a l d a t a o u t = SERIAL DATA IN2 ;
206 pass v := pass v and match v ;
207 assert s e r i a l d a t a o u t = SERIAL DATA IN2
208 report ”Error on SERIAL DATA IN2” severity e r r o r ;
209 i f DEBUGMODE then
210 report ”SERIAL DATA IN2 read ! ” severity note ;
211 end i f ;
212
213 −−read the f our th pa t t e rn
214 for i in 63 downto 0 loop
215 s e r i a l d a t a o u t ( i ) := s out ;
216 wait until c l k = ’ 0 ’ ;
217 end loop ; −− i
218
219 −−check the f our th pa t t e rn
220 match v := s e r i a l d a t a o u t = SERIAL DATA IN3 ;
221 pass v := pass v and match v ;
222 assert s e r i a l d a t a o u t = SERIAL DATA IN3
A.7. top tb.vhd 155
223 report ”Error on SERIAL DATA IN3” severity e r r o r ;
224 i f DEBUGMODE then
225 report ”SERIAL DATA IN3 read ! ” severity note ;
226 end i f ;
227 end s e r i a l r e a d w r i t e ;
228 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
229 −− check t ha t the s h i f t r e g i s t e r can be r e s e t from the pseudo random generator
230 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
231 procedure c h e c k s h i f t r e g r s t i s
232 variable s e r i a l d a t a o u t : s t d l o g i c v e c t o r (63 downto 0) ;
233 begin −− c h e c k s h i f t r e g r s t
234 −− i n i t
235 r s t n <= ’0 ’ ;
236 l oad n <= ’1 ’ ;
237 s h i f t n <= ’1 ’ ;
238 s i n <= ’1 ’ ;
239 s e l r e g <= ’1 ’ ; −− from pseudo random generator
240 s e l <= ”00” ;
241
242 pass v := true ;
243 wait for 9∗HALFCLOCKPERIOD;
244 wait until c l k = ’0 ’ ;
245
246 −− l oad p a r a l l e l z e ros to s h i f t r e g i s t e r s
247 l oad n <= ’0 ’ ;
248 wait until c l k = ’ 1 ’ ;
249 wait until c l k = ’ 0 ’ ;
250
251 −− output s h i f t r e g i s t e r s data s e r i a l l y
252 for i in 63 downto 0 loop
253 s e r i a l d a t a o u t ( i ) := s out ;
254 wait until c l k = ’ 0 ’ ;
255 end loop ; −− i
256
257 −− check t ha t data i s zeroed
258 match v := s e r i a l d a t a o u t = X”0000000000000000” ;
259 pass v := pass v and match v ;
260 assert match v report ”Error on s h i f t r e g i s t e r r e s e t ” severity e r r o r ;
261 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
262
263 end c h e c k s h i f t r e g r s t ;
264 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
265 −− purpose : read the f i r s t 100 and 200 pseudo random generated data
266 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
267 procedure read rand i s
268 variable s e r i a l d a t a o u t : s t d l o g i c v e c t o r (63 downto 0) ;
269 begin −− read rand
270
271 −− i n i t
272 r s t n <= ’0 ’ ;
273 l oad n <= ’1 ’ ;
274 s h i f t n <= ’1 ’ ;
275 s i n <= ’1 ’ ;
276 s e l r e g <= ’1 ’ ; −− from pseudo random generator
277 s e l <= ”00” ;
156 Appendix A. VHDL source code
278
279 pass v := true ;
280 wait for 9∗HALFCLOCKPERIOD;
281 wait until c l k = ’0 ’ ;
282
283
284 s e l r e g <= ’1 ’ ; −− pseudo random data to s h i f t r e g
285 l oad n <= ’0 ’ ; −− ready to load the zeroed vec tor
286 wait until c l k = ’ 0 ’ ;
287
288 r s t n <= ’1 ’ ; −− c l e a r the r e s e t
289 wait until c l k = ’ 0 ’ ;
290 wait for 200∗HALFCLOCKPERIOD;
291
292 l oad n <= ’1 ’ ; −− swi t ch to s e r i a l mode to e x t r a c t data
293 s h i f t n <= ’0 ’ ;
294 −− output s h i f t r e g i s t e r s data s e r i a l l y
295 for i in 63 downto 0 loop
296 s e r i a l d a t a o u t ( i ) := s out ;
297 wait until c l k = ’ 0 ’ ;
298 end loop ; −− i
299
300 −− check data
301 match v := s e r i a l d a t a o u t = X”2F8D072F8D0BD0BD” ;
302 pass v := pass v and match v ;
303 assert match v report ”Error on read random data ” severity e r r o r ;
304 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
305
306 l oad n <= ’0 ’ ; −− r e s e l e c t p a r a l l e l input to s h i f t r e g s
307
308 wait for 72∗HALFCLOCKPERIOD;
309
310 l oad n <= ’1 ’ ; −− swi t ch to s e r i a l mode to e x t r a c t data
311 s h i f t n <= ’0 ’ ;
312 −− output s h i f t r e g i s t e r s data s e r i a l l y
313 for i in 63 downto 0 loop
314 s e r i a l d a t a o u t ( i ) := s out ;
315 wait until c l k = ’ 0 ’ ;
316 end loop ; −− i
317
318 −− check data
319 match v := s e r i a l d a t a o u t = X”7DF14972F14972F1” ;
320 pass v := pass v and match v ;
321 assert match v report ”Error on read random data ” severity e r r o r ;
322 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
323 end read rand ;
324
325 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
326 −− purpose : re se t , read e x t e rna l data , mu l t ip l y , output r e s u l t
327 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
328 procedure mult ext data (
329 −− the number corresponding to the t e s t e d mu l t i p l i e r
330 constant mult number : in i n t e g e r ) i s
331
332 variable s e r i a l d a t a o u t : s t d l o g i c v e c t o r (63 downto 0) ;
A.7. top tb.vhd 157
333 −− input mu l t i p l i e r data
334 constant DATA IN : s t d l o g i c v e c t o r (63 downto 0) := X”AC0E5F8EAC0E5F8E” ;
335 −− expec ted r e s u l t = DATA IN+DATA IN(63 downto 32)∗DATA IN(31 downto 0)
336 constant EXPECTEDDATAOUT : s t d l o g i c v e c t o r (63 downto 0) :=
337 DATA IN+(DATA IN(63 downto 32) ∗DATA IN(31 downto 0) ) ;
338
339 begin −− mul t e x t da ta
340
341 −− i n i t
342 r s t n <= ’0 ’ ;
343 l oad n <= ’1 ’ ;
344 s h i f t n <= ’1 ’ ;
345 s i n <= ’1 ’ ;
346 s e l r e g <= ’1 ’ ; −− from regs to mults
347 case mult number i s
348 when 0 => s e l <= ”00” ;
349 when 1 => s e l <= ”01” ;
350 when 2 => s e l <= ”10” ;
351 when 3 => s e l <= ”11” ;
352 when others => s e l <= ”XX” ;
353 end case ;
354
355 pass v := true ;
356 wait for 9∗HALFCLOCKPERIOD;
357 wait until c l k = ’0 ’ ;
358
359 s e l r e g <= ’1 ’ ; −− pseudo random data to s h i f t r e g
360 l oad n <= ’0 ’ ; −− ready to load the zeroed vec tor
361 wait until c l k = ’ 0 ’ ;
362
363 r s t n <= ’1 ’ ; −− c l e a r the r e s e t
364 wait until c l k = ’ 0 ’ ;
365
366 s h i f t n <= ’0 ’ ; −− enter s e r i a l data
367 l oad n <= ’1 ’ ;
368 for i in 63 downto 0 loop
369 s i n <= DATA IN( i ) ;
370 wait until c l k = ’ 0 ’ ;
371 end loop ; −− i
372 s h i f t n <= ’1 ’ ;
373 l oad n <= ’1 ’ ;
374
375 wait until c l k = ’ 0 ’ ;
376 s e l r e g <= ’0 ’ ;
377 wait until c l k = ’ 0 ’ ;
378
379 −− de lay i f p a r a l l e l 4 implementation i s used
380 i f mult number = 1 or mult number = 3 then
381 wait for 6∗HALFCLOCKPERIOD;
382 end i f ;
383
384 l oad n <= ’0 ’ ;
385 wait until c l k = ’ 0 ’ ;
386
387
158 Appendix A. VHDL source code
388 l oad n <= ’1 ’ ; −− swi t ch to s e r i a l mode to e x t r a c t data
389 s h i f t n <= ’0 ’ ;
390 −− output s h i f t r e g i s t e r s data s e r i a l l y
391 for i in 63 downto 0 loop
392 s e r i a l d a t a o u t ( i ) := s out ;
393 wait until c l k = ’ 0 ’ ;
394 end loop ; −− i
395
396 −− check data
397 match v := s e r i a l d a t a o u t = EXPECTEDDATAOUT;
398 pass v := pass v and match v ;
399 assert match v report ”Error on mult ip ly ex t e rna l data ” severity e r r o r ;
400 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
401
402 end mult ext data ;
403 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
404 −− purpose : mu l t i p l y and add pseudo random generated data
405 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
406 procedure random mac (
407 −− the number corresponding to the t e s t e d mu l t i p l i e r
408 constant mult number : in i n t e g e r ) i s
409
410 variable s e r i a l d a t a o u t : s t d l o g i c v e c t o r (63 downto 0) ;
411
412 begin −− random mac
413
414 −− i n i t
415 r s t n <= ’0 ’ ;
416 l oad n <= ’0 ’ ; −− s t o r e incoming data
417 s h i f t n <= ’1 ’ ;
418 s i n <= ’0 ’ ;
419 s e l r e g <= ’1 ’ ; −− from rand to regs
420 case mult number i s
421 when 0 => s e l <= ”00” ;
422 when 1 => s e l <= ”01” ;
423 when 2 => s e l <= ”10” ;
424 when 3 => s e l <= ”11” ;
425 when others => s e l <= ”XX” ;
426 end case ;
427
428 pass v := true ;
429 wait for 9∗HALFCLOCKPERIOD;
430 wait until c l k = ’0 ’ ;
431
432 s e l r e g <= ’0 ’ ; −− pseudo random data to mult
433 l oad n <= ’0 ’ ; −− ready to load sum to regs
434 r s t n <= ’1 ’ ; −− c l e a r the r e s e t
435 wait until c l k = ’ 0 ’ ;
436
437 wait for 4∗HALFCLOCKPERIOD;
438 −− de lay i f p a r a l l e l 4 implementation i s used
439 i f mult number = 1 or mult number = 3 then
440 wait for 6∗HALFCLOCKPERIOD;
441 end i f ;
442 wait for 1000∗HALFCLOCKPERIOD;
A.7. top tb.vhd 159
443
444
445 l oad n <= ’1 ’ ; −− swi t ch to s e r i a l mode to e x t r a c t data
446 s h i f t n <= ’0 ’ ;
447 −− output s h i f t r e g i s t e r s data s e r i a l l y
448 for i in 63 downto 0 loop
449 s e r i a l d a t a o u t ( i ) := s out ;
450 wait until c l k = ’ 0 ’ ;
451 end loop ; −− i
452
453 −− check data
454 match v := s e r i a l d a t a o u t = X”14C9836842DEF744” ;
455 pass v := pass v and match v ;
456 assert match v report ”Error on mult ip ly random data ” severity e r r o r ;
457 −−i e e e . s t d l o g i c t e x t i o . wr i t e ( d l , s e r i a l d a t a o u t ) ; w r i t e l i n e ( output , d l ) ;
458
459 end random mac ;
460 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
461 −−t e s t sequence
462 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
463 begin
464 c h e c k s h i f t r e g r s t ;
465 ch e ck e r r o r s ( ” Sh i f t Reg i s t e r s Reset : ” ) ;
466 s e r i a l r e a d w r i t e ;
467 ch e ck e r r o r s ( ” S e r i a l Read/Write : ” ) ;
468 read rand ;
469 ch e ck e r r o r s ( ”Read Rand : ” ) ;
470 mult ext data (0 ) ;
471 ch e ck e r r o r s ( ”Mult ip ly ex t e rna l data on mult0 : ” ) ;
472 mult ext data (1 ) ;
473 ch e ck e r r o r s ( ”Mult ip ly ex t e rna l data on mult1 : ” ) ;
474 mult ext data (2 ) ;
475 ch e ck e r r o r s ( ”Mult ip ly ex t e rna l data on mult2 : ” ) ;
476 mult ext data (3 ) ;
477 ch e ck e r r o r s ( ”Mult ip ly ex t e rna l data on mult3 : ” ) ;
478 random mac (0 ) ;
479 ch e ck e r r o r s ( ”Add random mu l t i p l i e d data f o r mult0 : ” ) ;
480 random mac (1 ) ;
481 ch e ck e r r o r s ( ”Add random mu l t i p l i e d data f o r mult1 : ” ) ;
482 random mac (2 ) ;
483 ch e ck e r r o r s ( ”Add random mu l t i p l i e d data f o r mult2 : ” ) ;
484 random mac (3 ) ;
485 ch e ck e r r o r s ( ”Add random mu l t i p l i e d data f o r mult3 : ” ) ;
486
487 g l oba l ch e ck ;
488 wait ;
489 end process ;
490
491 end t o p f u n c t e s t ;
492
493 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
494
495 configuration t o p t b t o p f u n c t e s t c f g of top tb i s
496 for t o p f u n c t e s t
497 end for ;
160 Appendix A. VHDL source code
498 end t o p t b t o p f u n c t e s t c f g ;
499
500 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Appendix B
Synopsys compilation scripts
B.1 compile top.tcl
1 #######################
2 ## Globa l v a r i a b l e s ##
3 #######################
4 set BIN . /bin /
5 set DB . /db/
6 set PAR . /par/ s r c /
7 set GATE . / gate /
8 set WORK . /work/
9 set SRC . /vhdl /
10 set r epo r t s pa th . / r epo r t s /
11
12 set design name top stm090
13
14 #ignore case to avoid problem on a c t i v i t y annotat ion
15 set f i n d i g n o r e c a s e t rue
16 set s upp r e s s e r r o r s ”VHDL−2285 OPT−150 TIM−111 TIM−112”
17 #remove the l im i t for high fanout ne ts
18 set h i gh f anou t n e t t h r e sho l d 0
19
20 #de f ine the working path
21 d e f i n e d e s i g n l i b work −path $WORK
22
23 ########################
24 ## Bus name v a r i a b l e s ##
25 ########################
26 set bus naming sty l e %s(%d)
27
28 ##############################
29 ## Remove prev ious des i gns ##
30 ##############################
31 #remove cons tra in t −all
32 #remove design −all
33
34 #######################
35 ## Read des ign ##
161
162 Appendix B. Synopsys compilation scripts
36 #######################
37 source r e a d vhd l . t c l
38
39 #l i n k des ign with l ibrary ( i . e . load requ i red l i b r a r i e s )
40 l i n k
41
42 #Uniqui fy the mu l t i l p i e r s b l o c k s
43 set un iqu i f y naming s ty l e %s %d
44
45 #rename the 4 main mu l t i p l i e r s
46 un iqu i f y −ce l l {mult 0 mult 1 mult 2 mult 3} −base name multd
47
48 #rename the remaining des ign
49 un iqu i f y
50
51 ###############################
52 ## Create clock MHz 62.5MHz ##
53 ###############################
54
55 #main clock
56 c r e a t e c l o c k c l k −period 16
57
58 #generated clock
59 c r e a t e g en e r a t e d c l o c k −source c l k −name dclk0 −divide by 1 mult 0 / c l k
60 c r e a t e g en e r a t e d c l o c k −source c l k −name dclk1 −divide by 1 mult 1 / c l k
61 c r e a t e g en e r a t e d c l o c k −source c l k −name dclk2 −divide by 1 mult 2 / c l k
62 c r e a t e g en e r a t e d c l o c k −source c l k −name dclk3 −divide by 1 mult 3 / c l k
63
64 #al low maximum de lay at input and output
65 s e t i n pu t d e l a y 0 −clock c l k [ a l l i n p u t s ]
66 s e t ou tpu t de l ay 0 −clock c l k [ a l l o u t pu t s ]
67
68 #next l i n e s are there to avoid TIM−111 warning
69 s e t i n pu t d e l a y 0 −clock dc lk0 mult 0 / c l k
70 s e t i n pu t d e l a y 0 −clock dc lk1 mult 1 / c l k
71 s e t i n pu t d e l a y 0 −clock dc lk2 mult 2 / c l k
72 s e t i n pu t d e l a y 0 −clock dc lk3 mult 3 / c l k
73
74 #Se t p ropaga t ed c l o c k automat ica ly set the co r r ec t
75 #s e t c l o c k l a t e n c y va lue for the generated c l o c k s
76 s e t p r opaga t ed c l o ck [ a l l c l o c k s ]
77
78 #repo r t c l o c k −skew
79
80 #######################
81 ## Set Loads ##
82 #######################
83 s e t d r i v e [ d r i v e o f CORE90GPSVT NomLeak.db:CORE90GPSVT/IVSVTX1/Z ] [ a l l i n p u t s ]
84 #2.004893
85 s e t l o a d [ l o ad o f CORE90GPSVT NomLeak.db:CORE90GPSVT/IVSVTX1/A] [ a l l o u t pu t s ]
86 #0.002090
87
88 #put an high load on the s ou t because i t
89 #w i l l d r i v e a ana log i c pad
90 s e t l o a d 10 s out
B.1. compile top.tcl 163
91
92 #######################
93 ## Design Constra in t s##
94 #######################
95 set max area 0
96
97 #######################
98 ## Compile Design ##
99 #######################
100
101 # Write unmapped top
102 cu r r en t d e s i gn top
103 wr i t e −hierarchy −output ${DB}unmapped top.db
104
105 # charac t e r i z e the 4 mu l t i p l i e r s
106 set t a r g e t l i b r a r y CORE90GPSVT NomLeak.db
107 c h a r a c t e r i z e −constra int {mult 0 mult 1 mult 2 mult 3}
108
109 # compile mult 0 and mult 1 with SVT
110 set t a r g e t l i b r a r y CORE90GPSVT NomLeak.db
111 cu r r en t d e s i gn multd 0
112 compi le
113 cu r r en t d e s i gn multd 1
114 compi le
115
116 # compile mult 2 and mult 3 with LVT
117 set t a r g e t l i b r a r y CORE90GPLVT NomLeak.db
118 cu r r en t d e s i gn multd 2
119 compi le
120 cu r r en t d e s i gn multd 3
121 compi le
122
123 #set dont touch to compile mu t l i p l i e r s
124 cu r r en t d e s i gn top
125 s e t dont touch {mult 0 mult 1 mult 2 mult 3 r i n g s v t 1 r i n g l v t 1 }
126
127 # the r e s t o f the des ign w i l l be compiled with
128 # the SVT techno logy
129 set t a r g e t l i b r a r y CORE90GPSVT NomLeak.db
130 compi le −map effort high
131
132 #show which DW implementation has been s e l e c t e d
133 r e p o r t r e s o u r c e s −hier > ${ r epo r t s pa th }${design name} . s yn rp rh
134
135 #remove unconnected por t s in DW des i gns
136 s e t dont touch {mult 0 mult 1 mult 2 mult 3} f a l s e
137 remove unconnected ports [ g e t c e l l s −hier ∗ ]
138 remove unconnected ports −blas t buses [ g e t c e l l s −hier ∗ ]
139 s e t dont touch {mult 0 mult 1 mult 2 mult 3} t rue
140
141 #########################
142 ## Fix ho ld v i o l a t i o n s ##
143 #########################
144 s e t f i x h o l d [ a l l c l o c k s ]
145
164 Appendix B. Synopsys compilation scripts
146 #recompi le top only
147 compi le −inc
148
149 #######################
150 ##Write Mapped des ign##
151 #######################
152 change names −rules vhdl −hierarchy
153 wr i t e −hierarchy −format vhdl −output ${GATE} top.vhd
154 wr i t e s d f ${GATE} t o p . s d f
155
156 #######################
157 ## Annotate Ac t i v t i y ##
158 #######################
159 #sh cp msim/mode l s im. in i . /mode l s im. in i
160 sh cp ${SRC} top tb .vhd ${GATE} top tb .vhd
161 sh vsim −c −do ${BIN} power sd f .do
162 r e a d s a i f −unit ns −scale 1 − instance top tb /dut −input ${GATE} b a c k . s a i f
163
164 ###############################################
165 ## Save repor t s in the de f ined d i r e c t o r y ##
166 ###############################################
167 r epo r t a r e a > ${ r epo r t s pa th }${design name} . s yn rpa
168 check des i gn > ${ r epo r t s pa th }${design name} . s yn rpd
169 r epo r t t im ing > ${ r epo r t s pa th }${design name} . s y n r p t
170 r e po r t h i e r a r chy > ${ r epo r t s pa th }${design name} . s yn rph
171 r e p o r t r e s o u r c e s > ${ r epo r t s pa th }${design name} . s y n r p r
172 r e p o r t c e l l > ${ r epo r t s pa th }${design name} . s yn rp c
173 report power −net −ce l l − f l a t − i n c lude input ne t s > ${ r epo r t s pa th }${design name}
. s yn rpp
174 r e p o r t s a i f − f l a t > ${ r epo r t s pa th }${design name} . s y n r p s
175 r e p o r t c o n s t r a i n t > ${ r epo r t s pa th }${design name} . s yn rpn
176 r e p o r t r e f e r e n c e −nosp l i t > ${ r epo r t s pa th }${design name} . s y n r p f
177 r e p o r t c l o c k −skew > ${ r epo r t s pa th }${design name} . s yn rpk
178 report power − i n c lude input ne t s −hier −h i e r l e v e l 1 > ${ r epo r t s pa th }${design name}
. syn rpph
179
180 #######################
181 ## Save des ign ##
182 #######################
183 change names −rules v e r i l o g −hierarchy
184 set bus naming sty l e %s \[%d \ ]
185 wr i t e −hierarchy −output ${DB} top gate .db
186 wr i t e −hierarchy −format v e r i l o g −output ${PAR} t op ga t e . v
187 wr i t e s d f ${PAR} t o p g a t e . s d f
188 wr i t e sd c ${PAR} t op ga t e . s d c
189
190 qu i t
B.2 read vhdl.tcl
1 analyze −f vhdl vhdl / r i n g s v t . vhd
2 analyze −f vhdl vhdl / r i n g l v t . v hd
3 analyze −f vhdl vhdl /RCA generic arch.vhd
4 analyze −f vhdl vhdl /mult.vhd
B.3. power sdf.do 165
5 analyze −f vhdl vhdl /mult par4.vhd
6 analyze −f vhdl vhdl / data gen.vhd
7 analyze −f vhdl vhdl / top.vhd
8 e l abo ra t e top −update
B.3 power sdf.do
1 ############################################################
2 # Scr i p t to computate the sw i t ch ing a c t i v i t y with ModelSim #
3 # Schuster Chr i s t i an , June 2003 , IMT Neuchate l #
4 ############################################################
5 #execute t h i s script with vsim −c −do power sd f .do
6 #needed f i l e s a r e : t o p . s d f , t o p . v hd , t o p t b . v h d
7
8 # Testbench path and f i l e names
9 set work di r / s c ra t ch / s chus t e r / s tm090 sv t l v t
10 set te s tbench top tb
11 set bench path $tes tbench /dut
12 set d i r gate
13 set sdffname $d i r / t o p . s d f
14
15 # Time and s imu la t ion s e t t i n g s
16 set t ime s c a l e ps
17 set back sa i f ba s e t ime 1E−12
18
19 set i n i t t im e 19592000
20 #19592ns
21 set eva lua t i on t ime 36704000
22 #36704ns
23
24 #compile des ign+te s t b ench
25 vcom −93 $d i r / top.vhd −work $work dir
26 vcom −93 $d i r / top tb .vhd −work $work dir
27
28 # Use the same path separa tor as Synopsys SAIF f i l e
29 set PathSeparator /
30 set DatasetSeparator :
31
32 vsim +not imingchecks −sdftyp $bench path=$sdffname − fore ign ” d p f l i i n i t / synopsys /
v2004.06 /auxx/syn/power/ d p f l i / lib−sparcOS5/ d p f l i . s o ” − l ib $work dir −t
$ t ime s c a l e $tes tbench
33 #+notimingchecks i s used to avoid unrea l problems on sd f annotat ion and v e r i l o g model
34
35
36 #i n i t i a l i z e r ing o s c i l l a t o r s
37 f o r c e top tb /dut/ r i n g s v t 1 / z po r t 0 0 −c 16
38 f o r c e top tb /dut/ r i n g l v t 1 / z po r t 0 0 −c 16
39
40 # Se l e c t t o g g l e reg ion
41 s e t t o g g l e r e g i o n $bench path
42
43 # In i t the c i r c u i t
44 run $ i n i t t im e
45
166 Appendix B. Synopsys compilation scripts
46 # Star t sw i t ch ing annotat ion
47 t o g g l e s t a r t
48
49 # Execute t e s t b ench
50 run $eva lua t i on t ime
51
52 # Stop sw i t ch ing annotat ion
53 t o g g l e s t op
54
55 # Write back annotat ion SAIF
56 t o g g l e r e p o r t $d i r / b a c k . s a i f $back sa i f ba s e t ime $bench path
57
58 qu i t
Appendix C
SoC Encounter P&R scripts
C.1 main.tcl
1 ########################################
2 ## Main f i l e for ENCOUNTER SOC4.1 ##
3 ## CSch, Ju ly 2006 , ver s ion 1 .1 ##
4 ########################################
5 # Required F i l e s :
6 # Sc r i p t s :
7 # − t op . con f
8 # − I O F i l l e r . t c l
9 # − c r e a t e g l o b a l n e t . t c l
10 # − pwr . t c l
11 # − do power domains
12 # − f o l l o wP i n . t c l
13 # − t o p . c t s t c h
14 # − o u t p u t n e t s . t c l
15 # − p l a c e o u t p u t b u f s . t c l
16 # − f i x d r c e r r o r s . t c l
17 # Data:
18 # − i o p l a c e . i o
19 # − LEF/IO90GPHVT BASIC 50A 7M2T PGC.lef (ALL ex t e rna l l a y e r s o f COREVDD1V0 pin need
the l i n e ”CLASS CORE ; ” in order to be be routed by s r ou t e , d i f f f i l e present )
20 # − LEF/IO90GPHVT 3V3 50A 7M2T PGC.lef ( d i f f f i l e presen t )
21 # Src :
22 # − t o p i o . v ( from cat src / t o p g a t e . v data/ io wrapper . v > src / t o p i o . v )
23 # − t o p g a t e . s d c ( change g e t p i n s −> g e t p i n s −hierarchy )
24
25 Puts ”###############################”
26 Puts ”###”
27 Puts ”### Load Design ”
28 Puts ”###”
29 Puts ”###############################”
30
31 ### un i qu i f y the n e t l i s t ( s h e l l to execute be f o r e an encounter s e s s i on )
32 ### un i q u i f yN e t l i s t −top
33 #In my case n e t l i s t was a l ready unique !
34
167
168 Appendix C. SoC Encounter P&R scripts
35 set CMOS090GP DIR / de s i gnk i t / cmos090 50a
36 set s cp t s s c r i p t s /
37 set data data/
38
39
40 setRCFactor −cap 1 . 1
41
42 #set the s ize o f the sma l l e s t d i s p l a y ed module −> d i s p l a y a l l
43 s e tP r e f e r en c e MinFPModuleSize 1
44
45 #load the des ign + io + corners
46 l oadConf ig ${ s cp t s } t op . c on f
47
48 #load f o o t p r i n t s used for t imin ing dr iven ana l y s i s
49 l o a d f o o t p r i n t − i n f i l e ${CMOS090GP DIR}/ SocEncounter cmos090gp 2.2 / cmos090gp 50a.c fp
50 se t InvFootPr int IVSVTX1
51 setBufFootPr int BFSVTX1
52 #setDe layFootPrint DLY1SVTX2
53
54 Puts ”#########################”
55 Puts ”###”
56 Puts ”### Create Floorp lan ”
57 Puts ”###”
58 Puts ”#########################”
59
60 ### de f ine f l o o r p l an
61 #f loorP lan −r 1 0 .7 40 40 40 40
62 #Fixed dimension a l l ow io co rne r s to be a l i gned with the 0.56um gr id
63 f l o o rP l an −s 800 .28 800 .28 50 .08 50 .08 50 .08 50 .08
64
65 ### Add IO f i l l e r
66 source ${ s cp t s } I O F i l l e r . t c l
67
68 Puts ”###############################”
69 Puts ”###”
70 Puts ”### Create PowerDomains and ”
71 Puts ”### Place Block ( s ) ”
72 Puts ”###”
73 Puts ”###############################”
74
75 #crea te the 5 separated power domains
76 source ${ s cp t s } do power domains . t c l
77
78 #place iore f comp ins tance needed for the 3V3 IOs
79 p l a c e In s t anc e io re f comp 732 .88 204 .04 R180
80 addHaloToBlock 31 . 5 20 20 30 −al lBlock
81
82 #connects a l l global nets
83 source ${ s cp t s } c r e a t e g l o b a l n e t . t c l
84
85 Puts ”###############################”
86 Puts ”###”
87 Puts ”### Create power s t r i p e s ”
88 Puts ”###”
89 Puts ”###############################”
C.1. main.tcl 169
90
91 ### add power r ing + s t r i p e s
92 source ${ s cp t s } pwr . t c l
93
94 ### s td− c e l l f o l l ow pin
95 source ${ s cp t s } f o l l owP i n . t c l
96
97 # save f loor−p lan
98 saveFPlan . / f p l a n . f p
99
100 # check f loor−p lan
101 veri fyGeometry
102
103 saveDesign . / t o p . f p . e n c
104 #source . / t o p . f p . e n c
105
106 Puts ”####################”
107 Puts ”###”
108 Puts ”### Place Design . . . ”
109 Puts ”###”
110 Puts ”####################”
111
112 #exec mkdir Timing
113 source ${ s cp t s }/ p l a c e o u t p u t b u f s . t c l
114
115 amoebaPlace −t imingdriven \
116 −doCongOpt \
117 −highEf fort \
118 − ignoreScan \
119 − ignoreSpare \
120 −QA \
121 −slack i n i t v i r t u a l . s l k
122
123 saveDesign . / t o p . p l a c e . e n c
124 #source . / t o p . p l a c e . e n c
125
126 checkPlace
127
128 buildTimingGraph
129 t imeDesign −preCTS −outDir . /Timing/PLACE.timing
130
131 Puts ”###################”
132 Puts ”###”
133 Puts ”### Opt im i z a t i o n . . . ”
134 Puts ”###”
135 Puts ”###################”
136
137 setOptMode −highEf fort \
138 −fixFanoutLoad \
139 −maxDensity 0 . 8 \
140 −reclaimArea \
141 −setupTargetSlack 0 . 0 \
142 −holdTargetSlack 0 . 0
143
144 optDesign −preCTS −setup −drv
170 Appendix C. SoC Encounter P&R scripts
145
146 saveDesign . / top.IPO.enc
147 # source . / top.IPO.enc
148
149 t imeDesign −preCTS −outDir . /Timing/ IPO.timing
150
151 Puts ”##############”
152 Puts ”###”
153 Puts ”### Run CTS.. . ”
154 Puts ”###”
155 Puts ”##############”
156
157 #clock with d i f f e r e n t l i b r a r i e s (SVT/LVT) depending on power domains
158 setCTSMode −fence −MSMV
159 #spe c i f y c l o c k t r e e f i l e
160 spec i fyClockTree − c l k f i l e ${ s cp t s } t o p . c t s t c h
161 #crea te repor t d i r e c t o r y
162 createSaveDir t op c t s
163
164 #do clock t r e e s yn t h e s i s
165 ckSynthes i s −rguide t op c t s / t o p c t s . g u i d e −report t op c t s / t o p c t s . c t s r p t
166 saveClockNets −output t op c t s / t o p c t s . c t s n t f
167 s a v eNe t l i s t t op c t s / t o p c t s . v
168 savePlace t op c t s / t o p c t s . p l a c e
169
170 saveDesign . /top.POST CTS.enc
171 #source . /top.POST CTS.enc
172
173 setAnalysisMode −clockTree
174 buildTimingGraph
175 t imeDesign −postCTS −outDir . /Timing/POST CTS.timing
176
177 Puts ”############################”
178 Puts ”###”
179 Puts ”### Optimizat ion post CTS. . . ”
180 Puts ”###”
181 Puts ”############################”
182
183 setOptMode −highEf fort \
184 −fixFanoutLoad \
185 −maxDensity 0 . 8 \
186 −reclaimArea \
187 −setupTargetSlack 0 . 0 \
188 −holdTargetSlack 0 . 0
189
190 optDesign −postCTS
191
192 saveDesign . /top.POST CTS IPO.enc
193 # source . /top.POST CTS IPO.enc
194
195 t imeDesign −postCTS −outDir . /Timing/POST CTS IPO.timing
196
197 Puts ”##################”
198 Puts ”###”
199 Puts ”### Nano r ou t e . . . . ”
C.1. main.tcl 171
200 Puts ”###”
201 Puts ”##################”
202
203 # F i l l e r Ce l l between s t d− c e l l s
204 addF i l l e r −ce l l FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 −pre f ix FILLER −powerDomain PDCORE
205 addF i l l e r −ce l l FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 −pre f ix FILLER −powerDomain PD0
206 addF i l l e r −ce l l FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 −pre f ix FILLER −powerDomain PD1
207 addF i l l e r −ce l l FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 −pre f ix FILLER −powerDomain PD2
208 addF i l l e r −ce l l FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 −pre f ix FILLER −powerDomain PD3
209
210 # connect a l l new s t d− c e l l i n s tance s to vdd/gnd
211 source ${ s cp t s } c r e a t e g l o b a l n e t . t c l
212
213
214 ########################
215 ## Route c l o c k s f i r s t ##
216 ########################
217
218 s e tAt t r i bu t e −net @clock −weight 5 −avoid detour t rue −bot tom pre f e r r ed rou t ing l aye r
4 −pr e f e r r ed ex t r a spac e 1
219 s e l e c tNe t −al lDefClock
220 setNanoRouteMode −quiet routeWithTimingDriven f a l s e
221 setNanoRouteMode −quiet envNumberProcessor 1
222 setNanoRouteMode −quiet r o u t e s e l e c t e d n e t o n l y t rue
223
224 g loba lDeta i lRoute
225
226 saveDesign . /top.POST CLK ROUTE.enc
227 #source . /top.POST CLK ROUTE.enc
228
229 #al low wide rou t ing for s ou t Z l v t Z sv t
230 convertNetToSNet −nets { s ou t Z lv t Z svt }
231 source ${ s cp t s }/ ou t pu t n e t s . t c l
232
233 ####################
234 ## Route A l l Nets ##
235 ####################
236
237 setNanoRouteMode −quiet routeFixPrewire t rue
238 setNanoRouteMode −quiet r o u t e s e l e c t e d n e t o n l y f a l s e
239 setNanoRouteMode −quiet routeWithTimingDriven f a l s e
240 setNanoRouteMode −quiet routeTdrEf fo r t 1
241 setNanoRouteMode −quiet drouteFixAntenna true
242 setNanoRouteMode −quiet routeWithSiDriven true
243 setNanoRouteMode −quiet routeS iLengthLimit 200
244 setNanoRouteMode −quiet r o u t e S iE f f o r t normal
245
246 g loba lDeta i lRoute
247
248 #f i x er ror s found with c a l i b r e DRC check
172 Appendix C. SoC Encounter P&R scripts
249 source ${ s cp t s }/ f i x d r c e r r o r s . t c l
250
251 saveDesign . /top.POST ROUTE.enc
252 #source . /top.POST ROUTE.enc
253
254 ##########################
255 ## Check for v i o l a t i o n s ##
256 ##########################
257
258 c l ea rDrc
259 ver i fygeometry −a l l owDi f fCe l lV i o l s
260 ve r i f yConne c t i v i t y −type r e gu l a r −error 1000 −warning 50
261 ver i fyProcessAntenna
262
263 reportLeakagePower
264
265 Puts ”################################”
266 Puts ”###”
267 Puts ”### Create ab s t r a c t views : v e r i l o g / LEF / DEF / GDS /SDF . . . ”
268 Puts ”###”
269 Puts ”################################”
270
271 #exec mkdir RESULTS
272
273 ########
274 ### ve r i l o g
275 ########
276 s a v eNe t l i s t . /RESULTS/ top .v
277
278 ########
279 ### l e f
280 ########
281 l e fOut . /RESULTS/ t o p . l e f −str ipePin −PGpinLayers 6 7
282
283 #######
284 ### def
285 #######
286 defOut − f l oorp lan −routing . /RESULTS/ t op . d e f
287
288 #######
289 ### gds
290 #######
291 streamOut . /RESULTS/ t op w i t h i o . g d s \
292 −mapFile ${CMOS090GP DIR}/ SocEncounter cmos090gp 2.2 /gds2 cmos90.map \
293 −libName DesignLib \
294 −structureName top w i th i o \
295 − s t r ipe s 1 \
296 −units 2000 \
297 −mode ALL
298
299 #######
300 ### sd f
301 #######
302 setExtractRCMode −deta i l
303 extractRC
C.2. top.conf 173
304 delayCal −sdf . /RESULTS/ t op . s d f
C.2 top.conf
1 ################################################
2 # #
3 # Input con f i gu ra t i on f i l e #
4 # #
5 ################################################
6
7 #set d e s i g n k i t path
8 set CMOS090GP DIR / de s i gnk i t / cmos090 50a
9
10 global rda Input
11
12 #set cwd . /work
13
14 set rda Input ( import mode ) {−treatUndefinedCellAsBbox 0 −verticalRow 0
15 −keepEmptyModule 1 }
16 set rda Input ( u i n e t l i s t ) ” s r c / t o p i o . v ”
17
18 set rda Input ( u i n e t l i s t t y p e ) {Ver i l og }
19 set rda Input ( u i i l m l i s t ) {}
20 set rda Input ( u i s e t t o p ) {1}
21 set rda Input ( u i t o p c e l l ) { t op w i th i o }
22 set rda Input ( u i c e l l l i b ) {}
23 set rda Input ( u i i o l i b ) {}
24 set rda Input ( u i a r e a i o l i b ) {}
25 set rda Input ( u i b l k l i b ) {}
26 set rda Input ( u i kbox l i b ) {}
27 set rda Input ( u i g d s f i l e ) {}
28 set rda Input ( u i t ime l i b ,m in ) ”
29 ${CMOS090GP DIR}/CORE90GPLVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORE90GPLVT Best.lib
30 ${CMOS090GP DIR}/CORE90GPSVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORE90GPSVT Best.lib
31 ${CMOS090GP DIR}/CORX90GPLVT SNPS−AVT 4.2/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORX90GPLVT Best.lib
32 ${CMOS090GP DIR}/CORX90GPSVT SNPS−AVT 4.2/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORX90GPSVT Best.lib
33 ${CMOS090GP DIR}/CLOCK90GPLVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB
/CLOCK90GPLVT Best.lib
34 ${CMOS090GP DIR}/CLOCK90GPSVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB
/CLOCK90GPSVT Best.lib
35 ${CMOS090GP DIR}/PR90M7 SNPS−AVT 3.0/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
PR90M7 Best.l ib
36 ${CMOS090GP DIR}/IO90GPHVT 3V3 50A 7M2T SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT 3V3 50A 7M2T Best.lib
37 ${CMOS090GP DIR}/IO90GPHVT BASIC 50A 7M2T SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT BASIC 50A 7M2T Best.lib
38 ${CMOS090GP DIR}/IO90GPHVT REF COMPENSATION 3V3 50A SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT REF COMPENSATION 3V3 50A Best.lib”
39
40 set rda Input ( u i t ime l i b ,max ) ”
174 Appendix C. SoC Encounter P&R scripts
41 ${CMOS090GP DIR}/CORE90GPLVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORE90GPLVT Worst.lib
42 ${CMOS090GP DIR}/CORE90GPSVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORE90GPSVT Worst.lib
43 ${CMOS090GP DIR}/CORX90GPLVT SNPS−AVT 4.2/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORX90GPLVT Worst.lib
44 ${CMOS090GP DIR}/CORX90GPSVT SNPS−AVT 4.2/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
CORX90GPSVT Worst.lib
45 ${CMOS090GP DIR}/CLOCK90GPHVT SNPS−AVT 2.1.a/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/
PT LIB/CLOCK90GPHVT Worst.lib
46 ${CMOS090GP DIR}/CLOCK90GPLVT SNPS−AVT 2.1/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB
/CLOCK90GPLVT Worst.lib
47 ${CMOS090GP DIR}/PR90M7 SNPS−AVT 3.0/SIGNOFF/bc 1.10V m40C wc 0.90V 105C/PT LIB/
PR90M7 Worst.lib
48 ${CMOS090GP DIR}/IO90GPHVT 3V3 50A 7M2T SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT 3V3 50A 7M2T Worst.lib
49 ${CMOS090GP DIR}/IO90GPHVT BASIC 50A 7M2T SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT BASIC 50A 7M2T Worst.lib
50 ${CMOS090GP DIR}/IO90GPHVT REF COMPENSATION 3V3 50A SNPS−AVT 4.0/SIGNOFF/
bc 1.10V m40C wc 0.90V 125C/PT LIB/IO90GPHVT REF COMPENSATION 3V3 50A Worst.lib”
51
52 set rda Input ( u i t im e l i b ) {}
53 set rda Input ( ui smodDef ) {}
54 set rda Input ( ui smodData ) {}
55 set rda Input ( u i dpath ) {}
56 set rda Input ( u i t e c h f i l e ) {}
57 set rda Input ( u i i o f i l e ) {data/ i o p l a c e . i o }
58 set rda Input ( u i t im i n g c o n f i l e ) { s r c / t op ga t e . s d c }
59 set rda Input ( u i l a t e n c y f i l e ) {}
60 set rda Input ( u i s c h e d u l i n g f i l e ) {}
61 set rda Input ( u i b u f f o o t p r i n t ) {}
62 set rda Input ( u i d e l a y f o o t p r i n t ) {}
63 set rda Input ( u i i n v f o o t p r i n t ) {}
64 set rda Input ( u i l e f f i l e ) ”
65 ${CMOS090GP DIR}/ SocEncounter cmos090gp 2.2 / cmos090gp soc . l e f
66 ${CMOS090GP DIR}/CORE90GPLVT SNPS−AVT 2.1/SIGNOFF/common/LEF/CORE90GPLVT ANT.lef
67 ${CMOS090GP DIR}/CORE90GPSVT SNPS−AVT 2.1/SIGNOFF/common/LEF/CORE90GPSVT ANT.lef
68 ${CMOS090GP DIR}/CORX90GPLVT SNPS−AVT 4.2/SIGNOFF/common/LEF/CORX90GPLVT ANT.lef
69 ${CMOS090GP DIR}/CORX90GPSVT SNPS−AVT 4.2/SIGNOFF/common/LEF/CORX90GPSVT ANT.lef
70 ${CMOS090GP DIR}/CLOCK90GPLVT SNPS−AVT 2.1/SIGNOFF/common/LEF/CLOCK90GPLVT ANT.lef
71 ${CMOS090GP DIR}/CLOCK90GPSVT SNPS−AVT 2.1/SIGNOFF/common/LEF/CLOCK90GPSVT ANT.lef
72 ${CMOS090GP DIR}/PR90M7 SNPS−AVT 3.0/SIGNOFF/common/LEF/PR90M7 ANT.lef
73 data/LEF/IO90GPHVT 3V3 50A 7M2T PGC.lef
74 data/LEF/IO90GPHVT BASIC 50A 7M2T PGC.lef
75 ${CMOS090GP DIR}/IO90GPHVT REF COMPENSATION 3V3 50A SNPS−AVT 4.0/SIGNOFF/common/LEF/
IO90GPHVT REF COMPENSATION 3V3 50A.lef”
76
77
78 set rda Input ( u i c o r e c n t l ) { aspect }
79 set rda Input ( u i a s p e c t r a t i o ) {1 . 0 }
80 set rda Input ( u i c o r e u t i l ) {0 . 7 }
81 set rda Input ( u i c o r e h e i g h t ) {}
82 set rda Input ( u i c o r e w id th ) {}
83 set rda Input ( u i c o r e t o l e f t ) {}
84 set rda Input ( u i c o r e t o r i g h t ) {}
C.3. IO Filler.tcl 175
85 set rda Input ( u i c o r e t o t o p ) {}
86 set rda Input ( u i co r e to bo t tom ) {}
87 set rda Input ( u i max i o he i gh t ) {0}
88 set rda Input ( u i r ow he i gh t ) {3 .92 }
89 set rda Input ( u i i sHorTrackHa l fP i t ch ) {0}
90 set rda Input ( u i i sVerTrackHa l fP i t ch ) {1}
91 set rda Input ( u i i oO r i ) {R0}
92 set rda Input ( u i i sOr i gCen t e r ) {0}
93 set rda Input ( u i e x c n e t ) {}
94 set rda Input ( u i d e l a y l im i t ) {1000}
95 set rda Input ( u i n e t d e l a y ) {1000 . 0p s }
96 set rda Input ( u i n e t l o ad ) {0 . 5 p f }
97 set rda Input ( u i i n t r a n d e l a y ) {120 . 0p s }
98 set rda Input ( u i c a p t b l f i l e ) {}
99 set rda Input ( u i d e f c a p s c a l e ) {1 . 0 }
100 set rda Input ( u i d e t c a p s c a l e ) {1 . 0 }
101 set rda Input ( u i x c ap s c a l e ) {1 . 0 }
102 set rda Input ( u i r e s s c a l e ) {1 . 0 }
103 set rda Input ( u i s h r s c a l e ) {1 . 0 }
104 set rda Input ( u i t ime un i t ) {none}
105 set rda Input ( u i c ap un i t ) {}
106 set rda Input ( u i o a r e f l i b ) {}
107 set rda Input ( u i oa abst ractname ) {}
108 set rda Input ( u i s i g s t o rm l i b ) {}
109 set rda Input ( u i c d b f i l e ) {}
110 set rda Input ( u i e c h o f i l e ) {}
111 set rda Input ( u i x i l m f i l e ) {}
112 set rda Input ( u i q x t e c h f i l e ) {}
113 set rda Input ( u i q x l i b f i l e ) {}
114 set rda Input ( u i q x c o n f f i l e ) {}
115 set rda Input ( ui pwrnet ) {vdd vdde vdd0 vdd1 vdd2 vdd3 vddcore}
116 set rda Input ( u i gndnet ) {gnd gnde \
117 CLKSLEEP TQ DIGA DIGB KOFF REFA REFB REFC REFD REFE REFF \
118 A13SRC A12SRC A11SRC A10SRC A9SRC A8SRC A7SRC A6SRC A5SRC A4SRC A3SRC A2SRC A1SRC
A0SRC \
119 IO CLKSLEEP IO TQ IO DIGA IO DIGB IO KOFF IO REFA IO REFB IO REFC IO REFD IO REFE
IO REFF \
120 IO A13SRC IO A12SRC IO A11SRC IO A10SRC IO A9SRC IO A8SRC IO A7SRC IO A6SRC IO A5SRC
IO A4SRC IO A3SRC IO A2SRC IO A1SRC IO A0SRC \
121 }
122 set rda Input ( f l i p f i r s t ) {1}
123 set rda Input ( double back ) {1}
124 set rda Input ( a s s i g n b u f f e r ) {1}
125 set rda Input ( u i g e n f o o t p r i n t ) {0}
C.3 IO Filler.tcl
1 #de f ine user g r i d
2 s e tP r e f e r en c e ConstraintUserXGrid 0 .56
3 s e tP r e f e r en c e ConstraintUserYGrid 0 .56
4 snapFPlanIO −usergr id
5 redraw
6
7 #add IO f i l l e r from the b i g g e r to the sma l l e r
176 Appendix C. SoC Encounter P&R scripts
8 add I oF i l l e r −ce l l IOFILLER64 LIN −pre f ix i o f i l l p e r i
9 add I oF i l l e r −ce l l IOFILLER32 LIN −pre f ix i o f i l l p e r i
10 add I oF i l l e r −ce l l IOFILLER16 LIN −pre f ix i o f i l l p e r i
11 add I oF i l l e r −ce l l IOFILLER8 LIN −pre f ix i o f i l l p e r i
12 add I oF i l l e r −ce l l IOFILLER4 LIN −pre f ix i o f i l l p e r i
13 add I oF i l l e r −ce l l IOFILLER2 LIN −pre f ix i o f i l l p e r i
14 add I oF i l l e r −ce l l IOFILLER1 LIN −pre f ix i o f i l l p e r i
15 redraw
C.4 do power domains.tcl
1 #crea te power domains
2 deletePowerDomain
3 createPowerDomain PD0 − t iming l ibs ”CORE90GPSVT”
4 createPowerDomain PD1 − t iming l ibs ”CORE90GPSVT”
5 createPowerDomain PD2 − t iming l ibs ”CORE90GPLVT”
6 createPowerDomain PD3 − t iming l ibs ”CORE90GPLVT”
7 createPowerDomain PDCORE − t iming l ibs ”CORE90GPSVT”
8
9
10 #inc lude in s tance s
11 modifyPowerDomainMember PD0 − instance core /mult 0 −power ( vdd0:vdd ) −ground ( gnd:gnd )
12 modifyPowerDomainMember PD0 − instance i o co vdd i o co 0 −power (vdd0:VDDCORE1V0) −move
13
14 modifyPowerDomainMember PD1 − instance core /mult 1 −power ( vdd1:vdd ) −ground ( gnd:gnd )
15 modifyPowerDomainMember PD1 − instance i o co vdd i o co 1 −power (vdd1:VDDCORE1V0) −move
16
17 modifyPowerDomainMember PD2 − instance core /mult 2 −power ( vdd2:vdd ) −ground ( gnd:gnd )
18 modifyPowerDomainMember PD2 − instance i o co vdd i o co 2 −power (vdd2:VDDCORE1V0) −move
19
20 modifyPowerDomainMember PD3 − instance core /mult 3 −power ( vdd3:vdd ) −ground ( gnd:gnd )
21 modifyPowerDomainMember PD3 − instance i o co vdd i o co 3 −power (vdd3:VDDCORE1V0) −move
22
23 modifyPowerDomainMember PDCORE − instance i o c o vdd i o c o c o r e −power (vddcore:VDDCORE1V0
) −move
24 modifyPowerDomainMember PDCORE − instance ∗ −power ( vddcore:vdd ) −ground ( gnd:gnd )
25
26 #re s i z e i t
27 modifyPowerDomainAttr PDCORE −box 194 .04 381 .76 691 .76 587 .76 −rsExts 10 10 40 10
−minGaps 10 10 10 10
28 createPowerDomainCut 640 .88 469 .28 691 .76 597 .76
29 modifyPowerDomainAttr PD0 −box 194 .04 194 .04 433 .76 361 .76 −rsExts 10 10 10 10
−minGaps 10 10 10 10
30 modifyPowerDomainAttr PD1 −box 194 .04 606 .56 640 .88 994 .32 −rsExts 10 10 10 10
−minGaps 10 10 10 10
31 modifyPowerDomainAttr PD2 −box 452 .26 194 .04 691 .76 361 .76 −rsExts 10 10 10 10
−minGaps 10 10 10 10
32 modifyPowerDomainAttr PD3 −box 659 .48 490 .44 994 .32 994 .32 −rsExts 10 10 10 10
−minGaps 10 10 10 10
C.5 create global net.tcl
1 Puts ”###############################”
C.5. create global net.tcl 177
2 Puts ”###”
3 Puts ”### Power d e c l a r a t i o n f o r s t d− c e l l s and IO PADs”
4 Puts ”###”
5 Puts ”###############################”
6
7
8 ###
9 ### WARNING : Al l the global nets shou ld be dec lared f i r s t in the ” . c o n f ” f i l e
10 ###
11
12
13 ###
14 ### f i r s t , d ec l a r e vdd/gnd pin ’ s for a l l s t d− c e l l s
15 ###
16
17 globalNetConnect vdd −type pgpin −pin {vdd } − inst ∗ −module {}
18 globalNetConnect gnd −type pgpin −pin {gnd } − inst ∗ −module {}
19
20 ### dec l a r e 0/1 vhd l / v e r i l o g cons tan t s to be on vdd/gnd supp l y s
21 globalNetConnect vdd −type t i e h i −module {}
22 globalNetConnect gnd −type t i e l o −module {}
23
24 ###
25 ### IO pads
26 ### − Al l the ins tance names for the IO pads must have the ” i o ” p r e f i x
27 ###
28
29
30 ### IO ’ s & core supp ly
31 globalNetConnect vdd −type pgpin −pin {vdd } − inst i o∗ −module {} −overr ide
32 globalNetConnect gnd −type pgpin −pin {gnd } − inst i o∗ −module {} −overr ide
33
34 ### remaining IOs pins
35 globalNetConnect gnde −type pgpin −pin {gnde } − inst i o∗ −module {} −overr ide
36 globalNetConnect vdde −type pgpin −pin {vdde } − inst i o∗ −module {} −overr ide
37
38 globalNetConnect IO CLKSLEEP −type pgpin −pin {CLKSLEEP } − inst i o∗ −module {}
−overr ide
39 globalNetConnect IO TQ −type pgpin −pin {TQ } − inst i o∗ −module {} −overr ide
40 globalNetConnect IO DIGA −type pgpin −pin {DIGA } − inst i o∗ −module {} −overr ide
41
42 globalNetConnect IO DIGB −type pgpin −pin {DIGB } − inst i o∗ −module {} −overr ide
43 globalNetConnect IO KOFF −type pgpin −pin {KOFF } − inst i o∗ −module {} −overr ide
44
45 globalNetConnect IO REFA −type pgpin −pin {REFA } − inst i o∗ −module {} −overr ide
46 globalNetConnect IO REFB −type pgpin −pin {REFB } − inst i o∗ −module {} −overr ide
47 globalNetConnect IO REFC −type pgpin −pin {REFC } − inst i o∗ −module {} −overr ide
48 globalNetConnect IO REFD −type pgpin −pin {REFD } − inst i o∗ −module {} −overr ide
49 globalNetConnect IO REFE −type pgpin −pin {REFE } − inst i o∗ −module {} −overr ide
50 globalNetConnect IO REFF −type pgpin −pin {REFF } − inst i o∗ −module {} −overr ide
51
52 globalNetConnect IO A0SRC −type pgpin −pin {A0SRC } − inst i o∗ −module {} −overr ide
53 globalNetConnect IO A1SRC −type pgpin −pin {A1SRC } − inst i o∗ −module {} −overr ide
54 globalNetConnect IO A2SRC −type pgpin −pin {A2SRC } − inst i o∗ −module {} −overr ide
55 globalNetConnect IO A3SRC −type pgpin −pin {A3SRC } − inst i o∗ −module {} −overr ide
178 Appendix C. SoC Encounter P&R scripts
56 globalNetConnect IO A4SRC −type pgpin −pin {A4SRC } − inst i o∗ −module {} −overr ide
57 globalNetConnect IO A5SRC −type pgpin −pin {A5SRC } − inst i o∗ −module {} −overr ide
58 globalNetConnect IO A6SRC −type pgpin −pin {A6SRC } − inst i o∗ −module {} −overr ide
59 globalNetConnect IO A7SRC −type pgpin −pin {A7SRC } − inst i o∗ −module {} −overr ide
60 globalNetConnect IO A8SRC −type pgpin −pin {A8SRC } − inst i o∗ −module {} −overr ide
61 globalNetConnect IO A9SRC −type pgpin −pin {A9SRC } − inst i o∗ −module {} −overr ide
62 globalNetConnect IO A10SRC −type pgpin −pin {A10SRC } − inst i o∗ −module {}
−overr ide
63 globalNetConnect IO A11SRC −type pgpin −pin {A11SRC } − inst i o∗ −module {}
−overr ide
64 globalNetConnect IO A12SRC −type pgpin −pin {A12SRC } − inst i o∗ −module {}
−overr ide
65 globalNetConnect IO A13SRC −type pgpin −pin {A13SRC } − inst i o∗ −module {} −overr ide
66
67
68 ############ remaining 3V3 IOs pins ############
69
70 globalNetConnect vdde −type pgpin −pin {vdde3v3 } − inst i o∗ −module {} −overr ide
71
72 globalNetConnect IO CLKSLEEP −type pgpin −pin {CLKSLEEP3V3 } − inst i o∗ −module {}
−overr ide
73 globalNetConnect IO TQ −type pgpin −pin {TQ3V3 } − inst i o∗ −module {} −overr ide
74 globalNetConnect IO DIGA −type pgpin −pin {CHIPSLEEP3V3 } − inst i o∗ −module {}
−overr ide
75
76 globalNetConnect IO REFA −type pgpin −pin {REFAPBIAS3V3 } − inst i o∗ −module {}
−overr ide
77 globalNetConnect IO REFB −type pgpin −pin {REFBAMPL3V3 } − inst i o∗ −module {}
−overr ide
78 globalNetConnect IO REFC −type pgpin −pin {REFCAMPH3V3 } − inst i o∗ −module {}
−overr ide
79 globalNetConnect IO REFD −type pgpin −pin {REFDNBIAS3V3 } − inst i o∗ −module {}
−overr ide
80 globalNetConnect IO REFE −type pgpin −pin {REFEIO3V3 } − inst i o∗ −module {}
−overr ide
81
82 globalNetConnect IO A0SRC −type pgpin −pin {A0SRC3V3 } − inst i o∗ −module {}
−overr ide
83 globalNetConnect IO A1SRC −type pgpin −pin {A1SRC3V3 } − inst i o∗ −module {}
−overr ide
84 globalNetConnect IO A2SRC −type pgpin −pin {A2SRC3V3 } − inst i o∗ −module {}
−overr ide
85 globalNetConnect IO A3SRC −type pgpin −pin {A3SRC3V3 } − inst i o∗ −module {}
−overr ide
86 globalNetConnect IO A4SRC −type pgpin −pin {A4SRC3V3 } − inst i o∗ −module {}
−overr ide
87 globalNetConnect IO A5SRC −type pgpin −pin {A5SRC3V3 } − inst i o∗ −module {}
−overr ide
88 globalNetConnect IO A6SRC −type pgpin −pin {A6SRC3V3 } − inst i o∗ −module {} −overr ide
89 globalNetConnect IO A7SRC −type pgpin −pin {A7SRC3V3 } − inst i o∗ −module {}
−overr ide
90 globalNetConnect IO A8SRC −type pgpin −pin {A8SRC3V3 } − inst i o∗ −module {}
−overr ide
91 globalNetConnect IO A9SRC −type pgpin −pin {A9SRC3V3 } − inst i o∗ −module {}
−overr ide
C.5. create global net.tcl 179
92 globalNetConnect IO A10SRC −type pgpin −pin {A10SRC3V3 } − inst i o∗ −module {}
−overr ide
93 globalNetConnect IO A11SRC −type pgpin −pin {A11SRC3V3 } − inst i o∗ −module {}
−overr ide
94 globalNetConnect IO A12SRC −type pgpin −pin {A12SRC3V3 } − inst i o∗ −module {}
−overr ide
95 globalNetConnect IO A13SRC −type pgpin −pin {A13SRC3V3 } − inst i o∗ −module {}
−overr ide
96
97 ###
98 globalNetConnect CLKSLEEP −type pgpin −pin {CLKSLEEP3V3 } − inst i o r e f ∗ −module {}
−overr ide
99 globalNetConnect TQ −type pgpin −pin {TQ3V3 } − inst i o r e f ∗ −module {} −overr ide
100 globalNetConnect DIGA −type pgpin −pin {CHIPSLEEP3V3 } − inst i o r e f ∗ −module {}
−overr ide
101
102 globalNetConnect REFA −type pgpin −pin {REFAPBIAS3V3 } − inst i o r e f ∗ −module {}
−overr ide
103 globalNetConnect REFB −type pgpin −pin {REFBAMPL3V3 } − inst i o r e f ∗ −module {}
−overr ide
104 globalNetConnect REFC −type pgpin −pin {REFCAMPH3V3 } − inst i o r e f ∗ −module {}
−overr ide
105 globalNetConnect REFD −type pgpin −pin {REFDNBIAS3V3 } − inst i o r e f ∗ −module {}
−overr ide
106 globalNetConnect REFE −type pgpin −pin {REFEIO3V3 } − inst i o r e f ∗ −module {}
−overr ide
107
108 globalNetConnect A0SRC −type pgpin −pin {A0SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
109 globalNetConnect A1SRC −type pgpin −pin {A1SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
110 globalNetConnect A2SRC −type pgpin −pin {A2SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
111 globalNetConnect A3SRC −type pgpin −pin {A3SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
112 globalNetConnect A4SRC −type pgpin −pin {A4SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
113 globalNetConnect A5SRC −type pgpin −pin {A5SRC3V3 } − inst i o r e f ∗ −module {}
−overr ide
114 globalNetConnect A6SRC −type pgpin −pin {A6SRC3V3 } − inst i o r e f ∗ −module {} −overr ide
115
116 ###
117 globalNetConnect CLKSLEEP −type pgpin −pin {CLKSLEEP3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
118 globalNetConnect TQ −type pgpin −pin {TQ3V3 } − inst i o c o v s s i o r e f a s r c −module {}
−overr ide
119 globalNetConnect DIGA −type pgpin −pin {CHIPSLEEP3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
120
121 globalNetConnect REFA −type pgpin −pin {REFAPBIAS3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
122 globalNetConnect REFB −type pgpin −pin {REFBAMPL3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
123 globalNetConnect REFC −type pgpin −pin {REFCAMPH3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
180 Appendix C. SoC Encounter P&R scripts
124 globalNetConnect REFD −type pgpin −pin {REFDNBIAS3V3 } − inst i o c o v s s i o r e f a s r c
−module {} −overr ide
125 globalNetConnect REFE −type pgpin −pin {REFEIO3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
126
127 globalNetConnect A0SRC −type pgpin −pin {A0SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
128 globalNetConnect A1SRC −type pgpin −pin {A1SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
129 globalNetConnect A2SRC −type pgpin −pin {A2SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
130 globalNetConnect A3SRC −type pgpin −pin {A3SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
131 globalNetConnect A4SRC −type pgpin −pin {A4SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
132 globalNetConnect A5SRC −type pgpin −pin {A5SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
133 globalNetConnect A6SRC −type pgpin −pin {A6SRC3V3 } − inst i o c o v s s i o r e f a s r c −module
{} −overr ide
134
135 ### Mult IO power pad
136 globalNetConnect vddcore −type pgpin −pin {VDDCORE∗} − inst i o c o vdd i o c o c o r e −module
{} −overr ide
137 globalNetConnect vdd0 −type pgpin −pin {VDDCORE∗} − inst i o c o vdd i o co 0 −module {}
−overr ide
138 globalNetConnect vdd1 −type pgpin −pin {VDDCORE∗} − inst i o c o vdd i o co 1 −module {}
−overr ide
139 globalNetConnect vdd2 −type pgpin −pin {VDDCORE∗} − inst i o c o vdd i o co 2 −module {}
−overr ide
140 globalNetConnect vdd3 −type pgpin −pin {VDDCORE∗} − inst i o c o vdd i o co 3 −module {}
−overr ide
141
142 ### connect c e l l s to the co r r ec t io
143 globalNetConnect vddcore −type pgpin −pin {vdd} − inst ∗ −module core −overr ide
144 globalNetConnect vdd0 −type pgpin −pin {vdd} − inst ∗ −module core /mult 0 −overr ide
145 globalNetConnect vdd1 −type pgpin −pin {vdd} − inst ∗ −module core /mult 1 −overr ide
146 globalNetConnect vdd2 −type pgpin −pin {vdd} − inst ∗ −module core /mult 2 −overr ide
147 globalNetConnect vdd3 −type pgpin −pin {vdd} − inst ∗ −module core /mult 3 −overr ide
148
149
150 ###
151 ### execute command
152 ###
153 applyGlobalNets
154
155
156 ###
157 ### check a l l des ign
158 ### (a s p e c i f i c check can a l s o be performed in menu : FloorPlan−>Global Net
Connection−> check but ton )
159 ###
160 #checkdes ign −all
C.6. pwr.tcl 181
C.6 pwr.tcl
1 #add r ing s ( core + power domains )
2
3 #extern r ing
4 addRing \
5 −spacing bottom 3 . 0 \
6 −spacing top 3 . 0 \
7 − spac ing r i ght 3 . 0 \
8 − s pa c i ng l e f t 3 . 0 \
9 −width bottom 10 \
10 −width top 10 \
11 −width r ight 10 \
12 −width l e f t 10 \
13 − layer bottom M7 \
14 − l ayer top M7 \
15 − l a y e r r i gh t M6 \
16 − l a y e r l e f t M6 \
17 −of f set bottom 0 .45 \
18 −o f f s e t t op 0 .45 \
19 − o f f s e t r i g h t 0 .45 \
20 − o f f s e t l e f t 0 . 45 \
21 −center 1 \
22 − s t a cked v i a t op l aye r M7 \
23 − s tacked v ia bot tom layer M1 \
24 −around core \
25 − j og d i s tance 0 .45 \
26 −threshold 0 .45 \
27 −nets {gnd vddcore}
28
29 #PD0
30 d e s e l e c tA l l
31 se lectGroup PD0
32 addRing \
33 −type b l o c k r i n g s \
34 −around power domain \
35 −spacing bottom 1 . 5 \
36 −spacing top 1 . 5 \
37 − spac ing r i ght 1 . 5 \
38 − s pa c i ng l e f t 1 . 5 \
39 −width bottom 8 \
40 −width top 8 \
41 −width r ight 8 \
42 −width l e f t 8 \
43 − layer bottom M7 \
44 − l ayer top M7 \
45 − l a y e r r i gh t M6 \
46 − l a y e r l e f t M6 \
47 −of f set bottom 0 .45 \
48 −o f f s e t t op 0 .45 \
49 − o f f s e t r i g h t 0 .45 \
50 − o f f s e t l e f t 0 . 45 \
51 − s t a cked v i a t op l aye r M7 \
52 − s tacked v ia bot tom layer M1 \
53 − j og d i s tance 0 .45 \
182 Appendix C. SoC Encounter P&R scripts
54 −threshold 0 .45 \
55 −nets {vdd0}
56 dese lectGroup PD0
57
58 #PD1
59 se lectGroup PD1
60 addRing \
61 −type b l o c k r i n g s \
62 −around power domain \
63 −spacing bottom 1 . 5 \
64 −spacing top 1 . 5 \
65 − spac ing r i ght 1 . 5 \
66 − s pa c i ng l e f t 1 . 5 \
67 −width bottom 8 \
68 −width top 8 \
69 −width r ight 8 \
70 −width l e f t 8 \
71 − layer bottom M7 \
72 − l ayer top M7 \
73 − l a y e r r i gh t M6 \
74 − l a y e r l e f t M6 \
75 −of f set bottom 0 .45 \
76 −o f f s e t t op 0 .45 \
77 − o f f s e t r i g h t 0 .45 \
78 − o f f s e t l e f t 0 . 45 \
79 − s t a cked v i a t op l aye r M7 \
80 − s tacked v ia bot tom layer M1 \
81 − j og d i s tance 0 .45 \
82 −threshold 0 .45 \
83 −nets {vdd1}
84 dese lectGroup PD1
85
86 #PD2
87 se lectGroup PD2
88 addRing \
89 −type b l o c k r i n g s \
90 −around power domain \
91 −spacing bottom 1 . 5 \
92 −spacing top 1 . 5 \
93 − spac ing r i ght 1 . 5 \
94 − s pa c i ng l e f t 1 . 5 \
95 −width bottom 8 \
96 −width top 8 \
97 −width r ight 8 \
98 −width l e f t 8 \
99 − layer bottom M7 \
100 − l ayer top M7 \
101 − l a y e r r i gh t M6 \
102 − l a y e r l e f t M6 \
103 −of f set bottom 0 .45 \
104 −o f f s e t t op 0 .45 \
105 − o f f s e t r i g h t 0 .45 \
106 − o f f s e t l e f t 0 . 45 \
107 − s t a cked v i a t op l aye r M7 \
108 − s tacked v ia bot tom layer M1 \
C.6. pwr.tcl 183
109 − j og d i s tance 0 .45 \
110 −threshold 0 .45 \
111 −nets {vdd2}
112 dese lectGroup PD2
113
114 #PD3
115 se lectGroup PD3
116 addRing \
117 −type b l o c k r i n g s \
118 −around power domain \
119 −spacing bottom 1 . 5 \
120 −spacing top 1 . 5 \
121 − spac ing r i ght 1 . 5 \
122 − s pa c i ng l e f t 1 . 5 \
123 −width bottom 8 \
124 −width top 8 \
125 −width r ight 8 \
126 −width l e f t 8 \
127 − layer bottom M7 \
128 − l ayer top M7 \
129 − l a y e r r i gh t M6 \
130 − l a y e r l e f t M6 \
131 −of f set bottom 0 .45 \
132 −o f f s e t t op 0 .45 \
133 − o f f s e t r i g h t 0 .45 \
134 − o f f s e t l e f t 0 . 45 \
135 − s t a cked v i a t op l aye r M7 \
136 − s tacked v ia bot tom layer M1 \
137 − j og d i s tance 0 .45 \
138 −threshold 0 .45 \
139 −nets {vdd3}
140 dese lectGroup PD3
141
142 #PDCORE
143 se lectGroup PDCORE
144 addRing \
145 −type b l o c k r i n g s \
146 −around power domain \
147 −spacing bottom 1 . 5 \
148 −spacing top 1 . 5 \
149 − spac ing r i ght 1 . 5 \
150 − s pa c i ng l e f t 1 . 5 \
151 −width bottom 8 \
152 −width top 8 \
153 −width r ight 8 \
154 −width l e f t 8 \
155 − layer bottom M7 \
156 − l ayer top M7 \
157 − l a y e r r i gh t M6 \
158 − l a y e r l e f t M6 \
159 −of f set bottom 0 .45 \
160 −o f f s e t t op 0 .45 \
161 − o f f s e t r i g h t 0 .45 \
162 − o f f s e t l e f t 0 . 45 \
163 − s t a cked v i a t op l aye r M7 \
184 Appendix C. SoC Encounter P&R scripts
164 − s tacked v ia bot tom layer M1 \
165 − j og d i s tance 0 .45 \
166 −threshold 0 .45 \
167 − l e f t 0 \
168 −tl 1\
169 −bl 1\
170 −nets {vddcore}
171 dese lectGroup PDCORE
172
173 #IO REF COMPENSATION
174 addRing \
175 −type b l o c k r i n g s \
176 −around each b lock \
177 −spacing bottom 1 . 5 \
178 −spacing top 1 . 5 \
179 − spac ing r i ght 1 . 5 \
180 − s pa c i ng l e f t 1 . 5 \
181 −width bottom 8 \
182 −width top 8 \
183 −width r ight 8 \
184 −width l e f t 8 \
185 − layer bottom M7 \
186 − l ayer top M7 \
187 − l a y e r r i gh t M6 \
188 − l a y e r l e f t M6 \
189 −of f set bottom 0 .55 \
190 −o f f s e t t op 0 .55 \
191 − o f f s e t r i g h t 0 .55 \
192 − o f f s e t l e f t 0 . 55 \
193 − s t a cked v i a t op l aye r M7 \
194 − s tacked v ia bot tom layer M1 \
195 − j og d i s tance 0 .45 \
196 −threshold 0 .45 \
197 −nets {vdd vdde}
198
199 addRing \
200 −type b l o c k r i n g s \
201 −around each b lock \
202 −spacing bottom 1 . 5 \
203 −spacing top 1 . 5 \
204 − spac ing r i ght 1 . 5 \
205 − s pa c i ng l e f t 1 . 5 \
206 −width bottom 8 \
207 −width top 8 \
208 −width r ight 8 \
209 −width l e f t 8 \
210 − layer bottom M7 \
211 − l ayer top M7 \
212 − l a y e r r i gh t M6 \
213 − l a y e r l e f t M6 \
214 −of f set bottom 20 . 5 \
215 −o f f s e t t op 20 . 5 \
216 − o f f s e t r i g h t 20 . 5 \
217 − o f f s e t l e f t 20 . 5 \
218 − s t a cked v i a t op l aye r M7 \
C.7. followPin.tcl 185
219 − s tacked v ia bot tom layer M1 \
220 −threshold 0 .45 \
221 −bottom 0 \
222 −r ight 0 \
223 −lb 1 \
224 −tr 1 \
225 −nets {gnd}
C.7 followPin.tcl
1 Puts ”###############################”
2 Puts ”###”
3 Puts ”### Create s td− c e l l f o l l ow pin ”
4 Puts ”###”
5 Puts ”###############################”
6
7 d e s e l e c tA l l
8 cutCoreRow
9
10 #avoid v ia6 7 on VDDCO HDRV MT 1V0 LIN edge
11 createRouteBlk −box 233 .4700 1042 .6900 276 .7550 1047 .9950 − layer 6
12 createRouteBlk −box 911 .768 1042 .69 955 .176 1049 .237 − layer 6
13 createRouteBlk −box 504 .69 139 .071 548 .532 146 .015 − layer 6
14 createRouteBlk −box 232 .895 140 .142 277 .333 146 .272 − layer 6
15
16 # Use editPowerVia to generate s t r i p e s− f o l l owp in s
17 #−noBlockPins f irstAfterRowEnd
18 s route −verbose −noPadRings −noStr ipes \
19 −corePinMaxViaWidth 30 −corePinMaxViaHeight 70 \
20 −targetViaTopLayer 7 −crossoverViaTopLayer 7 \
21 −secondaryStopSCPin f i r s t S t r i p e \
22 −viaConnectToShape { s t r i p e r i ng } \
23 −de leteExis t ingRoutes \
24 −padPinWidth 7\
25 −nets {gnd vdd vdd0 vdd1 vdd2 vdd3 vddcore vdde}
26
27 #avoid rou t ing too close to the pad
28 createRouteBlk −box 923 .796 143 .219 927 .711 156 .959 − layer a l l
29
30 #Route IO REF sp e c i a l ne t s
31 s route −verbose −noPadRings −padPinToAlignedBlockPin \
32 −stopStripeSCPin lastPadRing −de leteExis t ingRoutes −nets {\
33 CLKSLEEP TQ DIGA DIGB KOFF REFA REFB REFC REFD REFE REFF \
34 A6SRC A5SRC A4SRC A3SRC A2SRC A1SRC A0SRC }
35 # A13SRC A12SRC A11SRC A10SRC A9SRC A8SRC A7SRC
36
37 #Remove b l o ckage s
38 de l e t eAl lRouteBlks
39
40 clearCutRow
41 d e s e l e c tA l l
186 Appendix C. SoC Encounter P&R scripts
C.8 place output bufs.tcl
1 p l a c e In s t anc e Z sv t bu f 194 .04 570 .385 MY
2 p l a c e In s t anc e Z l v t bu f 194 .04 535 .085 R180
3 p l a c e In s t anc e s ou t bu f 194 .04 409 .714 R180
C.9 output nets.tcl
1 #outputs ne t s with width o f 1 um
2 s e tEd i t − f o r c e s p e c i a l 1
3 s e tEd i t −width hor i zonta l 1
4 s e tEd i t −width ve r t i ca l 1
5
6 #s ou t
7 s e tEd i t −nets s out
8 s e tEd i t − l a y e r ho r i z on ta l M1
9 s e tEd i t − l a y e r v e r t i c a l M2
10 uiSetTool addWire
11 editAddRoute 194 .847 411 .480
12 editAddRoute 145 .233 411 .459
13 editAddRoute 144 .821 413 .911
14 editAddRoute 145 .115 413 .911
15 editCommitRoute 145 .115 413 .911
16 s e tEd i t − l a y e r ho r i z on ta l M2
17 editAddRoute 142 .614 413 .933
18 editAddRoute 145 .259 413 .439
19 editAddRoute 145 .043 413 .933
20 editCommitRoute 145 .043 413 .933
21 uiSetTool s e l e c t
22
23 #Z l v t
24 s e tEd i t −nets Z l v t
25 s e tEd i t − l a y e r ho r i z on ta l M1
26 s e tEd i t − l a y e r v e r t i c a l M2
27 uiSetTool addWire
28 editAddRoute 195 .073 536 .971
29 editAddRoute 145 .874 536 .559
30 editAddRoute 145 .028 549 .990
31 s e tEd i t − l a y e r ho r i z on ta l M2
32 editAddRoute 142 .497 549 .872
33 editCommitRoute 142 .497 549 .872
34 uiSetTool s e l e c t
35
36 #Z sv t
37 s e tEd i t −nets Z svt
38 uiSetTool addWire
39 s e tEd i t − l a y e r ho r i z on ta l M1
40 s e tEd i t − l a y e r v e r t i c a l M2
41 editAddRoute 195 .030 572 .574
42 editAddRoute 146 .930 572 .574
43 editAddRoute 146 .764 685 .499
44 s e tEd i t − l a y e r ho r i z on ta l M2
45 editAddRoute 142 .467 685 .146
46 editCommitRoute 142 .467 685 .146
C.10. fix drc errors.tcl 187
47 uiSetTool s e l e c t
C.10 fix drc errors.tcl
1 #################################################
2 #Fix the DRC error s d i s covered with c a l i b r e DRC
3 #################################################
4
5 #lower net
6 #move the f i r s t part
7 s e l e c tWi r e 950 .5100 203 .6300 952 .6100 203 .7700 3 gnd
8 s e l e c tWi r e 952 .4700 203 .6300 954 .2900 203 .7700 3 gnd
9 s e l e c tWi r e 954 .1500 203 .6300 981 .6100 203 .7700 3 gnd
10 editMove y −1.399
11 d e s e l e c tA l l
12 #add v ia at the end
13 s e tEd i t − l a y e r ho r i z on ta l M3 − l a y e r v e r t i c a l M4 −nets gnd
14 s e tEd i t −width hor i zonta l 0 . 14 −width ve r t i ca l 0 . 14
15 editAddRoute 981 .604 202 .292
16 editAddRoute 983 .905 202 .306
17 editCommitRoute 983 .905 202 .306
18 uiSetTool s e l e c t
19 d e s e l e c tA l l
20
21 ##
22 s e l e c tWi r e 983 .9900 203 .6300 984 .6900 203 .7700 3 gnd
23 ed i tDe l e t e −objects Se l e c t ed
24 d e l e t eT i l e s − se l ec ted
25 deleteBumps − se l ec ted
26 s e l e c tWi r e 984 .5500 203 .6300 984 .6900 244 .8500 2 gnd
27 ed i t S t r e t c h y −0.683 low
28 editAddRoute 984 .610 203 .019
29 editAddRoute 981 .323 203 .076
30 editCommitRoute 981 .323 203 .076
31 uiSetTool s e l e c t
32 d e s e l e c tA l l
33
34 #################################
35
36 #upper net
37 #de l e t e e x i s t i n g wire
38 s e l e c tWi r e 946 .4700 450 .8700 947 .7300 451 .0100 5 gnd
39 s e l e c tWi r e 947 .5900 450 .8700 947 .7300 451 .5700 4 gnd
40 s e l e c tWi r e 947 .5900 451 .4300 981 .6100 451 .5700 3 gnd
41 ed i tDe l e t e −objects Se l e c t ed
42 d e l e t eT i l e s − se l ec ted
43 deleteBumps − se l ec ted
44 #crea te the new one
45 editAddRoute 942 .920 448 .604
46 editAddRoute 944 .305 455 .932
47 editAddRoute 984 .069 455 .849
48 editCommitRoute 984 .069 455 .849
49 s e tEd i t − l a y e r v e r t i c a l M3
50 #add ex t ra v ia
188 Appendix C. SoC Encounter P&R scripts
51 editAddRoute 942 .905 455 .921
52 editAddRoute 942 .913 455 .503
53 editAddRoute 942 .961 455 .527
54 editCommitRoute 942 .961 455 .527
55 s e tEd i t − l a y e r v e r t i c a l M4
56 uiSetTool s e l e c t
57 d e s e l e c tA l l
58
59 #################################
60
61 #move o f f end ing net M1.S.3.1
62 #uiSetTool moveWire
63 s e l e c tWi r e 385 .6300 383 .6700 387 .7300 383 .8100 3 c l k p
64 s e l e c tWi r e 383 .6700 383 .6700 385 .7700 383 .8100 3 c l k p
65 editMove y 1 .246
66 d e s e l e c tA l l
67
68 s e l e c tWi r e 382 .8400 383 .4000 385 .4200 383 .5200 1 core / data gen 1 / c lk p Fence N0
69 ed i tDe l e t e −objects Se l e c t ed
70 d e l e t eT i l e s − se l ec ted
71 deleteBumps − se l ec ted
72 s e l e c tWi r e 380 .5900 383 .3900 382 .9700 383 .5300 3 core / data gen 1 / c lk p Fence N0
73 ed i t S t r e t c h x 3 .292 high
74 uiSetTool s e l e c t
75 d e s e l e c tA l l
76
77 ##################
78
79 #move an o f f end ing net on M1
80 s e l e c tWi r e 911 .2600 528 .7200 911 .6000 528 .8400 1 core /mult 3 /mult par4 0 /
AandBx15xx17x
81 editMove y 0 .558
82 uiSetTool s e l e c t
83 d e s e l e c tA l l
C.11 top.ctstch
1
2 ### CLK ###
3
4 # Sample Gated CTS Command
5 AutoCTSRootPin i o c l k /ZI
6
7 NoGating NO
8 Buf f e r IVSVTX6 BFSVTX1 BFSVTX8 BFSVTX10 BFSVTX12 IVLVTX6 BFLVTX1 BFLVTX8 BFLVTX10
BFLVTX12
9
10 MaxDelay 10ps
11 MinDelay 0ps
12 MaxSkew 100ps
13
14 End
15
16
C.12. ioplace.io 189
17 ### Reset ###
18
19 # Sample Gated CTS Command
20 AutoCTSRootPin i o r s t n /ZI
21
22 NoGating NO
23 Buf f e r IVSVTX6 BFSVTX1 BFSVTX2 BFSVTX4 BFSVTX6 BFSVTX8 BFSVTX12 IVLVTX6 BFLVTX1
BFLVTX2 BFLVTX4 BFLVTX6 BFLVTX8 BFLVTX12
24
25
26 MaxSkew 1ns
27
28 End
C.12 ioplace.io
1 ######################################################
2 # #
3 # Si l i c on Pe r sp e c t i v e , A Cadence Company #
4 # FirstEncounter IO Assignment #
5 # #
6 ######################################################
7
8 Ver s i on : 2
9
10 Pad: i o c o r n e r 4 SE CORNER LIN
11 Pad: i o c o r n e r 3 NE CORNER LIN
12 Pad: i o c o r n e r 2 NW CORNER LIN
13 Pad: i o c o r n e r 1 SW CORNER LIN
14
15 Pad: i o c o vdd i o co 1 N VDDCO HDRV MT 1V0 LIN
16 Pad: i o v s s i o 2 N VSSIO 3V3 LIN
17 Pad: i o c o vdd i o c o c o r e N VDDCO HDRV MT 1V0 LIN
18 Pad: i o vdd i o 2 N VDDIO 3V3 LIN
19 Pad: i o c o v s s i o c o 3 N VSSIOCO LIN
20 Pad: i o c o vdd i o co 3 N VDDCO HDRV MT 1V0 LIN
21
22 Pad: i o s i n W
23 Pad: i o s o u t W
24 Pad: i o Z l v t W
25 Pad: i o Z s v t W
26 Pad: i o s h i f t n W
27 Pad: i o c o v s s i o c o 2 W VSSIOCO LIN
28
29 Pad: i o c o vdd i o co 0 S VDDCO HDRV MT 1V0 LIN
30 Pad: i o c l k S
31 Pad: i o c o vdd i o co 2 S VDDCO HDRV MT 1V0 LIN
32 Pad: i o l o ad n S
33 Pad: i o c o v s s i o c o 1 S VSSIOCO LIN
34 Pad: i o c o v s s i o r e f a s r c S VSSIO 3V3 REF ASRC LIN
35
36 Pad: i o c o vdd i o co g E VDDIOCO LIN
37 Pad: i o c o v s s i o c o g E VSSIOCO LIN
38 Pad: i o s e l 0 E
190 Appendix C. SoC Encounter P&R scripts
39 Pad: i o s e l 1 E
40 Pad: i o s e l r e g E
41 Pad: i o r s t n E
Appendix D
FPGA source code
D.1 main FPGA.vhd
1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2 −− Ti t l e : FPGA code fo r demostrator t e s t board
3 −− Projec t :
4 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
5 −− Fi l e : main FPGA. vhd
6 −− Author : <mtschuster@WS−3439>
7 −− Company :
8 −− Created : 2007−02−03
9 −− Last update : 2007−02−03
10 −− Platform :
11 −− Standard : VHDL’93
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
13 −− Descr ip t ion : This code generate the s t imu l i needed to :
14 −− 1) S e l e c t the de s i r ed mu l t i p l i e r ;
15 −− 2) Reset i n t e r na l r e g i s t e r s ;
16 −− 3) Execute 10 ’000 ’000 o f Mu l t i p l y and Accumulate on the 64
17 −− b i t r e g i s t e r ;
18 −− 4) Read back the content o f the accumulator r e g i s t e r with a
19 −− f requency d i v i d ed by 4 ;
20 −− 5) Ver i fy the read data with the expec ted va lue and output
21 −− the dec i s i on on the pass / f a i l p ins ;
22 −− 6) At the end o f t h i s sequence , ch ip c l o c k i s s topped to
23 −− a l l ow s t a t i c power measurements .
24 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
25 −− Copyright ( c ) 2007
26 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
27 −− Revis ions :
28 −− Date Version Author Descr ip t ion
29 −− 2007−02−03 1.0 mtschuster Created
30 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
31 l ibrary i e e e ;
32 use i e e e . s t d l o g i c 1 1 6 4 . a l l ;
33 use i e e e . s t d l o g i c un s i g n ed . a l l ;
34
35 entity main i s
191
192 Appendix D. FPGA source code
36 port (
37 −−4 user sw i t che s ( normal ly ON)
38 S1 : in s t d l o g i c ;
39 S2 : in s t d l o g i c ;
40 S3 : in s t d l o g i c ;
41 S4 : in s t d l o g i c ;
42 −−r e l a y outputs f o r the 4 mu l t i p l i e r s
43 −− BEWARE: 0=on (VDDM) ; 1=o f f (GND)
44 P1 : out s t d l o g i c ; −− mult0 : RCA32 SVT
45 P2 : out s t d l o g i c ; −− mult1 : RCA32 PAR4 SVT
46 P3 : out s t d l o g i c ; −− mult2 : RCA32 LVT
47 P4 : out s t d l o g i c ; −− mult3 : RCA32 PAR4 LVT
48 −− Test r e s u l t l e d s
49 OK led : out s t d l o g i c ; −− t e s t passed
50 KO led : out s t d l o g i c ; −− t e s t f a i l e d
51 −− Control FPGA pins
52 mult num : in s t d l o g i c v e c t o r (1 downto 0) ; −−mu l t i p l i e r s e l e c t o r
53 −− Se r i a l i n t e r f a c e pins
54 CHIP sout : in s t d l o g i c ; −− s e r i a l i n t e r f a c e output
55 CHIP sin : out s t d l o g i c ; −− s e r i a l i n t e r f a c e input
56 CHIP shi f t n : out s t d l o g i c ; −− enab le b i t s h i f t i n g , a c t i v e low
57 CHIP load n : out s t d l o g i c ; −− enab le p a r a l l e l load , a c t i v e low
58 CHIP sel : out s t d l o g i c v e c t o r (1 downto 0) ; −− s e l e c t the mu l t i p l i e r unde
t e s t
59 CHIP se l reg : out s t d l o g i c ; −− route to /from the s h i f t r e g i s t e r
60 CHIP clock : out s t d l o g i c ; −− chip c l o c k
61 CHIP rst n : out s t d l o g i c ; −− chip asynchronous rese t , a c t i v e low
62 c l o ck : in s t d l o g i c ; −− FPGA c lock
63 ) ;
64 end main ;
65
66 architecture arch of main i s
67 −− s t a t e machine s t a t e s
68 type FSM states i s ( INIT , RUN, READBACK, VERIFY) ;
69 signal cu r r s t a t e , n e x t s t a t e : FSM states ;
70 signal r s t n : s t d l o g i c ; −− g l o b a l r e s e t
71 signal count : i n t e g e r range 0 to 16777215; −− counter de l ay ing
the next s t a t e
72
73 signal c l o ck s l ow : s t d l o g i c ; −− c l o c k d i v i d ed by 4
74 signal c l o c k s l ow enab l e : s t d l o g i c ; −− enab le c l ock s l ow , a c t i v e h igh
75 signal c l o c k d i v c oun t e r : s t d l o g i c v e c t o r (1 downto 0) ; −− c l o c k d i v i d e r
counter
76
77 signal read : s t d l o g i c ; −− readback t r i g g e r
78 signal data : s t d l o g i c v e c t o r (63 downto 0) ; −−readback data
79 signal f a i l : s t d l o g i c ; −− t e s t passed
80 signal pass : s t d l o g i c ; −− t e s t f a i l e d
81
82 signal mul t s e l : s t d l o g i c v e c t o r (1 downto 0) ; −−mu l t i p l i e r
s e l e c t i o n
83 begin
84 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
85 −− COMBINATORIAL LOGIC
86 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
D.1. main FPGA.vhd 193
87
88 −− r e s e t : through swi tch 4
89 r s t n <= S4 ;
90 −−mu l t i p l i e r s e l e c t i o n and chip c l o c k mu l t i p l e x i n g
91 mul t s e l <= mult num ;
92 CHIP sel <= mul t s e l ;
93 c l o ck s l ow <= c l o ck d i v c oun t e r (1 ) ;
94 CHIP clock <= c lo ck s l ow when c l o c k s l ow enab l e = ’1 ’ else c l o ck ;
95 −−t e s t r e s u l t l e d s
96 OK led <= pass ;
97 KO led <= f a i l ;
98 −−s e l e c t mu l t i p l i e r s power
99 P1 <= ’0 ’ when mul t s e l = ”00” else ’ 1 ’ ;
100 P2 <= ’0 ’ when mul t s e l = ”01” else ’ 1 ’ ;
101 P3 <= ’0 ’ when mul t s e l = ”10” else ’ 1 ’ ;
102 P4 <= ’0 ’ when mul t s e l = ”11” else ’ 1 ’ ;
103
104 −− Fin i t e s t a t e machine d e f i n i t i o n
105 FSM : process ( cu r r s t a t e , count , mul t s e l , data , S4 )
106 −− number o f c l o c k o f the i n i t s t a t e
107 constant INIT LENGTH : i n t e g e r := 4 ;
108 −− number o f c l o c k s f o r the running s t a t e
109 −− i t corresponds to number o f mu l t i p l i c a t i o n s + 2
110 −− p a r a l l e l mu l t i p l i e r s r e qu i r e 3 ex t ra c l o c k s due to l a t ency
111 constant RUNNING LENGTH : i n t e g e r := 10000002;
112 constant RUNNING LENGTH PAR : i n t e g e r := RUNNING LENGTH +
3 ;
113 −− number o f c l o c k to execute readback ta sk based on f u l l speed c l o c k
114 constant READBACKLENGTH : i n t e g e r := 254 ;
115
116 −− expec ted r e s u l t a f t e r 10 ’000 ’000 mu l t i p l i c a t i o n s and accumulat ions
117 constant EXPECTED RESULT : s t d l o g i c v e c t o r (63 downto 0) := X”0
E4DD39EA61421FC” ;
118 −− on low supp ly v o l t a g e s (<0.4V) one ex t ra mu l t i p l i c a t i o n can occur
119 constant EXPECTED RESULT LV : s t d l o g i c v e c t o r (63 downto 0) := X”1628
d37ce47c248c ” ;
120
121 begin
122 −−− chip d e f a u l t s va lue s
123 CHIP sin <= ’0 ’ ;
124 CHIP shi f t n <= ’1 ’ ;
125 CHIP load n <= ’1 ’ ;
126 CHIP se l reg <= ’0 ’ ;
127 c l o c k s l ow enab l e <= ’0 ’ ;
128 CHIP rst n <= ’1 ’ ;
129 read <= ’0 ’ ;
130 pass <= ’0 ’ ;
131 f a i l <= ’0 ’ ;
132 −− s t a t e machine
133 case c u r r s t a t e i s
134 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
135 −− l oad zeros from random generator to s e r i a l i n t e r f a c e r e g i s t e r s
136 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
137 when INIT =>
138 CHIP load n <= ’0 ’ ; −− p a r a l l e l load
194 Appendix D. FPGA source code
139 CHIP shi f t n <= ’1 ’ ; −− no s h i f t
140 CHIP se l reg <= ’1 ’ ; −− from rand to regs
141 CHIP rst n <= ’0 ’ ; −− mantain the random generator to zeros
142 −− a f t e r the i n i t time i s passed go to the next s t a t e
143 i f count = INIT LENGTH then
144 nex t s t a t e <= RUN;
145 else
146 nex t s t a t e <= INIT ;
147 end i f ;
148 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
149 −− run the mu l t i p l i c a t i o n s and accumulat ions
150 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
151 when RUN =>
152 CHIP load n <= ’0 ’ ; −− p a r a l l e l load
153 CHIP shi f t n <= ’1 ’ ; −− no s h i f t
154 CHIP se l reg <= ’0 ’ ; −− from rand to mu l t i p l i e r
155 CHIP rst n <= ’1 ’ ; −− a c t i v a t e the random generator
156 −− a f t e r the mu l t i ca t i on and accumulation , go to the next s t a t e
157 −− due to the p a r a l l e l nature o f mult1 and mult3 , few ex t ra c l o c k s are
requ i red
158 i f ( count = RUNNING LENGTH and mul t s e l ( 0 ) = ’0 ’ )
159 or ( count = RUNNING LENGTH PAR and mul t s e l ( 0 ) = ’1 ’ ) then
160 nex t s t a t e <= READBACK;
161 else
162 nex t s t a t e <= RUN;
163 end i f ;
164 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
165 −− read back va lue s from the r e g i s t e r s through the s e r i a l i n t e r f a c e
166 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
167 when READBACK =>
168 CHIP load n <= ’1 ’ ; −− s e r i a l behaviour
169 CHIP shi f t n <= ’0 ’ ; −− a c t i v a t e data s h i f t i n g
170 read <= ’1 ’ ; −− a c t i v a t e readback
171 c l o c k s l ow enab l e <= ’1 ’ ; −− swi t ch to s low c l o ck
172 −− t r i g g e r data reading and once f i n i s h e d go to the f i n a l s t a t e
173 i f count = READBACKLENGTH then
174 nex t s t a t e <= VERIFY;
175 else
176 nex t s t a t e <= READBACK;
177 end i f ;
178 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
179 −− v e r i f y read data with the expec ted data and output the r e s u l t to l e d s
180 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
181 when VERIFY =>
182 nex t s t a t e <= VERIFY; −− looped s t a t e u n t i l a r e s e t i s f i r e d
183 c l o c k s l ow enab l e <= ’1 ’ ; −− remain with s low c l o c k
184 i f ( data = EXPECTED RESULT) or ( data = EXPECTED RESULT LV) then
185 pass <= ’1 ’ ; −− green LED on
186 else
187 f a i l <= ’1 ’ ; −− red LED on
188 end i f ;
189 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
190 −− i f something s t range happens , go to the i n i t s t a t e
191 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
192 when others =>
D.1. main FPGA.vhd 195
193 nex t s t a t e <= INIT ;
194 end case ;
195 end process FSM;
196
197 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
198 −− SEQUENTIAL LOGIC
199 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
200 −−asynchrounous r e s e t r e g i s t e r s
201 FSM regs : process ( c lock , r s t n )
202 begin
203 i f r s t n = ’0 ’ then
204 c u r r s t a t e <= INIT ;
205 data <= ( others => ’ 0 ’ ) ;
206 e l s i f ( c lock ’ event and c l o ck = ’1 ’ ) then
207 c u r r s t a t e <= nex t s t a t e ;
208 −− read data back when read = ’1 ’
209 i f read = ’1 ’ and c l o c k d i v c oun t e r = ”01” then
210 data <= data (62 downto 0)&CHIP sout ;
211 end i f ;
212 end i f ;
213 end process FSM regs ;
214
215 −−synchronous r e s e t r e g i s t e r s
216 −−at each new FSM s t a t e the counter i s r e s e t
217 counter : process ( c lock , r s t n )
218 begin
219 i f c lock ’ event and c l o ck = ’1 ’ then
220 i f r s t n = ’0 ’ or ( c u r r s t a t e /= nex t s t a t e ) then
221 count <= 0 ;
222 else
223 count <= count + 1 ;
224 end i f ;
225 end i f ;
226 end process counter ;
227
228 −−generate lower frequency c l o c k
229 gen CHIP clk : process ( c lock , read )
230 begin
231 i f c lock ’ event and c l o ck = ’0 ’ then
232 i f read = ’0 ’ then
233 c l o c k d i v c oun t e r <= ”10” ;
234 else
235 c l o c k d i v c oun t e r <= c l o ck d i v c oun t e r + ”01” ;
236 end i f ;
237 end i f ;
238 end process gen CHIP clk ;
239 end arch ;
196 Appendix D. FPGA source code
Appendix E
MATLAB based automated test
functions
E.1 test mult.m
1 function data = te s t mu l t (mult , f r eq , vo l t )
2 %mult in 0−3
3 %fr e q lower than 80MHz
4 %vo l t lower or equa l to 1V
5
6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
7 % Connect dev i c e s
8 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
9 % Open dev i c e s
10 k2400 = v i s a ( ’ n i ’ , ’GPIB0 : : 1 3 : : 0 : : INSTR ’ ) ;
11 k213 = v i s a ( ’ n i ’ , ’GPIB0 : : 1 1 : : 0 : : INSTR ’ ) ;
12 a g i l e n t = v i s a ( ’ n i ’ , ’GPIB0 : : 1 0 : : 0 : : INSTR ’ ) ;
13 fopen ( k2400 ) ;
14 fopen ( k213 ) ;
15 fopen ( a g i l e n t ) ;
16
17 % Get informat ion about dev i c e s
18 fpr intf ( k2400 , ’ ∗IDN? ’ ) ;
19 cu r r en t s en s e = fscanf ( k2400 )
20 vo l t a g e s ou r c e = fscanf ( k213 )
21 fpr intf ( ag i l en t , ’ ∗IDN? ’ ) ;
22 f r equency gene ra to r = fscanf ( a g i l e n t )
23
24
25 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
26 % I n i t i a l i z e dev i c e s
27 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
28 % Reset dev i c e s
29 fpr intf ( k2400 , ’ ∗RST ’ ) ;
30 fpr intf ( ag i l en t , ’ ∗RST ’ ) ;
31 % Prepare the k2400 fo r current measurements
32 fpr intf ( k2400 , ’ :SOUR:FUNC VOLT’ ) ; % se t source to v o l t a g e
197
198 Appendix E. MATLAB based automated test functions
33 fpr intf ( k2400 , ’ :SOUR:VOLT:MODE FIXED ’ ) ; % se t source to DC
34 fpr intf ( k2400 , ’ :SOUR:VOLT 0 ’ ) ; % re s e t source to 0
35 fpr intf ( k2400 , ’ : SENS :FUNC ”CURR” ’ ) ; % s e l e c t current measurement
36 fpr intf ( k2400 , ’ :CURR:NPLC 0 .1 ’ ) ; % se t i n t e g r a t i on time 1 = 1/50Hz , 0.1 = 1/500Hz
37 fpr intf ( k2400 , ’ :CURR:PROT 0.02 ’ ) ; % se t Compliant to 20mA
38 fpr intf ( k2400 , ’ :CURR:RANG 0.01 ’ ) ; % se t range to 10mA
39 fpr intf ( k2400 , ’ :FORM:ELEM CURR’ ) ; % se t current data format
40 fpr intf ( k2400 , ’ :TRIG:COUNT 5 ’ ) ; % number o f mul t i read
41 fpr intf ( k2400 , ’ :ARM:SOUR PSTEST ’ ) ; % enab le t r i g g e r on p o s i t i v e edge o f SOT
42 fpr intf ( k2400 , ’ :SOUR:DEL 0.05 ’ ) ; % in t ra measure de lay to 50ms
43 fpr intf ( k2400 , ’ :OUTP ON’ ) ; % enab le output
44
45
46 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
47 % Body of the code
48 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
49
50 s e t mul t (mult , k213 ) ; % s e l e c t mu l t i p l i e r
51
52 i = 0 ;
53 for f = f r e q % for each frequency do
54 s e t f r e q ( f , a g i l e n t ) ; % se t the frequency
55 pause (2 ) ; % al low frequency to s t a b i l i z e
56 i = i +1; j = 0 ;
57 for V = vo l t % for each supp ly v o l t a g e do
58 j = j +1;
59 % check t ha t supp ly v o l t a g e never exceed 1V
60 vdd core = V+0.1; % se t core vo l t a g e 100mV higher than mu l t i p l i e r
61 i f vdd core > 1
62 vdd core = 1 ;
63 end
64 i f V > 1
65 V = 1 ;
66 end
67
68 s e t v o l t a g e ( vdd core , k213 ) ; % se t the core supp ly v o l t a g e
69 fpr intf ( k2400 , [ ’ :SOUR:VOLT ’ num2str(V) ] ) ; % se t mu l t i p l i e r supp ly v o l t a g e
70 s t a r t o f f ( k213 ) ; % re s e t the FPGA
71 fpr intf ( k2400 , ’ : INIT ’ ) ;% arm the current sens ing
72 s t a r t on ( k213 ) ; % ac t i v a t e the FPGA and t r i g g e r the sens ing
73 dyn = str2num( g e t cu r r en t ( k2400 ) ) ; % read the current va lue s
74 pass = pa s s t e s t ( k213 ) ; % check i f t e s t pass o f f a i l
75 fpr intf ( k2400 , ’ :ARM:SOUR IMM’ ) ; % take an immadiate measure f o r s t a t i c
76 fpr intf ( k2400 , ’ : INIT ’ ) ; % arm the current sens ing
77 s t a t = str2num( g e t cu r r en t ( k2400 ) ) ; % read the current va lue s
78 data ( i , j , : ) = [ f V max( dyn ) min( s t a t ) pass ] ; % sto r e r e s u l t s in data
79 fpr intf ( k2400 , ’ :ARM:SOUR PSTEST ’ ) ; % enab le t r i g g e r on p o s i t i v e edge o f SOT
80 end
81 end
82
83 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
84 % Disconnect dev i c e s
85 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
86 % Disab l e outputs
87 fpr intf ( k2400 , ’OUTP OFF ’ ) ;
E.1. test mult.m 199
88 fpr intf ( ag i l en t , ’OUTP OFF ’ ) ;
89
90 % Close dev i c e s
91 fc lose ( k2400 ) ;
92 fc lose ( k213 ) ;
93 fc lose ( a g i l e n t ) ;
94
95 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
96 %% Agi l en t 33250A frequency generator code
97 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
98 function s e t f r e q ( f req , dev ) % se t a square c l o c k on a g i l e n t
99 fpr intf ( dev , [ ’APPL:SQU ’ num2str( f r e q ) ’ , 3 . 3 , 1 .65 ’ ] ) ;
100
101 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
102 %% Kei th l ey 2400 current sens ing code
103 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
104 function curr = ge t cu r r en t one ( dev ) % take a one−sho t measure
105 fpr intf ( dev , ’ :MEAS:CURR? ’ ) ;
106 curr = fscanf ( dev ) ;
107
108 function curr = ge t cu r r en t ( dev ) % take a current measure
109 fpr intf ( dev , ’ :FETCH? ’ )
110 curr = fscanf ( dev ) ;
111
112 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
113 %% Kei th l ey 213 quad vo l t a g e source code
114 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
115 function s e t v o l t a g e (v , dev ) % ca l i b r a t e d vo l t a g e on k213
116 fpr intf ( dev , [ ’P1C0A0R1H0J128 ,143V ’ num2str( v ) ’X ’ ] ) ;
117
118 function s t a r t on ( dev ) % CHIP rst n high
119 fpr intf ( dev , ’P4V3 . 3X ’ ) ;
120
121 function s t a r t o f f ( dev ) % CHIP rst n low
122 fpr intf ( dev , ’P4V0X ’ ) ;
123
124 function s e t mul t (num, dev ) ; % s e l e c t the mu l t i p l i e r under t e s t
125 i f num == 0
126 s t r 1 = ’P2V0X ’ ;
127 s t r 2 = ’P3V0X ’ ;
128 e l s e i f num == 1
129 s t r 1 = ’P2V3 . 3X ’ ;
130 s t r 2 = ’P3V0X ’ ;
131 e l s e i f num == 2
132 s t r 1 = ’P2V0X ’ ;
133 s t r 2 = ’P3V3 . 3X ’ ;
134 else
135 s t r 1 = ’P2V3 . 3X ’ ;
136 s t r 2 = ’P3V3 . 3X ’ ;
137 end
138 fpr intf ( dev , s t r 1 ) ;
139 fpr intf ( dev , s t r 2 ) ;
140
141 function pass = pa s s t e s t ( dev ) %wait f o r t e s t r e s u l t s and check i f t e s t passed
142 pass = 0 ;
200 Appendix E. MATLAB based automated test functions
143 f a i l = 0 ;
144 while ( pass == 0 && f a i l == 0)
145 fpr intf ( dev , ’U5X ’ ) ;
146 din = str2num( fscanf ( dev ) ) ;
147 pass = b i t g e t ( din , 1 ) ;
148 f a i l = b i t g e t ( din , 3 ) ;
149 end
