Technology-Accurate Variability-Aware Performance Macromodels for On-Chip Communication Synthesis by Bacinschi, Petru Bogdan
Technology-Accurate Variability-Aware
Performance Macromodels for
On-Chip Communication Synthesis
Vom Fachbereich 18
Elektrotechnik und Informationstechnik
der Technischen Universita¨t Darmstadt
zur Erlangung der Wu¨rde eines
Doktor–Ingenieurs (Dr.-Ing.)
genehmigte Dissertation
von
Dipl.-Ing.
Petru Bogdan Bacinschi
geboren am 24. Juli 1981
in Ludus¸, Ruma¨nien
Referent: Prof. Dr. Dr. h. c. mult. Manfred Glesner
Korreferentin: Prof. Dr.-Ing. Anca Manolescu
Korreferent: Prof. Dr.-Ing. Norbert Wehn
Tag der Einreichung: 2. Juli 2010
Tag der mu¨ndlichen Pru¨fung: 5. November 2010
D17
Darmstadt 2010
To my lovely wife,
Veronica
Acknowledgments
This dissertation is the outcome of my work as a teaching and research assistant at the
Institute of Microelectronic Systems, Technische Universita¨t Darmstadt. Many people
have contributed in countless ways inmaking this work possible. I would like to sincerely
thankmyDoktorvater, Prof. Manfred Glesner, for his kind advice and guidance duringmy
doctoral years and for involving me in various teaching activities and research projects
funded by several companies and scientific foundations.
I also express my gratitude towards Prof. Anca Manolescu and Prof. Norbert Wehn,
who kindly accepted to act as reviewers for this thesis. Their comments and observations
have been very valuable for improving the quality of the work. Furthermore, I would like
to thank Prof. Udo Schwalke, Prof. Volker Hinrichsen, and Prof. Gerd Balzer for acting
as members of the examination committee.
This work could not have been accomplished in a pleasant way without a good atmo-
sphere at the working place. For this, I would like to express my thanks to all colleagues
at the institute with whom I had the pleasure of carrying out important research projects,
producing reports and papers, sharing various teaching activities, and solving several
stringent administrative issues. The friendly help and support of Hans-Peter Keil, Le-
andro Mo¨ller, Sebastian Pankalla, Franc¸ois Philipp, Faizal Samman, Christopher Spies,
Pongyupinpanich Surapong, and Ping Zhao permitted me to concentrate on writing the
final manuscript and preparing the exam. I would also like to thank my older and for-
mer colleagues Prof. Alberto Garcı´a, Andre Guntoro, Heiko Hinkelmann, Prof. Klaus
Hofmann, Prof. Thomas Hollstein, Prof. Leandro Indrusiak, Octavian Mitrea, Massoud
Momeni, Tudor Murgan, Oana Mutihac, Oliver Soffke, and Prof. Peter Zipf, who shared,
on various occasions, their experience regarding a multitude of issues like writing papers
and project proposals, finalizing project reports, as well as scientific and less-scientific
practical advices. Further, I am also greatly indebted to my colleagues Roland Brand and
Andreas Schmidt for their continuous support in many technical issues. Many thanks
also to our friendly secretaries, Silvia Hermann and Iselona Klenk.
Last but not least, I wish to greatly thank my lovely wife Veronica for her great love
and support, and also my entire family for all their efforts and for the received education
and opportunities.
v
Abstract
A major challenge in the design of multi-processor systems-on-chip (MPSoCs) is to pro-
vide an adequate on-chip communication architecture. Hereby, a series of parameters
must be considered, including communication data size, speed, power consumption, and
topology, to name only a few. Additionally, variable data flows, as well as increasing pro-
cess and environmental parameter variations lead to undesired effects, such as reduced
yield or increased leakage power levels. The main objective of this thesis is to provide
a methodology for the parametrized joint optimization of delay and energy consump-
tion during the communication architecture synthesis, by performing a statistical analysis
and optimization of parametric yield under the influence of parameter variations. More-
over, in order to increase the accuracy of the proposed methodology, circuit-level models
for the communication activities and technology-accurate models for the interconnection
segments are developed.
In order to accurately specify statistical parameter distributions in the application pro-
file and process parameter variations, this thesis develops a complete methodology for
variability description and propagation across performancemacromodel expressions. For
this purpose, a generalized random variable model is developed, capable of represent-
ing non-standard estimated distributions using discretized pdfs with adjustable accuracy.
Another important contribution represents the development of a propagation method for
statistical distributions across the modeling expressions using analytic implementations
of the most often used operators as well as the introduction of a fast generalized method
for implementing statistical operators with a precision comparable to Monte Carlo at a
very small fraction of the execution time. Based upon this methodology, statistical per-
formance macromodels for delay and energy consumption are constructed.
Since the use of different signaling methods has a strong impact on communication
performance, a further important contribution is the inclusion of signaling techniques
in the communication synthesis in the form of circuit-level communication models. First,
a technology-dependent statistical transistor model is derived, which supports variability
descriptions for all process-dependent parameters and employs the previously-developed
statistical operators to propagate the parameter distributions throughout the model ex-
pressions. Furthermore, pulsed current-mode and voltage-mode signaling circuits are
analyzed and modeled using the statistical transistor model, equivalent circuit models,
and analytic expressions of the current and voltage signals. Within this context, the im-
vii
viii ABSTRACT
pact of voltage scaling and body biasing on the circuit performance are also analyzed.
Afterwards, the circuit-level models are employed for modeling entire communication
segments and the segment models are included within the system-level performance
macromodels for the communication synthesis. The accuracy of communication segment
models is further enhanced through a wide-bandwidth characterization method for arbi-
trary interconnect segments. The method relies on an initial set of parameter extractions,
designed to reflect the particularities of a given manufacturing process, and applies a
sequence of incremental extrapolations to construct the model of a specified segment.
Accuracy evaluations show a performance close to industry-standard field simulators.
Finally, synthesis results in the context of delay-driven and energy-driven optimiza-
tions show the efficiency of pulsed current-mode signaling on long communication seg-
ments and the advantages of voltage-mode signaling on short links. In addition, it is
shown that voltage scaling and body biasing can be integrated effectively in the commu-
nication synthesis to reduce energy consumption.
Kurzfassung
Eine bedeutende Herausforderung fu¨r den Entwurf von Multi-Prozessor-Systems-
on-Chip (MPSoCs) ist die Erstellung einer geeigneten On-Chip-Kommunikations-
Architektur. Dabei soll eine Reihe von Parametern beru¨cksichtigt werden, wie z.B. Kom-
munikationsdatenmenge, Geschwindigkeit, Stromverbrauch und Topologie, um nur
einige zu nennen. Daru¨ber hinaus fu¨hren variable Datenflu¨sse sowie zunehmende
Prozess- und Umgebungsparametervariationen zu unerwu¨nschtenWirkungen, wie einer
reduzierten Fertigungsausbeute oder einer erho¨hten Verlustleistung. Das Hauptziel
dieser Dissertation ist es, eine Methode fu¨r die parametrisierte gleichzeitige Opti-
mierung von Verzo¨gerung und Energieverbrauch im Rahmen der Kommunikationssyn-
these zu entwickeln, die sich durch die Durchfu¨hrung einer statistischen Analyse und
Optimierung der parametrischen Ausbeute unter dem Einfluss von Parametervariatio-
nen kennzeichnet. Daru¨ber hinaus werden Schaltungsmodelle fu¨r die Kommunika-
tion sowie technologiegenaue Modelle fu¨r die Verbindungssegmente entwickelt, um die
Genauigkeit der vorgeschlagenen Methode zu erho¨hen.
Um statistische Parameter-Distributionen in dem Anwendungsprofil sowie Prozess-
parametervariationen genau spezifizieren zu ko¨nnen, wird in dieser Dissertation eine
integrierte Methode fu¨r die Beschreibung und U¨bertragung der Variabilita¨t durch Mo-
dellgleichungen entwickelt. Zu diesem Zweck wird ein allgemeines Zufallsvariablenmo-
dell entwickelt, das nicht-standardverteilte Distributionen mittels diskretisierter Dichte-
funktionen mit einstellbarer Genauigkeit beschreiben kann. Weitere wichtige Beitra¨ge
stellen die Entwicklung einer Methode zur U¨bertragung statistischer Verteilungen durch
Modellgleichungen mittels analytischer Implementierungen der am ha¨ufigsten verwen-
deten Operatoren sowie die Einfu¨hrung einer allgemeinen Methode fu¨r die Umsetzung
schneller statistischer Operatoren mit Monte-Carlo-a¨hnlicher Genauigkeit dar. Basierend
auf dieser Methode werden statistische Makromodelle fu¨r die Verzo¨gerung und den
Energieverbrauch erstellt.
Da die Verwendung verschiedener Signalu¨bertragungsmethoden einen wichtigen
Einfluss auf die Kommunikationsleistung hat, stellt ein weiterer wichtiger Beitrag die
Integration der Signalu¨bertragungstechniken in der Kommunikationssynthese als Kom-
munikationsmodelle auf Schaltungsebene dar. Zuna¨chst wird ein technologieabha¨ngiges
statistisches Transistormodell abgeleitet, das Variabilita¨tsbeschreibungen fu¨r alle Prozess-
parameter unterstu¨tzt und die zuvor entwickelten statistischen Operatoren verwen-
ix
x KURZFASSUNG
det. Daru¨ber hinaus werden Signaltreiberschaltungen im gepulsten Strom-Modus und
Spannung-Modus analysiert. Diese werden mit Hilfe des entworfenen statistischen Tran-
sistormodells sowie der Ersatzschaltungsmodelle und analytischer Ausdru¨cke der Strom-
und Spannungssignale modelliert. In diesem Zusammenhang werden die Auswirkun-
gen der Spannungsskalierung und des “Body Biasing” (Substratvorspannung) auf das
Schaltungsverhalten analysiert. Anschließend werden die Schaltungsmodelle fu¨r die
Modellierung gesamter Kommunikationssegmente eingesetzt und die Segmentmodelle
werden innerhalb der Makromodelle fu¨r die Kommunikationsynthese auf Systemebene
verwendet. Die Genauigkeit der Modelle fu¨r Kommunikationssegmente wird weiter
durch eine breitbandige Charakterisierungsmethode fu¨r arbitra¨re Leiterbahnsegmente
verbessert. Die Methode basiert auf einer Reihe von Parameterextraktionen, welche die
Besonderheiten des spezifischen Herstellungsprozesses abbilden. Nachfolgend wird hie-
rauf basierend und unter Durchfu¨hrung inkrementeller Extrapolationen ein Modell fu¨r
ein ausgewa¨hltes Kommunikationssegment erstellt. Genauigkeitsanalysen zeigen, dass
die so erzielte Modellgenauigkeit nahe an Ergebnissen liegt, die mit branchenu¨blichen
Feldsimulatoren erreicht werden ko¨nnen.
Schließlich zeigen Syntheseergebnisse, die fu¨r Verzo¨gerung oder Energiever-
brauch optimiert sind, die Effizienz der gepulsten Strom-Modus-Signalu¨bertragung
auf langen Kommunikationssegmenten, sowie die Vorteile der Spannung-Modus-
Signalu¨bertragung fu¨r kurze Verbindungen. Daru¨ber hinaus wird gezeigt, dass Span-
nungsskalierung und Body Bias wirksam in der Kommunikationssynthese eingesetzt
werden ko¨nnen, um den Energieverbrauch zu senken.
Table of Contents
1 Introduction and Overview 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Fundamentals and Challenges of Accurate Communication Synthesis 7
2.1 Application Profile and Design Space Exploration . . . . . . . . . . . . . . . 9
2.1.1 Behavioral Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Architectural Description and Design Constraints . . . . . . . . . . . 11
2.1.3 Performance Model Creation . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Estimation and Optimization . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Performance Macromodel Concept . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Delay Macromodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Macromodels for Power Estimation . . . . . . . . . . . . . . . . . . . 19
2.2.4 Statistical and Process-Accurate Modeling . . . . . . . . . . . . . . . 24
2.3 Resource Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Preemptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Non-Preemptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Parameter Variations and Statistical Analysis . . . . . . . . . . . . . . . . . . 35
2.4.1 Sources of Parameter Variations . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Statistical Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Technology Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Process Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Yield Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3 Transistor-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Optimization Resources at the Circuit Level . . . . . . . . . . . . . . . . . . . 43
2.6.1 Choice of Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.2 Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xi
xii TABLE OF CONTENTS
2.6.3 Body Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Variability-Aware Performance Macromodels 51
3.1 Application and Architectural Profile . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Extraction of the Application Profile . . . . . . . . . . . . . . . . . . . 54
3.1.2 Architecture and Technology Specification . . . . . . . . . . . . . . . 56
3.1.3 Variability Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Random Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Employed Standard Distributions . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Discretized pdf Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Typical Usage and Accuracy Control . . . . . . . . . . . . . . . . . . . 61
3.2.4 Sampling Technique for Discretized pdfs . . . . . . . . . . . . . . . . 63
3.3 Method for the Propagation of Distributions . . . . . . . . . . . . . . . . . . 64
3.3.1 Statistical Sum and Maximum Operators . . . . . . . . . . . . . . . . 65
3.3.2 Statistical Difference Operator . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Statistical Product Operator . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.4 Numerical Implementation of other Statistical Operators . . . . . . . 76
3.3.5 Handling Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.6 Random Variable Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Embedding Technique for Random Variables . . . . . . . . . . . . . . . . . . 84
3.4.1 Variability Sources and RV Leaf Nodes . . . . . . . . . . . . . . . . . 84
3.4.2 Variability Propagation and Estimation of Results . . . . . . . . . . . 85
3.4.3 Changes and Updates Propagated Downstream . . . . . . . . . . . . 86
3.4.4 Result Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Performance Macromodels for Delay Estimation . . . . . . . . . . . . . . . . 89
3.5.1 Structure and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.2 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.6 Performance Macromodels for Energy Consumption . . . . . . . . . . . . . 93
3.6.1 Dynamic Energy Macromodels . . . . . . . . . . . . . . . . . . . . . . 93
3.6.2 Leakage Energy Macromodels . . . . . . . . . . . . . . . . . . . . . . 94
3.6.3 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.7 Partitioning, Assignment, and Scheduling Optimization . . . . . . . . . . . . 97
3.7.1 Methods for Solution Space Exploration . . . . . . . . . . . . . . . . . 98
3.7.2 Cost Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.7.3 Optimization Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.7.4 Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
TABLE OF CONTENTS xiii
4 Technology-Accurate, Variability-Aware Circuit-Level Models 103
4.1 Variability-Aware Transistor Model . . . . . . . . . . . . . . . . . . . . . . . . 104
4.1.1 BSIM4.3-Based Current Source Model . . . . . . . . . . . . . . . . . . 105
4.1.2 Modeling Spatially-Correlated Process Parameter Variations . . . . . 108
4.1.3 Inclusion of Random Variables and Results Estimation . . . . . . . . 113
4.2 Pulsed Current-Mode Signaling Model . . . . . . . . . . . . . . . . . . . . . . 114
4.2.1 Derivation of Current Switching Paths . . . . . . . . . . . . . . . . . . 115
4.2.2 Equivalent Current-Source Circuit Model . . . . . . . . . . . . . . . . 120
4.2.3 Analytic Model for Delay and Energy Consumption . . . . . . . . . . 123
4.2.4 Performance Evaluation under Voltage Scaling and Body Biasing . . 127
4.3 Voltage-Mode Signaling Model . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.3.1 Equivalent Current-Source Circuit Model . . . . . . . . . . . . . . . . 129
4.3.2 Analytic Model for Delay and Energy Consumption . . . . . . . . . . 131
4.3.3 Performance Evaluation under Voltage Scaling and Body Biasing . . 134
4.4 Modeling of Communication Segments . . . . . . . . . . . . . . . . . . . . . 135
4.4.1 Transceiver and Interconnect Model . . . . . . . . . . . . . . . . . . . 136
4.4.2 Floorplan Model using Clusters . . . . . . . . . . . . . . . . . . . . . 137
4.4.3 Estimation of Communication Circuit Placement on Die . . . . . . . 138
4.4.4 Quick Delay Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.4.5 Implementation of Communication Nodes . . . . . . . . . . . . . . . 139
4.4.6 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5 Technology-Aware Characterization Method for On-Chip Segments 147
5.1 Wideband Characterization Method . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.1 Interconnect Modeling Challenges . . . . . . . . . . . . . . . . . . . . 149
5.1.2 Multistep Extrapolated S-Parameter Model . . . . . . . . . . . . . . . 150
5.2 Parameter Extraction Framework . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Multistep Extrapolation Method . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.1 Extraction of the Base Parameter Set . . . . . . . . . . . . . . . . . . . 154
5.3.2 Incremental Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.3.3 Passivity Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.4 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6 Methodology Binding 171
6.1 Application Profile Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.1.1 Description of the SoC Resource Set . . . . . . . . . . . . . . . . . . . 173
xiv TABLE OF CONTENTS
6.1.2 Floorplan Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.1.3 Design Space Exploration Method . . . . . . . . . . . . . . . . . . . . 175
6.1.4 Cost Function Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2 Evaluation of Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.1 Delay-Optimized Architecture . . . . . . . . . . . . . . . . . . . . . . 177
6.2.2 Energy-Optimized Architecture . . . . . . . . . . . . . . . . . . . . . 180
6.2.3 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7 Conclusions 185
7.1 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A Complex Expression of the Output Voltage for the Voltage-Mode Signaling Cir-
cuit 189
References 201
List of Abbreviations
ABB Adaptive Body Biasing
ASIC Application-Specific Integrated Circuit
ASIP Application-Specific Instruction-Set Processor
BSIM Berkeley Short-channel IGFET Model
CAD Computer-Aided Design
CD Critical Dimension
CDF Cumulative Distribution Function
CMOS Complementary Metal Oxide Semiconductor
CMP Chemical-Mechanical Polishing
CN Communication Node
CPU Central Processing Unit
CSF Communication Speed Flexibility
CSP Communicating Sequential Process
D2D Die-to-Die (Parameter Variations)
DAG Directed Acyclic Graph
DIBL Drain Induced Barrier Lowering
DOF Depth of Focus
DSM Deep Sub-Micron
DSP Digital Signal Processing
FBB Forward Body Biasing
FCT Floorplan Cluster Tree
FFT Fast Fourier Transform
FPGA Field-Programmable Gate Array
FSM Finite-State Machine
GPP General Purpose Processor
HDL Hardware Description Language
IGFET Insulated-Gate Field-Effect Transistor
ITRS International Technology Roadmap for Semiconductors
IP Intellectual Property
LDD Lightly Doped Drain
LER Line Edge Roughness
xv
xvi LIST OF ABBREVIATIONS
MC Monte Carlo
MOS Metal Oxide Semiconductor
MOSFET Metal Oxide Semiconductor Field-Effect Transistor
MPSoC Multiprocessor System-on-Chip
NDF Neighboring Density Factor
NMOS N-Type MOS
NoC Network-on-Chip
NRMSE Normalized RMSE
ODE Ordinary Differential Equation
OM Order of Magnitude
PE Processing Element
PCA Principal Component Analysis
PCM Pulsed Current Mode
pdf Probability Density Function
PM Performance Macromodel
PMN PM Node
PMOS P-Type MOS
PN Processing Node
PSK Phase-Shift Keying
PSM Program-State Machine
PTM Predictive Technology Model
PWL Piece-Wise Linear
RBB Reverse Body Biasing
RDF Random Dopant Fluctuations
RMS Root Mean Square
RMSE RMS Error
RSF Response Surface Function
RSM Response Surface Methodology
RT Resource Type
RTA Rapid Thermal Annealing
RTL Register Transfer Level
RV Random Variable
SA Simulated Annealing
SoC System-on-Chip
SPICE Simulation Program with Integrated Circuit Emphasis
STA Static Timing Analysis
TG Task Graph
UML Unified Modeling Language
VHDL Very-High-Speed Integrated Circuit Hardware Description
Language
VM Voltage Mode
List of Tables
2.1 Scheduling table for the example in Fig.2.14 [59]. . . . . . . . . . . . . . . . . 34
2.2 Predicted three-sigma variations of device parameters across several tech-
nology nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Parameters of the execution times of five PNs on different resources. Values
given in nanoseconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2 Power parameters for five PNs and three different resources. Values given
in milliwatts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1 Example values for the simulations. . . . . . . . . . . . . . . . . . . . . . . . 118
4.2 Input values for the communication synthesis of the three-task example. . . 143
5.1 Wire attributes for a three-wireM4-segment. . . . . . . . . . . . . . . . . . . 166
5.2 Maximum relative delay error across all considered metal layers and wires
per segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.1 Application profile parameters used for the communication synthesis. . . . 173
6.2 Floorplan cluster parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.3 Scheduled start and end times for processing nodes, evaluated as 99% in-
ferior quantile from the statistical distributions. . . . . . . . . . . . . . . . . . 178
6.4 Parameters of the synthesized communication segments, evaluated as 99%
inferior quantile from the statistical distributions. . . . . . . . . . . . . . . . 178
6.5 Scheduled communication activities on the synthesized architecture from
Fig. 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6 Parameters of the three synthesized communication segments shown in
Fig. 6.5 (evaluated using the 99% inferior quantile from the statistical dis-
tributions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.7 Relative delay error of the communication circuit models with respect to
circuit simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
xvii
List of Figures
2.1 Task dependencies represented as data flow graphs. . . . . . . . . . . . . . . 9
2.2 Task graph (a) and refined processing node representation at the operation
level (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Extended task graph representation showing IP resources Ri and the inter-
resource communication nodes CNi. . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Section from a task graph with four processing nodes (a) and the resulting
deterministic delay model for processing node 3 (b). . . . . . . . . . . . . . . 13
2.5 Task graph example (a) and the attached delay PM (b). . . . . . . . . . . . . 15
2.6 Modeling of data (a) and scheduling (b) dependencies (after [156]). . . . . . 16
2.7 Control dependencies in the task graph (a) and in the delay macromodel
(b) (after [56]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Insertion of communication speed flexibility nodes in the task graph (a,b)
and the corresponding delay macromodel structure (c) (after [155]). . . . . . 17
2.9 Transformation of a multiple-pin net (a) into a two-pin net (b) (after [49]). . 18
2.10 Open framework with embedded CAD tools for performance modeling
and design exploration, as proposed in [14]. . . . . . . . . . . . . . . . . . . . 19
2.11 Power-optimized clustering of processing tasks (a) and the corresponding
resource mappings and communication link (after [50]). . . . . . . . . . . . . 21
2.12 Task graph example (a) and the derived power PM (b) (after [56]). . . . . . . 23
2.13 Low power preemptive scheduling with fixed priority (after [146]). . . . . . 31
2.14 Extended task graph example for static scheduling (after [59]). . . . . . . . . 33
2.15 Scheduling of four tasks (a) considering only critical-path information (b)
and after including the resource mapping (c) (after [57]). . . . . . . . . . . . 35
2.16 Classification of parameter variations. . . . . . . . . . . . . . . . . . . . . . . 36
2.17 Current/voltage mode repeater (after [17]). . . . . . . . . . . . . . . . . . . . 44
2.18 Voltage and current sensing circuits: (a) hybrid-mode transmitter, (b) voltage-
mode receiver, and (c) current-mode receiver (after [18]). Pull-down signal-
ing path in current-mode (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xix
xx LIST OF FIGURES
2.19 Body biasing of NMOS and PMOS transistors in a triple-well process. . . . 48
3.1 Description of timing and dynamic power values depending on e.g. re-
source mapping and parameter variations. . . . . . . . . . . . . . . . . . . . 53
3.2 Application profiling steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Analytic and sampled pdfs for several standard distributions. . . . . . . . . 59
3.4 Discretized pdf over Nb bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Cumulative distribution function computed from a discrete pdf. . . . . . . . 63
3.6 Sampling method using a standard uniform distribution and the CDF. . . . 64
3.7 Limits of the overlap during the sum computation. . . . . . . . . . . . . . . 66
3.8 Limits of the overlap during the sum computation. . . . . . . . . . . . . . . 67
3.9 Estimated delay of three processing tasks computed using the sum opera-
tor and through Monte Carlo sampling. . . . . . . . . . . . . . . . . . . . . . 68
3.10 Evaluation of the maximum between a random variable and a constant. . . 69
3.11 Subtrahend distribution mirrored across the ordinate. . . . . . . . . . . . . . 69
3.12 Repartition of X and Y random variables across the four quadrants and
discretized pdf of the product Z = XY . . . . . . . . . . . . . . . . . . . . . . 70
3.13 Variable spans across multiple quadrants. . . . . . . . . . . . . . . . . . . . . 71
3.14 Relative positions of the {X,Y } partition corners . . . . . . . . . . . . . . . . 72
3.15 Leakage energy distributions for three slacks, computed using the product
operator and through direct sampling (Monte Carlo). . . . . . . . . . . . . . 76
3.16 Fast numerical implementation with adjustable accuracy. . . . . . . . . . . . 78
3.17 Accuracy of the implemented statistical operators for several values of Nb
and Nsb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.18 Impact of increasing the number of bins Nb or individual samples Nsb on
operator accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.19 Pdfs obtained for Nb = 50 and Nsb = 50 compared with Monte Carlo for
different statistical operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.20 Influence of correlations on statistical result distributions (example formax-
imum operator). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.21 Topological correlations at reconvergent nodes (a) tracked by testing in-
bound nodes for common parents (b). . . . . . . . . . . . . . . . . . . . . . . 83
3.22 Random variable representations embedded into the leaf nodes of perfor-
mance models for variable parameters. . . . . . . . . . . . . . . . . . . . . . 85
3.23 Pdf propagation at each operational node in a PM. . . . . . . . . . . . . . . . 86
3.24 Evaluation of a PM propagated upstream from the output node. . . . . . . . 87
LIST OF FIGURES xxi
3.25 Downstream propagation of a pdf update triggered by a change in system
configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.26 Inferior quantile (a) and superior quantile (b) used as confidence points for
design decisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.27 Statistical performance macromodel for delay estimations. . . . . . . . . . . 90
3.28 Execution sequences and resource mappings for the four test scenarios. . . . 92
3.29 Statistical delays evaluated using the delay PM. . . . . . . . . . . . . . . . . 92
3.30 Statistical performancemacromodel for estimating the dynamic energy con-
sumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.31 Statistical performance macromodel for leakage energy estimation. . . . . . 95
3.32 Delay-optimized resource mapping (a) and mapping with improved en-
ergy consumption (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.33 Statistical dynamic energy (a) and leakage energy (b) consumptions evalu-
ated using the energy PMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.34 Initial random assignment and scheduling (a) and optimized configuration
(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.35 Delay, dynamic energy, and leakage energy results before and after the op-
timization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1 Statistical current-source transistor model based on BSIM4.3 equations and
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Die grid for modeling spatially-correlated process variations. . . . . . . . . . 109
4.3 Computed grid coordinates and correlation distance for the covariancema-
trix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4 Repartition of the correlation coefficient on a 9 × 6 grid, as reported to the
top-left cell, for a decay distance dd = 15mm and a residual correlation
ρr = 0.09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 Spatially-correlated values of the threshold voltage parameter Vth0 from
grid cells 2 (a), 3 (b), and 4 (c), plotted with respect to the values from cell 1. 112
4.6 Subthreshold current plot for an NMOS transistor with W = 3µm, L =
80nm obtained with the commercial BSIM4 implementation in the Ca-
dence Spectre circuit simulator (a) and with the derived current-source
model (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.7 Variations in the output and transfer characteristics obtained for an NMOS
transistor withW = 3µm, L = 80nm obtained from the process parameter
variations described in Sec. 4.1.2. . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.8 Drain current distribution (a) and variation of the standard deviation over
the bias ranges (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
xxii LIST OF FIGURES
4.9 Pulsed current-mode signaling driver (a) and receiver circuit (b). . . . . . . 115
4.10 Transistor-level circuit implementation of the PCM driver. . . . . . . . . . . 116
4.11 Operation of the dynamic logic input control stage. . . . . . . . . . . . . . . 116
4.12 Switched current path flowing through transistorsM1 and NM1. . . . . . . . 117
4.13 Clock synchronization and output signals transmitting current pulses on
the differential line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.14 Waveforms of the clock and data signals and the corresponding voltages at
the near and far end of the differential line. . . . . . . . . . . . . . . . . . . . 119
4.15 Current pulses on the interconnect lines and the corresponding drain and
source voltages for transistorM1. . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.16 General line model with current-mode driver. . . . . . . . . . . . . . . . . . . 121
4.17 Line delay definition at 50% swing point (a) and the output voltage of the
circuit model (b) used to compute the delay. . . . . . . . . . . . . . . . . . . . 123
4.18 Current pulse shape (a) and the model approximation for computing the
delay (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.19 Delay (a) and energy (b) variation with interconnect line length. . . . . . . . 128
4.20 Impact of voltage scaling and body bias on the static (a) and dynamic en-
ergy consumption (b) for a 15mm interconnect line. . . . . . . . . . . . . . . 129
4.21 Voltage-mode buffer circuit and equivalent circuit model. . . . . . . . . . . . 130
4.22 Region of interest in the drain current characteristic for computing the de-
lay (a) and exponential approximation for the delay model (b). . . . . . . . . 130
4.23 Voltage-mode driver and line model. . . . . . . . . . . . . . . . . . . . . . . . 131
4.24 Delay (a) and static energy (b) comparison between voltage-mode and PCM
signaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.25 Influence of voltage scaling and body biasing on the delay (a) and static
energy consumption (b) of a 15mm voltage-mode line. . . . . . . . . . . . . 135
4.26 Communication segment using the available circuit-level models. . . . . . . 136
4.27 Floorplan clusters enclosing the on-chip resources (a) and the correspond-
ing floorplan cluster tree (FCT). . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.28 Estimation of communication circuit location for considering spatially-correlated
parameter variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.29 Fast approximation of the delay solution using the bisection method (ex-
ample shown for the PCM signaling circuit). . . . . . . . . . . . . . . . . . . 139
4.30 Statistical model for signaling resources embedding the analytical formu-
lations from Sec. 4.2 and 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.31 Implementation of a communication node in the delay macromodel. . . . . 140
LIST OF FIGURES xxiii
4.32 Inclusion of communication nodes in the dynamic energy macromodel. . . 141
4.33 Structural element for modeling the static energy of a communication node
within the leakage energy macromodel (example shown for a PCM signal-
ing circuit). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.34 Task graph example with emphasis on communication nodes (a), the as-
sociated floorplan cluster tree (b), and the modeled communication seg-
ments (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.35 Total delay of the synthesized structure from Fig. 4.34(c) with different sig-
naling circuits on the two communication segments. . . . . . . . . . . . . . . 144
4.36 Influence of body biasing on the leakage energy of the configuration with
voltage-mode signaling on segment 1 and PCM signaling on segment 2. . . 145
5.1 Complexity of mutually-coupled inductances in distributed RLCG models. 149
5.2 Overview of the extrapolated S-parameter modeling workflow. . . . . . . . 151
5.3 Magnitude plot of the Z12, Y12, and S12 parameters for a single-wire segment.152
5.4 Cross-section through the structural model of the CMOS process. . . . . . . 153
5.5 (a)Orthogonal routing directions in adjacent metal layers. (b)NDF values
of 0, respectively 50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.6 Structural model of an n-wire interconnect segment. . . . . . . . . . . . . . . 155
5.7 Associated n-port model for an n-wire segment. . . . . . . . . . . . . . . . . 156
5.8 Orthogonal sweeps of the wire attributes, illustrated here for length and
spacing (NDF axis not shown). . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.9 Variable-width (a) and variable-spacing (b) sweeps during the initial pa-
rameter extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.10 Maximum NDF in the upper metal layer, with power grid and maximum-
width signal line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.11 Passivation example for a single-wire interconnect segment (metal 1, l = 10µm,
w = 400 nm, s = 810 nm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.12 RMS error between extrapolated and extracted results for the entire range
of tested interconnect segments. . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.13 Magnitude of extrapolated and extracted parameters for a single-wire in-
terconnect segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.14 Angle values for the extrapolated and extracted parameters of a single-wire
segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.15 Magnitude plot of six S-parameters for a three-wireM4-segment. . . . . . . 167
5.16 RMS errors between extrapolated and directly-extracted parameters (three-
wireM4-segment). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
xxiv LIST OF FIGURES
5.17 Circuit employed for the transient simulations. . . . . . . . . . . . . . . . . . 168
5.18 Signal propagation delays from three-wire interconnect segments placed
on three metal layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.19 Delay RMSE for the transient simulations of interconnect segments onmetal
5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1 Application task graph (a) and the considered SoC architecture (b). . . . . . 172
6.2 Floorplan cluster tree for the processing resources (a) and one possible
inter-resource connection in a hierarchical bus architecture (b). . . . . . . . . 174
6.3 Resource mapping configuration, scheduling sequences, and communica-
tion segments synthesized for minimum delay. . . . . . . . . . . . . . . . . . 177
6.4 Delay-optimized communication architecture synthesized as a shared bus
and three point-to-point links. . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.5 Architecture optimized for minimum energy consumption, requiring only
four resources and three communication segments. . . . . . . . . . . . . . . 180
Chapter 1
Introduction and Overview
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Motivation
The steady increase in performance requirements for embedded systems coupled with
the ability to integrate more transistors per unit area with every new technological node
have lead to the concept of system-on-chip (SoC), an integrated version of the classical
embedded system architecture. Furthermore, the current trend to maximize the execu-
tion parallelism of general purpose processors by increasing the number of integrated
processing cores has been adopted also by the SoC architectures. Consequently, heteroge-
neous multi-processor systems-on-chip (MPSoCs) are currently the architecture of choice
for implementing complex consumer applications, such as high definition television re-
ceivers, mobile communication platforms, and video game consoles, and their usage is
thus increasing. While many-core architectures are advertising significant performance
boosts for parallel data-heavy applications, particularly in the case of heterogeneous im-
plementations the inter-core communication is likely to become a bottleneck and lead
to the saturation of performance increase with the number of processing elements. This
way, the design of an adequate communication architecture for many cores and for a vari-
able number of running applications is becoming one of the paramount design concerns
for MPSoCs. Hereby, a series of constraints must be considered, such as communication
needs, minimum performance level, power budget, area, required yield, to name only a
few.
Furthermore, the workload on each communication segment may have significant
time fluctuations, caused by multiple applications possibly sharing the same architec-
1
2 CHAPTER 1 INTRODUCTION AND OVERVIEW
ture, resulting into variable data flowswhichmust be transferred at the given parameters.
Apart from data flow variations, process as well as environmental parameter deviations
including temperature changes (hot-spots) and supply voltage variations may no longer
be neglected, since they increasingly lead to undesirable effects, such as reduced yield and
higher power dissipation. Particularly the increasing level of intra-die variations [20, 94]
exhibits a stronger impact on the total circuit delay and leakage variations [82] with every
new technology node.
The main objective of this thesis is to provide a methodology for the parametrized
joint optimization of delay and energy consumption at the system level during the com-
munication architecture synthesis, by performing a statistical analysis and optimization
of the parametric yield under the influence of parameter variations. The automated syn-
thesis of communication architectures has received recently a significant attention from
the design community [156], with the focus on both performance and power optimiza-
tion, however the inclusion of parameter variations, a rigorous statistical analysis using
arbitrary non-normal variability models together with algebraic operations on random
variables, the optimization of parametric yield, and technology accuracy through the use
of circuit-level models represent significant novel approaches.
1.2 Research Objectives
The goal of the present thesis is to provide an integrated methodology for the modeling
and optimization of on-chip communication synthesis, with an emphasis on parameter
variability and technology accuracy. For this purpose, a set of system-level statistical de-
lay and energy macromodels are developed, which employ accurate circuit-level models
for the communication structures. The developed macromodels are then employed to
explore the synthesis and optimization of on-chip communication architectures.
For the efficient characterization of application requirements, a profiling interface is
defined which allows the specification of application-relevant data, such as processing
tasks and communication loads, and of architecture and technology parameters, such
as MPSoC resources and technology parameters. Moreover, in order to efficiently char-
acterize non-Gaussian parameter variations, a random variable model with adjustable
accuracy is developed, which relies on the discrete representation of probability density
functions. Together with this model, a set of statistical operators are developed, including
an analytic implementation of a statistical product operator and fast numerical implemen-
tations with adjustable precision for any other statistical algebraic operation. Upon this
statistical method for the propagation of discrete pdf representations across algebraic ex-
pressions, a set of variability-aware macromodels for delay and energy consumption are
developed. The performance macromodel structures embed statistical operation nodes
which store locally a discrete pdf representation of the computed result. Next, the devel-
oped macromodels are employed to optimize the mapping and scheduling of processing
1.3 THESIS OUTLINE 3
tasks on the MPSoC resources with respect to a desired parametric yield extracted from
the performance distributions using quantile functions.
An accuratemodeling of the communication activities is achieved by developing circuit-
level models for the on-chip communication links. First, a technology-accurate statistical
transistor model is developed using BSIM4 equations, CMOS process parameters, and
the statistical methodology developed previously. This current-source transistor model
is then employed to develop circuit-level models for pulsed current-mode and voltage-
mode signaling circuits. Since the choice of different signaling methods as well as volt-
age scaling and body biasing have a significant influence on the on-chip communication
performance, these methods are applied to the developed circuit models and analyzed.
Further, the circuit-level models are employed in the modeling of on-chip communica-
tion segments and the corresponding communication activities. It is important to note
that this thesis does not focus on a particular communication architecture, such as hierar-
chical on-chip buses or networks-on-chip, but rather uses the concept of communication
segment to represent an on-chip communication link.
For accurate representations of on-chip interconnection segments and for validating
the synthesized architecture, a computationally-efficient wide-bandwidth characteriza-
tion method is developed. The method defines a set of initial parameter extractions for
characterizing the CMOS manufacturing process, followed by on-demand multistep ex-
trapolations for modeling a given interconnection segment with specified wire length,
wire widths, spacings, metal layer, and neighboring routing information.
1.3 Thesis Outline
This thesis is organized in three main parts. First, an introductory part presents the mo-
tivation, problem formulation, fundamentals, and current challenges in on-chip commu-
nication synthesis. After that, the core of the thesis contains the main contributions in the
areas of variability-aware performancemacromodels, circuit-level modeling of communi-
cation structures, and technology-accurate characterization of interconnection segments.
At the end, the thesis summarizes the proposed methodology in an application context
and presents several concluding remarks.
Part I Chapter 2 presents the most important aspects which must be considered in the
design of on-chip communication architectures. Within this context, the concepts
of delay and power macromodels are detailed and several modeling approaches
are discussed. In addition, the importance of statistical modeling combined with
process accuracy for performance estimations of state-of-the-art silicon implemen-
tations is emphasized. Several statistical methods to analyze parameter variations
are examined and their drawbacks are indicated. Moreover, the shortcomings of
several modeling approximations and of the underlying transistor-level models are
4 CHAPTER 1 INTRODUCTION AND OVERVIEW
evidenced. Finally, additional resources at the circuit level which can be applied in
the optimization of communication architectures are illustrated.
Part II Chapters 3, 4, and 5 represent the main contributions of this work. Starting from
the need to accurately specify statistical parameter distributions in the application
profile and for process parameter variations, chapter 3 develops a completemethod-
ology for the variability propagation across performance macromodel expressions.
For this purpose, a generalized random variable model is developed, capable of
representing non-standard estimated distributions using discretized pdfs with ad-
justable accuracy. Another important contribution is the development of a prop-
agation method for statistical distributions across the modeling expressions using
analytic implementations of the most often used operators and introducing a fast
generalized method for implementing statistical operators with a precision compa-
rable to Monte Carlo at a very small fraction of the execution time. Based upon
this methodology, statistical performance macromodels for delay and energy con-
sumption are constructed. Since the use of different signaling methods has a strong
impact on communication performance, chapter 4 brings an important contribution
to the inclusion of signaling techniques in communication synthesis frameworks
in the form of circuit-level communication models. First, a technology-dependent
statistical transistor model is derived, which supports variability descriptions for
all process-dependent parameters and employs the statistical operators developed
in the previous chapter to propagate the parameter distributions throughout the
model expressions. Furthermore, pulsed current-mode and voltage-mode signaling
circuits are analyzed andmodeled using the statistical transistor model, which is de-
pendent on process and environmental variations. Within this context, the impact
of voltage scaling and body biasing on the circuit performance are also analyzed.
Afterwards, the circuit-level models are employed for modeling entire communica-
tion segments and the segment models are included into the system-level perfor-
mance macromodels employed in the communication synthesis. The accuracy of
communication segment models is further enhanced in chapter 5, which introduces
a computationally-efficient wide-bandwidth characterization method for arbitrary
interconnect segments. The method relies on an initial set of parameter extractions,
designed to reflect the particularities of a given manufacturing process, and ap-
plies a sequence of incremental extrapolations to obtain the n-port model of a speci-
fied segment. Accuracy evaluations show a performance close to industry-standard
field simulators.
Part III The results of applying the developed methodology in the context of a practical ex-
ample are analyzed in chapter 6. The choice of communication segments, signaling
methods, supply voltage, and body bias are presented and discussed for optimiza-
tion scenarios oriented on delay or energy minimization. The accuracy achieved by
the modeling framework is again investigated for the synthesized communication
segments. Finally, chapter 7 summarizes the thesis and identifies possible directions
1.3 THESIS OUTLINE 5
for future enhancements.
Chapter 2
Fundamentals and Challenges of
Accurate Communication Synthesis
Contents
2.1 Application Profile and Design Space Exploration . . . . . . . . . . . . . 9
2.1.1 Behavioral Specification . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Architectural Description and Design Constraints . . . . . . . . . . 11
2.1.3 Performance Model Creation . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Estimation and Optimization . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Performance Macromodel Concept . . . . . . . . . . . . . . . . . . . 14
2.2.2 Delay Macromodels . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Macromodels for Power Estimation . . . . . . . . . . . . . . . . . . 19
2.2.4 Statistical and Process-Accurate Modeling . . . . . . . . . . . . . . 24
2.3 Resource Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Preemptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Non-Preemptive Methods . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Parameter Variations and Statistical Analysis . . . . . . . . . . . . . . . . 35
2.4.1 Sources of Parameter Variations . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Statistical Analysis Methods . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Technology Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Process Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Yield Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3 Transistor-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Optimization Resources at the Circuit Level . . . . . . . . . . . . . . . . . 43
7
8 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
2.6.1 Choice of Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.2 Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.3 Body Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Embedded systems represent an increasingly ubiquitous presence in our lives since
almost two decades. Recently, complex consumer applications, such as high definition
television sets, video games, and state-of-the-art video encoding/decoding systems, are
mostly integrated on large heterogeneousmultiprocessor systems-on-chip (MPSoCs) [144].
Due to the inherent complexity of MPSoC architectures and of current manufacturing
processes, the design of such systems represents a particularly challenging task, from the
perspectives of chip-level optimization and on-chip communication design. In essence,
embedded system design involves an accurate functional and architectural specification,
followed by the optimized mapping of the application on the target architecture. Within
this context, it is important to notice the lack of a complete de facto automated design
methodology or tool, that assists the designers during the complete design of MPSoCs,
from the initial specifications to the tape-out submission of the chip layout, and which is
still accurate for the current manufacturing technologies. First observed by Gajski in [64]
for embedded system designs, the aforementioned remark is still valid today, mostly
because the architectural and manufacturing challenges scaled up with the advances in
the research and development of CAD tools. Nevertheless, a substantial number of ap-
proaches and methods have been developed to tackle different important aspects in the
design process. In this chapter we enumerate the most relevant methodologies for the
on-chip communication synthesis and point out their key challenges. From this perspec-
tive, the work described in this thesis fits into the global set of design methodologies and
addresses several of the paramount challenges, such as parameter variability and tech-
nology accuracy.
Sec. 2.1 discusses the abstraction of application behavior, architectural description,
and performance estimation, required for design space explorations. Further, Sec. 2.2
presents the concepts of delay and power macromodels, together with the importance of
statistical modeling combinedwith process accuracy. Sec. 2.3 enumerates several schedul-
ing techniques and points out the importance of scheduling decisions on the overall la-
tency and energy consumption. The important challenge posed by parameter variations
is discussed in Sec. 2.4, where several approaches to the modeling of variability are an-
alyzed. Moreover, the importance of technology accuracy in yield estimations and the
underlying transistor models is evidenced in Sec. 2.5. Finally, Sec. 2.6 discusses addi-
tional optimization resources for communication synthesis available at the circuit level.
2.1 APPLICATION PROFILE AND DESIGN SPACE EXPLORATION 9
R
i
1
2
3
1
2 3
Data
dependency
Resource
dependency
1
b=true
?
2 3
Control
dependency
(a) (b) (c)
Y N
Fig. 2.1: Task dependencies represented as data flow graphs.
2.1 Application Profile and Design Space Exploration
The overall system design begins with a functional specification, followed by mapping
the individual tasks on the target physical architecture. This step includes a behavioral
specification of the running application and an architectural description of the available
IP resources. In addition, the design constraints and the parametric description of task-
resource mappings are extracted and specified. The following step consists of creating
the necessary performance models for the required performance metrics and at the de-
sired abstraction levels. Finally, the best communication synthesis is identified through
exploration of various implementation alternatives and estimation of the corresponding
performance values.
2.1.1 Behavioral Specification
Describing the desired system functionality means usually creating a behavioral model
of the system, using a high-level description language, such as Matlab/Simulink, UML,
Verilog, VHDL, or SystemC, to name only a few. This coarse functional description can be
already validated through simulation or formal verificationmethods. It is to be noted that
this initial system description is usually architecture-independent, therefore it can be per-
formed before gathering any knowledge concerning the target resources. Several models
are available for describing system functionality [62], such as finite-state machine (FSM),
communicating sequential processes (CSP) [77], program-state machine (PSM) [63], Petri
nets, flowcharts, UMLmodels etc. Throughout this thesis, the preferred data structure is a
data flow graph, or task graph for representing the system behavior. Particularly, we em-
ploy an extended version of the task graph, to represent concurrencies of the allocated re-
sources, communication activity, and the different types of task dependencies. Fig. 2.1(a)
shows for instance a simple data flow dependency between the tasks. In Fig. 2.1(b), al-
though independent from a data transfer viewpoint, the two tasks are constrained to run
sequentially on the resource Ri. A control dependency can be seen as a particular case
10 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
(a) (b)
Start
1 2
3 54 6
87 9
End End
Start
x
+ +
/
2 x2
a b + ab - cd
c d + ab - cd
2 2
22
Ex.:
Fig. 2.2: Task graph (a) and refined processing node representation at the operation level (b).
Start
1 2
3 54 6
87 9
End
CN1 CN2 CN3
CN4 CN5 CN6
R1
R2
R4
R5
R3
Fig. 2.3: Extended task graph representation showing IP resources Ri and the inter-resource com-
munication nodes CNi.
of a data flow dependency with an additional control condition (i.e. a control variable
check), as shown in Fig. 2.1(c).
Depending on the selection of a task’s granularity level, the processing nodes in the
task graph can represent either entire jobs, processes, individual statements, or opera-
tions. In this work, a coarser representation for the processing tasks is chosen, which
simplifies the partitioning and enables a faster design space exploration. Since the main
focus of the thesis is on the communication synthesis, a more fine-grained representation
of the inter-resource channels is provided, including driving circuits and interconnect
segments. Given these considerations, the hierarchical representation of the system func-
tionality can be captured in a task dependency graph, as shown in Fig. 2.2(a). Different
granularities of the processing node representations can be adopted, such as more de-
tailed operation-level dependencies (illustrated in Fig. 2.2(b)) [156].
Computational tasks in the context of a task graph (TG) are denoted in this thesis as
2.1 APPLICATION PROFILE AND DESIGN SPACE EXPLORATION 11
processing nodes (PNs), while communication operations are denoted rather as commu-
nication nodes (CNs). Hereby, a communication operation represents an inter-resource
communication, where tasks (PNs) assigned to different IP resources need to commu-
nicate and require therefore an inter-resource communication channel. Based on these
considerations, an extended task graph includes both the PNs assigned to the available
resources and the required inter-resource communication nodes. Such an extended TG
segment is shown in Fig. 2.3.
2.1.2 Architectural Description and Design Constraints
The description of the target subsystem comprises the enumeration of available IP re-
sources and their characteristics. Such IP blocks would include e.g. application-specific
integrated circuits (ASICs), application-specific instruction-set processors (ASIPs), paral-
lel processors, digital signal processors (DSPs), microcontrollers, microprocessors, general-
purpose programmablemicroprocessors, largermega-cores such asmicroprocessor+ASIC
combinations, and other specialized pre-designed logic blocks such asmemories, arbiters,
multipliers, FFT units, interfaces etc. Next, a selected set of such physical resources is allo-
cated for the system implementation and represents the target architecture. This resource
set is then described in terms of the performance constraints of each resource (e.g. timing
information, dynamic and leakage power dissipation values etc.) and other implementa-
tion details, such as the block area in a given technology.
It is to be mentioned that this work focuses on the communication synthesis, therefore
IP-level implementations for task processing blocks, such as the synthesis of a behavioral-
level description into an ASIC hardware implementation, or the automatic software gen-
eration for a given microprocessor are beyond our scope. Hence, for the scope of commu-
nication synthesis, a limited description of the IP resources is employed. Particularly in-
teresting details include power, delay (execution time), and communication load values.
For these purpose, performance values related to the execution of tasks are described by
parametrized functions, either specified (by the IP provider) or extracted within a profil-
ing procedure.
In order to extract the parametric description for the target architecture, the behav-
ioral model of the application is simulated with language-specific tools and using dy-
namic profiling tools. This procedure is used to determine branch probabilities, possible
execution paths, number of calls to specific operations, and the execution time on the
target architecture (e.g. estimated in cycles). It has been pointed out that very accurate
estimations can be obtained in this profiling step if the instruction set and a correspond-
ing compiler for the target processor are available [70]. For hardware logic circuits, this
parametric description can be obtained from a coarse synthesis of the blocks, estimation
of the number of logic gates, and by employing the timing and power metrics for the
envisaged technology from logic cell libraries. It is to be noted that particularly the ex-
traction of communication loads between processing tasks is of utmost importance for
12 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
the communication synthesis.
Typically, the result of the profiling analysis consists of minimum, maximum, and
average estimations of performance metrics. Alternatively, multiple samples of these es-
timations can be collected and used to build discretized probability density functions and
obtain a more insightful characterization. Another option is to approximate the profiling
results with given standard distribution types. The latter two description methods are
employed in this work for specifying application profiles.
After a thorough insight concerning the achievable performance metrics of the target
architecture is obtained, the design constraints for the communication synthesis can be
specified. Factors such as overall die area and layout design rules are dictated by cost
restrictions and the chosen technology. On the opposite, delay and power budgets are
derived from the application requirements. Chip area constraints are mainly used during
floorplanning and routing, but also to determine spatial correlations of process parame-
ters across the die. In contrast, the total delay and power budgets are relevant for almost
every of the following design steps, including task-resource mapping, scheduling, com-
munication (signaling) resource allocation, and circuit-level voltage optimizations.
2.1.3 Performance Model Creation
Fast and accurate evaluations of various performance metrics for a given system con-
figuration are necessary to find a design solution which satisfies all the constraints and
which is optimized according to a given objective. To evaluate the performance metrics
we use performance models, which, in essence, are parametrized expressions built to es-
timate a given metric. A continuous trade-off between speed and accuracy dictates the
development and improvement of performance models. On the one hand, fast estima-
tions are critical to keep the design space exploration and optimization feasible. On the
other hand, technology accuracy is becoming extremely important with state-of-the-art
manufacturing processes where parameter variations exhibit a significant influence on
the yield [20]. In the particular case of communication synthesis, accurate models for the
communication segments are required. For efficiently estimating the overall system per-
formance only coarse estimations are necessary for the processing tasks while they run on
the allocated resources. Such estimations can be represented in the form of statistical de-
scriptions of the execution times and power dissipation levels, extracted from simulation
and profiling.
A performance model represents a set of data structures and expressions which sym-
bolically describe a given performance metric [156]. Depending on the particular met-
ric (delay, dynamic power, leakage etc.) and modeling complexity, several performance
models for system-level estimations have been proposed. A typical deterministic per-
formance model for delay estimations is shown in Fig. 2.4 in the form of a maximum
operation followed by a sum. Here, we denote with T si the earliest starting time allowed
for processing node i, T ei is its ending time, and T
x
i is its execution time, always specified
2.1 APPLICATION PROFILE AND DESIGN SPACE EXPLORATION 13
1 2
3 4
R
i
T1
e
T2
e
T4
e
D
a
ta
 d
e
p
e
n
d
e
n
c
ie
s
max +
S
ch
ed
ul
in
g
de
pe
nd
en
cy
T3
s
T3
x
T3
e
R
i
(a)
(b)
Fig. 2.4: Section from a task graph with four processing nodes (a) and the resulting deterministic
delay model for processing node 3 (b).
in the context of given resource on which the node is running. The maximum operation
expresses the condition for the earliest starting time, which depends on the finishing of all
previous PNswhich have data or scheduling dependencies with the current node. Finally,
the sum operation adds the execution time to the starting time to determine the earliest
end time of the node. A more detailed overview on performance modeling approaches
is offered in Sec. 2.2. The main contributions of this thesis in the field of performance
modeling are described in detail in chapters 3 and 4.
2.1.4 Estimation and Optimization
Exploration in the implementation space implies the evaluation of possible design alter-
natives, given by various configurations of the set of interconnected resources, each of
them implementing sections of the behavioral specification. At every step in the design
space exploration, an estimation of the design quality is performed, using performance
models and applying a cost function on the estimated performance metrics. The accept-
ability of a particular design depends on the given constraints and on the particular figure
of merit chosen as objective function to be optimized. Hereby, the exploration speed and
efficiency are directly determined by the granularity and accuracy of the employed per-
formance models. Here, the involved computational effort and the resulting total time
for the entire design space exploration may be substantial with respect to the other de-
sign steps. Note that, the choice and implementation of the exploration and optimization
algorithm has also a strong influence on the overall effort.
At this point in the design process, the physical resources are allocated for the differ-
ent tasks. Usually, variables are stored into memory blocks, behaviors are implemented
by processors, and the communication tasks are attributed to inter-resource channel seg-
ments. This partitioning step is characterized by the granularity of the structural objects.
This way, we can apply the resource mapping at gate level, block level, core level etc.
Since the focus lies on the communication, we apply the mapping of processing tasks at
the coarser core and block levels. A fine-grained circuit-level representation is employed
14 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
for the communication structures, considering also the interconnect-related delay and
power dissipation in the equivalent circuits.
Further, the performance metrics to be considered during the optimization must be
defined. Examples include the overall execution time (application delay, or latency),
dynamic power dissipation, and, particularly important for recent technologies, leak-
age power. The next step is to combine the considered metrics into a cost (or objective)
function which best represents the figure of merit for optimization. Finally, an optimiza-
tion algorithm is applied during the exploration. Examples include mixed-integer linear
programming [132], greedy priority-driven clustering [51], list scheduling [57], iterative-
improvement methods, such as simulated annealing and tabu search [156, 128], genetic
algorithms [25], and custom heuristic methods [59]. Nonetheless, the largest challenge for
the design optimization step remains the characterization of the optimum solution and
the certainty of achieving a global optimum.
2.2 Performance Modeling
The communication synthesis paradigm states that a communication architecture must
be found which is optimal for all the applications running on the designed System-on-
Chip [156]. From this point of view, a series of difficulties must be considered. First, the
communication architecture is unique andmust be adequate for all intended applications.
This implies that all the possible data flow variations must be taken into account during
the optimization. Second, important information about the routing path, segment length,
and interconnect parasitics is not available at early design stages. Hence, there is a sub-
stantial need for new algorithms and modeling approaches which are able to efficiently
predict the segment length and accurately estimate line parasitics. Further, to obtain an
architecture which satisfies all performance constraints for the given applications, accu-
rate performance models for the relevant metrics (e.g. latency, power) are required.
2.2.1 Performance Macromodel Concept
Aperformancemacromodel (PM) represents a symbolical description of the systemwhich
allows the estimation of performance attributes, such as delay, dynamic power, or leak-
age, considering a particular system configuration during design space explorations. The
term macromodel indicates an overall system-wide model, resulted from the composi-
tion of several smaller resource-level or task-level models. Typical representations of per-
formance macromodels include analytical model expressions, numerical equations, and
graphical representations implemented as linked data structures and operations.
Highly-flexible performance models for latency and power, which can be easily ex-
tended and refined for new requirements, have been proposed in [156, 56, 155]. Here,
latency models are developed as symbolic representations of the timing values for the
2.2 PERFORMANCE MODELING 15
R1
Start
1 2
3 54
End
R2
R3
0
+
+
T1
x
T2
x
max
max
max
T3
s
T4
s
T5
s
+
+
+
T3
x
T1
s
T2
s
T1
e
T2
e
T5
x
T4
x
max
Delay
T3
e
T5
e
T4
e
(a) (b)
Fig. 2.5: Task graph example (a) and the attached delay PM (b).
start, execution, and end times of processing and communication nodes. Within this rep-
resentation, communication nodes consist of alternating sequences of data packets and
synchronization nodes for handshaking.
A graph representation of a PM for delay, directly derived from a task graph as pro-
posed in [156] is shown in Fig. 2.5. The start node in the delay PM is set to the determin-
istic value of 0, which represents the initial timing value for the delay computation. Next,
the main part of the macromodel consists of nodes representing symbolic variables and
operations linked by directed arcs. Finally, the additional dashed links represent schedul-
ing dependencies (in the example from Fig. 2.5 it has been assumed that PNs 3, 4, and 5
are scheduled on resource R3 in this order).
The numeric estimation of the modeled performance attribute occurs by evaluating
all the operational nodes. Further model refinements and extensions could add to this
basic structure also other operations, such as multiplication, minimum, division, square
root etc. Typically, operation nodes represent either fundamental computations in the
macromodel, such as the sum of dissipated power by multiple resource units, or perfor-
mance constraints, such as wait conditions for the end times of all predecessors.
Additional structures may be added to the macromodel as a result of design decisions
during exploration, such as adding a new communication segment which inserts an ad-
ditional latency between two processing nodes. Changing task-resource mappings or
altering the scheduling sequence has also an impact on the PM structure. It is important
to note, that the performance macromodel definition is very general and can be adapted
for many performancemetrics. In addition, model flexibility is an important requirement,
to allow for extensions, refinements, and adding new relationships between the modeled
attribute and design changes.
16 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
(a) (b)
1 k2
n
max
Tn
s
+
Tn
e
Tn
x+
+
+
T1
x
Tk
x
T1
s
T2
s
Tk
s
T2
x
T2
e
T1
e
Tk
e
1 2 R1
max
T1
s
+
T1
e
T1
x
max
T2
s
+
T2
e
T2
x
Fig. 2.6: Modeling of data (a) and scheduling (b) dependencies (after [156]).
(a) (b)
1
2
3
p
q
r
C
C
+
T1
x
T1
s C
C
max
T2
s
+
T2
e
T2
x
max
T3
s
+
T3
e
T3
x
max
Tp
s
+
Tp
x
max
Tq
s
+
Tq
x
min
T1
r
Tp
e
Tq
e
Fig. 2.7: Control dependencies in the task graph (a) and in the delay macromodel (b) (after [56]).
2.2.2 Delay Macromodels
A delay macromodel example has been shown in Fig. 2.5, whereas the general rules for
expressing data and scheduling dependencies are depicted in Fig. 2.6(a) and (b). As men-
tioned before, data dependencies add PM links between the maximum node of a given
PN and the end times of all its predecessors, whereas scheduling dependencies are rep-
resented by additional dashed links between otherwise independent tasks which are as-
signed to the same resource.
Control dependencies can also be included in the macromodel [56], where the flow of
processing nodes is influenced by conditional branches. Such an example is illustrated in
Fig. 2.7(a), where after the execution of PN 1 a boolean condition C is tested. If condition
C is found true, then PNs 2. . . p are executed. Otherwise, the execution flow is directed
towards PNs 3. . . q. The end of the conditional branch, where the possible execution flows
rejoin, is symbolically marked by inserting the artificial node r. The corresponding rep-
resentation of the control dependency in the delay macromodel is shown in Fig. 2.7(b). If
condition C is true, the upper branch of the macromodel is evaluated in the usual way,
while in the lower branch, the link marked by the false C condition propagates the value
infinite. This way, at the rejoining node r, a minimum operation ensures that the false
2.2 PERFORMANCE MODELING 17
(b) (c)
2
5
R2
R3
CN1
2
5
R2
R3
CN1min
CSF1 max
T2
s
+
T2
x
T2
e
T
s
+
TCN
min
1
CN1 +
TCSF
T
e
1
CN1 max
T5
s
+
T5
x
T5
e
(a)
Fig. 2.8: Insertion of communication speed flexibility nodes in the task graph (a,b) and the corre-
sponding delay macromodel structure (c) (after [155]).
condition branch is discarded. Reversely, if C is false, then the upper branch propagates
the infinite value and the lower C branch is executed normally.
Given the above formulations, the end time of a given processing node i is evaluated
as:
T ei = max
p=j...k
s=l...m
(
T ep , T
e
s
)
+ T xi (2.1)
where p = j . . . k iterates all the predecessors of PN i in the task graph (data dependen-
cies), and s = k . . . l iterates all the scheduling dependencies. If PN i is followed by a
conditional branch, the output T ei is also marked by the corresponding condition, and the
computation becomes:
T ei =
 maxp=j...ks=l...m
(
T ep , T
e
s
)
+ T xi , Ci = true
∞ , Ci = false
(2.2)
according to the value of the branch condition Ci.
A method for estimating the execution time of communication tasks has been pro-
posed in [155] in the form of communication speed flexibility (CSF). The concept is illus-
trated in Fig. 2.8 for a given communication node CN1. In Fig. 2.8(b), the CN is replaced
with aminimumCN (CN1min) followed by a CSF node. First, theminimumCN represents
the shortest delay which can be achieved for the given communication load, in the tar-
get technology, with a minimum-length communication segment. Hence, the minimum
CN introduces the absolute minimum achievable communication latency, independent
of the floorplanning and routing information which is not available before synthesizing
the complete communication architecture. Next, the CSF node inserts a delay equal to
the maximum tolerable latency on the respective segment, which does not violate the
total system delay constraint. This value can be evaluated and updated during the opti-
mization, for each system configuration, by comparing the total delay with the delay con-
straint and computing the allowed slacks on each communication segment. The added
minimum CN and CSF delays are inserted in the delay PM as shown in Fig. 2.8(c). Note
18 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
Rd C1 C2 Ck-1 Ck
l1 l2 lk
Rd
C0 CL
l
(a)
(b)
Fig. 2.9: Transformation of a multiple-pin net (a) into a two-pin net (b) (after [49]).
that, the sum of the minimum CN delay and the CSF represents the maximum communi-
cation delay tolerable on a given segment. As explained in [155], this maximum tolerable
delays become constraints for the subsequent communication synthesis step. A bus seg-
ment which exceeds the maximum tolerable delay will also violate the overall system
latency constraint.
An alternative approach in the form of a delay estimation engine has been employed
in [130] for bus delay modeling. Within this framework, the bus wire lengths are com-
puted after a simulated annealing-based floorplanning using the half-perimeter of the
minimum bounding box which encloses the bus connections [32]. Given the estimated
length l, wire delays are computed according to [49] as:
T = RdC0 +
(
α1l
W 2 (α2l)
+ 2
α1l
W (α2l)
+Rdcf +
√
Rdrcacf l
)
· l (2.3)
where α1 =
rca
4
, α2 =
1
2
√
rca
RdCL
, W (x) is Euler’s Lambert function [29] defined asW (x) ={
w
∣∣wew = x}, Rd is the driver’s on resistance, r is the sheet resistance (Ω/), ca is the
unit area capacitance (fF/µm2), and cf is the unit effective-fringing capacitance (fF/µm),
whereas C0 and CL are lumped capacitances computed for equivalence in terms of the
Elmore delay [60, 136, 49] as shown in Fig. 2.9 and given by:
CL =
k∑
j=1
∑j
i=1 li
l
· Cj (2.4)
C0 =
k∑
j=1
Cj − CL (2.5)
Note that, the delay estimated by (2.3) is assuming an optimal wire sizing as indicated
in [49] with an average wire width of:
w =
√
r (cf l + 2CL)
2Rdca
· l (2.6)
An alternative approach to performance modeling has been recently proposed in [14]
where commercial and academic standalone CAD tools are embedded into an open frame-
work and employed for performance estimations. Such tools typically embed complete
2.2 PERFORMANCE MODELING 19
Plug-and-Play Component Platform
Target Specification
Model Description
C-Compiler
Assembler
Linker
Simulator
VHDL Description
Synthesis
Gate-Level Model
Evaluation
Profiling, Performance Estimation
Component Model Library
Memories
Configuration
Parameters
Processors
Configuration
Parameters
Interconnects
Configuration
Parameters
Interfaces
Configuration
Parameters
Timing and Power Models
Processor Design Platform
Heterogeneous MPSoC Design Space Exploration
Fig. 2.10: Open framework with embedded CAD tools for performance modeling and design
exploration, as proposed in [14].
design flows for processors, with high-level modeling platforms, instruction-set simula-
tors, C-compilers, assemblers, and linkers. Moreover, academic tools such asMPARM [102],
include collections of component models, like processors, interconnects, memories, and
dedicated interfaces, together with simulation engines. After selecting the required com-
ponents from a library, further configuration options are available for each element to
allow a better adaptation to the application needs. Timing and power models are also
available for the included components, hence the performance metrics can be directly
evaluated. Finally, a flexible open framework based on the SystemC language integrates
the two platforms and allows the MPSoC design using either plug-and-play library com-
ponents, or custom-designed processing and logic blocks, as summarized in Fig. 2.10.
2.2.3 Macromodels for Power Estimation
Meeting the power constraint for the entire system is a challenging task, particularly since
it can be strongly influenced at every design step, including partitioning and mapping,
scheduling, and the communication synthesis. Accurate estimations of the power dissi-
pation are possible at the RTL level for logic blocks and through circuit-level simulations
including parasitics for the communication segments. Nevertheless, since early design
decisions, such as mapping a task to a particular resource, or sharing a narrower com-
munication segment at the cost of added latency, have a strong influence on the total
power consumption, we need to develop and integrate power estimation models also at
the higher levels of abstraction.
Early approaches to high-level power estimation rely on simple observations, which
are usually proven only after the optimization step, through subsequent simulations or
measurements. In [146] a high-level power estimation for processors is used to optimize
20 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
the scheduling of running tasks. The optimization relies on bringing the processor into a
power-down mode, were all parts except the clock and timer circuits are turned off, and
applying a dynamic frequency and voltage scaling while the processor is running. The
power estimation relies on the proportionality of the dynamic power consumption with
the frequency and with the square of the supply voltage.
The first approach to optimize power consumption during the hardware-software co-
synthesis of embedded systems has been published in [50] and includes a power esti-
mation step for CPUs, ASICs, FPGAs, as well as for communication activities. System
description includes an average-power vector ξ (ti) = {ξi1, ξi2, . . . , ξin} for each task ti,
where each element ξij represents the average dissipated power of task ti if assigned to
processing element j. Here, the average power dissipation is estimated at nominal supply
voltage and assuming an average data stream. A similar vector is defined for the peak
power, κ (ti) = {κi1, κi2, . . . , κin}, where each element κij designates the peak power dissi-
pation of ti if assigned to processing element j. Hereby, the operating conditions for peak
power dissipation are assuming the highest supply voltage level and the highest data
stream. Idle power levels are also considered, for processing elements (PEs), ASICs, FP-
GAs, and communication links, estimated for the case in which no task is being executed
on the respective resource. It is further assumed that the average and peak idle power
consumption levels are specified for each processor. For ASICs and other logic elements,
the gate-level power is specified, while for FPGAs, the average and peak idle power lev-
els are also given. Similarly, each communication link is described by the average and
peak idle power levels. During the automated system design, processing tasks are or-
ganized in clusters, which are then assigned to existing resources. To optimize power,
the energy levels of each task are considered during clustering, instead of the task prior-
ities. An example of such a energy-oriented clustering is shown in Fig. 2.11, where the
two clusters C1 and C2 are formed along the higher-energy paths. The values in brackets
represent energy levels, while the numbers in bold type indicate the task priorities. e1 to
e6 are the inter-task communication edges. Although the main optimization focus lies on
power minimization, energy consumption levels are employed to take into account both
active-mode and idle power consumptions for resources and communication links in a
unified manner.
For a given task ti (or edge ei), the average energy consumption is given by:
Ei = T
x
wcij
· ξij (2.7)
where T xwci is the worst-case execution time and ξij is the average power dissipation, both
in conjunction with a given resource j. Here, the use of worst-case execution times is cho-
sen if the real-time constraint precedes the total power constraint, otherwise the normal
operating conditions are employed. Furthermore, if task ti has no child dependencies, its
average energy is given by (2.7). Otherwise, the average energy level is computed from
every child edge e (ti, tc) as:
Eti = max
tc
(
Ei + Ee(i,c) + Ec
)
(2.8)
2.2 PERFORMANCE MODELING 21
(b)
(a)
t1
t2
t3
t4
t5
t6
t7
C1
C2
e1
e2
e3
e4
e5
e6
(50)26
(40)-15
(15)-45
(27.5)
15
(20)-5
(12.5)-25
-45 (5)
R2 R1
t2 t5-[ ] t1 t6,[ ]t7,
e1[ ]
Link1
Fig. 2.11: Power-optimized clustering of processing tasks (a) and the corresponding resource map-
pings and communication link (after [50]).
where Ei is the average energy of task ti, Ec is the average energy of tc, and Ee(i,c) is the
average energy of edge e (ti, tc), all estimated by (2.7).
It is important to note that energy levels must be updated after each clustering. Par-
ticularly the communication edges have a different cost (execution time and power dis-
sipation) if they connect two different resources as compared to the case in which they
connect tasks assigned to the same processing unit. In this way, the clustering config-
uration is optimized until it minimizes the average energy level. During the allocation
of resources, the peak power dissipation is estimated and checked against the maximum
constraint. For processors and communication links, the average and peak power dissi-
pations are estimated according to the processing tasks and communication edges which
are running on them. If we denote with ℜξ the average energy consumption of the re-
source j and with θξ the average idle power dissipation [50], then for each processing
resource P and communication link L:
ℜξ (P ) =
[∑
ti∈T
ξip · T xip · ni
]
+
[
θξ (P ) ·Ψ(P )] (2.9)
ℜξ (L) =
∑
ej∈E
ξjl · T xjl · nj
+ [θξ (L) ·Ψ(L)] (2.10)
where ti ∈ T represent the processing tasks running on P , ej ∈ E the communication
tasks assigned to L, ni (nj) is the number of times that ti (ej) is running across the total
system time period, and Ψ represents the idle time of a given resource or communication
link in the system period.
In contrast to processing units, tasks assigned to FPGAs or ASICs can also run in paral-
lel if designed appropriately. As a consequence, the peak power dissipation is computed
as the sum of the peak power levels of the tasks running in parallel, followed by a maxi-
mum of the tasks (or groups of parallel tasks) which run in series. Finally, the total system
power dissipation can be found by dividing the total estimated energy consumption for
22 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
all resources and communication links by the total system latency, which is estimated
using the delay macromodel.
A performance macromodel for power estimation during the hardware-software co-
synthesis of low-power embedded systems has been proposed in [56] and represents the
power dissipation at a task and resource level using a linked tree structure with oper-
ational nodes. This approach considers in particular the resource mapping and task
scheduling decisions, whereas the inter-resource communication aspects are not cap-
tured. The power consumption is computed as the sum of the power dissipated by each
processing element during task execution and during idle time. A special attention is
payed to the option of temporarily shutting down a resource when it’s not used by any
processing task. An example for a small TGwith five tasks is depicted in Fig. 2.12. P ai rep-
resents the average active-mode power consumption for each resource i and is multiplied
by the execution time of each task (computed as the difference T ej − T sj ). The total active-
mode power consumption is estimated by adding the contributions of each resource and
of all assigned tasks. In addition, for each resource, the idle time is computed, as the
difference between the starting time of a task T sj and the end time of the previous task
in the scheduling list T ek . In the example from Fig. 2.12, the idle times of resource R3 are
between tasks 3 and 4, and between tasks 4 and 5, thus given by the differences T s4 − T e3
and T s5 − T e4 , respectively. This idle time is then multiplied by the idle-mode power con-
sumption, which for resource R3 is indicated by P
i
3. If R3 is shut down between tasks 3
and 4, then this decision is indicated by the symbolic variable S34 which is set to “1”, oth-
erwise it will be set to “0” indicating that the resource is kept on. Similarly, if R3 is shut
down and restarted between tasks 4 and 5, then S45 will be set to “1”, otherwise set to
“0”. It is to be noticed that shutting down and restarting a resource requires an additional
power, which is represented by P stopi and P
start
i , respectively. By multiplying the sum of
these power amounts with the value of the flags Sjk, their contribution is automatically
added or ignored to the total power consumption. In a similar way, the idle-mode power
amounts are multiplied with the negated shutdown flags, Sjk, to enable their contribu-
tion when the respective resource is not turned off. Although not depicted in Fig. 2.12(b),
similar idle-mode estimations must be implemented also for the resourcesR1 andR2. For
instance, the idle time of R2 in this small example is given by the difference between the
system end time (as given by the output of the delay PM) and the end time of task 2, T e2 .
In this time frame, R2 can be kept on, where the idle-mode contribution P
i
2 multiplied
with the idle time will be added, or turned off, where P stop2 must be added. Finally, it is
important to notice, that also in this approach, both active-mode and idle-mode power
contributions are multiplied by time durations (execution time, respectively idle time se-
quences). This observation leads to the remark that the estimated metric represents an
energy quantity, rather than power. It is also important to remember at this point, that
for battery-powered embedded systems, the total energy consumption is of a more criti-
cal concern than the heat generated by instant power dissipation. Therefore, estimating
energy consumption versus power dissipation appears to be a more appropriate decision
in this case. Note also, that the power dissipation can be easily obtained by dividing the
2.2 PERFORMANCE MODELING 23
R1
Start
1 2
3 54
End
R2
R3
(a) (b)
T1
e
T1
s
P1
a
T2
e
T2
s
P2
a
T3
e
T3
s
P3
a
T4
e
T4
s
P3
a
T5
e
T5
s
P3
a
T4
s
T3
e
T5
s
T4
e
P3
i
P3
i
S34
S45
P3
stop
P3
start
+
S34
P3
stop
P3
start
+
S45
+
+
+
Total Power
Fig. 2.12: Task graph example (a) and the derived power PM (b) (after [56]).
total energy consumption over a system period by the time obtained from the delay PM.
Given these observations, we can conclude that the total system energy considering
shutting down and restarting resources can be estimated as the sum:
Etotal = Eactive + Eidle + Erestart (2.11)
The active-mode energy contribution is given by:
Eactive =
∑
Rk∈{R}
P ak
∑
PNj↔Rk
(
T ej − T sj
)
(2.12)
where {R} is the set of resources and PNj ↔ Rk denotes the set of PNs assigned to
resource Rk. The idle-mode energy contribution is determined by:
Eidle =
∑
Rk∈{R}
P ik
∑
PNj↔Rk
(
T sj − T ej−1
) · Sj−1,j (2.13)
and the energy consumption for shutting down and restarting the resources can be esti-
mated as:
Erestart =
∑
Rk∈{R}
(
P stopk · T stopk + P startk · T startk
) ∑
PNj↔Rk
Sj−1,j (2.14)
where T stopk and T
start
k are the time duration required for shutting down and restarting
resource Rk and have been added to the model to provide a unified expression for the
total energy consumption. It is to be mentioned that the implications concerning the
additional latency caused by shutting down and restarting a resource must also be taken
into account in the delay PM.
The total idle time of a resource can also be estimated by computing the utilization
rate [75]:
uRk =
NRkactive
Ntotal
(2.15)
24 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
where NRkactive represents the number of clock cycles in which Rk is active and Ntotal is
the total number of cycles for the execution of the complete application. This way, an
alternative expression for estimating the idle energy can be written in terms of utilization
rates as:
Eidle =
∑
Rk∈{R}
(1− uRk) · P ik · Ttotal (2.16)
where Ttotal is the total execution time of the application, as given by the delay PM.
As it can be seen, several timing values are required for estimating the power, respec-
tively the energy consumption. Thus, a tight connection must exist between the delay
PM, which estimates all the start and end times of the tasks, and the power PM. This
connection can be implemented by linking the outputs of several operational nodes from
the delay PM which compute e.g. T sj and T
e
j to the inputs of operational nodes (mostly
subtraction nodes, see Fig. 2.12(b)) from the power PM. Thus, there is a substantial over-
head to update the PMs and the inter-PM connection links when iterating through design
changes during the system optimization. This aspect is discussed in more detail in chap-
ters 3 and 4. It is also important to add that the inclusion of inter-resource communication
activity in the system architecture must be also included in the power estimation. The in-
clusion of communication links in power PMs is discussed in chapter 4.
2.2.4 Statistical and Process-Accurate Modeling
Increasing variations in process and environmental parameters [20] have brought the
need for new approaches to performance modeling, which are both accurate, to include
the parasitic effects of very deep sub-micron processes, and statistical in nature, to model
the influence of variations. The first efforts towards statistical modeling have been fo-
cused on the evolution of static timing analysis (STA) towards statistical STA [24].
One of the first concepts for a statistical performance model for delay estimations has
been presented in [142] and derives a method for computing the delay of interconnect
lines with two drivers. This method employs an equivalent circuit model which includes
the driver output impedance and capacitance, load capacitance at the receiver, and an RC
model for the interconnect line. Considering process variations in the interconnect wire
width, both resistance and capacitance values are expressed in terms of their dependence
on this parameter, as:
R = Rnom
w
w +∆w
(2.17)
C = Cnom
(
1 +
w
w −∆w
)
(2.18)
where w is the line width and ∆w is the corresponding variation. For the interconnect
width parameter, a Gaussian distribution is assumed with zero mean and 10% standard
deviation. The Elmore delay model is employed to compute the delay at the end of the
2.2 PERFORMANCE MODELING 25
line as:
d = Rs (2Cl + C1 + C2) +R1
(
C1
2
+ C2 + Cl
)
+R2
(
C2
2
+ Cl
)
(2.19)
where Rs is the output resistance of the driver, Cl are the load capacitances at the input
and output of the line, and the interconnect runs over two metal layers, each segment be-
ing characterized byR1, C1, andR2, C2, respectively. In this approach, a simple sensitivity
dependence of the delay with respect to a given parameter variation is extracted, in the
form of the first derivative. For instance, if only interconnect wire width variations are
considered as in [142], the delay expression including the contributions of the parameter
variations can be written as:
d = dnom +
∂d
∂w1
∆w1 +
∂d
∂w2
∆w2 (2.20)
where w1 and w2 are the wire widths of the line on the two metal layers. The sensitivities
can be further analytically extracted from (2.19), as shown here for the dependence on w1:
∂d
∂w1
=
(
Rs +
R1
2
)
∂C1
∂w1
+
(
C1
2
+ C2 + Cl
)
∂R1
∂w1
(2.21)
Such a simplified sensitivity analysis keeps only the first derivatives of the delay with
respect to the variable process parameters. As a result, only the first derivatives are prop-
agated across the model, not the full distributions of the parameter variations. The mod-
eling capability is therefore limited to a first-order approximation, as shown in this ex-
pression of the arrival time expanded for the process parameters p1, p2, . . . , pN :
A (p1, p2, . . . , pN) ≈ A (pnom1 , pnom2 , . . . , pnomN ) +
N∑
i=1
ai∆pi,a (2.22)
where ai are the first-order sensitivities (derivatives) of the delay with respect to pi, and
∆pi,a is the deviation of process parameter pi for the net A where the delay is estimated.
The model complexity is limited by the extraction of first-order derivatives of the perfor-
mance metric with respect to every parameter variation considered. In addition, analytic
expressions of the sensitivities can be extracted only if the base model includes all the
process and environment parameters which are to be taken into account.
Another early approach towards statistical delaymodeling has been published in [121],
proposing a simplified analytical model for estimating the influence of gate length (Lgate)
variations to the variance of the overall delay. Within this approach, the MOSFET gate
length is represented in terms of the distinct contributions to the variation:
L = Lnom + Lprox + Lspat + ε (2.23)
where Lprox represents the proximity-dependent variation, Lspat is the spatial variation,
and ε is the random residual variation. Typically, proximity-dependent sources of varia-
tions are treated as systematic, since the frequency and placement of particular gates in
26 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
the layout is known. As a result, Lprox is modeled by a discrete random variable, while
Lspat and ε are represented by Gaussian distributions, such as Lspat ∼ N
(
0, σ2spat
)
and
ε ∼ N (0, σ2). The delay of a gate is estimated using the compact gate delay model [39]
and is given by:
d =
CLVdd
n
(
I−1dn + I
−1
dp
)
(2.24)
where CL is the load capacitance, n = 3.7, and Idn, Idp are the drain current of the NMOS
and PMOS transistors. Further, the drain saturation current is approximated using the
following empirical relationship [39]:
Idsat ∼ L−0.5eff T−0.8ox (Vdd − Vth) (2.25)
where Leff is the effective gate length, Tox is the oxide thickness, and Vth is the threshold
voltage. Assuming that Leff ≈ L and that CL ≈ L ·W ·Cox (whereW is the channel width
and Cox is the oxide capacitance), the dependence of the gate delay on the gate length
considering (2.24) is of the form d ∼ L1.5, or d = k · L1.5, where k is a process-dependent
constant. The same analysis can be further extended for a path ofm gates, where the path
delay becomes:
D =
m∑
i=1
di = k
m∑
i=1
L0.5i Li+1 (2.26)
where Li and Li+1 are the gate length of the driver and of the load, respectively, for every
two successive gates in the path. The analysis then finds the variance of the path de-
lay, which can be computed by propagating the variances of the individual contributions
from (2.23) through the delay expression from (2.26). Finally, the standard deviation of
the path delay is expressed as:
σD =
√
Var {D} = 1.5Dnom
Lnom
(
σ2Lprox + 0.28σ
2
m
+ σ2Lspat ·
k∑
l=1
γ2l
)1/2
(2.27)
where k is the number of the spatial partitions crossed by the given path (inside a spatial
partition the spatial correlation is considered to be equal to 1), and γl =
nl
m
is the ratio of
the number of gates nl in a given partition l to the number of gates in the path (m).
This method shows the derivation of an analytic expression of the standard deviation
of the overall path delay, starting from a simple analytic model of the gate delay. It is to be
noticed that only gate length variations are captured in this approach. Additional param-
eter variations would require a new model derivation, starting from the gate delay, while
the complexity of propagating all the variances would eventually become prohibitive. It
is also important to add, that only the standard deviation is estimated, i.e. a single pa-
rameter is employed to characterize the distribution of the delay.
A more refined approach published in [34] includes the effects of process parameters
such as isolation oxide strain, transistor orientation, and etch loading. In addition to pro-
cess variations, also environmental parameter variations are considered, including sup-
ply voltage and temperature. The model derivation relies on the alpha-power law [139]
2.2 PERFORMANCE MODELING 27
delay model for estimating the gate delay as [58]:
Td =
CL · Vdd
I
=
K · Vdd
(Vdd − Vth)α (2.28)
where CL is the load capacitance, K =
CL
µCox(W/L)
, and α is a velocity saturation index.
Further, a unified analytic expression for the drive current is obtained, in the form of:
I ∝
{
ln
[
1 + exp
(
Vdd − Vth
2S
)]}2
{
1 + ln
[
1 + exp
(
Vdd − Vth
EsatL
)]} (2.29)
where S is the subthreshold swing and Esat is the critical electrical field inducing the
carrier velocity saturation [167]. Based on this formulation for the drive current, an ex-
pression for the gate delay is derived as:
Td =
KL · Vdd ·
{
1 + ln
[
1 + exp
(
Vdd − Vth
EsatL
)]}
{
ln
[
1 + exp
(
Vdd − Vth
2S
)]} (2.30)
whereKL represents a loading parameter and is modeled by a polynomial function:
KL = (k0 + k1 · L · CL + k2 · Lak) /W (2.31)
The coefficients k0, k1, k2, and ak are extracted from simulations and are process-dependent.
Based on this delay formulation, a canonical model (linear around the nominal value) is
developed for the delay variability. For process parameter variations, statistical Gaussian
distributions are assumed, whereas other environment parameters (i.e. temperature, volt-
age) are described using corner models. The delay variability is extracted in the form of a
coefficient of variation, as the ratio of standard deviation to the mean (or nominal) value,
and is expressed as:
σTd
Td
=
√(
∂ lnTd
∂L
)2
· σ2L +
(
∂ lnTd
∂Vth
)2
· σ2Vth (2.32)
Again, this modeling approach extracts a single parameter to characterize the delay
variability, namely the coefficient of variation, hence a more detailed characterization
of the delay distribution is not offered. In addition, the only process and environment
parameters which are captured are the ones present in the nominal gate delay expression
from (2.30). Thus, for the inclusion of other parameter variations, a new delay model
must be developed.
One of the most recently-published modeling methods in this field [149] derives ana-
lytic expressions for computing the delay and leakage of a digital gate considering inter-
die and intra-die process parameter variations and their spatial correlations. In this ap-
proach, only variations in the gate length and the zero-bias threshold voltage Vth0 are
28 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
considered. The analysis relies on the same sensitivity-based modeling of performance
metrics with respect to parameter variations. Considering e.g. p different process param-
eters, with their respective variations∆P , the sensitivity-based expressions for delay and
leakage are given by:
Delay = dnom +
p∑
i=1
αi (∆Pi) (2.33)
Leakage = exp
(
Vnom +
p∑
i=1
βi (∆Pi)
)
(2.34)
where dnom is the delay nominal value, exp (Vnom) is the nominal value of the leakage
power, αi is the delay sensitivity with respect to process parameter Pi, and βi are the cor-
responding sensitivities of the logarithm of leakage. These sensitivities must be extracted
during circuit simulations while varying the corresponding process parameters. Spatial
correlations between the parameters are captured using a 2-D grid across the die area and
a principal component analysis (PCA)method is applied for enabling their representation
using de-correlated standard normal distributions.
This method can be extended with additional process parameters, however at the ex-
pense of running additional simulations to extract the sensitivities. A big challenge here
is the inter-parameter correlation. If only one parameter Pi is varied at a time and the
corresponding sensitivity αi (or βi) is extracted, this sensitivity might also embed a sig-
nificant dependence of the delay (or leakage) on another parameter Pj , assuming that Pj
is correlated with Pi. Later, when Pj is varied, another sensitivity αj (or βj) is obtained.
Since both αi and αj embed a common amount of the dependence on Pj , errors are intro-
duced when summing up αi∆Pi + αj∆Pj . Such linear sensitivity models give accurate
results only under the assumption of independence between the process parameters. Fur-
thermore, this method stores the extracted sensitivities as 2-D lookup tables, indexed by
the input rise time and the output capacitance. Nevertheless, besides the dependencies
of the sensitivities on the input slew and load capacitance, there are many other possible
dependencies, including dependencies on voltage levels, body bias, temperature-related
dependencies, which are not captured by default. For any additional dependency a new
extraction of the sensitivities and a corresponding storage form is required. Finally, at
given circuit conditions, a lookup must be performed for all these parameters, to obtain
the matching sensitivity, and if not found, possible interpolations must be considered,
which would add further questions regarding interpolation method, induced errors etc.
To summarize, several research efforts have focused on performance modeling meth-
ods considering parameter variations, with significant results. One of the main limita-
tions to their applicability consists of the difficulty to extend them to include additional
process parameters. This is either not possible without reformulating the underlying de-
lay model [142,121,34], or, at best, implying a time-consuming extraction of new sensitiv-
ities [149]. Here it should be also mentioned that numeric extractions of sensitivities from
circuit simulations are always possible, nevertheless, as pointed out before, this approach
2.3 RESOURCE SCHEDULING 29
rises important questions concerning inter-parameter correlations.
Besides extension-related restrictions, the modeling capability of these methods is
mostly limited. On the one hand, methods like [121, 34] employ a characterization of
statistical distributions using a single parameter, such as the variance, standard devia-
tion, or the variation coefficient. On the other hand, sensitivity-based approaches such
as [142, 149] allow only a simple linear modeling of the dependence on parameter vari-
ations. The propagation of variation distributions from the parameter level up to the
performance metric is therefore limited by this simplified representation. Furthermore,
the extraction and storage of sensitivities has a limited ability to capture all the influence
of external varying factors, such as changes in voltages and temperature.
A practical method for including all state-of-the-art process parameters in the model
and for propagating their entire distributions across the analytic expressions has not been
published yet. The main challenges for such an approach are mainly the model complex-
ity (to include all the parameters) and the development of an extensive set of practical
statistical operators which operate on distributions and propagate a complete represen-
tation of the variation, such as e.g. the probability density function. This thesis offers an
important contribution in this area by providing an extensive modeling approach rely-
ing on the BSIM transistor model and the required statistical operators for propagating
the entire distributions across the performance macromodels. The inter-parameter cor-
relation issue is solved by employing a detailed analytic model at the transistor level,
which embeds all process parameters, without recurring to sensitivity analyses. The spa-
tial correlation of process parameters and temperature is also considered by employing a
PCA-based de-correlation method.
Technology accuracy represents a key factor in the overall model precision, partic-
ularly for deep sub-micron processes. In the methods presented so far, the accuracy is
limited by the simplified current, delay, and leakage models. Relatively-recent modeling
approaches [34] employ simplified empirical transistor models developed two decades
ago [139]. This thesis presents two modeling methods which are focused on technology
accuracy at the circuit level (chapter 4) and at the interconnect level (chapter 5).
2.3 Resource Scheduling
Scheduling of processing and communication tasks on the assigned resources (processing
elements and communication segments) determines their relative execution order, hence
the start and finish time for each task. It has been shown in Sec. 2.1.3 and 2.2 that schedul-
ing dependencies play an important role in performance model creation.
While assigning a scheduling sequence, it is important to consider all data and control
dependencies between the tasks. In addition, the scheduling algorithm must ensure that
no task is left waiting in the execution queue, i.e. that no task is “starving”. Usually a
performance metric is also optimized during the scheduling method, such as minimum
30 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
execution time, or lowest power consumption. Since both scheduling and resource map-
ping are known to be NP-complete problems [66], heuristic approaches which do not
guarantee optimality are often employed.
In this section a few representative scheduling methods are presented in the context of
system-level optimization and communication architecture synthesis. Additionally, their
implications in the performance model structure and the overall design of an optimized
communication are shown.
2.3.1 Preemptive Methods
Within preemptive scheduling algorithms, the scheduler interrupts running tasks during
their execution and switched the context to other tasks, typically on a priority-driven ba-
sis. Interrupted tasks continue their execution eventually, after restoring their state from
a saved context. Preemptive scheduling is not always preferred in embedded systems,
since it involves a substantial overhead: active scheduler, context switching, state saving
etc. Nonetheless, depending on the particular application and task dependencies, it can
improve the execution efficiency. For instance, if tasks are waiting frequently for exter-
nal events, during so-called “busy-wait” states, preemption would allow other tasks to
execute, thus leading to an overall shorter system latency.
A preemptive, static scheduling algorithm has been proposed in [50,51] which assigns
priority levels to tasks based on a dynamic finishing time (T ei ) estimation for delay min-
imization. Starting from the observation that a task can be reused by multiple functions,
an optimized resource mapping can be performed. Furthermore, depending on the de-
gree of reusage of a task, its preemption suitability is determined. To take into account the
scheduler overhead during task preemption, such as state saving and context switches,
an preemption overhead parameter is defined and employed for delay and power esti-
mations. The finish times of each task and of the overall system are finally compared with
the specified deadlines to validate scheduling decisions.
Processing and communication tasks are characterized by execution priority levels
and are first sorted in the decreasing priority order. Hereby, tasks with equal priorities
are sorted in the increasing order of their execution times. Since communication links can
providemore than one segment, both sequential and concurrent communication activities
are considered. The preemption decision for a task ti with priority φi which runs on a
given resource r in favor of another task tj with priority φj is given by:
ti ← tj if

φj > φi
or :
T bi + ηr + T
x
jr ≤ µ (ti)
(2.35)
where T bi is the best-case finish time of task ti, T
x
jr is the execution time of tj on resource
r, and µ (ti) is the required deadline for ti. The parameter ηr is the preemption overhead
2.3 RESOURCE SCHEDULING 31
t it i-1 t i+1
Dynamic frequency & voltage scaling
Trt i t i+1
Power-down mode
stop
Tr
start
t
t
(a)
(b)
t1
t2 t3
Active task
Run queue
Delay queue
t1
t3
Active task
Run queue
t2
Delay queue
(c) (d)
Fig. 2.13: Low power preemptive scheduling with fixed priority (after [146]).
of resource r, which includes context switching and state saving. The second branch
in (2.35) indicates that the preemption occurs only if ti is a sink task which will not miss
its deadline. In static preemptive scheduling approaches, after a task is scheduled on a
resource, the exact start and end times T si and T
e
i can be precisely evaluated considering
the preemption slots. Thus, the accuracy of system delay estimations depends on the
modeling and integration of task scheduling.
A fixed priority preemptive scheduling for optimizing power efficiency has been de-
veloped in [146], which achieves power reduction by utilizing the slack times between the
running tasks. These idle timing intervals are inherently caused by inter-task dependen-
cies, such as in the case of tasks waiting for the input from other resources, but are also
generated by variations in the actual execution times of tasks due to real-time conditions.
A run-time mechanism identifies and utilizes these slacks for power-reduction mecha-
nisms, such as switching the resource to a power-down mode, or applying a frequency
and voltage scaling.
In real-time systems, the execution time of several tasks may vary due to several dy-
namic conditions, such as different resource allocation, computational load of a process-
ing core, voltage drops etc. The worst-case execution time T xwci can be estimated through
profiling, or static analysis, as explained in [98]. Starting from the observation that the
ratio of the best-case to the worst-case execution time can vary up to as much as one or-
der of magnitude in typical applications [61], and that several idle time slots are present
in systems with fixed priority scheduling, a substantial power reduction can be achieved
by dynamically scaling the speed or shutting down the resource if the idle slot is suf-
ficiently large. Thus, during the scheduling process, if only one task is eligible and its
actual execution time T xi is smaller than the available slot Si, a frequency and voltage
scaling is applied according to the timing difference Si − T xi , as shown in Fig. 2.13(a). If
no task is scheduled in the time slot, then the resource is switched into a power-down
mode, whereas the additional latencies required for stopping and restarting the resource
must also be considered, as illustrated in Fig 2.13(b). For this purpose, a conventional
scheduler with fixed priority is modified to support these mechanisms.
Fixed-priority scheduling algorithms ensure that each task meets its deadline con-
straints and are simple to implement. Typically, higher priorities are assigned for tasks
with shorter execution times or higher execution frequencies. Lower-priority tasks are
32 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
preempted when higher-priority tasks request the execution. The scheduling mechanism
proposed in [146] maintains two task queues. A run queue contains tasks waiting for
execution, ordered by priority, while a delay queue maintains the tasks already finished,
which are waiting to start again in the next system period. The delay queue is periodi-
cally searched, and high-priority tasks are moved into the run queue. Fig. 2.13(c) shows
task t1 being executed, while the lower-priority tasks t2 and t3 wait for their turn in the
run queue. In Fig. 2.13(d) task t1 requests again execution and task t3, which was running
at the moment, is preempted and moved back into the run queue. If the run queue is
empty and no task is active, then the resource is powered down. Otherwise, if an active
task is running, then speed scaling is applied.
2.3.2 Non-Preemptive Methods
Non-preemptive scheduling methods are usually preferred in embedded system design,
because of their low overhead. Very simple methods, such as first-match first-serve,
which select the first compatible task from the execution queue, are often employed on
heterogeneous MPSoC platforms [144]. A basic extension to this method is to select the
task which benefits the most from the instruction set or the hardware architecture of the
respective core. This decision may be detrimental for very simple tasks, which will be
constantly pushed back in the queue. Adding a priority level to each task would prevent
starving conditions.
A static, non-preemptive deterministic scheduling mechanism has been presented
in [59], in combination with a communication optimization algorithm for minimizing
the worst-case system latency. Here, the particularities of the underlying architecture
and the parameters of the communication protocol are considered for the generation of
static scheduling tables for both data processing and communication tasks. Both data and
control dependencies are taken into consideration, therefore the actual execution path in
the task graph is not predictable at the design time. Fig. 2.14 shows an example where
processing nodes PN2, PN11, and PN12 have control dependencies with their succes-
sors. The values of the branch conditions C, D, and K are determined only at run-time.
Nevertheless, at a certain time moment in the execution, all the upstream conditions are
known. Thus, the scheduling algorithm must take the best decision possible consider-
ing the known conditions. For this reason, the optimized static scheduling is determined
through a heuristic algorithm, which minimizes the worst-case delay for all possible com-
binations of the branch conditions.
The scheduling table has one row for each processing and communication node (PNi
and CNi, respectively), and contains the start times T
s
i for the different values of the
branch conditions. The resulting scheduling table for the example in Fig. 2.14 is displayed
in Tab. 2.1. The entry for a node in a given column corresponds to its scheduling time for
the case in which the condition at the top of the column is true. It can be noted, that the
start times for PN1, PN2, PN11, and CN1 do not depend on any branch condition, hence
2.3 RESOURCE SCHEDULING 33
Start
1
32
CN1
CN4
6
5
CN2
4 CN6
8 9
CN8
10
CN5
7
CN7
CN3
11
12 13
CN10 CN9
14 15 16
17
End
CN11
CN14
CN12
CN13
C C
C
D
D
K
K
Fig. 2.14: Extended task graph example for static scheduling (after [59]).
they have unconditional deterministic start times (given in the first column).
Theworst-case delay is determined by examining the scheduling rows for the terminal
tasks. If the terminal tasks are PNt1, PNt2,. . . ,PNtn, then the worst-case delay is given by:
Twc = max
i=1...n
{
max
j=1...Nc
{
T sti,j
}
+ T xti
}
(2.36)
whereNc is the number of possible conditions in the schedule table, and T
s
ti,j is the sched-
uled start time of PNti in the j-th column. Equation (2.36) represents also the cost function
to be optimized by the heuristic scheduling algorithm. In the example from Fig. 2.14 the
terminal tasks are PN10 and PN17, thus, according to Tab. 2.1, (2.36) becomes:
Twc = max {35 + T x10, 30 + T x17} (2.37)
Another non-preemptive list-scheduling algorithm with fixed priority has been pub-
lished in [57] for tasks interacting through both data and control dependencies. In gen-
eral, list-scheduling approaches select the task with the highest priority in terms of the
optimization goal (e.g. shortest global execution time). To avoid execution conflicts, a
single priority is usually assigned to each task and an individual scheduling table is built
for every processing unit. Due to the fact that some nodes interact through control de-
pendencies, an execution trace is identified for each combination of the branch conditions.
Consequently, each conditional trace is scheduled independently as an alternative execu-
tion path. Within this process, the scheduling algorithm, which relies on heuristics to find
close-to-optimal schedules, selects the nodes according to the worst-case combination of
the branch conditions. This operation results into scheduling sequences determined by
the most time-constrained traces. To cope with the complexity given by considering mul-
tiple traces, a set of priorities is assigned to each node, with a priority for each execution
trace to which it belongs.
Typically, the priority function includes information regarding the relative position
of a node in the critical path. If a node has many successors along the critical path, its
34 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
true D D ∧ C D ∧ C ∧K D ∧ C ∧K D ∧ C D ∧ C ∧K D ∧ C ∧K D D ∧ C D ∧ C
PN1 0
PN2 3
PN3 6 6
PN4 7 7
PN5 18 18 18
PN6 21 20 20
PN7 21 21 21
PN8 29 28 28
PN9 26 25 25
PN10 35 34 27 26 34 26
PN11 0
PN12 9 9
PN13 10 13
PN14 18 24
PN15 19 24
PN16 15 15 15 15
PN17 25 24 30 26 24 24
CN1 3
CN2 9 8
CN3 21 20 21 20 20 20
CN4 19 18 18
CN5 12 13
CN6 26 25 25
CN7 25 24 24
CN8 32 32 32
CN9 8 11
CN10 8 8
CN11 16 16
CN12 16 16
CN13 22 22
CN14 23 22 23 22
Tab. 2.1: Scheduling table for the example in Fig.2.14 [59].
execution must finish as early as possible, hence it receives a high priority. This approach
leads to critical-path-driven schedulings, as shown for a four-task example in Fig. 2.15(b).
However, tasks running on software processors and communication segments must be
serialized and including this information in the scheduling algorithm could lead to better
scheduling results, as shown in Fig. 2.15(c).
A general limitation of the existing scheduling approaches is the lack of support for
variability in the execution times. All the operations applied for scheduling decisions are
2.4 PARAMETER VARIATIONS AND STATISTICAL ANALYSIS 35
R3
R1
Start
1 2(80) (10)
3 4(20) (40)R2
End
time
0 20 40 80 100 120
1 2
3
4
R1
R2
R3
(b) Critical-path scheduling
time
0 20 40 80 100 120
12
3
4
R1
R2
R3
(c) Scheduling with resource
     mapping information
(a)
Fig. 2.15: Scheduling of four tasks (a) considering only critical-path information (b) and after
including the resource mapping (c) (after [57]).
Tech. Node 90 nm 65nm 45nm 32nm 22nm
Leff (3σ nm) 4.7 2.6 1.9 1.3 0.9
VTn (3σD2D) 12% 12% 12% 12% 12%
VTp (3σD2D) 6% 6% 6% 6% 6%
Tox (3σ) 4% 4% 4% 4% 4%
Tab. 2.2: Predicted three-sigma variations of device parameters across several technology nodes.
performed on constant numbers and the simple algebraic operators are not suitable for
statistical evaluations. Chapter 3 presents a statistical methodology which operates on
timing quantities described by statistical distributions.
2.4 Parameter Variations and Statistical Analysis
As CMOS technologies continue to scale in the deep-submicron regime, an increasingly
higher level of systematic and random variations in process, supply voltage, and tem-
perature continuously affects the performance of integrated circuits [27, 20, 33]. Process-
induced fluctuations result into significant variations of various device parameters, such
as Leff , Tox, and VT [118]. The sources of such variations are either environmental or
physical factors. Tab. 2.2 shows predicted values from literature for variations in Leff [80,
133,34,173,95,33], VT [133,34,7,20], and Tox [133,20]. It is to be mentioned that die-to-die
(D2D) VT variations differ between NMOS and PMOS devices as shown in [83] and intra-
die variations (not shown in Tab. 2.2) are inversely-proportional to the square root of the
channel area [20]:
σIDVT = 3.19 · 10−8
(
ToxN
0.4
A√
LeffWeff
[V]
)
(2.38)
36 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
Lot-to-Lot
Wafer-to-Wafer
Die-to-Die Intra-Die
Fig. 2.16: Classification of parameter variations.
where NA is the average channel doping. The intra-die VT variations include also the
effect of uncorrelated channel doping fluctuations.
Given this trend in device scaling for achieving an enhanced performance and cost
reductions, the employed materials and manufacturing processes have come very close
to their reliability limits [108]. Thus, an important step towards improving the manufac-
turing yield is to understand the sources of parameter variations [141] and to develop a
statistical methodology for accurate performance modeling.
The inclusion of process parameter variations in modeling methodologies has started
with the use of process corners. Hereby, an analysis is performedmultiple times, for every
process condition. If a sufficiently-high number of process conditions is considered, the
effects of process variations on a particular performance metric can be estimated. This
approach is however limited by several important aspects. First, this method has been
developed to model die-to-die parameter shifts, whereas intra-die variations can not be
accurately captured [24, 7]. Second, the number of process parameters which show sig-
nificant variations has increased substantially with the new process generations [20], and
the corresponding number of process corner files required to capture the entire param-
eter set has reached a number which prohibits the direct analysis using standard Monte
Carlo approaches [143]. Effective modeling approaches for including various parameter
variations have been developed lately in the field of statistical timing analysis [24]. In the
remaining of this section the sources of parameter variations are discussed, followed by
a brief review of the developed methods for statistical analysis.
2.4.1 Sources of Parameter Variations
There are several classifications of parameter variations available in the literature, de-
pending on their type [24,20], source [173,7,141,108], and effects [118,27,107]. Generally,
parameter variations are investigated for:
• manufacturing process parameters – which typically describe devices and intercon-
nects;
2.4 PARAMETER VARIATIONS AND STATISTICAL ANALYSIS 37
• environmental parameters – including the temperature, operating voltages, as well
as wear-out influences across the circuit lifetime.
Variations in process parameters are further classified into:
• systematic, predictable variations – which depend on layout characteristics and can
be theoretically modeled using deterministic factors, such as the surrounding layout
topology [165];
• random variations – for parameters that are unknown at design time, and for which
only statistical descriptions are available.
Nonsystematic, random variations are caused by slight deviations in the manufactur-
ing steps, such as chemical-mechanical polishing (CMP) [96], rapid thermal annealing
(RTA) [9] etc. These random variations affect differently the manufactured devices and
can be classified in:
• die-to-die (D2D) variations – which have the same value for all the devices on a
single die, resulting from the process changes from lot to lot, wafer to wafer, and
reticle to reticle;
• intra-die variations – which have a different impact on each device across the same
die.
Finally, intra-die variations can be differentiated according to their correlation as:
• spatially-correlated variations – which exhibit a gradual change between different
locations on the die;
• uncorrelated variations – which affect different devices independently, such as ran-
dom dopant fluctuations (RDF), or the line edge roughness (LER).
According to the source and type of variations, an adequate statistical model must
be employed. The next section shows a series of methods for capturing and analyzing
parameter variations.
2.4.2 Statistical Analysis Methods
Variability effects can be included in the analysis by modeling the corresponding param-
eters as random variables. As a result, the estimated performance metrics, which depend
on these parameters, will be also described by random variables. There are several ways
to characterize a statistical distribution, either by employing its probability density func-
tion (pdf) or the cumulative distribution function (CDF). If obtaining the entire pdf or
CDF characterization is not possible or intended, a coarser description is employed, by
38 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
specifying a few moments of the distribution, typically the mean and variance (or the
mean and standard deviation).
Early approaches to statistical modeling of process variations have assumed Gaussian
distributions for the process parameters and the resulting performance metrics [67, 148,
2, 10, 45]. Hereby, the dependence of a metric on the individual parameter variations is
expressed in terms of the individual sensitivities, e.g. the path delay is estimated as:
Dp = N
(
Dp,nom,
(
∂Dp
∂L
)2
σ2∆L +
(
∂Dp
∂Vth
)2
σ2∆Vth +
(
∂Dp
∂Tox
)2
σ2∆Tox + · · ·
)
(2.39)
This method assumes a linear dependence on each process parameter variation∆Pi given
by the first-order sensitivity:
∆Dp (∆Pi) =
∂Dp
∂Pi
(∆Pi) (2.40)
resulting into linear-dependence expressions for the delay, respectively leakage [45]:
Delay = Dnom +
p∑
i=1
αi (∆Pi) (2.41)
Leakage = exp
(
Vnom +
p∑
i=1
βi (∆Pi)
)
(2.42)
where αi and βi are the first-order sensitivities to the respective parameter Pi.
However, variations in several physical device parameters can significantly differ
from a normal distribution [24], such as deviations in the critical dimension (CD, usu-
ally the gate length) caused by shifts in the depth of focus (DOF), which add a negative
skewness to the distribution. In addition, performance metrics exhibit several nonlin-
ear dependences on process parameters, which result in non-normal distributions even
if the process parameters are normally distributed. Nonetheless, as indicated in [24], the
maximum operation required for many delay estimations (see also Sec. 2.2.2) is strongly
nonlinear and adds a positive skewness to the result. These observations have lead to
several approaches to represent non-normal random distributions [113, 3, 92, 36]. A sim-
ple triangular distribution model, discretized as an impulse train, has been used in [113]
to enable a fast delay computation from random variables. Discretized pdf represen-
tations have been employed in [99, 3, 100, 55] for representing the distributions, com-
bined with Monte Carlo sampling. Hereby, 3-point, 5-point, and 7-point piece-wise linear
(PWL) approximations of pdfs and CDFs are employed in [55] for representing the delays
and arrival times, respectively. Discretized pdf and CDF representations were used also
in [6, 5, 4], where they are propagated across the delay tree by means of statistical sum
and maximum operations. The propagation of PWL approximations for pdfs and CDFs
has been also discussed in [92] in the context of a runtime optimization method through
error trade-offs. A non-linear function of process parameters fA (∆XN) was introduced
2.4 PARAMETER VARIATIONS AND STATISTICAL ANALYSIS 39
in [36] to derive an extended canonical dependence for the delay assuming non-Gaussian
parameters and nonlinear dependences:
A = Anom +
nLG∑
i=1
aLG,i ·∆XLG,i + fA (∆XN) + an+1 ·∆Ra (2.43)
where XLG are linear-dependence Gaussian parameters with the respective sensitivities
aLG, and∆XN = (∆XN,1,∆XN,2, . . . ) contains the nonlinear and non-Gaussian parameter
variations. ∆Ra indicates a normalized Gaussian parameter representing uncorrelated
random variations, with the corresponding sensitivity an+1. An alternative method is
employed in [157] which uses connectivity graphs to estimate the results of process vari-
ations described by discretized pdfs on device performance and employs new operators
and domains based on sampling in the random variable space.
Once an appropriate representation of the variable parameters has been found, the
variability must be propagated across the models, from the parameter level to the top-
level system performance metrics. Earlier approaches relied on the simple propaga-
tion of extracted sensitivities [140, 115], extracted through response surface methodol-
ogy (RSM) or Monte Carlo simulations and represented in the form of response surface
functions (RSF). A more recent approach [163, 68] proposes a method to compute the
stochastic response of performance metrics by means of orthogonal polynomial expan-
sions in an infinite dimensional Hilbert space. This approach has the disadvantage of
slow convergence of the infinite series representations, together with the high complex-
ity of finding the coefficient functions. A further statistical method for parametric yield
prediction [85] proposes the division of the parameter space into feasibility regions of
regular shape (i.e. parallelipipeds or ellipsoids) and performing a numerical integration
of the joint pdf of the sources of variation. This modeling approach relies again on lin-
ear representations of the performance metrics using sensitivity matrices. A propagation
method assuming independent lognormal distributions of process parameters has been
presented in [135]. Here, the total leakage current expression is approximated using a
lognormal distribution:
pdf (Itot) =
 1
Itot
√
2piσ2N,Itot
 exp[−( log (Itot)− µN,Itot
σN,Itot
√
2
)2]
(2.44)
where µN,Itot and σ
2
N,Itot
are the mean and variance of the normal random variable corre-
sponding to the lognormal distribution. An analytical approach for computing the pdf
of arrival time has been proposed in [168] for buffer circuits, which uses a recursive algo-
rithm implying the computation of first-order derivatives for the delay expression. This
method is however suitable only for simple circuits, for which basic analytic models ex-
ist, therefore does not scale with circuit complexity. A quadratic dependence of the delay
on process parameter variations has been used in [171] to propagate distributions us-
ing second-order sensitivity sets. This approach extends the modeling capabilities a step
40 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
beyond first-order approximations, remaining nevertheless limited by the quadratic esti-
mations and still requiring sampling in the parameter space. A Taylor series expansion
up to the fourth order has been used as alternative in [176], which improves the model-
ing accuracy. Finally, improved Monte Carlo approaches have been used several times
recently for statistical analysis, in the form of importance sampling [153, 175, 172]. This
method translates the variability from the parameter space to the performance metrics in
the form of parametric yield estimations. The key idea behind importance sampling is to
reuse the results of Monte Carlo simulations and to obtain estimates of one random vari-
able by sampling in a different distribution. For instance, the expectation of a function
Ψ(X) of the random variable X described by pdf p (x) can be obtained as:
Ep (Ψ (X)) =
∫ ∞
−∞
Ψ(x) p (x) dx =
∫ ∞
−∞
Ψ(x)W (x) q (x) dx (2.45)
where W (x) = p(x)
q(x)
. In this way, the samples of X can be obtained using a different pdf
q (x). By properly weighting each sample with the functionW (x) the error introduced by
sampling with a different distribution is minimized [172].
Correlations can arise between the distributions during the analysis, either from re-
converging paths [6] (topological correlations) or from spatially correlated process pa-
rameters [24] (spatial correlations). Bayesian networks have been used in [22] to represent
the circuit and accurately track topological correlations. Spatial correlations are handled
in [23] using Karhunen-Loe`ve expansions. While providing very accurate representa-
tions, Karhunen-Loe`ve computations are always sample-specific and involve large com-
putational overheads by relying on eigenvalue and eigenvector computations from the
sample matrices. Simplified correlation structures have been also employed for describ-
ing intra-die variations, such as grid [35] or quadtree models [1]. A distance-dependent
correlation model for intra-die variations has been introduced in [104] which divides the
chip die into several perfect correlation regions. Spatial correlations were also charac-
terized in [101] through an optimal spatial model which matches a set of observations
using generalized least square fitting. Finally, the principal component analysis (PCA)
is often employed for transforming spatially correlated variables into a set of standard
decorrelated principal components of variation [150,43].
To conclude, most approaches to statistical modeling are limited by several factors,
such as Gaussian or lognormal assumptions, linear dependences, or the propagation of
distributions considering only first, second, or up to fourth order approximations. Rel-
atively recent methodologies still assume Gaussian distributions and linear functions of
variation [8, 71]. A complete description of the distributions propagated entirely across
technology-accurate complex models, which is also scalable for any number of parameter
variations, has not been yet developed. This thesis brings an important contribution to
this challenge in chapters 3 and 4.
2.5 TECHNOLOGY ACCURACY 41
2.5 Technology Accuracy
As discussed in Sec. 2.2, amajor challenge in performancemodeling is to develop accurate
models for the current technologies and particularly to embed accurately the existing
parameter variations. In this section we discuss three important aspects of technology
accuracy, namely the process characterization, yield prediction and optimization, and
transistor-level models for statistical analysis.
2.5.1 Process Characterization
The accurate use of process parameter variations assumes first a reliable method for the
process characterization and extraction of the parameter distributions. An early method-
ology presented in [120] points out the importance of considering the sources of param-
eter correlations. It relies on a relatively simple approach, namely a direct sampling
combined with clustering of the interdependent device parameter sets. While keeping
the measured process parameters organized in sets, this method preserves the inter-
parameter correlations and stores a direct link between each set and the corresponding
die location.
A direct measurement-based method for process characterization has been published
in [122] and proposes the statistical characterization of process parameters by means of
resistance measurements. Within this approach, a set of test sites are implemented on
each reticle which contain long and narrow polysilicon resistors. By measuring the resis-
tance of these polysilicon lines, the distribution of CD variations is captured. Statistical
methods to analyze measurement data and extract statistical distributions have been pre-
sented in [174], based on moment matching of quadratic models up to the third moment.
Hereby, any given parameter is estimated as a quadratic function of a Gaussian random
variable, which is fitted by matching the first three moments (i.e. mean, variance, and
skewness). The method presented in [164] assumes a variable degree of accuracy in the
description of parameter distributions, given the difficulty of extracting process informa-
tion at early design stages. As a consequence, it employs random variable representations
using bounds for CDFs to handle the partially-specified uncertainty descriptions. Math-
ematical theories from random fields and convex analysis were used in [169] to extract
a spatial correlation function and the corresponding matrix from measurement data. A
similar approach has been employed in [46] where variogram functions are used to ex-
tract spatial covariation models. An empirical variogram function is extracted from the
data and is used to eliminate the global component of variation, while emphasizing the
intra-die variability. The empirical model based on variograms is then tested against data
and refined through weighted least-squares regression. The models employed in this
thesis for modeling process parameter variations and spatial correlations are discussed in
chapter 4.
42 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
2.5.2 Yield Optimization
Mapping the variations in process parameter space to the performance metrics is one
way to estimate the parametric yield, and can be performed as discussed in Sec. 2.4.2. A
correlated factor is the statistical yield optimization, with respect to given criteria. Here,
an important premise for accuracy is to consider the particularities of the manufacturing
technology.
A method for statistical technology mapping has been proposed in [147] for optimiz-
ing the logic synthesis with respect to leakage power minimization. The proposed ap-
proach finds a circuit mapping considering the dependence of the overall cost function
on the variance in making the dynamic selection of the gates from a given technology
library. In [105], power minimization is achieved using a linear programming technique.
However, this approach accounts for only two sources of variability, namely the effective
channel length (Leff ) and the gate-length independent threshold voltage, which are both
assumed to be Gaussian random variables. In addition, the delay is approximated by
a first-order Taylor series expansion, while the leakage power is modeled using a first-
order sensitivity. A probabilistic description of power-performance tradeoffs has been
used in [93] for design space exploration. This method uses sets of Pareto-optimal points
to encode solutions in the power-delay space, which are not dominated by any other so-
lution in the feasibility set. The parametric yield has been modeled in [21] as a function of
the mean and variance of circuit leakage. Hereby, the leakage is minimized considering
gate sizes, gate lengths, and threshold voltage.
The development of circuit-level statistical models with a high technology accuracy is
presented in chapter 4. The method employed in this thesis relies on the yield optimiza-
tion with respect to a custom cost function of the delay, dynamic power, and leakage.
2.5.3 Transistor-Level Models
A key factor to achieving good accuracy during the statistical analysis is to rely on accu-
rate models down to the transistor level. A statistical method for device characterization
using discrete probability propagation has been proposed in [157], which uses a sim-
plified connectivity graph-based transistor model. Only a few process parameters are
captured by this model and there is no systematic definition of the required statistical
operators to process the parameter distributions. A statistical model for the leakage of
double-gate MOSFETs has been introduced in [11], which depends only on gate length
and body thickness variations. Moreover, the leakage is expressed using a Taylor series
expansion, for which only the mean and variance are computed, and is approximated
with a Gaussian distribution, which strongly limits the accuracy. An extrapolated tran-
sistor model from a set of data points has been proposed in [40] for fast statistical circuit
simulations. This modeling approach is rather empirical and does not capture all pro-
cess parameters: only variations in Leff and Vth are captured. Recently, a very simple
2.6 OPTIMIZATION RESOURCES AT THE CIRCUIT LEVEL 43
transistor-level model has been presented in [47] and models the gate length as:
Ljk = µLjk + ajkX1 + bjkX2 (2.46)
whereLjk is the gate length of transistor j from standard cell k, µLjk is themean value, and
ajk, bjk represent the PCA coefficients of the two principal components X1 and X2. The
scalability and accuracy of this model are however very limited, as only the gate length
is statistically modeled and it relies on only two principal components for the variation
(which are described by standard normal random variables).
For good accuracy with the manufacturing process a more detailed, physically accu-
rate, transistor model is required, which can embed variations in possibly all existing pro-
cess parameters. As discussed in Sec. 2.4.2, many statistical analysis approaches rely on
simple or empirical transistor models, such as the relatively-old alpha-power law MOS-
FETmodel [139,28]. In contrast, this thesis introduces a fully-statistical detailed transistor
model based on BSIM4 equations and embedding pdf descriptions for all process param-
eters employed by state-of-the-art technologies.
2.6 Optimization Resources at the Circuit Level
Until now only optimizations concerning the task partitioning, assignment, and schedul-
ing have been addressed, which achieve improvements in delay and power consumption
through architectural choices and system-level design decisions. Nevertheless, once an
architecture is chosen and the tasks are mapped and scheduled on the corresponding re-
sources, the communication activity between the cores still has a strong influence on the
overall latency and power dissipation. Thus, it becomes very attractive to investigate fur-
ther optimization options at the level of communication segments. Especially the choice
of a particular signaling circuit, followed by circuit-level optimization techniques, such
as supply voltage scaling and body biasing, would have a substantial impact on speed
and power. The effects of such optimization measures are discussed in this section.
2.6.1 Choice of Signaling
Communication performance is partly determined by the segment attributes, such as
width, length, and frequency, but it also depends strongly on the signaling method and
on the particular transceiver circuit which is connected to the bus. A strong call for al-
ternative designs in on-chip global interconnects has been recently issued by the ITRS
roadmap [81], to reduce delay and power consumption. The first suggested approach is
to use different signaling methods, which would include both signal design and signal
coding techniques.
A series of novel signaling methods have been proposed in the literature [12,17,37,52,
87,161] which target the reduction of delay and power dissipation in on-chip global inter-
44 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
IN OUT
V
CTRL duty cycle
adjust
1x
1x
4x
4x
1.5x
1.5x
1.5x
1.5x
2x
2x
2x
2x
4x
4x
1x
1x
2x
2x
4x
4x
Hybrid Current/Voltage-Mode
Repeater
Variable Output Resistance
Driver
A0
A0 A1 A2
A1 A2
Fig. 2.17: Current/voltage mode repeater (after [17]).
connects. In [17], a hybrid current/voltage-mode signaling is proposed, which minimizes
delay and static power consumption, being also compatible with repeater-insertion meth-
ods. The hybrid repeater circuit employed for this signaling method is drawn in Fig. 2.17
and consists of an amplifier with variable input resistance connected to a variable output
resistance driver. When the control voltage VCTRL is “0”, the feedback transmission gate is
opened and the driver operates in voltagemode. In this configuration, the input amplifier
acts as a self-biased inverter. When VCTRL switches to “1”, the amplifier acts as a resis-
tive termination for the communication line and the signaling occurs in current-mode.
The three bits A0, A1, and A2 control the output resistance of the driver by switching the
parallel transistors. The additional transistors connected to the internal node are used
to control the duty cycle of the communication and are typically biased at Vdd/2 for a
symmetrical duty cycle.
Voltage-mode and current-mode transmitters are similar in principle and typically
switch a low-impedance output connected to the interconnect segment, as shown in the
equivalent representation from Fig. 2.18(a). In current-mode signaling, one of the out-
put transistors switches a static current path from the receiver to Vdd or ground. On the
other hand, at the receiver side, a voltage-mode circuit exhibits a high input capacitive
impedance, as illustrated symbolically in Fig. 2.18(b), whereas a current-mode receiver is
characterized by a low-impedance input node, as shown by the circuit from Fig. 2.18(c).
In the case of a current-mode connection, assuming a pull-down switch at the transmit-
ter, a static current Icm is switched between Vdd at the receiver side and the ground at the
transmitter, as illustrated in Fig. 2.18(d).
Signaling in current mode improves the delay and hence the bandwidth of a commu-
nication segment by switching a low-impedance (resistive) sensing circuit at the receiver
end and allowing a high static current to flow through the interconnect line. Current
sensing techniques have been proven to effectively reduce propagation times in long in-
terconnect lines [18,16] at the expense of a high static power dissipation. Dynamic power
levels are, nevertheless, relatively low, due to the reduced voltage swings involved in
2.6 OPTIMIZATION RESOURCES AT THE CIRCUIT LEVEL 45
CL
ZL
Vout
Iout
(a) (b) (c)
Iout
Icm
communication segment
(d)
Fig. 2.18: Voltage and current sensing circuits: (a) hybrid-mode transmitter, (b) voltage-mode
receiver, and (c) current-mode receiver (after [18]). Pull-down signaling path in current-mode (d).
current switching. As a result, hybrid-mode signaling schemes as the one used in [17]
achieve a good trade-off by using fast current-mode signaling at high data activities, for
maximum bandwidth, while switching to voltage-mode signaling at lower traffic rates,
to reduce static power consumption.
Current-mode signaling methods have been employed recently also in differential
schemes, such as for sending three data channels over two pairs of transmission lines [44],
achieving a 37% increase in data rate over pure differential signaling. Two of the three
channels are switched as half-swing differential signals, each over a pair of lines, hence
reducing the occupied dynamic switching range in each channel and the total power
consumption. The third channel could be thus inserted as a half-swing complementary
common-mode signal over the two pairs of transmission lines used by the first two chan-
nels. The use of current-mode drivers has the benefit of increased data rate, but contribute
also to the overall increase of 33% in power consumption.
A pulsed current-mode signaling is presented in [87] which speculates the LC be-
havior of interconnects at high frequencies to achieve a near speed-of-light propagation
through a repeaterless link. The transmission of high-frequency, reduced-swing current
pulses maximizes the effect of wire inductance, therefore the on-chip interconnect is oper-
ated as a transmission line with reduced dispersion. It is shown in [87], that full-rail opti-
mally repeated RC lines exhibit a high bit energy consumption and are at least three times
slower than the speed-of-light propagation. On the opposite, the reduced-swing current-
mode operation allows for repeaterless global interconnects and the sharp current-pulse
transmission achieves a near speed-of-light latency at low bit energies. A similar ap-
proach based on the same principle has been published in [37] and performs a PSK mod-
ulation with a high-frequency carrier for minimizing the delay, albeit at the cost of higher
energy consumption.
46 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
Low-voltage swing signaling schemes have beenwidely investigated for saving power
in interconnects. An analytic model for computing the optimum swing for minimum
power consumption has been introduced in [152] and shows potential power reductions
by a factor of 3 to 8 for voltage swings between 60 and 120mV. The use of low-swing
signaling has been demonstrated in a Pentium 4 [52], showing lower power dissipation
and permitting the use of thin interconnect wires for reduced coupling capacitance and
achieving compact layouts. Typically, low-swing voltage-mode signaling reduces power
at the cost of delay increase, whereas current-mode signaling works fast, but increases
static power dissipation. Nevertheless, in [161], a differential current-mode signaling has
been introduced for improving both delay and power consumption. Hereby, control sig-
nals are employed to keep the power consumption to a minimum and allow the static
current to flow only for a fraction of the cycle.
Bringing the benefits of the different signaling methods into the optimization of on-
chip communication requires the ability to model and integrate the different signaling
circuits into the system design framework. Chapter 4 presents the contributions of this
thesis into this area and presents a complete methodology to accurately model the de-
lay and power of signaling circuits considering parameter variations. The optimization
framework is then able to select the optimum circuit from a library of models for each
communication segment.
2.6.2 Voltage Scaling
The aggressive downscaling of supply voltages is one of the most effective solutions to
reduce dynamic power consumption in digital circuits. In fact, Vdd scaling while keeping
a fixed threshold voltage brings a quadratic reduction in dynamic energy at the cost of
decreased circuit performance. The limits of voltage scaling and the behavior of circuits
at very low supply voltages, in the subthreshold regime, have been investigated in [73].
These results are useful for predicting the achievable power improvement. An algorithm
for computing the required voltage scaling for executing taskswith timing slacks has been
introduced in [13], employing an architectural model of the application. Here, the focus
lies on the distinction between static (off-line) and dynamic (runtime) voltage scaling and
on the minimization of the runtime scaling overhead. An adaptive voltage scaling at the
core level in multiprocessor chips has been proposed in [129] to compensate for asym-
metries in power and performance due to process variations. All the aforementioned
approaches investigate voltage scaling at the core level and do not specifically consider
the power loss due to inter-core communication.
A series of relatively-new methods propose the combination of voltage scaling with
bus architecture synthesis [126, 127, 125, 124]. In [126] and [127], a system-level model for
the delay and power consumption of the communication tasks is employed, expressed
2.6 OPTIMIZATION RESOURCES AT THE CIRCUIT LEVEL 47
by:
Tc = κ
Vi
(Vi − Vth)α (2.47)
Ec = ατ · Ceff · V 2i · Tc (2.48)
where κ is a technology constant, Vi is the supply voltage, α is the saturation velocity,
ατ represents the switching activity, and Ceff is the effective switched capacitance dur-
ing data communication. Further, [125] and [124] extend this method with a statistical
analysis based on a first-order sensitivity dependence.
Two fundamental observations can be made. First, most of the published methods are
focusing on voltage scaling at the core level, and do not consider energy benefits if the
scaling would be applied also for the inter-core communication infrastructure. Second,
the voltage scaling approaches for the communication structure use rather architectural-
level models than circuit-level accurate estimations. It could be seen that voltage scaling
can bring important gains in power if applied on the communication circuits. Never-
theless, for a successful application during the synthesis of communication architecture,
accurate circuit-level models are required, which reflect the bus structure, length, type of
signaling, and the manufacturing process. Sec. 4.2.4 and 4.3.3 present the contributions
of this thesis to the integration of voltage scaling techniques with accurate circuit-level
models of the communication structures.
2.6.3 Body Biasing
Body biasing is employed to change the threshold voltage of CMOS transistors by apply-
ing a voltage between the substrate and the source. In the case of forward body biasing
(FBB), the body-to-source junction is directly biased, and the threshold voltage decreases.
FBB achieves an increase in speed, but also in leakage. The opposite holds for the reverse
body biasing (RBB), where the bulk junction is reversely biased, achieving a higher Vth,
hence a lower leakage at the cost of reduced transistor performance.
Either a single chip-wide body bias, or many local body biases for different regions
can be applied. It is to be noticed that a separate n-well is required for each PMOS body
biasing regions. Similarly, for multiple NMOS body bias regions, a triple-well process
is needed for the separation of bulks (see Fig. 2.19). Further, the body bias value can be
either fixed or dynamically adjusted at run-time. A fixed body bias is typically chosen at
the design or test time to optimize a selected performance metric and remains constant
over the lifetime of the chip. On the opposite, the method which dynamically adjusts the
body bias, also called adaptive body biasing (ABB), modulates the substrate voltage at
run-time to achieve a tradeoff between performance and leakage power consumption.
Dynamic power consumption and leakage power consumption optimizations are ob-
tained in [170] with a combination of adaptive body biasing and dynamic voltage scaling.
Based on an analytic energy consumption expression derived from the alpha-power law
48 CHAPTER 2 FUNDAMENTALS AND CHALLENGES OF ACCURATE COMMUNICATION SYNTHESIS
Vbs,p Vbs,n
p+ p+ n+ n+
N-well P-well
N-isolation
Fig. 2.19: Body biasing of NMOS and PMOS transistors in a triple-well process.
transistor model, the optimal body bias values are evaluated at every given frequency
and adjusted at run-time to achieve a tradeoff between optimal energy consumption and
clock speed. Energy-efficient architectural techniques including adaptive body biasing
are discussed in [73] to reveal the tradeoff between leakage and performance implied by
adjusting the threshold voltage. Within this context, it is pointed out that reverse body
biasing has a worsening effect with respect to the short channel effects, which becomes
more significant with technology scaling.
Another typical use of ABB is to reduce the influence of process parameter varia-
tions [42,41]. For instance, an ABBmethod for increasing frequency and reducing leakage
in digital chips affected by variations has been implemented in [159] using a circuit for
critical path monitoring and applying the optimal body bias for each digital cell. The au-
thors of [119] approach the reduction of dynamic power, leakage power, and increase in
the speed of digital circuits bymeans of process variations estimation circuits and demon-
strate a digital on-chip controller for selecting an a priori stored body bias value. A for-
ward body bias is used in [110] for reducing delay variations in digital circuits and an
automatic body biasing architecture for minimizing the active power consumption has
been published in [90]. The approach in [117] employs a fixed forward body bias after
performing external measurements to identify parameter variations. The method pro-
posed in [154] mitigates the effects of variability by testing several supply voltages with a
body bias controller which adjusts the body bias to meet the frequency requirement, fol-
lowed by the selection of the supply voltage value which achieved the minimum power
consumption. The use of adaptive body biasing in chip multiprocessors to improve sym-
metry by speeding up slower cores has been examined in [79]. Finally, the limitations of
ABB methods for addressing process variability are discussed in [26], which also points
out the importance of delaying the body biasing step until test time, where a leakage
measurement can evaluate precisely the influence of parameter variations.
The broad usage of body biasing techniques in digital circuits reveals a significant
potential of this method to reduce leakage power consumption where performance drops
are acceptable. Thus, a complex on-chip communication infrastructure, which dissipates
large amounts of leakage power and can locally afford lower speeds due to the inherent
slacks, represents a very promising candidate for body biasing. Since until now the body
biasing methods have focused on optimizations at the core level, one of the main scopes
of this thesis is to investigate the influence of body biasing on the communication circuits
and to integrate this method in the optimization framework. Sec. 4.2.4 and 4.3.3 present
2.7 SUMMARY 49
the application of body biasing on the circuit-level communication models, while the
results are presented and discussed in Sec. 4.4.6.
2.7 Summary
Based on the observation that significant performance amounts and serious challenges in
MPSoC designs are dictated by the inter-module communication, this section discussed
the most important aspects which must be considered in the design of on-chip commu-
nication architectures. It has been shown that the design space exploration must rely on
efficient abstractions of the application flow combined with accurate performance macro-
models to achieve good estimations of the actual design performance. In this context,
the concepts of delay and power macromodels have been detailed and several model-
ing approaches have been discussed. Moreover, the importance of statistical modeling
combined with process accuracy of performance estimations has been emphasized in the
context of complex state-of-the-art manufacturing technologies. It has been also pointed
out that task scheduling has a significant influence on system latency and total energy
consumption and several scheduling approaches have been discussed.
One of the paramount challenges in system and communication design is represented
by the increasing parameter variations. Several statistical methods to analyze and model
parameter variations have been examined and their drawbacks have been indicated. The
most important challenge in this context is to accurately represent the parameter distri-
butions and to model the dependence of performance models on parameter variations
by accurate propagation of the distributions. Moreover, technology accuracy plays a key
role in the context of both accurate performance models and statistical analysis and the
shortcomings of modeling approximations and of the underlying transistor-level models
have been evidenced. Finally, the additional optimization resources for communication
at the circuit level have been illustrated, including the choice of signaling method, volt-
age scaling, and body biasing. Nevertheless, the contributions of this thesis to each of
the important shortcomings and challenges in the communication synthesis have been
mentioned.
Chapter 3
Variability-Aware Performance
Macromodels
Contents
3.1 Application and Architectural Profile . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Extraction of the Application Profile . . . . . . . . . . . . . . . . . . 54
3.1.2 Architecture and Technology Specification . . . . . . . . . . . . . . 56
3.1.3 Variability Description . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Random Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Employed Standard Distributions . . . . . . . . . . . . . . . . . . . 59
3.2.2 Discretized pdf Model . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Typical Usage and Accuracy Control . . . . . . . . . . . . . . . . . . 61
3.2.4 Sampling Technique for Discretized pdfs . . . . . . . . . . . . . . . 63
3.3 Method for the Propagation of Distributions . . . . . . . . . . . . . . . . 64
3.3.1 Statistical Sum and Maximum Operators . . . . . . . . . . . . . . . 65
3.3.2 Statistical Difference Operator . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Statistical Product Operator . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.4 Numerical Implementation of other Statistical Operators . . . . . . 76
3.3.5 Handling Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.6 Random Variable Algebra . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Embedding Technique for Random Variables . . . . . . . . . . . . . . . . 84
3.4.1 Variability Sources and RV Leaf Nodes . . . . . . . . . . . . . . . . 84
3.4.2 Variability Propagation and Estimation of Results . . . . . . . . . . 85
3.4.3 Changes and Updates Propagated Downstream . . . . . . . . . . . 86
3.4.4 Result Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Performance Macromodels for Delay Estimation . . . . . . . . . . . . . . 89
51
52 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
3.5.1 Structure and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.2 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.6 Performance Macromodels for Energy Consumption . . . . . . . . . . . . 93
3.6.1 Dynamic Energy Macromodels . . . . . . . . . . . . . . . . . . . . . 93
3.6.2 Leakage Energy Macromodels . . . . . . . . . . . . . . . . . . . . . 94
3.6.3 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.7 Partitioning, Assignment, and Scheduling Optimization . . . . . . . . . 97
3.7.1 Methods for Solution Space Exploration . . . . . . . . . . . . . . . . 98
3.7.2 Cost Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 99
3.7.3 Optimization Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.7.4 Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
An important objective of this thesis is to provide an integrated framework for the
parametrized joint optimization of delay and power at the system level during the com-
munication architecture synthesis. To achieve this goal, the developed methodology re-
quires an accurate description of the target MPSoC architecture and of the running appli-
cations, considering parameter variations in execution times, data flow, power consump-
tion, and lower-level process parameters. This chapter presents a unified methodology
for describing the synthesis-relevant parameters of the application profile, architectural
details, and manufacturing process.
Accurate estimations of the parametric yield rely on the performance of the under-
lying statistical analysis. For this reason, the parameter variations must be modeled us-
ing adequate random variable representations. In addition to commonly-used Gaussian
variation models, customized parameter distributions resulted from measurements and
profiling analyses must be implemented. A generalized representation of both standard
and custom, discretized distributions is implemented and presented in this chapter.
Performance macromodels require the composition of several underlying variable pa-
rameters in complex analytic or numerical expressions. The propagation of statistical
distributions from the parameter level up to the performance model is therefore depend-
ing on several algebraic compositions. The method developed in this thesis propagates
the entire representation of statistical distributions during the algebraic operations on
random variables. This chapter describes the propagation method for pdfs and the set of
statistical operators developed for the operations on random variables.
The random variable representations and their algebraic composition using statistical
operators are employed to develop statistical performance macromodels for delay and
power. Within this context, the interconnection between MPSoC architectural resources
and the abstract representation of the models is discussed. The structures of the devel-
oped macromodels are presented and the variability representation inside each node is
3.1 APPLICATION AND ARCHITECTURAL PROFILE 53
Application
Processing Nodes
PN1
PN2
PN3
PN4
Communication Nodes
CN1
CN2
CN3
CN4
Architecture
Processing Resources
R1
Signaling Resources
R2
S1
S2
Ti
x
Pd,i,
Parameter Variations
p (x)
x
X
Fig. 3.1: Description of timing and dynamic power values depending on e.g. resource mapping
and parameter variations.
illustrated. Particularities involving design decisions, such as reassigning a processing
node to a different resource, or changes in the scheduling list of a resource are reflected
by important changes in the macromodels. The required changes and a method to update
the affected distributions is presented.
Variability-aware representations in the form of random variable performance mod-
els have a direct impact on the system-level design space exploration. Design decisions
such as task mapping and scheduling have to be driven by the evaluation of performance
metrics. For this, a method to build a cost function from a set of statistical distributions
representing delay, dynamic power, and leakage power must be found. Such particulari-
ties are discussed here as well.
This chapter is organized as follows. Sec. 3.1 presents a method for the extraction and
specification of application and architectural profiles with respect to variable parameters.
In Sec. 3.2, a generalized random variable model is developed, based on a discretized rep-
resentation of statistical distributions with adjustable accuracy. Further, Sec. 3.3 presents
the development of a propagation technique for pdfs implemented using analytic and
numeric statistical operators. The embedding of random variable models in the synthesis
framework is discussed in Sec. 3.4. Afterwards, the performance macromodels for delay
and energy consumption are presented in Sec. 3.5 and 3.6, respectively. Finally, the global
optimization of resource allocation and scheduling is analyzed in Sec. 3.7.
3.1 Application and Architectural Profile
Application profiles are required for the extraction of processing and communication
tasks and are used as the input to the synthesis algorithms. Hereby, a detailed descrip-
54 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
tion regarding variability is of key concern. As an example, the same task may exhibit
different execution times on different cores and, in addition, the execution times are func-
tions of several parameter variations. Further, dynamic and leakage power consumptions
are as well dependent on the processing resource and exhibit variations which must be
described in an appropriate way. Fig. 3.1 illustrates this concept: a particular mapping
of processing and communication nodes on the architectural resources results into dif-
ferent values of the execution times and dynamic power dissipations. In addition, the
influence of parameter variations results into the need for a statistical description of these
metrics. Nonetheless, data dependencies and communication needs are to be extracted
and expressed in a variable form, depending on the workload changes of the running
applications.
Besides application-related details, an accurate description of the hardware platform
must be provided, in the form of a resource set. Within this context, processing el-
ements, memory units, interfaces, controllers, and other application-specific hardware
blocks must be included.
This thesis provides a unified application interface to describe task details, data de-
pendencies, timing, power, and data flow parameters, complete hardware resource sets,
task-resource compatibility associations, and parametric yield constraints in a consistent
form. Variable parameters are specified using the most appropriate distribution from a
set of given distribution types, as well as in the form of a non-standard discretized prob-
ability density function (pdf) description.
3.1.1 Extraction of the Application Profile
One of the critical tasks for the automated synthesis of communication structures is de-
veloping a system profile at a high level of abstraction and extracting the on-chip com-
munication events. This work considers embedded systems, realized as MPSoCs, which
are suitable for applications with variable data flows. Within this context, an application
profile consists of the architecture definition, a detailed data flow graph, execution costs,
and the design constraints. A schematic representation of the profiling steps is depicted
in Fig. 3.2.
The first application description is a behavioral model, written in a high-level lan-
guage (such as C++), which enables the flexible division into processing tasks, as well as
further refinements of the granularity (e.g. separate load/store operations with memory
units etc.). At this point, the inter-task communication loads can be estimated by using ar-
rays of equally-sized block structures which encapsulate the data transferred in the func-
tion calls between the processing tasks. The amount and frequency of the data transfers
can be tracked down automatically using a language-specific tool (e.g. the GNU profiler
gprof ), followed by an estimation of the variability parameters. Hereby, the most appro-
priate variability model is selected from a set of random variable descriptions adapted
to the profile, ranging from standard distributions to estimated distribution functions,
3.1 APPLICATION AND ARCHITECTURAL PROFILE 55
Application Requirements
Modeling
Language
(e.g. C/C++)
Application
Model
Granularity
Processing
Tasks
Data
Dependencies
Profiling
Tool
(e.g. GNU
gprof)
Execution and
Communication
Profiling
Data Flow
Variability Model
Communication
Loads
MPSoC
Resource
Set
Number and 
Type of
IP Cores
Task-Resource
Compatibility
Task Execution
Time
Target
Technology
Process
Variations
Core Size Leakage
Power
Dynamic
Power
Variability
Distribution Set
HDL
(SystemC)
Resource-Dependent
Model
Simulation
Chip Area Max. Latency Power Budget
Min. Yield
Level
Processing Tasks
Resource Set
B
e
h
a
v
io
ra
l 
P
ro
fi
lin
g
E
x
e
c
u
ti
o
n
 a
n
d
 P
o
w
e
r 
E
s
ti
m
a
ti
o
n
Design Constraints
Fig. 3.2: Schematic workflow showing the application profiling steps.
mostly in the form of discretized pdfs (see also Sec. 3.2).
Granularity is an important factor in the modeling accuracy of processing tasks and
has a strong influence on the possibilities of partitioning the application onto the avail-
able architectural resources. Nevertheless, the complexity of the design optimization is
strongly affected by the task granularity choice. Due to the main focus on communica-
tion synthesis, this thesis sets the granularity of processing nodes at the task level. Finer
granularities, such as instruction-level or operation-level nodes, would result into a finer
modeling of the processing nodes at the cost of a significantly higher complexity. Since the
instructions and operations of a task are usually executed on the same core, the increase in
modeling overhead does not improve the representation of communication nodes, there-
fore a finer granularity is not justified in this case.
The information obtained from the behavioral modeling and profiling activities con-
tains the number of processing tasks, their data dependencies (links between the tasks)
56 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
and the inter-task communication loads, specified as statistical distributions. This infor-
mation serves as input to the communication synthesis framework and is specified in
an application description INI file. The section of the INI file describing the relationship
between the processing tasks (processing nodes) includes the following:
• NPN , the number of PNs
• PNi,name = {name}, the name of each PN
• PNi,dd : PNj, PNk, . . . , the data dependencies of each PN (children links)
• Li,jc = {distribution}, the communication load between PNi and PNj , expressed as
random variable distribution (predefined distribution type or discretized pdf)
3.1.2 Architecture and Technology Specification
The choice of a target architecture and technology depends strongly on the application
constraints, such as latency, area, power budget, and also the design and manufacturing
costs. Typically, the manufacturing costs and performance levels (latency, power budget)
determine the choice of a particular technology node, but also the availability of several
intellectual property (IP) modules for the given technology is to be considered. Next, the
set of available IP modules for the design is collected into a resource set, specified by the
number and type of resources. In combination with the selected technology, the leakage
power of each resource is estimated. If no leakage information is provided with the IP
library, a leakage simulation of each resource is required. Depending on the desired level
of accuracy, this step may involve synthesizing the resources and performing gate-level
power simulations, or a higher-level model written in hardware description language
(HDL) is employed together with technology-dependent power estimations. The esti-
mated leakage power, including process and environmental variations, is then described
statistically, using the set of available random variable distribution types.
After specifying the resource set, a task-resource compatibility graph is defined to
specify allowed mappings during the synthesis. Given the compatibility information,
performance attributes such as execution time and dynamic power consumption are esti-
mated and specified in the form of random variable distributions. A resource-dependent
simulation using HDL models enables a first-order approximation of the execution time
and power consumption levels of each processing task on the compatible resources. Due
to process and environmental parameter variations, both execution time and power con-
sumption are expressed as statistical distributions.
The information describing the target architecture is also stored in the configuration
file and includes:
• NRT , the number of available resource types (processors, ASIC, memory, interfaces
etc.)
3.1 APPLICATION AND ARCHITECTURAL PROFILE 57
• RTi,name = {name}, the name of each resource type
• NRTi , the number of available resources of type RTi
• Pl,RTi = {distribution}, the leakage power consumption of resource type RTi
• PNi,co : RTj, RTk, . . . , the compatible resource types for executing PNi
• T xi,RTj = {distribution}, the execution time of PNi on the compatible resource type
RTj
• P
RTj
d,i = {distribution}, the dynamic power consumption for executing PNi on the
compatible resource type RTj
Communication transfers between the resources are strongly dependent on task allo-
cation and scheduling, as well as on the transceiver type and bus attributes, and are there-
fore estimated during the synthesis using accurate models. Further, process-dependent
parameters and parameter variations are specified in the input configuration files con-
sidering die area and spatial intra-die variability. First, the NMOS and PMOS transistor
parameters are specified at their nominal value, according to the BSIM4 description for
the given technology. Next, the die area is divided into a custom-sized grid and the posi-
tion of each resource in the grid is specified. Floorplanning of the IP resources is outside
the scope of this thesis, therefore the positions of resources in the grid are considered
given. After that, the process parameters which exhibit deviations are specified using the
distribution models and by specifying a spatial correlation behavior (see Sec. 4.1.2). Fi-
nally, design constraints such as area, latency, power budget, and the required yield level
can be directly specified in the profile.
3.1.3 Variability Description
As discussed in Sec. 3.1.1 and 3.1.2, the application and architecture description contains
several variable quantities, including communication loads, leakage and dynamic power
consumptions, execution times, and process parameter variations. These variable quan-
tities must be specified in the input configuration files, therefore a unified statistical de-
scription is required.
Sec. 3.2 will describe in detail the developed random variable models employed in this
work, which represent the set of available statistical descriptions for the variable quanti-
ties. The developed models include standard distributions, such as Gaussian, uniform,
lognormal, Cauchy, Pareto, Weibull, but also custom pdf representations of any estimated
distribution, specified in discretized form. For instance, the leakage power of resource
type RTi can be expressed as a Gaussian distribution in the INI file in the form of:
Pl,RTi = N (14.e-3, 5.e-3) [W]
P σrell,RTi = 3
58 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
which specifies a normal distribution ofmean µ = 14mWand standard deviation σ = 5mW.
The P σrell,RTi factor indicates the relevant sigma domain considered for approximating the
spread of the distribution, which is set in this case to three sigma (the minimum andmax-
imum leakage can be approximated with µ−3σ and µ+3σ, respectively). Better accuracy
can be achieved by setting the σrel domain to higher values, e.g. 4σ or 6σ.
Another convenient way for expressing multiple values which are in the same domain
is to specify a common order of magnitude. Assuming several execution times in the
nanosecond domain, expressed e.g. as uniform distributions between their minimum and
maximum value, they can be specified as:
T x1,RTj = U(12, 18)
T x2,RTj = U(10, 14)
T x3,RTj = U(7, 12)
· · ·
TOMRTj = 1.e-9 [s]
The common order of magnitude (OM) of 1 ns indicates that all the above specified timing
values are expressed in nanoseconds.
3.2 Random Variable Model
As discussed in the previous sections, the large majority of design values considered in
the communication synthesis framework are inherently exposed to variations, either en-
vironmental (e.g. process, thermal, and voltage) or functional (data flow variations). Due
to this non-negligible aspect, all the key parameters in the design are represented as ran-
dom variables (RVs).
There are several descriptions available to completely characterize random variables,
like e.g. the probability density function (pdf), the cumulative distribution function (CDF),
or the characteristic function (ΦX (ν) = E
{
ejνX
}
, for operations in the symbolic fre-
quency domain). From these available descriptions, the pdf has been selected for internal
representations of random variables in the framework, as it relates directly to the exper-
imental distribution of an estimated random variable and can be therefore seamlessly
approximated with the histogram of sampled RV values.
Standard distribution functions, such as Gaussian, lognormal, or uniform, are imple-
mented using the exact pdf analytic expression. Other distribution types, particularly
custom non-normal distributions, or the results of nonlinear operations on random vari-
ables, are implemented using a discretized model with variable accuracy.
3.2 RANDOM VARIABLE MODEL 59
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Analytic
Sampled
Gaussian
Exponential
Power
Cauchy
Rayleigh
Lognormal
x
p (x)
X
Fig. 3.3: Analytic and sampled pdfs for several standard distributions.
3.2.1 Employed Standard Distributions
The synthesis framework implements a limited subset of standard random distributions
for representing parameter variations which can be accurately modeled by well-known
distribution laws. Examples include:
• Gaussian distribution N (µ, σ)
• Exponential power distribution E (a, b)
• Cauchy (Lorentz) distribution C (a)
• Rayleigh distribution R (σ)
• Uniform distribution U (a, b)
• Lognormal distribution L (ζ, σ)
Internally, the corresponding random variables are characterized only by the parameters
of the known distribution law. After storing the distribution type and parameters, sam-
ples from each distribution type are obtainedwith very good accuracy using the functions
from the GNU Scientific Library (GSL) [65]:
• gsl_ran_gaussian (Rg, σ) for Gaussian distributions
• gsl_ran_exppow (Rg, a, b) for exponential power distributions
• gsl_ran_cauchy (Rg, a) for Cauchy distributions
• gsl_ran_rayleigh (Rg, σ) for Rayleigh distributions
60 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
x
i
biA
minx maxx
bN bins
Fig. 3.4: Discretized pdf over Nb bins.
• gsl_ran_flat (Rg, a, b) for uniform distributions
• gsl_ran_lognormal (Rg, ζ, σ) for lognormal distributions
where Rg is the selected random generator type. Figure 3.3 illustrates the accuracy of
the sampled values in a comparison between the analytic expression of the pdf and the
distribution of samples generated for a Gaussian N (2, 0.5), exponential power E (1, 3),
Cauchy C (1.5), Rayleigh R (1.5), and lognormal L (0, 1) random variable. Each sampled
pdf has been extracted from 100 000 generated samples.
The quality of the generated samples can be selected by choosing from a series of
different random number generators, including e.g. the following [65]:
• taus, Tausworthe generator ( 870k doubles/second)
• ranlxs0, second-generation RANLUX algorithm ( 571k doubles/second)
• ranlxd1, double precision output, second-generation RANLUX (254k doubles/se-
cond)
As explained in [65], these generators offer several “luxury” levels, extremely long peri-
ods, and satisfy most statistical tests. A taus generator has been employed in the exam-
ple from Fig. 3.3.
3.2.2 Discretized pdf Model
For parameter variations which cannot be described by standard distributions, a more
general estimated distribution model is developed in this work. Besides non-standard
distributed parameters, also nonlinear operations applied to standard distributions re-
sult into nonstandard pdfs, which fail to be represented by the models discussed in the
previous section. In contrast, the model developed here is designed to fit any form of pdf
with a precision which can be adjusted through several parameters.
3.2 RANDOM VARIABLE MODEL 61
Considering a random variable X described by an arbitrary pdf pX (x), a discretized
function p̂X (x) is defined and computed as:
p̂X (x) =
Abi
∆i
=
1
∆
xmin+i∆∫
xmin+(i−1)∆
pX (τ) dτ (3.1)
where bi is the discrete bin corresponding to the continuous value x, Abi is the bin area,
and ∆i is the bin width. Assuming that the support of pX (x) is distributed into equally-
spaced bins, then ∆i = ∆, (∀) i and the bin area is equal to the integral from (3.1).
As shown in Fig. 3.4, the support of pX (x) is limited for practical reasons to the inter-
val [xmin, xmax], beyond which the values of random variable X are considered to occur
with no significant frequency for the given application. This interval is divided into Nb
bins with widths ∆i, i = 1 . . . Nb and heights equal to the integral of pX (x) over the bin
range. In the general case, the bin widths are not equal, allowing for an optimization
of the bin width with respect to the pdf curvature. Nevertheless, using very dense and
equally-spaced bins simplifies the analysis without a significant impact on accuracy.
Statistical moments can be estimated from the discretized pdf by replacing the con-
tinuous expressions with equivalent discrete definitions. For instance, the first-order mo-
ment (expected value) is defined starting from its continuous definition as:
E (X) =
∞∫
−∞
xpX (x) dx (3.2)
Ê (X) =
Nb∑
i=1
ri−1 + ri
2
∆ibi (3.3)
Here, the value of x is approximated by the center of the bin, whereas the integral of
the pdf over the bin range is estimated by the bin area. In a similar way, the centered
second-order moment can be derived as:
V ar (X) =
∞∫
−∞
(x− µ)2 pX (x) dx (3.4)
V̂ ar (X) =
Nb∑
i=1
(
ri−1 + ri
2
− Ê (X)
)2
∆ibi (3.5)
3.2.3 Typical Usage and Accuracy Control
The customized discrete pdf model is employed to represent variable parameters in two
cases. First, process parameters exhibiting strong deviations from standard Gaussian or
other well-known distributions can be specified using this RV model. Also other per-
formance parameters estimated in the application profiling step can be well represented
62 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
using the discrete RV model. The complete specification of such parameters Pk in the
input configuration includes the following information:
• DD (Nb), specifying a discrete distribution model with Nb bins
• ri, for i = 0, Nb, the ranges representing the bin separation points, for bins with
distinct widths ∆i
• Pk,min, Pk,max, the limits of the approximation interval, for bins with equal widths
∆ =
Pi,max − Pi,min
Nb
• bi, for i = 1, Nb, the normalized height of each bin, such that
Nb∑
i=1
bi∆i = 1
The above specification replaces the usual Pk = {distribution} for parameters specified
using the discrete RVmodel. The number of bins and bin ranges are established according
to the accuracy requirements. Bin heights are computed from the experimental distribu-
tions obtained during the profiling of performance parameters, such as execution times,
communication loads, or power consumption. In this case, the bin height will be given
by:
bi =
Nb
Ns (Pk,max − Pk,min)
Ns∑
j=1
(
H (sj − ri−1)−H (sj − ri)
)
(3.6)
where Ns is the number of samples obtained for the given parameter, sj is a particular
sample of Pk, andH (x) is the Heaviside step function. For a process parameter described
by a non-standard distribution law pPk (x), which is available in any form (analytic closed-
form expression, look-up table etc.), the bin heights can be computed using numerical
integration. An example with a good tradeoff between accuracy and speed is Simpson’s
3/8 rule [38]:
bi =
ri − ri−1
8
(
pPk (ri−1) + 3pPk (x1) + 3pPk (x2) + pPk (ri)
)
(3.7)
where x1 and x2 are equally-spaced points between the bin ranges ri−1 and ri (pPk (x) is
assumed to satisfy the normalization condition
∫∞
−∞
pPk (x) dx = 1).
A second case of usage for the discrete model is within the synthesis framework, for
estimating the results of algebraic operations on random variables. As shown later in this
chapter, the composition of RVs using various operators often results into non-standard
distributions. As discussed in chapter 2, maximum operations add a positive skewness
to distributions. In addition, the division of two random variables usually results into
heavy-tailed distributions. Moreover, for most algebraic operators there is no known
analytic expression for the distribution of their result. Given these considerations, the
discretized RV model is also employed for representing the result of algebraic operations
applied to random variables, for which the result does not obey to standard distribution
laws. First, the support of the result distribution is estimated and divided into bins. After
3.2 RANDOM VARIABLE MODEL 63
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 3.5: Cumulative distribution function computed from a discrete pdf.
that, the height of each bin is computed by applying the respective operator on the input
variables. Sec. 3.3 discusses the use of discretized RVs for operations on random variables
in more detail.
Any distribution type can be approximated with adjustable precision using the dis-
cretized RVmodel. Hereby, the tradeoff between accuracy and computation speed can be
adjusted by modifying the following parameters:
• The support of the approximated pdf, given by the [xmin, xmax] interval;
• Nb, the number of bins inside the pdf support;
• Ns, the number of samples, if the distribution is approximated from a set of sampled
parameters (in the profiling step);
• The numerical integration method for computing the bin heights bi;
• The type of random generator and its “luxury” level, in the particular case in which
the distribution is approximated using Monte Carlo samples from other known dis-
tributions.
It can be observed that the accuracy of the discrete pdf model is not intrinsically lim-
ited. By increasing the number of bins and the support limits, any desired accuracy level
can be achieved, at the cost of increased computational overhead.
3.2.4 Sampling Technique for Discretized pdfs
Once a random variable has been characterized using the discretized pdf model, it is very
useful to develop a function which generates samples from this distribution. First, the
64 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 0
1
Fig. 3.6: Sampling method using a standard uniform distribution and the CDF.
cumulative distribution function (CDF) is computed from the discretized pdf:
F̂X (x) =
∑
j<i
bj∆j + bi (x− ri−1) (3.8)
where bi is the bin containing the value x (bounded by the ranges ri−1, ri). The sum
accumulates the areas of the bins before bi, while the second term in equation (3.8) cal-
culates the area enclosed between value x and the left boundary of the current bin. Due
to this partial contribution of the bin bi, the computed CDF F̂X (x) is continuous in every
point. Fig. 3.5 illustrates an example of discretized pdf and the resulting CDF computed
with (3.8).
Once the CDF has been computed, samples from the random variable X , described
by p̂X (x), can be obtained in the following way. Considering another RV Y uniformly
distributed in the interval [0, 1], samples x from RV X are obtained from samples y of RV
Y as:
x = F̂−1X (y) (3.9)
or, alternatively, x is the solution of equation F̂X (x) = y, where F̂X (x) is strictly mono-
tone. Due to the monotonicity of the CDF, this equation can be efficiently solved in a few
steps using e.g. the bisection method [38]. Fig. 3.6 illustrates this concept.
3.3 Method for the Propagation of Distributions
Since the data flows across performance macromodels (PMs) originate from multiple de-
sign parameters which are exposed to variations, computing the performance metric at
the PM output requires inevitably to apply arithmetic computations at the PM nodes on
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 65
random variables. In other words, the statistical distributions are propagated from the
parameter-level description across the PM, towards the output performance metric.
Several methods for the propagation of variability have been discussed in Sec. 2.4.2
and their disadvantages have been pointed out. Either only a few moments of the distri-
butions are propagated, or linear models are assumed for the dependencies on variable
parameters. In addition, only a few methods consider non-Gaussian parameter distribu-
tions.
In contrast, the method presented here brings the following novel improvements:
• propagates the complete pdf representation across each operation (not only a few
moments);
• is compatible with any nonlinear closed-form expression of performance models;
• its flexibility (operation-based propagation of complete pdf) allows for including
any number of parameter variations in the performance model expression;
• employs both standard distribution models and custom discrete pdf models for
non-standard distributions;
• has several adjustable parameters to allow the tradeoff between accuracy and eval-
uation speed.
The proposed method relies on the evaluation of the result pdf in each operation node
of a performance macromodel. The input random variables are represented using their
pdfs and statistical operators are developed to compute the pdf of the result. For the sum,
maximum, difference, and product operations, the result pdf is directly evaluated using
analytic operators. For other operations, the result pdf is obtained through statistical esti-
mation. In the following, the variability propagation method is discussed in the particu-
lar case of the commonly-used operators in performance macromodels (sum, maximum,
difference, and product), then the analysis is extended to other operation types. The pre-
sentation of statistical operators considers first the uncorrelated case. The implications
and methods for handling several correlation types between the random variables are
discussed in Sec. 3.3.5.
3.3.1 Statistical Sum and Maximum Operators
Statistical sum and maximum operations have been extensively studied in the field of
statistical timing analysis, as indicated in [24]. Several approaches [19, 160, 97, 89] rely on
analytic representations, albeit assuming normal distributions. Here a derived analytical
method is presented which has been adapted for the random variable models developed
in this work.
66 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Overlap
Fig. 3.7: Limits of the overlap during the sum computation.
Let Z be a RV equal to the sum of two random variablesX and Y . The pdf of the sum
is defined by the convolution product:
pZ (z) = (pX ∗ pY ) (z) =
∞∫
−∞
pX (x) · pY (z − x) dx (3.10)
The interval over which the sum pdf extends is bounded by zmin = xmin + ymin and
zmax = xmax + ymax. After computing the bounds, the interval [zmin, zmax] is divided into
Nb uniform bins. Then, for each bin i = 1, Nb, bounded by the ranges rinf = zinf+(i− 1)∆
and rsup = zinf + i∆, the overlapping range of the two pdfs pX (x) and pY (z − x)must be
identified.
As illustrated in Fig. 3.7, the lower bound of the overlap is given bymax (xmin, z − ymax),
which ranges in the current bin between:
zinflow = max (xmin,−ymax + rinf ) (3.11)
zsuplow = max (xmax,−ymax + rsup) (3.12)
Similarly, the upper bound of the overlapping range varies between:
zinfup = min (xmax,−ymin + rinf ) (3.13)
zsupup = min (xmax,−ymin + rsup) (3.14)
First, the left value of the integral from (3.10) is computed for z = rinf , using Simpson’s
3/8 rule:
Intl =
zinfup − zinflow
8
[
f
(
zinflow
)
+ 3f (z1) + 3f (z2) + f
(
zinfup
)]
(3.15)
where f (z) = pX (z)·pY (rinf − z) and the two points z1 and z2 separate the overlap region
in equal slots:
z1 = z
inf
low +
(
zinfup − zinflow
)
/3 (3.16)
z2 = z
inf
low +
(
zinfup − zinflow
)
· 2/3 (3.17)
After that, the right value of the integral is evaluated, for z = rsup:
Intr =
zsupup − zsuplow
8
[
f (zsuplow) + 3f (z1) + 3f (z2) + f
(
zsupup
)]
(3.18)
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 67
Current
bin
rsuprinf
Average value
(bin height)
Integral function
Right value
of the integral
Left value
of the integral
Fig. 3.8: Limits of the overlap during the sum computation.
where f (z) = pX (z) · pY (rsup − z) and the two points z1 and z2 are given by:
z1 = z
sup
low +
(
zsupup − zsuplow
)
/3 (3.19)
z2 = z
sup
low +
(
zsupup − zsuplow
) · 2/3 (3.20)
As illustrated in Fig. 3.8, the left and right values of the integral correspond to the
values of pZ (z) at the bin boundaries. Finally, the bin height bi is evaluated as the av-
erage value between the left and the right integral values. In addition, to ensure that
the resulted pdf satisfies the normalization condition
Nb∑
i=1
bi∆i = 1, every bin bi must be
multiplied by the factor:
fi,norm =
1
(ri − ri−1)
Nb∑
j=1
bj
(3.21)
Fig. 3.9 shows the estimated delay of three processing tasks, evaluated as the sum be-
tween earliest starting time and task execution time. Both starting and execution times
are estimated as statistical distributions and the total delay has been evaluated using the
implemented sum operator. For illustration purposes, the evaluations have been approxi-
mated with 30 bins for each random variable. The results are compared with direct Monte
Carlo samplings from the distributions using 1 000 000 samples for each variable and are
found to be in very good agreement.
For the maximum operation, let W be another random variable computed as W =
max (X,Y ). Then, given the CDFs FX (x) and FY (y) of RVsX and Y , respectively, the pdf
ofW is given by:
pW (w) = FX (w) · pY (w) + FY (w) · pX (w) (3.22)
After estimating the CDFs F̂X (x) and F̂Y (y) using (3.8), the distribution of W can be
evaluated for every bin between max (xmin, ymin) and max (xmax, ymax).
A rather interesting case is the particular maximum operation between a given ran-
dom variable X and a constant Y = c. This case is often found in performance models
68 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
0 20 40 60 80 100 120 140
0
0.02
0.04
0.06
0.08
0.1
0.12
Delay [ns]
N
or
m
al
iz
ed
 p
df
 
 
Sum Operator
Monte Carlo
Fig. 3.9: Estimated delay of three processing tasks computed using the sum operator and through
Monte Carlo sampling.
where only a subset of parameters are described statistically, the rest being considered
constant. In the trivial case where the constant is outside of the support of pX (x), the
result is either the unchanged pdf pX (x) (if c ≤ xmin), or the constant c (if c ≥ xmax). Oth-
erwise, if the constant is found between xmin and xmax, the pdf of W will be computed
as:
pW (w) =

0, w < c
FX (c) + pX (c) , w = c
pX (w) , w > c
(3.23)
The first branch in (3.23) is empty, since the lower bound of pW (w) is given bymax (xmin, c).
Because the pdf of Y is equal to a Dirac delta delayed by c, the first term in (3.22) becomes
FX (c), as illustrated in Fig. 3.10. Finally, knowing that the CDF of Y is a Heaviside step
function delayed by c (as shown in Fig. 3.10), the third branch is identical to the pdf ofX .
3.3.2 Statistical Difference Operator
Statistical subtractions are required for computing timing differences, such as timing
slacks between the scheduled tasks. Let Z = X − Y be the difference of two random
variables. The pdf of Z can be computed using the implementation of the sum operator
after transforming the pdf of Y as follows:
pZ (z) =
∞∫
−∞
pX (x) · p′Y (z − x) dx (3.24)
p′Y (y) = pY (−y) (3.25)
Hence, the sum operator can be applied after mirroring the pdf of the subtrahend with
respect to the ordinate axis, as shown in Fig. 3.11.
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 69
1
1
Fig. 3.10: Evaluation of the maximum between a random variable and a constant.
yymaxymin0
p (y)
Y
yy     = -ymin 0
p (y)
Y
max y      = -ymax min
'
Fig. 3.11: Subtrahend distribution mirrored across the ordinate.
3.3.3 Statistical Product Operator
Another important contribution of this work is the development of an efficient statistical
product operator. Statistical sum and maximum operations have received a large atten-
tion within the statistical timing analysis, whereas the statistical product operation has
not yet been systematically analyzed and implemented for the general case of two arbi-
trary random distributions. Even though a statistical formulation exists [69], a serious
challenge is posed by the integration limits, with many particular cases of repartition
across the four quadrants and the inclusion of zero in the distribution support.
First, a brief observation is required with respect to the repartition of variables across
the real axis over both positive and negative values. While process parameters usually
exhibit positive values, their variations are often modeled with respect to the nominal
value as [151]:
Pk = Pk,nom +∆Pk,inter +∆Pk,spatial (xi, yi) + ∆Pk,random,i (3.26)
In this representation, the nominal value of the parameter Pk,nom is positive, whereas
70 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
z
zmax
zmin 0
p (z)
Z
y
0
z min.
z max.
x
ymax
ymin
xmaxxmin
QII QI
z>0z<0
z<0z>0
QIII QIV Nbins
Fig. 3.12: Repartition of X and Y random variables across the four quadrants and discretized pdf
of the product Z = XY .
random variables like the inter-die variation ∆Pk,inter, the intra-die spatially-correlated
variation Pk,spatial (xi, yi), and the intra-die uncorrelated component ∆Pk,random,i are mod-
eled with random variables usually centered on zero and extending over both positive
and negative values.
First, the limits [zmin, zmax] for the support of pZ (z) are determined. As shown in
Fig. 3.12, zmax can be found as one of the following products: xmaxymax in the first quad-
rant (QI), xmaxymin in the second quadrant (QII), xminymin in the third quadrant (QIII), and
xminymax in the fourth quadrant (QIV). Likewise, zmin is given by xminymin in QI, xminymax
in QII, xmaxymax in QIII and xmaxymin in QIV.
More interesting are the cases where the distributions span across more than one
quadrant. For spans in QI and QII (Fig. 3.13(a)), zmin = xminymax (< 0) and zmax =
xmaxymax (> 0). In QII+QIII (Fig. 3.13(b)), zmin = xminymax (< 0) and zmax = xminymin (> 0).
Similarly, in spans across QIII+QIV the z range is given by [xmaxymin, xminymin] and in
QI+QIV by [xmaxymin, xmaxymax]. If the distributions span across all four quadrants (as in
Fig. 3.13(e)), then zmin = min {xminymax, xmaxymin} and zmax = max {xmaxymax, xminymin}.
After establishing the limits for the distribution of the product, the domains of X and
Y are partitioned onto the four quadrants, resulting into four {X,Y } partitions. The prod-
uct pdf pZ (z) is then computed separately on each quadrant and the results are merged
afterwards. To do this, the domain between zmin and zmax is divided into Nb bins. Then,
for each bin, the pdf of Z is evaluated as explained in the following.
The general integration formula for the product distribution is given by:
pZ (z) =
∞∫
−∞
pX (x) pY
(z
x
) 1
|x|dx (3.27)
Thereby, the largest challenge in computing the integral is finding the integration limits.
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 71
y
0 x
ymax
ymin
xmaxxmin
QI + QII
y
0 x
ymax
ymin
xmax
xmin
y
0 x
ymax
ymin
xmaxxmin
y
0 x
ymax
ymin
xmaxxmin
y
0 x
ymax
ymin
xmaxxmin
QII + QIII
QIII + QIV
QI+QIV
QI + QII + QIII + QIV
(a)
(c)
(d)
(e)
(b)
Fig. 3.13: Variable spans across multiple quadrants.
The supports of the two pdfs are limited by the following conditions:
For pX : xmin ≤ x ≤ xmax (3.28)
For pY : ymin ≤ z
x
≤ ymax (3.29)
The {X,Y } partition in the first quadrant is bounded by xQImin = max {0, xmin}, yQImin =
max {0, ymin}, xQImax = max {0, xmax}, and yQImax = max {0, ymax}. Here, both supports of pX
and pY are non-negative and (3.29) becomes:
z
yQImax
≤ x ≤ z
yQImin
(3.30)
Consequently, the lower and upper limits of integration (Ll and Lu) are given by:
Ll = max
{
xQImin,
z
yQImax
}
(3.31)
Lu = min
{
xQImax,
z
yQImin
}
(3.32)
Three particular cases can be identified. First, if the curve xy = xQImaxy
QI
min which intersects
the lower-right corner of the partition is greater than the curve xy = xQIminy
QI
max which
passes through the upper-left corner (Fig. 3.14(a)), then xQIminy
QI
min ≤ xQIminyQImax < xQImaxyQImin <
72 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
y
0
x
ymaxxmin
QI
xy=
yminxmaxxy=
QI
QIQI
QIV ymaxxmaxxy=
QIV
QIV yminxminxy=
QIV
QIII ymaxxminxy=
QIII
QIII yminxmaxxy=
QIII
QII yminxminxy=
QII
QII ymaxxmaxxy=
QII
Region 1
Region 2
Region 3
y
0
x
QI ymaxxminxy=
QI = QI yminxmax
QIQII yminxminxy=
QII = QII ymaxxmax
QII
QIII ymaxxminxy=
QIII = QIII yminxmax
QIII QIV yminxminxy=
QIV= QIV ymaxxmax
QIV
Region 1
Region 2
y
0
x
QI yminxmaxxy=
QI
ymaxxminxy=
QIQI
QII yminxminxy=
QII
QII ymaxxmaxxy=
QII
QIII yminxmaxxy=
QIII
QIII ymaxxminxy=
QIII QIV ymaxxmaxxy=
QIV
QIV yminxminxy=
QIV
Region 1
Region 2
Region 3
(a) (b) (c)
Fig. 3.14: Relative positions of the {X,Y } partition corners. Opposite corners on distinct xy curves
(a,c) and on the same curve (b).
xQImaxy
QI
max. In this case, the product pdf is given by:
pZ (z) =

∫ z/yQImin
xQImin
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
min ≤ z ≤ xQIminyQImax (Region 1)∫ z/yQImin
z/yQImax
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
max ≤ z ≤ xQImaxyQImin (Region 2)∫ xQImax
z/yQImax
pX (x) pY
(z
x
) 1
x
dx, xQImaxy
QI
min ≤ z ≤ xQImaxyQImax (Region 3)
(3.33)
In the second case, the partition corners lie on the same curve xy = xQIminy
QI
max = x
QI
maxy
QI
min
(Fig. 3.14(b)), for which:
pZ (z) =

∫ z/yQImin
xQImin
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
min ≤ z ≤ xQIminyQImax (Region 1)∫ xQImax
z/yQImax
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
max ≤ z ≤ xQImaxyQImax (Region 2)
(3.34)
The third case occurs for xQImaxy
QI
min < x
QI
miny
QI
max (Fig. 3.14(c)) and:
pZ (z) =

∫ z/yQImin
xQImin
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
min ≤ z ≤ xQImaxyQImin (Region 1)∫ xQImax
xQImin
pX (x) pY
(z
x
) 1
x
dx, xQImaxy
QI
min ≤ z ≤ xQIminyQImax (Region 2)∫ xQImax
z/yQImax
pX (x) pY
(z
x
) 1
x
dx, xQIminy
QI
max ≤ z ≤ xQImaxyQImax (Region 3)
(3.35)
In the second quadrant, bounded by xQIImin = min {xmin, 0}, yQIImin = max {0, ymin},
xQIImax = min {xmax, 0}, and yQIImax = max {0, ymax}, the values of Z are negative and (3.29) is
replaced by:
z
yQIImin
≤ x ≤ z
yQIImax
(3.36)
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 73
The lower and upper integration limits are evaluated as:
Ll = max
{
xQIImin,
z
yQIImin
}
(3.37)
Lu = min
{
xQIImax,
z
yQIImax
}
(3.38)
If xQIIminy
QII
max < x
QII
miny
QII
min < x
QII
maxy
QII
max ≤ xQIImaxyQIImin (Fig. 3.14(a)), the pdf of the product is
computed in the following way:
p
Z
(z) =

∫ z/yQIImax
xQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
max ≤ z ≤ xQIIminyQIImin (Region 3)∫ z/yQIImax
z/yQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
min ≤ z ≤ xQIImaxyQIImax (Region 2)∫ xQIImax
z/yQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
maxy
QII
max ≤ z ≤ xQIImaxyQIImin (Region 1)
(3.39)
If xQIIminy
QII
min = x
QII
maxy
QII
max (Fig. 3.14(b)), the pdf is computed as:
p
Z
(z) =

∫ z/yQIImax
xQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
max ≤ z ≤ xQIIminyQIImin (Region 2)∫ xQIImax
z/yQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
min ≤ z ≤ xQIImaxyQIImin (Region 1)
(3.40)
Finally, in the case in which xQIImaxy
QII
max < x
QII
miny
QII
min (Fig. 3.14(c)), the distribution is given
by:
p
Z
(z) =

∫ z/yQIImax
xQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
max ≤ z ≤ xQIImaxyQIImax (Region 3)∫ xQIImax
xQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
maxy
QII
max ≤ z ≤ xQIIminyQIImin (Region 2)∫ xQIImax
z/yQIImin
p
X
(x) p
Y
(z
x
) 1
(−x)dx, x
QII
miny
QII
min ≤ z ≤ xQIImaxyQIImin (Region 1)
(3.41)
In the third quadrant the bounds of the {X,Y } partition are xQIIImin = min {xmin, 0},
yQIIImin = min {ymin, 0}, xQIIImax = min {xmax, 0}, and yQIIImax = min {ymax, 0}. Since z > 0, the
inequality (3.29) becomes:
z
yQIIImax
≤ x ≤ z
yQIIImin
(3.42)
Hence, the integration limits are:
Ll = max
{
xQIIImin ,
z
yQIIImax
}
(3.43)
Lu = min
{
xmax,
z
yQIIImin
}
(3.44)
74 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Here, the distribution is computed if xQIIImax y
QIII
max ≤ xQIIImax yQIIImin < xQIIImin yQIIImax < xQIIImin yQIIImin
(Fig. 3.14(a)) as:
pZ (z) =

∫ xQIIImax
z/yQIIImax
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
max ≤ z ≤ xQIIImax yQIIImin (Region 1)∫ z/yQIIImin
z/yQIIImax
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
min ≤ z ≤ xQIIImin yQIIImax (Region 2)∫ z/yQIIImin
xQIIImin
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
min y
QIII
max ≤ z ≤ xQIIImin yQIIImin (Region 3)
(3.45)
In the second case xQIIImax y
QIII
min = x
QIII
min y
QIII
max (Fig. 3.14(b)) and the distribution becomes:
pZ (z) =

∫ xQIIImax
z/yQIIImax
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
max ≤ z ≤ xQIIImax yQIIImin (Region 1)∫ z/yQIIImin
xQIIImin
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
min ≤ z ≤ xQIIImin yQIIImin (Region 2)
(3.46)
The third case occurs for xQIIImin y
QIII
max < x
QIII
max y
QIII
min (Fig. 3.14(c)) and the pdf is evaluated as:
pZ (z) =

∫ xQIIImax
z/yQIIImax
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
max ≤ z ≤ xQIIImin yQIIImax (Region 1)∫ xQIIImax
xQIIImin
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
min y
QIII
max ≤ z ≤ xQIIImax yQIIImin (Region 2)∫ z/yQIIImin
xQIIImin
pX (x) pY
(z
x
) 1
(−x)dx, x
QIII
max y
QIII
min ≤ z ≤ xQIIImin yQIIImin (Region 3)
(3.47)
Finally, in the fourth quadrant xQIVmin ≥ 0 and yQIVmax ≤ 0, hence z < 0 and the relation-
ship (3.29) becomes:
z
yQIVmin
≤ x ≤ z
yQIVmax
(3.48)
As a consequence, the lower and upper limits of the integral are computed as:
Ll = max
{
xQIVmin ,
z
yQIVmin
}
(3.49)
Lu = min
{
xQIVmax ,
z
yQIVmax
}
(3.50)
Hence, the following cases of repartition can be distinguished in QIV. First, if xQIVmaxy
QIV
min <
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 75
xQIVmaxy
QIV
max < x
QIV
min y
QIV
min ≤ xQIVmin yQIVmax (Fig. 3.14(a)), the pdf becomes:
pZ (z) =

∫ xQIVmax
z/yQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
min ≤ z ≤ xQIVmaxyQIVmax (Region 3)∫ z/yQIVmax
z/yQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
max ≤ z ≤ xQIVmin yQIVmin (Region 2)∫ z/yQIVmax
xQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmin y
QIV
min ≤ z ≤ xQIVmin yQIVmax (Region 1)
(3.51)
Second, if the opposite corners lie on the same xy curve (xQIVmaxy
QIV
max = x
QIV
min y
QIV
min as in
Fig. 3.14(b)), the product pdf is computed as:
pZ (z) =

∫ xQIVmax
z/yQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
min ≤ z ≤ xQIVmaxyQIVmax (Region 2)∫ z/yQIVmax
xQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
max ≤ z ≤ xQIVmin yQIVmax (Region 1)
(3.52)
Third, if the opposite corners lie on reversed xy curves (xQIVmin y
QIV
min < x
QIV
maxy
QIV
max , see Fig. 3.14(c)),
the integration limits are as follows:
pZ (z) =

∫ xQIVmax
z/yQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
min ≤ z ≤ xQIVmin yQIVmin (Region 3)∫ xQIVmax
xQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmin y
QIV
min ≤ z ≤ xQIVmaxyQIVmax (Region 2)∫ z/yQIVmax
xQIVmin
pX (x) pY
(z
x
) 1
x
dx, xQIVmaxy
QIV
max ≤ z ≤ xQIVmin yQIVmax (Region 1)
(3.53)
A particular case occurs for z = 0, where the integral is not always defined (particu-
larly if x = 0). Since this case occurs only if x = 0∨ y = 0, the pdf is computed separately,
such that:
0+ε∫
0−ε
pZ (z) =
0+ε∫
0−ε
pX (x) dx+
0+ε∫
0−ε
pY (y) dy (3.54)
in the neighborhood of 0, for an arbitrary small value ε.
This completes the definition of pZ (z) for all real values across the four quadrants.
It is important to notice, that multiplications in separate quadrants result into overlap-
ping regions on the z axis. This is especially the case when the joint {X,Y } distribution
spans across all four quadrants. If the distribution span covers only two quadrants (see
Fig. 3.13(a–d)), the values of z from one of them are always negative, while the values
from the other are positive, hence their coverages on the z axis are disjoint. Suppose the
joint distribution {X,Y } spans across all four quadrants, as in Fig. 3.13(e). In this case, the
76 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
0 5 10 15 20 25
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Leakage Energy [nJ]
N
or
m
al
iz
ed
 p
df
 
 
Monte Carlo
Product operator
Fig. 3.15: Leakage energy distributions for three slacks, computed using the product operator and
through direct sampling (Monte Carlo).
components from the first and third quadrants result into overlapping positive regions,
given by:
z ∈ [0, xmaxymax] from QI (3.55)
z ∈ [0, xminymin] from QIII (3.56)
Likewise, the negative components resulting from the second and fourth quadrants are:
z ∈ [xminymax, 0] from QII (3.57)
z ∈ [xmaxymin, 0] from QIV (3.58)
Since each of the overlapping components contributes to the final distribution of Z, the
pdf segments computed for the overlapping branches must be added.
Considering the previous observations, the algorithm for computing the product dis-
tribution is given in Listing 3.1. As an example, Fig. 3.15 shows the result from a perfor-
mance macromodel subset which multiplies three slack distributions between process-
ing tasks with the estimated average leakage power of the core on which the tasks are
running. The average leakage power has been estimated in a discrete form and varies
between 266 and 321mW for the given core. The results of the implemented product op-
erator are compared with direct Monte Carlo samplings and multiplications. As shown
in Fig. 3.15, the results are in a very good agreement. The implemented operator has also
been tested on variables spanning across multiple quadrants and the results show the
same accuracy level.
3.3.4 Numerical Implementation of other Statistical Operators
The previous sections have presented the development of a series of analytic operators for
the propagation of statistical distributions. These operators offer the important advantage
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 77
PRODUCTOPERATOR(X,Y)
1 Compute zmin, zmax
2 Divide [zmin, zmax] into Nb bins
3 for i← 1 to Nb
4 do /* Retrieve the bin ranges: */
5 zinf ← zmin + (i− 1) zmax−zminNb ;
6 zsup ← zmin + i · zmax−zminNb ;
7 /* Compute pZ (zinf ): */
8 pZ (zinf )← 0; /* Reset to zero */
9 if zinf = 0
10 then pZ (zinf )← pX (0) + pY (0) ;
11 else if zinf < 0
12 then /* Check QII and QIV: */
13 if QII span= true /*{X,Y } spans over QII: */
14 then /* Compute integration limits: */
15 Ll = max
{
xQIImin,
zinf
y
QII
min
}
;
16 Lu = min
{
xQIImax,
zinf
y
QII
max
}
;
17 if Ll = Lu
18 then pZ (zinf )+ = 0; /* Integral is zero */
19 else /* Apply Simpson’s 3/8 rule: */
20 x0 ← Ll;
21 x1 ← Ll + (Lu − Ll) /3;
22 x2 ← Ll + (Lu − Ll) · 2/3;
23 x3 ← Lu;
24 Compute f (x) = pX (x) pY
( zinf
x
)
1
(−x) in x0, x1, x2, and x3;
25 pZ (zinf )+ = (Lu − Ll) f(x0)+3f(x1)+3f(x2)+f(x3)8 ;
26
27 if QIV span= true /*{X,Y } spans over QIV: */
28 then Get limits and evaluate the integral similarly.
29
30 else /* zinf > 0: check QI and QIII */
31 Evaluate pz (zinf ) in QI and QIII.
32
33 Compute pZ (zsup) in a similar way.
34
35 /* Compute the average value of the current bin: */
36 pZ [i] =
pZ(zinf )+pZ(zsup)
2 ;
37
38 /* Check the normalization condition for pZ (z): */
39 Sb =
∑Nb
i=1 pZ [i] ;
40 for i← 1 to Nb
41 do pZ [i] ∗ = NbSb(zmax−zmin) ;
Listing 3.1: Algorithm for computing the product pdf.
of both high accuracy and high speed and, in addition, they are implemented for the most
often used operations in the macromodels.
78 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
copies of each sample
x
p (x )
X
p [i]
X
b
1
b
Nb
b
i
r
i-1
r
i
U(r  ,r )
i-1 i
r
i-1
r
i
-
1 N samples
sb
N
s
b
in
d
iv
id
u
a
l
s
a
m
p
le
s
Ns
Nsb
i pX [i]
Algebraic
operator
N
s
samples
Accumulation
z
p (z)
Z
b
1
b
Nb
k
k
k
k
k
Fig. 3.16: Fast numerical implementation with adjustable accuracy.
Nevertheless, additional statistical operations are required in performance modeling,
such as exponential, logarithm, n-th root, or even division, for which no general analyt-
ical expressions exist. For instance, the general case of division of two random variables
results often in unknown, mostly Pareto-like (heavy-tailed) distributions. Only a few
very particular cases [106,131] are known, like e.g. the ratio of two normal RVs with zero
mean, which is a Cauchy distribution (also long-tailed, for which parameters like mean
or variance are undefined).
One of themain goals of this work is the development of a general methodologywhich
permits the application of any algebraic operation on a distribution, in order to easily in-
clude further extensions to the developed models. To achieve this goal, this work extends
the implementation of the previous analytic operators with numerical approximations
of several other algebraic operations. Several adjusting parameters which influence the
accuracy are discussed and the precision of the numerical operators is analyzed.
Let Ψ(·) be an arbitrary algebraic operator (other than the operators implemented in
the previous sections) defined as:
Ψ : Pn → P (3.59)
P =
{
pX (x) : R → R+0
∣∣∣∣ ∫ ∞
−∞
p
X
(x) dx = 1
}
(3.60)
where n is the number of statistical distributions on which it is applied and P is the set of
real probability density functions.
If Z = Ψ(pX1 (x1) , . . . , pXn (xn)) is the result of the operator, then the pdf pZ (z) is
estimated from Ns samples obtained numerically. As illustrated in Fig. 3.16, the input
pdf of each RV Xk, k = 1, n is represented in discrete form over Nb bins, where each
bin bi is approximated with a uniform distribution U (ri−1, ri). After that, Nsb individual
samples are generated from each uniform distribution and copied f times, where f is a
factor dependent on the respective bin area and is computed as:
f =
Ns ·∆i · pXk [i]
Nsb
(3.61)
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 79
10
20
30
40
50
0
20
40
60
0
0.05
0.1
0.15
0.2
N
b
Nsb
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
]
Exp
10
20
30
40
50
0
20
40
60
2
4
6
8
10
N
b
Nsb
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
]
Log
10
20
30
40
50
0
20
40
60
2
4
6
8
10
N
b
Nsb
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
] Sqrt
10
20
30
40
50
0
20
40
60
0
0.5
1
1.5
2
N
b
Nsb
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
] Division
(a) (b)
(c) (d)
Fig. 3.17: Accuracy of the implemented statistical operators for several values of Nb and Nsb.
where pXk [i] is the height of bin bi for RV Xk. In this way, the algebraic operation de-
scribed by Ψ(·) must be applied only on the individual Nsb samples from each bin. The
resulting Ns samples obtained after applying the operation on all n variables are accu-
mulated into the discretized pdf pZ (z), which represents the estimated distribution of the
result.
First, it is important to notice the performance improvement with respect to Monte
Carlo sampling. Due to the fact that the samples from each bin are multiplied by the
factor f , this technique requires only Nb · Nsb samples of each input variable, while pure
Monte Carlo needs Ns samplings. The speedup achieved by this method with respect to
Monte Carlo (MC) is given by the relative computational overhead:
overhead =
Nb ·Nsb
Ns
· 100 [%] MC (3.62)
The accuracy of this method is controlled by adjusting the number of binsNb, the num-
ber of individual samples extracted from each bin Nsb, and the total number of desired
samples Ns. In addition, by balancing the ratio of NbNsb to Ns, a good tradeoff between
speed and accuracy can be achieved.
80 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
0 100 200 300 400 500
0
0.02
0.04
0.06
0.08
0.1
0.12
N
b
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
]
0 100 200 300 400 500
1
2
3
4
5
6
7
N
sb
N
o
rm
a
li
z
e
d
 R
M
S
E
 [
%
]
Increasing N
b
Increasing N
sb
(a) (b)
Fig. 3.18: Impact of increasing the number of bins Nb or individual samples Nsb on operator accu-
racy.
The impact of various settings on the accuracy is illustrated in Fig. 3.17 for several
operations. Here, the total number of samples Ns was set to 1 000 000 and each statis-
tical operator has been applied on 30 random distributions from which the results are
averaged. The obtained pdfs have been compared with the pdfs obtained from 1 000 000
Monte Carlo evaluations (samples from the input distributions onwhich the algebraic op-
erations are applied) and the achieved accuracy is evaluated using the normalized RMS
error, computed as:
NRMSE =
RMSE
zmax − zmin [%] (3.63)
where the RMS error is evaluated from the differences between the bin heights of the
estimated pdfs using the statistical operators (pZ,stat) and Monte Carlo (pZ,MC) as:
RMSE =
√√√√√ Nb∑
i=1
(
pZ,stat [i]− pZ,MC [i]
)2
Nb
(3.64)
The higher accuracy of the exponential operator with respect to the other investigated
operations is due to the very large values which result from exponentiation, and, as a
consequence, to the large output span (zmax − zmin) which reduces the normalized error.
In addition, the the change in Nb has a relatively higher impact on the NRMSE than the
change in Nsb due to the computation of the RMS error over the number of bins.
The results show that good accuracies can be obtained already with values for Nb and
Nsb between 30÷50. With Nb = 50 and Ns = 50, the relative computational overhead
achieved by these statistical operators with respect to Monte Carlo is 0.25% (for the se-
lected number of 1 000 000 samples), as indicated by (3.62).
Further increasings in the number of bins or in the number of individual samples
result into moderate improvements in the accuracy, as shown in Fig. 3.18. The most im-
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 81
0 1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
3.5
Numerical Operator
Monte Carlo
p (z)
Z
z
division
log
sqrt
Fig. 3.19: Pdfs obtained for Nb = 50 and Nsb = 50 compared with Monte Carlo for different
statistical operators.
portant gain in accuracy is achieved for values ofNb andNsb up to 50÷100, as shown here
for the exponential operator (Fig. 3.18(a)) and the square root operator (Fig. 3.18(b)).
Fig. 3.19 offers a better image of the relative accuracy achieved by the implemented op-
erators with respect to Monte Carlo, showing the resulting pdfs for Nb = 50 and Nsb = 50
in the case of logarithm, square root, and division operations (the results of the exponen-
tial are orders of magnitude larger). The results show a very good accuracy obtained by
the presented method with the estimated overhead of 0.25% with respect to Monte Carlo
sampling.
3.3.5 Handling Correlations
As discussed in Sec. 2.4.2, topological correlations generated by reconverging paths must
be carefully tracked. The significant impact of correlations on the statistical computations
is illustrated in Fig. 3.20 for the maximum delay computed from multiple paths with dif-
ferent correlations. With increasing correlation, the maximum delay distribution exhibits
an increasing skewness. If a confidence point of e.g. 99% is chosen for evaluating the de-
lay, under the influence of correlations the position of this point may move significantly,
leading to underestimations of the actual delay.
To account for topological correlations, each reconvergent node (Fig. 3.21(a)) must be
identified first. Nodes with multiple inputs are tested as potential sinks for reconverging
paths by recursively collecting the parents of each inbound node in individual lists (see
Fig. 3.21(b)). If the lists are disjoint, then the node is not reconvergent, and the maximum
operation is computed as described in Sec. 3.3.1. Otherwise, if the lists contain at least
one common parent, then the node is a sink for reconverging paths. In this case, a lower
and an upper bound of the statistical distribution are computed. The upper bound is
computed as a maximum operation (Sec. 3.3.1), whereas the lower bound is computed as
82 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
0 20 40 60 80 100
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Delay [ns]
N
o
rm
a
li
z
e
d
 p
d
f
ρ=0
ρ=1
Increasing
correlation
Confidence point
displacement
Fig. 3.20: Influence of correlations on statistical result distributions (example for maximum oper-
ator).
the minimum between the CDFs of the inbound distributions:
plowZ (z) =
d
dz
F̂ lowZ (z) =
d
dz
[
min
(
F̂X1 (z) , F̂X2 (z) , . . .
)]
(3.65)
where Xi are the inbound RVs and the CDFs F̂Xi (xi) are computed according to (3.8).
These bounds provide a first estimation of the influence of topological correlations. A
further refinement of the estimation between the bounds can be employed for increasing
the accuracy, as described by the method proposed in [6].
For numerical operators, topological correlations are tracked in a similar way. After
identifying reconvergent nodes, given the order of the first common parent between two
reconverging paths, a correlation coefficient is estimated. A first-order common parent
generates a total correlation ρ = 1 between its output nodes. After each operation, the
correlation coefficient decreases according to a piecewise-linear model:
ρ (n) =
 1−
n
NL
(1− ρr) , n ≤ NL
ρr, n > NL
(3.66)
where n is the number of statistical operations after the common parent output, NL is the
number of operations over which the correlation coefficient decreases linearly, and ρr is a
residual correlation in the long reconverging paths. In typical cases NL ranges from 10 to
20 nodes and ρr is between 0.2 and 0.4.
After estimating the correlation coefficient, samples from two correlated random vari-
ables X and Y can be obtained by defining two new variables A and B computed using
3.3 METHOD FOR THE PROPAGATION OF DISTRIBUTIONS 83
Parent lists for inbound nodes:
(a)
+
T1
x
T1
s
max
T2
s
+
T2
e
T2
x
max
T3
s
+
T3
e
T3
x
max
Tp
s
+
Tp
x
max
Tq
s
+
Tq
x
max
Tr
s
Tp
e
Tq
e
Reconvergent
node
+
Tp
x
max
T2
x
+
max
+
Tq
x
max
T3
x
+
max
1
2
33
2 p
q q
p
r
p
2
2
1 1
3
3
q
Check for
common parents
(b)
Fig. 3.21: Topological correlations at reconvergent nodes (a) tracked by testing inbound nodes for
common parents (b).
a standard normal RV C as follows [100]:
A = X +
√
ρ · E {X2} · C (3.67)
B = Y +
√
ρ · E {Y 2} · C (3.68)
In this way, correlated samples of A and B can be computed from the samples of the
uncorrelated variables X , Y , and C.
In addition to topological correlations, the spatial correlations between intra-die pa-
rameters must be accounted. In this work a principal component analysis (PCA) method
is employed to express spatially-correlated RVs in terms of independent standard nor-
mal variables. For instance, if the drain current of a transistor is expressed as a function
of process parameters as Id = f (P1, P2, . . . ), after the PCA decorrelation the expression
becomes:
Id = f (f1 (N1, N2, . . . , Nn) , . . . ) (3.69)
where N1, . . . , Nn are a set of uncorrelated standard Gaussian RVs, representing the prin-
cipal components of the decomposition. The employed PCAmethod is described in detail
in Sec. 4.1.2.
3.3.6 Random Variable Algebra
The statistical operators developed in Sec. 3.3.1–3.3.4 are implemented within a random
variable algebra class. This RV algebra module computes the required statistical opera-
tions on a list of input random variables. Within the framework, the random variables
are represented as objects corresponding to each of the implemented RV type discussed
in Sec.3.2.
84 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Several settings are available to control the behavior of the algebraic operations, which
include:
• Parameters for the precision of the RV representation, such as the number of binsNb,
the number of individual (Nsb) and total samples (Ns) for the numerical operators,
and also margins for the estimation of the pdf support from a limited number of
samples;
• Parameters for the embedded random number generator, such as the generator type
(see Sec. 3.2.1), seed, and methods for seeding the generator to an arbitrary time-
dependent value;
• Settings for the estimation of the moments of random variables, such as the number
of samples used in moment estimations.
The algebra module applies the required statistical operation on the list of input RVs
and provides the result in the form of a new random variable which embeds the com-
plete pdf representation of the statistical result. During algebraic operations the module
automatically recognizes particular cases, such as operations with constants, as well as
standard embedded RV types (Gaussian, uniform etc.) which lead to several simplifica-
tions in the applied operations. Through the use of a central RV algebra module, all the
precision and performance tradeoffs can be adjusted efficiently during the synthesis.
3.4 Embedding Technique for Random Variables
Variability-aware performance macromodels are developed in this work by embedding
the RV models discussed in Sec. 3.2 within macromodel data structures. Global perfor-
mance macromodels, as well as smaller circuit-level models, are represented as linked
graphs containing parametric and computational nodes.
As discussed in Sec. 2.4, due to parameter variability considerations, each node in
the graph embeds a RV representation. This random variable describes either a parame-
ter in the case of parametric nodes (e.g. estimated execution time T xi,RTj , dynamic power
consumption P
RTj
d,i , leakage Pl,RTi , communication load L
i,j
c , or process parameter varia-
tions), or the result of a statistical operation in the case of operational nodes. This section
discusses the various cases in which the random variables are embedded within the per-
formance macromodels developed throughout this thesis.
3.4.1 Variability Sources and RV Leaf Nodes
The leaf nodes in a performance model represent typically parameter values, whereas in-
ternal nodes specify operations. Parametric nodes specify either system-level parameters
3.4 EMBEDDING TECHNIQUE FOR RANDOM VARIABLES 85
max
Ti
s
+
Ti
e
Ti
x
Tj
x
Pd,j
RTk
V
th0
Parametric
leaf nodes
Fig. 3.22: Random variable representations embedded into the leaf nodes of performance models
for variable parameters.
estimated in the profiling step, such as execution times, communication loads, and aver-
age power dissipation, or technology-related parameters, such as the process parameter
variations. In both cases, the parameters exposed to variations are represented as ran-
dom variables. Fig. 3.22 illustrates the representation of parametric leaf nodes as random
variables which embed the complete pdf.
For this purpose, the RV models developed in Sec. 3.2 are employed to represent the
values stored in each PM node. As a consequence, every parametric node embeds a RV
representation of the parameter value, to provide a unified interface between the nodes.
Constant values are represented as particular cases of random variables which return the
same constant value when they are sampled. Further, the statistical operators automati-
cally recognize constant values and perform the operations accordingly (typically shift or
scale the distribution with the constant value).
A typical PM node stores a random variable and a collection of input and output
links. Parametric nodes are typically leaf nodes in a PM and contain therefore only output
links towards other (operational) nodes. On this output links, they provide the stored RV
representation for statistical computations.
3.4.2 Variability Propagation and Estimation of Results
The typical internal nodes in a performance macromodel are operational nodes, which
implement a statistical operation on random variables. Such an operational node exhibits
input links from the random variables on which the operation is to be applied and stores
a random variable for the result of the operation. This resulting RV is then provided on
the output links to the other nodes connected downstream.
The variability descriptions are propagated at pdf level from node to node. They
start at the parametric leaf nodes, where the pdfs have been estimated from simulations,
profiling operations, or technology descriptions. Following, at each operational node, a
86 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Ti
x
+
Other parametric
or operational
nodes
Statistical
operation
+
+( ,max,
exp, ),
Inbound
PM nodes
Outbound
operational
PM nodes
Result pdf
z
p (z)
Z
Fig. 3.23: Pdf propagation at each operational node in a PM.
statistical operator is applied to the random variables embedded in each of the inbound
nodes. At this point, inbound nodes can represent either leaf nodes or other operational
nodes. After the statistical operation is applied to the inbound RVs, the resulting RV
is stored in the node and provided to the other nodes connected through output links.
Fig. 3.23 illustrates the computation and storage of the result pdfs at each operational
node.
In this way, the complete pdf representation is updated after each operation and fur-
ther propagated from node to node towards the output of the performance macromodel.
When a performance macromodel is evaluated, all the PM nodes must compute and store
their local pdf. For this purpose, the evaluation request is first sent to the output PM node
and from this node it is recursively propagated upstream in the graph until all PM nodes
have evaluated their pdfs, as illustrated in Fig. 3.24. After all PM nodes in the graph have
computed the pdfs, the distribution stored in the output node represents the result of the
performance macromodel.
3.4.3 Changes and Updates Propagated Downstream
During the design space exploration, several re-mapping and re-scheduling decisions
take place. For instance, if a task k is re-mapped from a resource of type RTi to a different
resource of type RTj , its execution time changes from T
x
k,RTi
to T xk,RTj . As a consequence,
the parametric node containing the pdf of task’s k execution time must be updated to the
new random variable describing T xk,RTj . Since this parametric node changed its embedded
distribution, all the computed pdfs in the downstream nodes become invalid and must
be reevaluated. As a consequence, an update must be triggered together with this change
and propagated downstream to all the involved nodes. This update trigger occurs as
3.4 EMBEDDING TECHNIQUE FOR RANDOM VARIABLES 87
Evaluation
request
+
T1
x
max +
T2
x
max +
T3
x
max +
Tp
x
max +
Tq
x
max
Output
PM node
r0 Delay
?
??
? ??
??
?
?
Fig. 3.24: Evaluation of a PM propagated upstream from the output node.
Updated
pdf
+
T2
x
max +
Tp
x
max +
Tq
x
maxr Delay
Downstream
update
Fig. 3.25: Downstream propagation of a pdf update triggered by a change in system configuration.
shown in Fig. 3.25, and all the nodes affected by the propagated update must reset their
evaluations and recompute the pdfs.
Similar updates are triggered by a scheduling change. Since scheduling dependencies
add supplementary links between the PM nodes (see Fig. 2.5), a change in schedule in-
volves deleting and adding the corresponding scheduling links. This changes the list of
inbound nodes for themax nodes, which must consequently update their computed pdf.
After recomputing the affected pdfs, a similar update must be propagated downstream
in the PM.
Also communication changes, such as the choice of a different signaling circuit, or the
tuning of supply voltage and body bias at the circuit level in the communication seg-
ments, as will be shown in chapter 4, has an impact on the delay and power consumption
of the communication tasks. Such changes affect again the pdfs stored in the macromodel
nodes and the corresponding updates must be propagated downstream.
88 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
X
p (x)
xz
α,inf
α
1-α
confidence point
X
p (x)
xz
α,sup
1-α
α
(a) (b)
Fig. 3.26: Inferior quantile (a) and superior quantile (b) used as confidence points for design deci-
sions.
3.4.4 Result Interpretation
Generally, due to the different sources of variation which occur along the design flow,
most of the design parameters are obtained at the end of a synthesis step as random
variables with a given distribution. Usually, design decisions operate with determined
values, such as the evaluation of cost functions and the decision of selecting or rejecting a
configuration with respect to its cost. Hence, it is necessary to extract a single value from
a given distribution which accounts for particular design preferences, typically expressed
as a desired yield level, a minimum performance limit, or a tradeoff between these two.
In this work, cost functions and the resulting design decisions are evaluated from the
statistical distributions by means of quantile functions. The inferior quantile zα,inf of a
distribution is defined for a given value α ≤ 1 by the following relationship:
zα,inf∫
−∞
pX (x) dx = α (3.70)
which bounds a fraction equal to α from the pdf (see Fig. 3.26(a)). By denoting with
FX(x) the cumulative distribution function and with FcX (x)
∆
= P (X > x) = 1 − FX(x)
the complementary cumulative distribution function of a RV X , the inferior quantile is
determined as:
zα,inf
∆
= z
∣∣∣∣∣P (X≤z)=α
0<α<1
= F−1X (α) (3.71)
A similar definition can be given for the superior quantile zα,sup, which bounds the supe-
rior part of the distribution, as shown in Fig. 3.26(b):
zα,sup
∆
= z
∣∣∣∣∣P (X>z)=α
0<α<1
= F−1cX (α) (3.72)
Either the inferior or the superior quantile can be used in design decisions during the
synthesis.
In addition to comparing the costs of two solutions, design constraintsmust be checked
when a possible solution is found. Since the constraints are constant numbers, the se-
lected quantile from the result pdf is used to check if the solution meets the constraints.
3.5 PERFORMANCE MACROMODELS FOR DELAY ESTIMATION 89
For instance, if the quantile corresponding to α = 99% is chosen, accepting a solution
means that 99% of the yield will meet the design constraints. The parametric yield ob-
tained for the design is therefore reflected by the quantile selected during the design space
exploration.
3.5 Performance Macromodels for Delay Estimation
This work develops statistical delay macromodels with variable granularity. There are
mainly two macromodel hierarchy levels: system-level PMs, which compute the perfor-
mance parameters for the complete design, and resource-level PMs, which estimate the
characteristics of a single resource element (e.g. a signaling circuit). At system level, de-
lays are estimated for processing tasks with a relatively coarser granularity and rely on
execution time estimations specified as random variables T xk,RTi . Communication tasks
are, on the other hand, estimated using the detailed circuit-level models presented in
chapter 4. Since the methodology is presented for both cases, the introduced macromod-
els can be easily extended to achieve the desired granularity for all tasks.
A delay macromodel is a directed acyclic graph consisting of linked PM nodes. As
mentioned in Sec. 3.4.1, the PM nodes represent either parameters (leaf nodes) or statisti-
cal operations (internal nodes). PM nodes which apply a statistical operation are imple-
mented using the statistical operators developed in Sec. 3.3. In addition, each PM node
embeds a random variable representation to describe either the stored parameter or the
result of the statistical operation.
3.5.1 Structure and Properties
The origin of the delay PM consists of the starting time value T sglobal. This is usually the
deterministic zero value, but can also be any other timing value, even statistical, at which
the timing computation starts. This latter case is particularly useful e.g. if the modeled
system represents a link in a larger processing chain and if the end time of the previous
link is also estimated statistically.
As illustrated in Fig. 3.27, the execution times of tasks assigned to resources are given
by the parametric nodes which embed the random variables T xi for each processing node
i. The earliest starting time T si of a processing node is computed as the statistical maxi-
mum of the end times of the nodes on which it depends. After that, the execution time
is added to this starting time by means of a statistical sum operator to compute the end
time T ei of each task.
For each processing node (PN) and communication node (CN) in the extended task
graph a maximum node, a sum node, and a parametric node for the execution time are
added in the delay macromodel. These three PM nodes constitute a structural element of
the macromodel and all structural elements are maintained into a list biunivocally associ-
90 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Tglobal
s
max +
Ti
x
Ti
s
Ti
e
max +
Tj
x
Tj
s Tj
e
schedulingdependency max Delay
Structural element
Fig. 3.27: Statistical performance macromodel for delay estimations.
ated to the list of processing nodes. Structural elements are interconnected by fixed data
dependencies between the nodes and by temporary scheduling links. When a change in
task scheduling occurs, the scheduling links between structural elements must be also
updated appropriately.
There are a series of particularities introduced by the statistical nature of computa-
tions, such as the reflection of design space explorations in the statistical distributions
computed within the macromodel. When a processing node is assigned to a different
resource, its execution time changes. As explained in Sec. 3.4.3, after updating the ran-
dom variable T xi , a value reset is propagated downstream, to update all the PM nodes
which depend on this parameter. In addition, scheduling updates require the deletion of
scheduling links between old successors in the scheduling list and the insertion of new
links between the nodes of the new sequence. The removal of links between PM nodes, as
well as the insertion of new links, require updates of the statistical computations which
are also propagated downstream in the macromodel. For instance, if the scheduling order
of two PNs/CNs assigned to the same resource is changed, the following changes occur
in the delay PM:
• The scheduling links connecting the nodes are removed;
• The PM nodes (PMNs) where the links pointed and all their downstream PMNs
must be updated;
• The new scheduling links connecting the nodes in their new positions are added;
• The PMNs where the new links point and all their downstream PMNs must be
updated.
Further, if a PN/CN is assigned to a new resource, the following changes must be per-
formed:
• The RV of the T x input parameter node is updated;
3.5 PERFORMANCE MACROMODELS FOR DELAY ESTIMATION 91
R0 R1 R2 R3 R4
PN µ σ µ σ µ σ µ σ µ σ
1 21 4 27 4 16 3 22 3 17 3
2 36 4 41 5 32 4 38 4 32 4
3 51 5 56 5 45 4 51 5 46 4
4 39 4 46 5 36 4 43 4 37 4
5 14 3 20 4 11 3 17 3 12 3
Tab. 3.1: Parameters of the execution times of five PNs on different resources. Values given in
nanoseconds.
• The affected sum node is reset and all its downstream PMNs;
• The modified PN/CN is removed from the scheduling list of the previous resource:
– The scheduling links between the node and its predecessor and successor in
the scheduling list are updated: first both removed, then a new link is added
from the predecessor to the successor;
– Whenever a scheduling link is removed or added, the PMN where the link
pointed/points must be reset together with all its downstream PMNs.
• The PN/CN is inserted into the scheduling list of the new resource:
– A scheduling link is removed and two new links are added to the PM;
– The affected PMNs are reset downstream correspondingly.
3.5.2 Application Examples
Four execution scenarios with five processing nodes have been selected for illustrating
the results of the delay PM. Assuming a set of five processing resources on which the
PNs can be mapped, the execution times of each PN on each resource are described by
statistical distributions with mean and standard deviation values as specified in Tab. 3.1.
In the first execution scenario, the PNs are independent and are mapped to individual
resources, as illustrated by the task graph in Fig. 3.28(a). Since there are no data or re-
source dependencies, the PNs are executed in parallel, leading to the best execution time
from the considered scenarios. The system latency is evaluated as a distribution using the
delay PM and is shown in Fig. 3.29.
The second scenario represents the worst-case execution time, in which all PNs are
mapped to the same resource R2, as depicted by the task graph in Fig. 3.28(b). Here, the
PNs are serialized by the scheduling dependencies, which leads to the largest execution
time, as shown by the distribution plotted in Fig. 3.29.
92 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
(a) Scenario 1
R01 R12 R23 R34 R45
Start
End
R21 2 3 4 5
Start
End
Start
End
R21
R3
2 3
4
R05
Start
End
R21
R3
2 R43
4
R05
(b) Scenario 2
(c) Scenario 3 (d) Scenario 4
Fig. 3.28: Execution sequences and resource mappings for the four test scenarios.
20 40 60 80 100 120 140 160 180
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Delay [ns]
N
o
rm
a
li
z
e
d
 p
d
f
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Fig. 3.29: Statistical delays evaluated using the delay PM.
The next two mixed scenarios illustrated in Fig 3.28(c) and (d) add data and schedul-
ing dependencies and, depending on resource mapping and scheduling, achieve execu-
tion times between the best and worst-case scenarios. In the third scenario, PNs 2, 3,
and 4 are assigned to the same resource R3, therefore their execution is serialized by the
additional scheduling dependency. In the fourth scenario, PN3 is assigned to a different
resource, therefore it can run in parallel with PN2 and PN4. This parallel execution leads
to a shorter execution time with respect to the third scenario, which is also reflected by
the results from Fig. 3.29.
3.6 PERFORMANCE MACROMODELS FOR ENERGY CONSUMPTION 93
3.6 Performance Macromodels for Energy Consumption
Power and energy consumption are commonly employed within the system design lex-
icon as interchangeable notions. Without affecting the generality of these terms, it is to
be noted that across the design flow steps, the two notions may not always be compati-
ble. As discussed in Sec. 2.2.3, the estimation of total energy consumption as performance
metric during the synthesis has a series of advantages over the estimation of power dissi-
pation. First, estimating the energy consumption takes into account the effect of leakage
power dissipations over long periods of idle time. This allows further to optimize the sys-
tem such that long slacks between the tasks are avoided to reduce the amount of leakage
dissipated by the idle resources. Second, the overall reduction of total energy consump-
tion is of primary importance in battery-powered embedded applications. And last, the
decision of mapping a task on a resource with lower dynamic power consumption, but
which requires a relatively long execution time, over assigning the task to a resource with
a higher power level but which requires a very short execution time might be detrimental
under the global system circumstances.
To include all these aspects in the framework in a unified way, energy consumption
has been chosen as the performance metric evaluated by the macromodels. It is also
important to remember, that the average power dissipation can be obtained from the
estimated energy through division by the execution time evaluated using the delay PM.
Several approaches in the literature, such as [50,56], use energy as the performance metric
within the context of power-optimized hardware/software co-synthesis.
3.6.1 Dynamic Energy Macromodels
The application profile specification described in Sec. 3.1.2 defined the average dynamic
power consumption P
RTj
d,i as a statistical distribution for each node PNi on every compat-
ible resource type RTj . The average dynamic energy consumption of executing PNi is
approximated by the statistical product between the specified average power and the PN
execution time:
Ed,i = P
RTj
d,i · T xi,RTj (3.73)
Hence, the total dynamic energy, evaluated by the PM, is given by:
Ed,total =
NPN∑
i=1
Ed,i (3.74)
The structure of the implemented PM is depicted in Fig. 3.30. Each PN introduces in
the dynamic energy PM a structural element containing a product node which combines
two parametric nodes: the average dynamic power dissipation P
RTj
d,i and the execution
time T xi .
94 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Pd,i
RT
Ti
x
j Ed,i
+ Ed,total
Structural element
Fig. 3.30: Statistical performance macromodel for estimating the dynamic energy consumption.
Note, that the macromodel structure depicted in Fig. 3.30 includes only the contribu-
tion of the processing nodes. The detailed modeling of communication nodes and the
corresponding structures in the performance macromodels are presented in chapter 4.
If a PN is assigned to a different resource, the dynamic energy PM is modified as
follows. First, the random variables of the T xi and P
RTj
d,i nodes are updated to the values
corresponding to the new resource. After that, the value of the affected multiplication
node is reset, as well as the values of all its downstream PMNs.
3.6.2 Leakage Energy Macromodels
The leakage energy consumption of a resource Rk is estimated from the product between
the average leakage power Pl,RTj specified as a distribution in the application profile (see
Sec. 3.1.2) and the total time in which the resource is idle:
El,Rk =
Nk−1∑
i=1
(
T si+1 − T ei
)
Pl,RTj +
(
T eglobal − T eNk
)
Pl,RTj (3.75)
where Nk is the number of tasks mapped on Rk and T
e
global is the total system latency. The
total idle time is computed as the sum of slacks between the execution of the scheduled
tasks, therefore the start and end times of each processing node must be known.
Since the starting times T si and the end times T
e
i are computed within the delay macro-
model (see Sec. 3.5.1), they can be directly used by connecting links between the output of
the nodes from the delay PM which compute them and the nodes of the leakage energy
PM.
The leakage energy PM is built starting from a list with all processing resources. For
each resource in the list, a structural element is created containing the following subele-
ments:
• For each slack between the PNs scheduled on the given resource, a pair of links
from the delay PM and one difference node are connected as shown in Fig. 3.31.
This difference node evaluates the slack between two successively scheduled PNs
PNi and PNi+1. The minuend link connects to the output of a PMN from the delay
3.6 PERFORMANCE MACROMODELS FOR ENERGY CONSUMPTION 95
max +
Ti
s
Ti
e
Ti
x
max +
Ti+1
s
Ti+1
e
Ti+1
x
Pl,RTj
E l,i
+ E l,total
Delay
macromodel
S
tr
u
c
tu
ra
l
e
le
m
e
n
t
t
Ti+1
s
Ti
e
slack
Fig. 3.31: Statistical performance macromodel for leakage energy estimation.
PM,which evaluates the start time T si+1 of PNi+1, while the subtrahend link connects
to the end time T ei of the previous node PNi.
• A similar set of links and a difference node for computing the slack of the last PN in
the scheduling list (PNNk), where the minuend is the system end time, as evaluated
at the output of the delay PM.
• An additional difference node which evaluates the timing gap between the initial
system start time (as specified by the root of the delay PM) and the start time of the
first PN scheduled on this resource.
• A sum node to add all the slacks, a parametric node, which stores the average leak-
age power of the resource, and a product node which multiplies the power with the
sum of the slacks.
Finally, the PM output is implemented by a sum node, which adds the energy consump-
tions of all active processing resources.
Whenever the scheduling order of two PNs changes, the followingmodificationsmust
be performed on the leakage energy PM. First, the complete structural element corre-
sponding to the resource on which the scheduling order changed must be updated, since
both the PN succession and the slacks have changed. In addition, the PM output node
must be reset whenever a structural element is updated. It is important to note, that al-
though other slacks, from different resources, might change during this schedule switch
(due to data dependencies), the affected structural elements are automatically updated
to the new values through the reset propagated downstream from the delay PM. This
update is facilitated by the direct connections between the two PMs.
96 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
R0
1
R143
R2
2
5
Start
End
R01
R143
2
5
Start
End
(a) Scenario 1 (b) Scenario 2
Fig. 3.32: Delay-optimized resource mapping (a) and mapping with improved energy consump-
tion (b).
R0 R1 R2
µ σ µ σ µ σ
Pl,RTj 22 2.2 9 1.8 43 3.2
P
RTj
d,1 22 2.9 17 2.4 27 3.2
P
RTj
d,2 19 1.8 16 2.1 26 3.2
P
RTj
d,3 26 3.7 21 3.2 32 3.9
P
RTj
d,4 24 3.6 21 3.1 31 3.9
P
RTj
d,5 19 2.7 16 2.6 24 3.1
Tab. 3.2: Power parameters for five PNs and three different resources. Values given in milliwatts.
Furthermore, if a PN is reassigned to a new resource, the structural elements corre-
sponding to the previous and to the new resource must be updated. Again, all the slacks
affected by this change are updated automatically through the connections with the delay
PM.
3.6.3 Application Examples
Let the five PNs described by the execution times from Tab. 3.1 be mapped on the re-
sources R0, R1, and R2 according to the task graphs in Fig. 3.32. The average leakage
power of each resource and the average dynamic power required for executing the PNs
are specified using statistical distributions with the mean values and standard deviations
given in Tab. 3.2.
The dynamic and leakage energy consumptions evaluated with the developed PMs
are displayed in Fig. 3.33 for the two scenarios. It can be observed that the first mapping
scenario is optimized for speed, as it uses three different resources for executing paral-
lel tasks. The higher leakage consumption is caused by the idle time of R1 waiting for
the results from PN1 and by the resource R2 which has the highest leakage power. The
second mapping scenario shown in Fig. 3.32 achieves an improved energy consumption
by mapping the tasks only on R0 and R1. Most of the leakage in this case is due to the
slack between the execution of PN2 on R0 and the total end time T
e
global and to the slack
3.7 PARTITIONING, ASSIGNMENT, AND SCHEDULING OPTIMIZATION 97
1 2 3 4 5 6 7 8
0
2
4
6
8
10
12
14
x 10
8
Dynamic Energy [nJ]
N
o
rm
a
li
z
e
d
 p
d
f
1 1.5 2 2.5 3 3.5 4
0
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Leakage Energy [nJ]
N
o
rm
a
li
z
e
d
 p
d
f
Scenario 1
Scenario 2
Scenario 1
Scenario 2
(a) (b)
Fig. 3.33: Statistical dynamic energy (a) and leakage energy (b) consumptions evaluated using the
energy PMs.
between the initial time T sglobal and T
s
3 on R1. As shown in Fig. 3.33(a), the dynamic power
is only moderately optimized, mainly by avoiding the power-intensive execution on R2
(see Tab. 3.2). In addition, the unused resources can be turned off, therefore they practi-
cally do not participate to the total energy consumption. A more radical power reduction
could be achieved by mapping all the tasks on R1, however at the expense of a much
higher latency, since the execution of all PNs would be serialized.
3.7 Partitioning, Assignment, and Scheduling Optimiza-
tion
The estimations obtained from the developed PMs are used for guiding the partitioning of
processing tasks, assignment on resources (mapping), and for scheduling optimizations.
Given the estimated metrics for delay and energy consumption, a global optimization
method is employed for the exploration of solution space. Within the optimization con-
text, particularly interesting decisions are to change the task mapping configuration and
to try several scheduling sequences on the resources. A further important aspect is the
exploration of communication resources and optimization of the communication links.
The implemented exploration of the solution space is guided by the evaluation of a
parametrized cost function which describes the design performance. By balancing sev-
eral parameters, the cost function can be tuned to reflect multiple design preferences re-
garding the tradeoff between speed and energy consumption and the desired yield level.
98 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
INITIALSCHEDULE()
1 for each Ri ∈ SR
2 do Start scheduling list Ls,i;
3 Get list of assigned PNs LPN,i;
4 for each PNj ∈ LPN,i
5 do if Ls,i = ∅
6 then Append PNj to Ls,i;
7 else for each PNk ∈ LPN,i
8 do if !PNk.ISUPSTREAM(PNj)
9 then Insert PNj in Ls,i before PNk;
10 if PNj /∈ Ls,i
11 then Append PNj to Ls,i;
12 UPDATEPMS();
Listing 3.2: Algorithm for finding the initial scheduling configuration.
3.7.1 Methods for Solution Space Exploration
Several design variables are searched during the exploration. One of them is the con-
figuration of assigned resources for each PN. After an initial resource mapping, PNs are
reassigned in the optimization process to other compatible resources. Another design
variable is the scheduling order of PNs on the assigned resources. Once an initial schedul-
ing configuration is found, according to the dependencies in the task graph, the order of
tasks in the scheduling lists is changed in the search for a better solution. Furthermore,
additional design characteristics are explored, such as the choice of communication re-
sources with different signaling methods, as well as circuit-level optimizations including
voltage scaling and body biasing, as described in Sec. 4.4.5.
The mapping between processing nodes and compatible resources is defined by the
function Resmap : SPN → SR, such that Resmap (PNj) = Ri, ∀PNj ∈ SPN , Ri ∈ SR,j ,
where SPN is the set of processing nodes, SR is the set of available resources, and SR,j ⊂
SR is the subset of compatible resources for PNj .
A static non-preemptive scheduling method is employed within this work, in which
each processing resourcemaintains a list with the scheduled PNs. An initial valid schedul-
ing configuration must be found as the starting point for explorations in the solution
space. The algorithm which finds the initial scheduling is presented in Listing 3.2. and
schedules each node from the list of assigned nodes before the first node which is not
placed upstream in the task graph (the method PNk.ISUPSTREAM(PNj) checks if PNk is
found upstream of PNj).
A nested loop optimization algorithm based on simulated annealing has been im-
plemented in this work for the solution space exploration. Nevertheless, the developed
statistical methodology and the macromodels can be employed in other combinatorial
optimization approaches as well.
3.7 PARTITIONING, ASSIGNMENT, AND SCHEDULING OPTIMIZATION 99
3.7.2 Cost Function Evaluation
The cost function C : Pn → P is defined on the set of real pdfs defined by (3.60) and
embeds the performance metrics which must be optimized in the synthesis. Within this
work, the cost function is implemented as a weighted sum of the delay, dynamic energy,
and leakage energy estimated using the performance macromodels:
C (T eglobal, Ed,total, El,total) = wT · T eglobal + wEd · Ed,total + wEl · El,total (3.76)
The weights wT , wEd , and wEl are adjustable, therefore the combined cost function
can lead to a design optimized for speed, for energy consumption, or for a weighted
combination of them.
As discussed in Sec. 3.4.4, the cost pdf C() is interpreted with a quantile function.
The extracted quantile employed for design decisions is also adjustable and offers a wide
range for defining the desired parametric yield.
3.7.3 Optimization Loop
The simulated annealing-based optimization loop is performed over a combinatorial so-
lution space having the following coordinates:
• The values of Resmap (PNj) , ∀PNj ∈ SPN ;
• The scheduling order of PNs on each processing resource;
• The type of signaling resource, supply voltage value, and body bias, on each com-
munication segment, as presented in chapter 4.
The first iteration performs an uniform random assignment of PNs on the compatible
resources. Hereby, any possible assignment is valid, however the achieved performance
is strongly influenced by the data and resource dependencies. The initial scheduling is
performed as presented in Listing 3.2. Starting from this initial configuration, iterative
jumps in the neighborhood of the current solution are performed by changing either one
of the above-mentioned coordinates. The type of jump at each iteration is determined by
different adjustable probabilities and typically considers the following aspects:
• Re-assigning a PN to a different resource requires a change in the scheduling lists of
the affected resources;
• A switch between two PNs in the scheduling list of a resource is performed more
often than a PN remapping, to allow for finding the optimal scheduling for a given
resource mapping configuration;
100 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
GPP15
11
13
8
1
MEM6GPP
GPP
ASIC
ASIC
7 16
9
4 3
2 5
14 12 10
GPP15
11
7
MEM6GPP
GPPASIC
9
16
8 313 5
4
14
1 ASIC2 10 12
1
4 4
4
4
8 8
4
1 844
1 1
8
4
4
4
(a) (b)
1
Fig. 3.34: Initial random assignment and scheduling (a) and optimized configuration (b).
• Since PN remapping also changes the communication segments needed between
the resources, a change of the signaling resource on a segment occurs with a higher
probability, to allow finding the optimal communication circuits;
• For a given resource mapping and signaling resource configuration, several supply
voltage and body bias changes are performed.
While remapping a PN to another compatible resource results always into a valid
configuration, there is only a restricted subset of resources which allow a switch in their
scheduling list. More exactly, two PNs in the scheduling list of a resource can be switched
only if neither one of them is placed upstream of the other in the task graph. As a conse-
quence, before performing the scheduling switch, a list is built with the resources which
allow such a scheduling switch and one of them is selected with a uniform random dis-
tribution.
3.7.4 Optimization Results
To illustrate the results of the assignment and scheduling optimization, a task graph with
16 processing nodes has been mapped on a resource set containing three general purpose
processors, two ASICs, and one memory block. The average execution times of the pro-
cessing nodes varied between 10 and 20 ns (mean), the inter-task communication loads
were selected between 5 and 15Mb (mean), and the inter-resource communication seg-
ments had lengths ranging between 1 and 25mm. Dynamic power varied in mean value
between 10 and 30mW and the leakage power was selected between 10 and 40mW.
The initial mapping and scheduling configuration is shown in Fig. 3.34(a), where the
PNs are assigned randomly to the resources. The communication segments which con-
nect the resources are first dictated by the data dependencies between the PNs and their
width is corresponding to the distance between the resources: longer buses have a wider
bit width, whereas local buses have only one data wire. It is assumed that the resource
floorplan is fixed.
The optimization used equal weights for the delay, dynamic energy, and leakage en-
ergy. Further, the simulated annealing used an exponential cooling schedule with factor
3.8 SUMMARY 101
40 45 50 55 60 65 70 75
0
100
200
300
N
o
rm
a
li
z
e
d
 p
d
f
40 45 50 55 60 65 70 75
0
100
200
300
Delay [ms]
N
o
rm
a
li
z
e
d
 p
d
f
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
0
5000
10000
15000
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
0
5000
10000
Dynamic Energy [mJ]
7.5 8 8.5 9 9.5 10
0
1000
2000
3000
7.5 8 8.5 9 9.5 10
0
1000
2000
3000
Leakage Energy [mJ]
Before optimization Before optimization Before optimization
After optimizationAfter optimizationAfter optimization
Fig. 3.35: Delay, dynamic energy, and leakage energy results before and after the optimization.
0.9 and 100 steps, each with 10 000 inner iterations. At each iteration, remapping a PN to a
different resource occurred with a probability Premap = 0.33, while switching the schedul-
ing order of two PNs was performed with the probability Preschedule = 0.67. Balanced
weights have been employed in the cost function, such that improvements in delay and
energy consumption have the same influence on the quality of a solution. Furthermore, a
99% level has been set for the parametric yield by means of quantiles.
Fig. 3.34(b) illustrates the best solution achieved during the optimization. Tasks with
large execution times on the general purpose processors, as long as compatible, have been
mapped on ASICs. Further, data-dependent tasks with large communication loads have
been mapped preferably on the same resource, to avoid the time-expensive inter-resource
communication. Also, intensive memory access tasks have beenmapped on the resources
which are communicating with the memory block via high-speed buses. Finally, the op-
timized scheduling minimized the slacks on the leakage-intensive resources.
The estimated pdfs for the total delay, dynamic energy, and leakage energy are pre-
sented in Fig. 3.35 for the initial and for the optimized configuration. The results show
that the optimization achieved an improvement of 13.2% in delay, 9.8% in dynamic en-
ergy, and 4% in leakage energy at the considered 99% confidence point.
3.8 Summary
Starting from the need of accurately specifying statistical parameter distributions in the
application and architectural profile and of employing them throughout the synthesis,
this chapter developed a complete methodology for the statistical modeling of perfor-
mance metrics.
The first contribution of this chapter is a generalized random variable model, capa-
ble of representing non-standard estimated distributions using discretized pdfs with ad-
justable accuracy. The typical usage, accuracy control, as well as a sampling method have
been presented.
102 CHAPTER 3 VARIABILITY-AWARE PERFORMANCE MACROMODELS
Another important contribution is the development of a propagation method for sta-
tistical distributions across the modeling expressions. Analytic expressions for the most
often used operators, including the detailed derivation of a statistical product operator,
have been presented. In this context, a further important contribution is the develop-
ment of a fast generalized method for implementing statistical operators with a precision
comparable to Monte Carlo at a very small fraction of the execution time.
Further, embedding the random variable model in the system representation has been
explained. The implied particularities, such as the downstream propagation of updates
and the result interpretation using quantile functions, have also been discussed.
Finally, the complete structures of statistical macromodels for delay and energy con-
sumption have been presented and their application has been illustrated using a few ex-
amples. Moreover, the global optimization of resource mapping and scheduling using the
statistical macromodels has been illustrated and analyzed in the context of an application
example proving the efficiency of the developed methodology.
Chapter 4
Technology-Accurate, Variability-Aware
Circuit-Level Models
Contents
4.1 Variability-Aware Transistor Model . . . . . . . . . . . . . . . . . . . . . . 104
4.1.1 BSIM4.3-Based Current Source Model . . . . . . . . . . . . . . . . . 105
4.1.2 Modeling Spatially-Correlated Process Parameter Variations . . . . 108
4.1.3 Inclusion of Random Variables and Results Estimation . . . . . . . 113
4.2 Pulsed Current-Mode Signaling Model . . . . . . . . . . . . . . . . . . . . 114
4.2.1 Derivation of Current Switching Paths . . . . . . . . . . . . . . . . . 115
4.2.2 Equivalent Current-Source Circuit Model . . . . . . . . . . . . . . . 120
4.2.3 Analytic Model for Delay and Energy Consumption . . . . . . . . . 123
4.2.4 Performance Evaluation under Voltage Scaling and Body Biasing . 127
4.3 Voltage-Mode Signaling Model . . . . . . . . . . . . . . . . . . . . . . . . 129
4.3.1 Equivalent Current-Source Circuit Model . . . . . . . . . . . . . . . 129
4.3.2 Analytic Model for Delay and Energy Consumption . . . . . . . . . 131
4.3.3 Performance Evaluation under Voltage Scaling and Body Biasing . 134
4.4 Modeling of Communication Segments . . . . . . . . . . . . . . . . . . . 135
4.4.1 Transceiver and Interconnect Model . . . . . . . . . . . . . . . . . . 136
4.4.2 Floorplan Model using Clusters . . . . . . . . . . . . . . . . . . . . 137
4.4.3 Estimation of Communication Circuit Placement on Die . . . . . . 138
4.4.4 Quick Delay Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.4.5 Implementation of Communication Nodes . . . . . . . . . . . . . . 139
4.4.6 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
103
104 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Optimizations achieved at the level of task mapping and scheduling have been pre-
sented in Sec. 3.7. Nevertheless, more radical optimizations of the communication archi-
tecture can be achieved at the circuit level, as discussed in Sec. 2.6, through the choice of
the signaling method and voltage optimizations in the transceiver circuit, such as supply
voltage scaling and body biasing.
This chapter provides a selection of circuit-level models for two different signaling
methods, developed by identifying the main current paths during circuit operation and
by deriving analytic expressions from equivalent circuit representations. Technology-
related information is included in the circuit models by using transistor-level current ex-
pressions derived from the BSIM4.3 model [167]. In addition, the models support vari-
ability descriptions for every process parameter specified in the BSIM model card using
the random variable models and the extended set of statistical operators developed in
chapter 3. Apart from the selection of a signaling circuit, voltage scaling and body bias-
ing techniques can be applied on the circuit models to show the immediate influence of
these adjustments on the performance metrics.
The content of this chapter is organized as follows. First, Sec. 4.1 develops a statis-
tical current-source transistor model derived from BSIM4 equations and describes the
modeling approach for spatially-correlated parameter variations using grids and correla-
tion decay models. On this basis, Sec. 4.2 derives analytic circuit-level models for delay
and energy consumption for pulsed current-mode signaling circuits and analyzes their
performance at various segment lengths, as well as under voltage scaling and body bias-
ing. Another analytic model is developed for voltage-mode signaling circuits in Sec. 4.3
and the performance of the analyzed signaling methods is compared. In the following,
Sec. 4.4 presents the embedding of the circuit-level models in the top-level performance
macromodels and illustrates the use of both signaling methods to achieve an optimized
communication architecture.
4.1 Variability-Aware Transistor Model
A technology-accurate circuit model should include all the relevant effects exhibited by
state-of-the-art manufacturing processes. In addition, for a good accuracy, parameter
variations must be considered within the model. Considering these observations, this
work derives circuit-level models for the communication structures starting from an ac-
curate statistical transistor model.
The underlying transistor-level model employed in this work is derived from the
BSIM4 transistor model, currently accepted on a very wide scale as the de facto standard in
transistor modeling for very deep sub-micron CMOS technologies. As a consequence, by
using BSIM4 equations, this model includes all the model parameters currently employed
4.1 VARIABILITY-AWARE TRANSISTOR MODEL 105
D
S
BG
D
S
BG
BSIM4.3
Ndep Toxe Toxp
K1 K2 Xl
Lint Fprout Lw
Xw Wint Wwn
Pvag Pclm Pdiblc1
. . .
p (x)
x
Pk
Fig. 4.1: Statistical current-source transistor model based on BSIM4.3 equations and parameters.
to describe process characteristics. In addition, the statistical analysis methodology devel-
oped in chapter 3 is applied to the model, thus allowing for the propagation of variability
from each model parameter to the final current expression.
In this way, a completely statistical transistor-level model is developed, with extended
modeling capabilities and good technology accuracy, which enables the use of parameter
variations for each process parameter specified in the BSIM4 standard.
4.1.1 BSIM4.3-Based Current Source Model
A current source transistor model is employed in this work, as illustrated in Fig. 4.1. The
current equations are derived from the BSIM4.3 transistor model, which is also used in
the SPICE and Spectre [30] model cards for the 90 nm technology used throughout this
work.
According to the BSIM4.3 specification [167], the effective gate voltage Vgse is com-
puted considering the maximum electrical field in the polysilicon gate and the electric
field in the gate oxide as:
Vgse = Vfb + Φs +
qεsiNgateT
2
ox,e
ε2r,ox
(√
1 +
2ε2r,ox (Vgs − Vfb − Φs)
qεsiNgateT 2ox,e
− 1
)
(4.1)
where Vfb is the flat-band voltage, Ngate is the poly-gate doping concentration, Tox,e is
the electrical gate oxide thickness, and εr,ox is the gate dielectric constant. Φs is the surface
potential and is computed as:
Φs = 0.4 +
kBT
q
ln
(
Ndep
ni
)
+Φn (4.2)
where Ndep is the channel doping concentration, Φn is the non-uniform vertical doping
effect on the surface potential, and ni denotes the intrinsic carrier concentration of silicon
which exhibits a non-linear dependence on the parameters measurement temperature.
106 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Further, the total threshold voltage is modeled by the following relationship:
Vth = Vth0 +
(
K1,ox
√
Φs − Vbseff −K1
√
Φs
)√
1 +
Lpeb
Leff
−K2,oxVbseff
+K1,ox
(√
1 +
Lpe0
Leff
− 1
)√
Φs + (K3 +K3bVbseff )
Tox,eΦs
Weff +W0
(4.3)
− 1
2
 Dvt0,w
cosh
(
Dvt1,w
LeffWeff
ltw
)
− 1
+
Dvt0
cosh
(
Dvt1
Leff
lt
)
− 1
 (Vbi − Φs)
− 0.5
cosh
(
Dsub
Leff
lt0
)
− 1
(η0 + ηbVbseff )Vds
where Vth0 is the long-channel threshold voltage at zero body bias,K1 is the body-effect
coefficient, Lpeb models the lateral non-uniform doping effect, Lpe0 is the lateral non-
uniform doping at Vbs = 0, K3 is the narrow width coefficient, K3b represents the body
effect coefficient ofK3,W0 is a narrow width parameter, Dvt0,w andDvt1,w are the first
and second coefficients of narrow-width effects, Dvt0 and Dvt1 are the first and second
coefficients of short-channel effects,Dsub is the coefficient of the DIBL effect on the output
resistance, η0 is the DIBL coefficient in the subthreshold region, and ηb is the body-bias
coefficient for the subthreshold DIBL effect. The coefficients K1,ox = K1Tox,e/Tox,m and
K2,ox = K2Tox,e/Tox,m model the dependence of K1 and K2 of the gate oxide thick-
ness, whereK2 is the charge-sharing parameter and Tox,m is the oxide thickness at which
the parameters are extracted. Vbi designates the built-in voltage of the source and drain
junctions and is expressed as:
Vbi =
kBT
q
ln
(
NdepNsd
n2i
)
(4.4)
whereNsd is the doping concentration of the source and drain diffusion regions.
Short-channel and DIBL effects are included by defining the characteristic channel
length as:
lt =
√
εsiTox,eXdep
εr,ox
(1 +Dvt2Vbs) (4.5)
while the characteristic length at zero body bias is given by lt0 =
√
εsiTox,eXdep0
εr,ox
. Hereby,
Dvt2 represents the body-bias coefficient of short-channel effects. Xdep =
√
2εsi(Φs−Vbs)
qNdep
is
the depletion width and Xdep0 =
√
2εsiΦs
qNdep
is the depletion width at zero body bias. In
addition, the characteristic length considering the narrow width effect in short channels
is given by:
ltw =
√
εsiTox,eXdep
εr,ox
(1 +Dvt2wVbs) (4.6)
4.1 VARIABILITY-AWARE TRANSISTOR MODEL 107
whereDvt2w is the body-bias coefficient of narrow width effects at small channel lengths.
The effective body to source voltage Vbseff limits the body bias to an upper boundary
and is evaluated as:
Vbseff = Vbc +
1
2
[
(Vbs − Vbc − δ) +
√
(Vbs − Vbc − δ)2 − 4δVbc
]
(4.7)
where δ is a constant equal to 10−3V and Vbc = 0.9
(
Φs − K
2
1
4K2
2
)
represents the maximum
boundary for Vbs.
The effective channel length and effective channel width are computed as:
Leff = L+Xl − 2
(
Lint +
Ll
LLln
+
Lw
WLwn
+
Lwl
LLlnWLwn
)
(4.8)
Weff =
W
Nf
+Xw − 2
[
Wint +
Wl
LWln
+
Ww
WWwn
+
Wwl
LWlnWWwn
+DwgVgsteff +Dwb
(√
Φs − Vbseff −
√
Φs
)]
(4.9)
whereXl andXw are the length, respectively width variations due to masking and etch-
ing, Lint andWint are the lateral diffusion, respectively the width reduction for one side,
Dwg and Dwb are the dependences of the effective channel width on the gate bias and
body bias, respectively, and the remaining parameters describe the interdependence of
the effective length and width on variations in channel dimensions.
An effective Vgse − Vth voltage difference is modeled by the following equation:
Vgsteff =
nvt ln
{
1 + exp
[
m(Vgse−Vth)
nvt
]}
m+ nCox,e
√
2Φs
qNdepεsi
exp
[
− (1−m)(Vgse−Vth)−Voff
nvt
] (4.10)
where m = 0.5 + arctan(Minv )
pi
,Minv is a fitting parameter for moderate inversion, vt is the
thermal voltage, Cox,e = εr,oxε0/Tox,e, Voff is the threshold voltage offset, and n is the
subthreshold swing computed as:
n = 1 +Nfact
Cdep
Cox,e
+
Cdsct +Cit
Cox,e
(4.11)
where Nfact is the subthreshold swing coefficient, Cdep = εsi/Xdep, Cit is an interface
trap parameter, and Cdsct is computed fromCdsc (the source/drain and channel coupling
capacitance),Cdscd (the drain-bias sensitivity ofCdsc) andCdscb (the body-bias sensitivity
of Cdsc).
Finally, the channel current is computed as:
Ids =
Ids0Nf
1 + RdsIds0
Vdseff
[
1 +
1
Cclm
ln
(
VA,sat + VA,CLM
VA,sat
)](
1 +
Vds − Vdseff
VA,DIBL
)
(
1 +
Vds − Vdseff
VA,DIBL
)(
1 +
Vds − Vdseff
VA,DITS
)(
1 +
Vds − Vdseff
VA,SCBE
)
(4.12)
108 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
where Nf is the number of transistor fingers, Ids0 expresses the linear drain current, de-
pendent on the effective mobility model, Rds embeds the bias-dependent drain-source
resistance model, Cclm is the capacitance associated to the channel length modulation,
and the effective drain to source voltage Vdseff is formulated as a smooth transition be-
tween Vds and the saturation voltage Vdsat. The remaining voltage expressions in the equa-
tion (4.12) represent components of the Early voltage, which is defined for the analysis of
the output resistance of the device in saturation. These include the saturation component
VA,sat, the channel length modulation component VA,CLM , the DIBL-induced component
VA,DIBL, the component due to the substrate current-induced body effect VA,SCBE , and the
component describing the drain-induced threshold shift VA,DITS .
Similar to the drain current, expressions for the gate-to-substrate tunneling current
Igb, gate-to-channel current Igc, as well as gate-to-source Igs and gate-to-drain Igd currents
are defined in the BSIM4.3 specification. The total set of bias-dependent device currents
define the source current model illustrated in Fig. 4.1.
It is to be noted, that all symbols written in bold in this section represent modeling
or process parameters defined in the BSIM4.3 transistor model. Variations in each of
the mentioned parameters, as well as in the remaining process parameters on which the
device current expressions depend1, can be expressed as statistical distributions, as ex-
plained in the next section. Hence, the current expressions are statistically evaluated and
the result is stored in pdf form.
4.1.2 Modeling Spatially-Correlated Process Parameter Variations
As explained in the previous section, parameter variations can be described for anymodel
parameter from the underlying BSIM4 specification. Without loss of generality, the simu-
lations employed in this work include descriptions of variations in the following process
parameters:
• Xl – the length variation due to masking and etching, described by:
Xl =Xl,nom + σXlN (0, 1) (4.13)
• Vth0 – the long-channel threshold voltage at zero body bias, with a variation com-
ponent dependent on the square root of the channel area [71, 34]:
Vth0 = Vth0,nom + σ1,Vth0N (0, 1) +
σ2,Vth0√
W · LN (0, 1) (4.14)
• Tox,e – the electrical gate equivalent oxide thickness, modeled as:
Tox,e = T
nom
ox,e + σTox,eN (0, 1) (4.15)
1For illustration purposes, this section presents only a subset of the equations implemented in themodel,
as well as a reduced subset of the underlying model parameters.
4.1 VARIABILITY-AWARE TRANSISTOR MODEL 109
Die Width: W
D
ie
 H
e
ig
h
t:
 H
V
e
rt
ic
a
l 
R
e
s
o
lu
ti
o
n
: 
R
e
s
H
Horizontal Resolution: ResW
Fig. 4.2: Die grid for modeling spatially-correlated process variations.
• Tox,p – the physical gate equivalent oxide thickness, described by a similar depen-
dence:
Tox,p = T
nom
ox,p + σTox,pN (0, 1) (4.16)
It is to be noted, that experimental parameter distributions expressed in discretized pdf
form are also supported. Further, the standard deviation values have been computed
from the literature roadmap presented in Tab. 2.2. In addition to process parameters,
intra-die temperature variations have been also modeled, with a strong spatial correla-
tion.
For modeling spatially-correlated parameter variations, the die areaW ·H is divided
into a grid with custom resolution ResW · ResH , as illustrated in Fig. 4.2. Each process
parameter has a spatially-correlated variation componentwhich is distributed intoResW ·
ResH correlated random variables across the chip area. Thus, for each given process
parameter Pk, a covariance matrix ΣPk is computed from the spatial correlation model
and has a size of (ResW ·ResH)2 elements. For each element in the covariance matrix
ΣPk (r, c) , ∀r, c = 1, (ResW ·ResH)2, the indices in the die grid are computed using the
following relationships:
rowindex1 = r div ResW (4.17)
colindex1 = r mod ResW (4.18)
respectively:
rowindex2 = c div ResW (4.19)
colindex2 = c mod ResW (4.20)
where div and mod indicate the integer division and modulo operations. Further, the
center coordinates of the two grid cells corresponding to the two random variables are
110 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
R
e
s
H
ResW
x
y
y
index2
y
index1
x
index1
x
index2
r
c
(index2)
(index1)
Correlation Distance
Fig. 4.3: Computed grid coordinates and correlation distance for the covariance matrix.
obtained as: {
xindex1 = (colindex1 + 0.5) ·W/ResW
yindex1 = (rowindex1 + 0.5) ·H/ResH
(4.21){
xindex2 = (colindex2 + 0.5) ·W/ResW
yindex2 = (rowindex2 + 0.5) ·H/ResH
(4.22)
The correlation distance between the two variables is consequently computed from the
two coordinate pairs, as shown in Fig. 4.3.
Finally, the covariance value is computed using the corresponding correlation model
as:
ΣPk (r, c) = ρindex1,index2 · σindex1σindex2 (4.23)
where σindex1 and σindex2 are the estimated standard deviations for the random variables
at the two positions in the grid corresponding to r and c, respectively. In the case of pro-
cess parameters with a spatial correlation described by a piecewise-linear (PWL) model,
ρindex1,index2 is computed as:
ρindex1,index2 =
 1−
dρ
dd
(1− ρr) , dρ ≤ dd
ρr, dρ > dd
(4.24)
where dρ is the correlation distance computed as the euclidean distance between the cen-
ters of the two grid cells, dd is the correlation decay distance, and ρr corresponds to a resid-
ual correlationwhich is still present for distances beyond the decay limit and is mainly the
effect of die-to-die parameter variations. Alternatively, other spatial correlation models
can be employed, such as a Gaussian correlation decay:
ρindex1,index2 = e
−
“
dρ
dd
”2
(4.25)
4.1 VARIABILITY-AWARE TRANSISTOR MODEL 111
(Cell 1)
Dec
ay D
istan
ce
0.81 0.6 0.39 0.09 0.09 0.09 0.09
0.09
0.09
0.09
0.09
0.09
0.090.090.090.360.550.710.8
0.6
0.39
0.19
0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
0.090.090.090.090.090.17
0.36 0.27 0.09 0.09 0.09 0.09
0.090.090.090.270.430.55
0.1
0.14
0.1
0.17
0.19
Residual
Correlation
W = 30 mm, ResW = 9
H
 =
 2
0
 m
m
, 
R
e
s
H
 =
 6
(Cell 2)
(Cell 3)
(Cell 4)
Fig. 4.4: Repartition of the correlation coefficient on a 9× 6 grid, as reported to the top-left cell, for
a decay distance dd = 15mm and a residual correlation ρr = 0.09.
This set of correlated random variables is decomposed into a set of independent RVs
using principal component analysis (PCA). Assuming an initial set of spatially-correlated
RVs:
Pk = [Pk,1, Pk,2, . . . , Pk,n]
T (4.26)
where n = ResW · ResH is the number of grid cells, the covariance matrix ΣPk and the
mean vector Pk = [E {P1} , . . . , E {Pn}]T are estimated. Then, a new vector of zero mean
random variables is computed as the difference:
P
0
k
= Pk −Pk (4.27)
After that, the principal components of the zero mean vector are expressed as linear com-
binations:
Ck,i = vi,1P
0
k,1 + vi,2P
0
k,2 + · · ·+ vi,nPk,n (4.28)
whereVi = [vi,1, . . . , vi,n]
T is the i-th eigenvector of ΣPk in decreasing order of the magni-
tude of the corresponding eigenvalues λi. Alternatively, the principal components can be
expressed in matrix form as:
Ck = [V1, . . . ,Vn]
T ·P0
k
(4.29)
Further, the initial set of spatially-correlated random variables can be expressed in terms
of a new set N of n uncorrelated standard normal variables, with zero mean and unit
variance:
Pk = Pk +D
1
2 ·V−1 ·N (4.30)
where V = [V1,V2, . . . ,Vn]
T is the matrix containing the eigenvectors of ΣPk , D =
Diag (λ1, . . . , λn) is the diagonal matrix of the eigenvalues of ΣPk in decreasing order, and
D
1
2 = Diag
(√
λ1, . . . ,
√
λn
)
.
As an example, consider a die area of 30mm×20mm divided by a 9×6 spatial correla-
tion grid, as depicted in Fig. 4.4. The correlation coefficient between the top-left cell (cell 1
112 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0.38
0.4
0.42
0.44
0.46
0.48
0.5
V
th0
(cell 1) [V]
V
th
0
(c
e
ll 
2
) 
[V
]
V
th
0
(c
e
ll 
3
) 
[V
]
V
th
0
(c
e
ll 
4
) 
[V
]
V
th0
(cell 1) [V] V
th0
(cell 1) [V]
(a) (b) (c)
Fig. 4.5: Spatially-correlated values of the threshold voltage parameter Vth0 from grid cells 2 (a),
3 (b), and 4 (c), plotted with respect to the values from cell 1.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
10
−8
10
−7
10
−6
10
−5
10
−4
10
−3
Vgs [V]
I d
[A
]
(a) (b)
0 100 200 300 400
dc (mV)
1 0
−3
1 0
−4
1 0
−5
1 0
−6
1 0
−7
1 0
−8
Y
0
 (
A
)
Fig. 4.6: Subthreshold current plot for an NMOS transistor with W = 3µm, L = 80nm obtained
with the commercial BSIM4 implementation in the Cadence Spectre circuit simulator (a) and with
the derived current-source model (b).
in Fig. 4.4) and the other grid cells decays gradually according to the PWL model (4.24).
The decay distance has been set to 15mm and a residual correlation ρr = 0.09 has been
assumed.
The spatially-correlated threshold voltage distribution for a 90 nm technology is shown
in Fig. 4.5 for the grid cells 2, 3, and 4 from Fig. 4.4, plotted with respect to the val-
ues from cell 1. It can be seen in Fig. 4.5(a), that the threshold voltage values from the
neighbor cell are highly correlated, while the values in cell 3 (Fig. 4.5(b)) still exhibit a
significant amount of correlation, according to the coefficient value of 0.43. Since cell 4
(Fig. 4.5(c)) lies beyond the correlation decay distance, only a very weak residual correla-
tion is present in the values.
4.1 VARIABILITY-AWARE TRANSISTOR MODEL 113
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
x 10
−3
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
x 10
−3
Vgs= 0.4 V
Vgs= 0.6 V
Vgs= 0.8 V
Vgs= 1 V
Vds = 0.2 V
Vds = 0.4 V
Vds = 0.6 V
Vds = 0.8 V
Vds = 1 V
Ids [A]
Vds [V]
Ids [A]
Vgs [V]
(a) (b)
Fig. 4.7: Variations in the output and transfer characteristics obtained for an NMOS transistor with
W = 3µm, L = 80nm obtained from the process parameter variations described in Sec. 4.1.2.
4.1.3 Inclusion of Random Variables and Results Estimation
Due to the underlying BSIM4 equations, the derived transistor model captures well the
very deep sub-micron effects. Fig. 4.6 shows a comparison between the subthreshold
current simulated with the commercial implementation of the transistor model in the Ca-
dence Spectre circuit simulator [31] and the current source model derived in Sec. 4.1.1.
Further comparisons over several device sizes and bias ranges show also a good agree-
ment with the commercial implementation.
Variations can be specified for all BSIM4model parameters in the form of random vari-
able descriptions, using themodels developed in Sec. 3.2. In addition, spatially-correlated
parameter variations can be specified using the two-dimensional die area grid and the
spatial correlation models described in Sec. 4.1.2. As a consequence, the variability de-
scriptions at process parameter level are propagated to the drain current expression using
the statistical approach developed in Sec. 3.3.
An example of the variations obtained in the drain current starting from the spatially-
correlated parameter variations described in Sec. 4.1.2 is shown in Fig. 4.7. It can be
observed, that the drain current variability increases with the drain-to-source bias level,
as illustrated in Fig. 4.7(a). At very small Vds values, the drain current is almost constant
with the process variations. On the contrary, Fig. 4.7(b) shows that the variability ampli-
tude is not significantly affected by changes in Vgs.
This aspect is further investigated by evaluating the standard deviation of the drain
current over the Vds and Vgs voltage range. Fig. 4.8(a) shows the distribution of the drain
current in the presence of process and temperature variations, whereas Fig. 4.8(b) plots
the estimated standard deviation with respect to changes in Vds and Vgs.
As previously observed, the results from Fig. 4.8(b) show a strong variation of the
114 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 2.9
0
50
100
150
200
250
300
350
C
u
m
m
u
la
te
d
 V
a
lu
e
s
Drain Current [mA]
0
0.2
0.4
0.6
0.8
1
0
0.5
1
0
0.01
0.02
0.03
0.04
0.05
0.06
V
gs
V
ds
S
ta
n
d
a
rd
 D
e
v
ia
ti
o
n
 [
m
A
]
(a) (b)
Fig. 4.8: Drain current distribution (a) and variation of the standard deviation over the bias
ranges (b).
drain current variability with the drain-to-source voltage, while changes in Vgs do not
significantly affect the variation. This result is explained by the strong dependences on
Vds of the factors in the drain current expression (4.12) and by the inner multiplications
between Vds and the variability-affected process parameters. First, the threshold voltage
Vth has a strong dependence onXl through Leff . Similarly, the effective voltage difference
Vgsteff and the effective mobility µeff depend strongly on the process variations through
their dependence on Vth. In addition, the current factor Ids0 and the difference Vds−Vdseff
have a linear dependence on Vds and thusmultiply the effect of process variations. Finally,
it must be mentioned that the Early voltage components VA,DITS and VA,SCBE exhibit a
linear dependence on the effective gate length Leff and exponential dependence on Vds.
4.2 Pulsed Current-Mode Signaling Model
Unlike traditional voltage-mode signaling methods, current-mode signaling techniques
rely on current switching on the interconnect line to represent the “0” and “1” logic levels.
As discussed in Sec. 2.6.1, the major drawback of current-mode signaling is its higher
static power dissipation. To counter the effect of switching large static currents, a pulsed
current-mode (PCM) signaling technique was proposed in [87], which uses sharp current
pulses to modulate the signals before transmission, taking advantage at the same time of
the high-frequency LC propagation through repeaterless interconnect links.
The design of the driver and receiver circuits is illustrated in Fig. 4.9. It can be ob-
served that the PCM technique requires a larger area than a conventional voltage-mode
buffer, due to a higher transistor count and to the integrated capacitances. Nevertheless,
the advantage of achieving near speed-of-light propagation times across long intercon-
nects without the use of repeaters balances this drawback. In the next section we analyze
4.2 PULSED CURRENT-MODE SIGNALING MODEL 115
clk
D
D
clk
D
D
1
Signal
Cs
clk
D
D
D
D
clk1
Signal
Cs
Line 1
Vdd
R
Differential
Current SensingLine 2
Vdd
R Out
Out
(a)
(b)
clk2
Fig. 4.9: Pulsed current-mode signaling driver (a) and receiver circuit (b).
the circuit and we identify the main switching current paths. Following, a circuit-level
model for the delay and power is developed, based on the current-source transistormodel
from Sec. 4.1.
4.2.1 Derivation of Current Switching Paths
The input control stages of the PCM driver are implemented using dynamic logic, as
shown in Fig. 4.10, to reduce the delay and area of the circuit. Each dynamic logic stage
controls the duration of the current pulses and data transmission cycles by selectively
connecting the data input to the inverter stages. Further, the inverter circuits control the
M1 and M2 driving transistors, they charge and discharge the capacitances Cs, and close
the current paths to the ground.
When the clock signals clk and clk1 are at zero, the output of the inverters is pulled
down to ground, switching off the transistorsM1 andM2, and discharging the capacitors
Cs at the same time. The second clock signal clk1 is slightly delayed with respect to clk,
such that when clk switches to “1”, M4 connects the inverter input to the drain of M5
which is driven by the data input D, as illustrated in Fig. 4.11.
In the short time left until clk1 switches to “1”, a sharp current pulse is driven from
the Vdd supply, through the weak resistance R and the interconnect line, and switched to
ground by the transistorsM1 and NM1, as shown by the current path i in Fig. 4.12.
116 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Signal
Cs
M1
clk
clk1
D
PM1
1NM
Dynamic
Logic
Signal
Cs
M2
clk
clk1
D
PM2
2NM
Dynamic
Logic
Fig. 4.10: Transistor-level circuit implementation of the PCM driver.
clk
D
Vdd
To the inverter
M3
M4
M5
D
Vdd
M5
D
Vdd
M5
clk = 0
clk = 1
Fig. 4.11: Operation of the dynamic logic input control stage.
The skew between the clock signals clk and clk1 determines the duration of the current
pulse. As soon as clk1 rises to “1”, the current path is maintained only until Cs is charged
to Vdd − Vth. At this point, the gate-to-source voltage of M1 is equal to Vth and M1 enters
4.2 PULSED CURRENT-MODE SIGNALING MODEL 117
Cs
M1
clk = 1
clk1
D = 1
PM1
1NM
Line
Vdd
R
i
= 0
Fig. 4.12: Switched current path flowing through transistorsM1 and NM1.
the subthreshold regime. Further, Cs is mainly charged through PM1 up to Vdd, whereM1
is completely in cut-off and the current flow on the line is stopped.
The bypass capacitor Cs has a small value of only a few fF, nevertheless it has the
important role of sinking a considerable amount of the high frequency current pulse. As
soon as the current path is closed byM1, the voltage on Cs increases with approx. 200mV,
thus helping decrease the current driven by transistor M1. This reduces both the size
requirements forM1 and the overall static power dissipation.
Fig. 4.13 shows the waveforms of the two clock signals and the corresponding differ-
ential outputs. A “1” in the data signal is transmitted as a short current pulse while clk
is high and clk1 is low. Consequently, the flowing current causes a slight voltage drop in
the line from Vdd to the sum of the drain-to-source voltages of M1 and NM1. Similarly, a
zero is transmitted through a pulse on the complementary line. When the clock signals
switch back to “0”, clk1 must stay low enough to ensure the complete discharging of Cs.
Otherwise, a residual charge would accumulate on the capacitor over several clock cycles
and would eventually prevent M1 from switching on, as the gate-to-source voltage will
be less than Vth. In addition, for modeling purposes, it is important to note that the main
driving transistorsM1 andM2 operate always in saturation during the current flow, and
in cut-off respectively shortly in subthreshold regime when the current is interrupted.
The behavior of the PCM circuit has been analyzed in detail through several simula-
tions under various conditions (clock frequencies, component values, interconnect length,
supply voltages, and body bias values). The waveforms presented in this section have
been obtained using the values from Tab. 4.1. Here, f is the clock frequency, Vb represents
118 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
D
clk
Signal
Signal
clk1
Current
Pulse 1
Current
Pulse 2
Cs discharging
Fig. 4.13: Clock synchronization and output signals transmitting current pulses on the differential
line.
Vdd 1.0V
R 80Ω
f 2.0GHz
Cs 10 fF
Wire length 300µm
WM1 3µm
WM2 3µm
WNM1 3µm
WNM2 3µm
L 80 nm
Vb 0V
Trans. analysis step 0.2 ps
Tab. 4.1: Example values for the simulations.
the body bias, and the transient analysis step has been set to 0.2 ps.
The 1GHz clock signals separated by a delay of 0.35 ns are plotted in Fig. 4.14. On
the rising edge of clk, a current starts to flow through one of the differential interconnect
lines and the voltage on the Signal line drops slightly, as illustrated by the near-end and
far-end waveforms in Fig. 4.14. When the data signal is on “0”, a similar voltage drop
occurs on the complementary line Signal between the positive edges of clk and clk1. It is
to be noted, that the voltage at the far-end of the line drops to the following value:
Vfar−end = Vdd − Ipulse ·R (4.31)
This equation is verified by the results from Fig. 4.14 (Vfar−end = 921.1mV) and by the
current value from Fig. 4.15 (Ipulse = 985.3µA).
4.2 PULSED CURRENT-MODE SIGNALING MODEL 119
3 .5 3.75 4.0 4.25 4.5 4.75 5.0 5.25
time (ns)
1.25
1.0
.75
. 5
.25
0.0
V
 (
V
)
V
 (
V
)
1 .25
1.0
.75
. 5
.25
0.0
V
 (
V
)
V
 (
V
)
1 .25
1.0
.75
. 5
.25
0.0
V
 (
V
)
V
 (
V
)
1 .025
1.0
.975
.95
.925
. 9
.875
V
 (
V
)
V
 (
V
)
1 .025
1.0
.975
.95
.925
. 9
.875
V
 (
V
)
V
 (
V
)
/Phi1_Dri
/Phi2_Dri
/ D
/Line_in /Line_out
/NLine_in /NLine_out
out = 921.1mVout = 921.1 Vt  .t  .t  .
in = 885.1mVin = 885.1 Vi   .i   .i   .
out = 921.1mVout = 921.1 Vt  .t  .t  .
in = 885.1mVin = 885.1 Vi   .i   .i   .
clk
clk1
D
Signal
Signal
Fig. 4.14: Waveforms of the clock and data signals and the corresponding voltages at the near and
far end of the differential line.
Further, Fig. 4.15 shows the current pulses issued on the two interconnect lines, which
have a 0.35 ns duration, corresponding to the delay between the two clock signals. As
the current starts to flow, the source voltage of M1 (equal to the voltage on Cs) increases
to approx. 200mV, where the two transistors M1 and NM1 drive the total current on the
line. As soon as clk1 switches to “1”, NM1 is turned off, and Cs is charged by both cur-
rents through M1 and PM1. Finally, after the falling edge of clk, M1 is turned off and Cs
is charged completely to Vdd by the current through PM1. At the same time, the drain
voltage ofM1 is only influenced by the current flowing through the line. Thus, the drain
voltage drops between the rising edges of the two clocks to a value that is well approxi-
mated by:
Vd,M1 ≈ Vdd − Ipulse ·R− Ipulse ·Rline (4.32)
where Rline is the parasitic resistance of the interconnect line. For the 300µm line in this
example (in the considered 90 nm CMOS technology), a parasitic resistance of 36.67Ω
has been estimated, which together with the simulated results from Fig. 4.15 verifies the
above equation. The terminal voltages of transistor M1 during the switching period will
be next utilized to compute the driving current using the transistor model from Sec. 4.1.
120 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
3 .5 3.75 4.0 4.25 4.5 4.75 5.0
time (ns)
1.25
1.0
.75
. 5
.25
0
−.25
I 
(m
A
)
I 
(m
A
)
1 .25
1.0
.75
. 5
.25
0
−.25
I 
(m
A
)
I 
(m
A
)
1 .025
1.0
.975
.95
.925
.9
.875
V
 (
V
)
V
 (
V
)
1 .25
1.0
.75
. 5
.25
0
−.25
V
 (
V
)
V
 (
V
)
i = 985.3uAi = 985.3uAi  .i  .i  .
i  = 985uAi = 985uAi  i   i   
Vd = 885mVVd = 885 V      
Vs = 209.7mVVs = 209.7 V  .  .  .
I1
I2
Vd
Vs
Fig. 4.15: Current pulses on the interconnect lines and the corresponding drain and source volt-
ages for transistorM1.
4.2.2 Equivalent Current-Source Circuit Model
Using the current-source transistor-level model developed in Sec. 4.1, the driver circuit
and the current-mode line are reduced to the equivalent circuit from Fig. 4.16. The para-
sitic resistance and inductance of the interconnect line are lumped into the values R and
L, whileR1 represents the small resistance connecting the current-mode line to the supply
voltage. The first capacitance C1 is the sum of the parasitic capacitances connected to the
drain ofM1 and is evaluated as:
C1 = Cgd,eq + Cdb (4.33)
where Cgd,eq is the equivalent gate-to-drain capacitance ofM1 and Cdb is the drain-to-bulk
capacitance of the reverse-biased pn junction. In the following, we employ equations
derived from the BSIM4.3 [167] capacitance models and adapted to compute the parasitic
capacitances of the transistors connected to the interconnect line at the driver and receiver
sides.
It is important to remember from Sec. 4.2.1, thatM1 operates only in the saturation or
subthreshold/cut-off regimes. As a consequence, the only component of Cdb is the drain-
bulk overlap capacitance. Due to the Miller effect considered when connecting the series
4.2 PULSED CURRENT-MODE SIGNALING MODEL 121
LR
R1
C2C1
Id
IC1
Id
IC2
IR1
UL=L
d( )
dt
+IC2IR1
Vdd
Vo
V1
Fig. 4.16: General line model with current-mode driver.
capacitance to the ground, the equivalent gate-to-drain capacitance is computed as:
Cgd,eq = 2 · Cgd ≈ 2 ·Cgdo ·Wactive (4.34)
whereCgdo
2 represents the non-LDD region drain-gate overlap capacitance per unit width.
Further, the active width is evaluated from the channel length L and width W using the
following relationship:
Wactive = W − 2 ·
(
Dwc +
Wlc
LWln
+
Wwc
WWwn
+
Wwlc
LWlnWWwn
)
(4.35)
whereDwc,Wlc,Wwc, andWwlc are BSIM-model channel-width offset parameters.
The drain-to-bulk capacitance Cdb depends on the considered voltage swing on the
drain during the current pulse and on the voltage applied to the bulk. The use of aver-
aging factors permits the computation of Cgd from the zero-bias junction capacitances as:
Cdb = Keq,bottom · Cj0,bottom +Kgateeq,sidewall · Cgatej0,sidewall +Kisolationeq,sidewall · Cisolationj0,sidewall (4.36)
Here, the bottom junction capacitance at the drain is evaluated as:
Cj0,bottom = Cjd ·Wactive ·
(
Dmci +Dmcg +Lint +
Ll
LLln
+
Lw
WLwn
+
Lwl
LLlnWLwn
)
(4.37)
where Cjd is the bottom junction capacitance per unit area at the drain, Dmci is the dis-
tance from the drain contact center to the isolation edge, and Dmcg is the distance from
the drain contact center to the gate edge. The sumDmci +Dmcg represents the exposed
(uncovered) drain diffusion length, while the remaining terms estimate the lateral diffu-
sion overlap.
Similarly, the gate and isolation sidewall capacitances are evaluated using the follow-
2Similar to Sec. 4.1.1, all symbols written in bold represent SPICE-model parameters dependent on the
manufacturing process.
122 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
ing relationships:
Cgatej0,sidewall = Cjswgd ·Wactive (4.38)
Cisolationj0,sidewall = Cjswd
[
Wactive + 2
(
Dmci +Dmcg +Lint +
Ll
LLln
+
Lw
WLwn
+
Lwl
LLlnWLwn
)]
(4.39)
where Cjswgd and Cjswd are the gate-edge, respectively isolation-edge sidewall junction
capacitances per unit length.
The averaging constant of the bottom capacitance is computed from the following
relationship:
Keq,bottom =
−PMjdbd
(Vhigh − Vlow) (1−Mjd)
[
(Pbd− Vhigh)1−Mjd − (Pbd− Vlow)1−Mjd
]
(4.40)
where Pbd is the bottom junction built-in potential at the drain andMjd is the bulk junc-
tion bottom grading coefficient. The voltages Vhigh and Vlow are defined as the maximum
and minimum reverse bias voltages during the output transition for which the delay is
estimated. If e.g. the output voltage swings between Voi and Vof , then:
Vhigh =
{
Vb − Voi, if |Vb − Voi| > |Vb − Vof |
Vb − Vof , otherwise
(4.41)
Vlow =
{
Vb − Voi, if |Vb − Voi| < |Vb − Vof |
Vb − Vof , otherwise
(4.42)
In this case, the output voltage travels during the from Voi = Vdd to the value Vof = Vd,M1
given by (4.32).
Similarly, the averaging constants of the sidewall capacitances are computed using the
following bias-dependent relationships:
Kgateeq,sidewall =
−PMjswgdbswgd
[
(Pbswgd − Vhigh)1−Mjswgd − (Pbswgd − Vlow)1−Mjswgd
]
(Vhigh − Vlow) (1−Mjswgd) (4.43)
Kisolationeq,sidewall =
−PMjswdbswd
[
(Pbswd − Vhigh)1−Mjswd − (Pbswgd − Vlow)1−Mjswgd
]
(Vhigh − Vlow) (1−Mjswd) (4.44)
where Pbswgd and Pbswd are the gate-edge and the isolation-edge sidewall junction built-
in potentials, respectively, whileMjswgd andMjswd represent the gate-edge and isolation-
edge sidewall junction capacitance grading coefficients.
The second capacitance C2 in the model from Fig. 4.16 sums the gate capacitance of
the input transistor in the receiver and the parasitic capacitance of the interconnect line.
The gate capacitance is estimated as:
Cg = Covs + Covd + Cgc ≈ (Cgso +Cgdo) ·Wactive + εr,oxε0
Tox,e
·Wactive · Lactive (4.45)
4.2 PULSED CURRENT-MODE SIGNALING MODEL 123
Delay
50%
clk
Signal (far-end)
50%
Vo
t
(a)
(b)
V
dd
V
dd
R
1
I
pulse
-
Fig. 4.17: Line delay definition at 50% swing point (a) and the output voltage of the circuit
model (b) used to compute the delay.
Finally, the wire resistance, inductance, and capacitance values are estimated using the
technology-dependent analytic expressions specified by the Predictive TechnologyModel
(PTM) [116].
4.2.3 Analytic Model for Delay and Energy Consumption
The propagation time over the PCM line is defined when the far-end signal reaches 50%
of its total swing, as illustrated by the waveforms in Fig. 4.17(a). Since the delay is defined
on the falling edge, we are interested to model only the falling part of the characteristic
and to estimate the time when the voltage reaches the 50% point, as shown by the Vo
characteristic in Fig. 4.17(b). In the following, we use the circuit from Fig. 4.16 to compute
the expression of Vo (t) and then we will identify the time t0 at which Vo (t0) reaches half
of the swing.
The swing of the driver during the start of the current flow through the interconnect
is bounded by two conditions. At the initial time, the transistorM1 is in cut-off, thus:
Id (0) = 0 ⇒ Vo (0) = Vdd (4.46)
The drain current then raises up to its maximum value Ipulse, where it remains until the
positive edge of clk1. If we assume the voltagemodel from Fig. 4.17(b), the second bound-
ary condition is:
Vo (∞) = Vdd −R1Ipulse (4.47)
Note that although the voltage on the line increases back to Vdd at the end of the current
pulse, only the falling edge is relevant for computing the delay.
124 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
(a)
Id
t
Ipulse
I
1
Region of
interest (b)
Ipulse
Fig. 4.18: Current pulse shape (a) and the model approximation for computing the delay (b).
Similarly, from the current pulse characteristic shown in Fig. 4.18(a), only the rising
edge and the maximum value Ipulse contribute to the delay evaluation. Therefore, for
the delay model, the current Id can be approximated with the characteristic drawn in
Fig. 4.18(b). We further assume that Id switches from 0 to Ipulse according to the following
characteristic:
Id (t) = Ipulse
(
1− e−αt) (4.48)
which can be adjusted to fit the switching profile of the driving transistor. For instance,
the parameter value α = 1010 corresponds to a current rise time tr ≈ 50ps. It is further
important to notice, that sinceM1 operates in saturation, the almost constant current ap-
proximation from Fig. 4.18(b) and (4.48) remains valid within the switching time frame.
The current through the resistorR1 is given by IR1 =
Vdd−Vo(t)
R1
, while the output voltage
can be expressed in terms of the charge on C2 as:
Vo (t) =
Q2 (t)
C2
=
Q2 (0)−
∫ t
0
IC2 (τ) dτ
C2
(4.49)
Note that at the initial time t = 0, both C1 and C2 are charged with the non-zero
charges Q1 (0) and Q2 (0). Since at t = 0 all the currents are zero, the following holds:
Q1 (0)
C1
=
Q2 (0)
C2
= Vdd (4.50)
If we neglect the inductivity effects in a first approximation, the following relation-
ships can be written for the voltage V1 on the capacitor C1 (see Fig. 4.16):
V1 (t) = Vo (t)−R [IR1 (t) + IC2 (t)] (4.51)
V1 (t) =
Q1 (0)−
∫ t
0
IC1 (τ) dτ
C1
(4.52)
Nevertheless, the current sum must be equal to: IC1 (t) + IC2 (t) + IR1 (t) = Id (t), such
4.2 PULSED CURRENT-MODE SIGNALING MODEL 125
that the following system of differential equations can be written:
R1IR1 (t) =
1
C2
∫ t
0
IC2 (τ) dτ (4.53)
1
C1
∫ t
0
IC1 (τ) dτ =
1
C2
∫ t
0
IC2 (τ) dτ +R [IR1 (t) + IC2 (t)] (4.54)
IC1 (t) = Id (t)− IC2 (t)− IR1 (t) (4.55)
After differentiating (4.53) and (4.54) and combining with (4.55), the following second-
order nonhomogeneous ordinary differential equation (ODE) is obtained:
d2IR1 (t)
dt2
+
(
R1 +R +R1
C2
C1
)
RR1C2
dIR1 (t)
dt
+
1
RR1C1C2
IR1 (t) =
Id (t)
RR1C1C2
(4.56)
Replacing IR1 (t) = e
λt, the associated characteristic equation becomes:
λ2 +
R1 +R +R1
C2
C1
RR1C2
λ+
1
RR1C1C2
= 0 (4.57)
with the solutions:
λ1,2 =
−
(
R1 +R +R1
C2
C1
)
±
√(
R1 +R +R1
C2
C1
)2
− 4RR1C2
C1
2RR1C2
(4.58)
By applying the method of variation of constants, a particular solution can be written
as IR1p (t) = u1 (t) e
λ1t + u2 (t) e
λ2t, from which the following equation system results:
u′1 (t) = −
Id (t)
(λ2 − λ1)RR1C1C2 e
−λ1t (4.59)
u′2 (t) =
Id (t)
(λ2 − λ1)RR1C1C2 e
−λ2t (4.60)
Recalling that Id (t) = Ipulse (1− e−αt), after integration the variation functions become:
u1 (t) = − Ipulse
(λ2 − λ1)RR1C1C2
[
−e
−λ1t
λ1
+
e−(λ1+α)t
λ1 + α
]
(4.61)
u2 (t) =
Ipulse
(λ2 − λ1)RR1C1C2
[
−e
−λ2t
λ2
+
e−(λ2+α)t
λ2 + α
]
(4.62)
Finally, the general solution of the inhomogeneous equation (4.56) becomes:
IR1 (t) = K1e
λ1t +K2e
λ2t +
Ipulse
(λ2 − λ1)RR1C1C2
[
1
λ1
− 1
λ2
+
(
1
λ2 + α
− 1
λ1 + α
)
e−αt
]
(4.63)
From the initial condition IR1 (0) = 0, we obtain a first relationship between K1 and
K2:
K2 = −K1 − Ipulse
RR1C1C2
[
1
λ1λ2
− 1
(λ1 + α) (λ2 + α)
]
(4.64)
126 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
The condition IR1 (∞) = Ipulse is verifiedwithout further constraints. Differentiating (4.53)
gives IC2 (t) = R1C2
dIR1 (t)
dt
, which after replacing the derivative of (4.63) and considering
the initial condition IC2 = 0, gives the second relationship for computingK1 and K2:
R1C2K1λ1 +R− 1C − 2K2λ2 + αIpulse
RC1 (λ1 + α) (λ2 + α)
= 0 (4.65)
After finding the expression of IR1 (t), the output voltage is obtained as follows:
Vo (t) = Vdd−R1K1eλ1t−R1K2eλ−2t− Ipulse
(λ2 − λ1)RC1C2
[
1
λ1
− 1
λ2
+
(
1
λ2 + α
− 1
λ1 + α
)
e−αt
]
(4.66)
which also satisfies the limit conditions Vo (0) = Vdd and Vo (∞) = Vdd −R1Ipulse.
An important observation has to be made. Depending on the values of the line and
device parasitics, the roots of the characteristic equation can be complex:
λ1 = ℜλ + jℑλ (4.67)
λ2 = ℜλ − jℑλ (4.68)
As a consequence, the output voltage will also be complex. Thus, for evaluating the delay,
only the real part is considered:
ℜVo(t) =eℜλt sin (ℑλt) (ℑK1 −ℑK2)R1 − eℜλt cos (ℑλt)R1 (ℜK1 + ℜK2)−
Ipulse
RC1C2 (ℑ2λ + ℜ2λ)
+
e−αtIpulse
RC1C2
(ℑ2λ + (α+ ℜλ)2) + Vdd (4.69)
where the constants K1 and K2 are also complex quantities, computed as:
ℜK1 = −
αIpulse (α+ 2ℜλ)
2RC1C − 2R1 (ℑ2λ + ℜ2λ)
(ℑ2λ + (α+ ℜλ)2) (4.70)
ℑK1 = −
αIpulse (−ℑ2λ + ℜλ (α+ ℜλ))
2RC1C2ℑλR1 (ℑ2λ + ℜ2λ)
(ℑ2λ + (α+ ℜλ)2) (4.71)
ℜK2 = −ℜK1 −
αIpulse (α+ 2ℜλ)
RC1C2R1
(ℑ2λ + (α+ ℜλ)2) (4.72)
ℑK2 = −ℑK1 (4.73)
If, in addition, the inductivity effects are important and must be considered, the equa-
tion (4.51) is replaced by the following:
V1 (t) = Vo (t)− L
[
dIR1 (t)
dt
+
dIC2 (t)
dt
]
−R [IR1 (t) + IC2 (t)] (4.74)
This leads to the following equation:
LR1C2
d3IR1 (t)
dt3
+ (L+RR1C2)
d2IR1 (t)
dt2
+(
R +R1 +
R1C2
C1
)
dIR1 (t)
dt
+
1
C1
IR1 (t) =
Id (t)
C1
(4.75)
4.2 PULSED CURRENT-MODE SIGNALING MODEL 127
which is a third-order nonhomogeneous ODE. The solution is found in a similar way,
with the only particularity of a third-order characteristic equation.
Finally, after obtaining the expression of the output voltage as in (4.69) or (4.69), the
propagation delay is obtained as:
Delay = t0
∣∣∣∣∣
Vo(t0)=Vdd−0.5·R1Ipulse
(4.76)
which corresponds to the time point at which Vo (t) crosses the 50% swing point. The
current Ipulse is evaluated using the voltage-dependent source model from Sec. 4.1. A fast
solution of the transcendental equation required by (4.76) is presented in Sec. 4.4.4.
The dynamic energy consumption at the circuit level is estimated during the active
communication period, where the circuit is switching. For this purpose, the total parasitic
capacitance of the circuit is estimated as shown in this section. After that, the dynamic
energy is computed as a function of the communication load (actually the switching ac-
tivity, which is dependent on the communication load), of the total capacitance (circuit
and interconnect), and of the communication segment width:
Ed ≈ f (Lc) CtotalV
2
dd
Ws
(4.77)
The dynamic energy of each segment is included in the overall system-level macromodel
as explained in Sec. 4.4.5.
Finally, the static energy consumption is estimated from the sum of the leakage cur-
rents in the driver and receiver during the idle time and the static current flow through
the line during the pulse window:
Es ≈
(
Idriverleakage + I
receiver
leakage
)
Vddtidle + IpulseVddtpulse (4.78)
Both the leakage currents and the signaling current are estimated using the statistical
transistor model. The integration in the macromodel structure is discussed in Sec. 4.4.5
as well.
4.2.4 Performance Evaluation under Voltage Scaling and Body Biasing
The delay plot in Fig. 4.19(a) shows a total delay variation of approx. 430 ps with intercon-
nect length, for lines up to 3 cm long. This relatively small increase confirms the usage
of PCM signaling circuits for global repeaterless lines. It is to be noted, that the driver
itself exhibits a 70 ps delay, which is not dependent on the line length. This fixed delay
represents in fact the lowest delay threshold, beyond which PCM-driven lines show their
speed advantage.
A relatively higher static energy consumption is the main downside of current-driven
lines, which is also illustrated by the plot from Fig. 4.19(b). Nevertheless, due to the short
128 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
0 0.5 1 1.5 2 2.5 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x 10
−11
Static Energy
Dynamic Energy
0 0.5 1 1.5 2 2.5 3
0
100
200
300
400
500
600
700
Line length [cm]
D
e
la
y
 [
p
s
]
Line length [cm]
E
n
e
rg
y
 [
J
]
(a) (b)
0.43 ns
Fig. 4.19: Delay (a) and energy (b) variation with interconnect line length.
current pulses, the static energy is limited to a value around 9pJ. For the evaluations, a
pulse current duration equal to the delay on the line and an idle time of 1ms for leakage
evaluations have been assumed. This corresponds to an average power dissipation of
9 nW. It can be noticed that the relative variation with the wire length does not have a
significant contribution to the total static energy value. On the other side, the dynamic
energy is strongly dependent on the line capacitance, as indicated by (4.77), therefore
most of its value is determined by the wire length.
The line delay is determined by the drive current which charges/discharges the para-
sitic capacitances. Due to the circuit structure of the PCM line, the drive current is mainly
dependent on the voltage drop across the line, from the supply voltage connection at the
receiver, to the drain-to-source voltage sum of the driving transistors. Consequently, a
change in the body bias of the driver has only a negligible impact on the current value.
Voltage scaling, on the other side, has a stronger influence on the signaling current, how-
ever it also changes the voltage from which the line capacitances are discharged. Nev-
ertheless, due to the very small swing of the output voltage Vdd − IpulseR1 (R1 is a small
resistance, in the order of tens of Ω), the impact of voltage scaling on the overall line de-
lay is also very small. Thus, body biasing and voltage scaling have a noticeable influence
only on the small delay of the driver. Evaluations using both the previously-developed
model and circuit simulations show a change in the driver delay of maximum 10ps with
voltage scaling and ±3 ps with body biasing for the investigated range of wire lengths.
The effect of voltage scaling and body biasing on the energy consumption of the PCM
communication segment has been investigated in Fig. 4.20. It can be seen that the static
energy consumption is highly affected by the body bias, whereas supply voltage scaling
exhibits a relatively smaller influence. As expected, reverse body biasing (negative Vbs)
increases the threshold voltage, which leads to a significant leakage reduction. Forward
body biasing (positive Vbs) has in the case of PCM signaling only the negative effect of
4.3 VOLTAGE-MODE SIGNALING MODEL 129
(a) (b)
−0.4
−0.2
0
0.2
0.4
0.6
0.7
0.8
0.9
1
0
1
2
3
4
x 10
−11
−0.4
−0.2
0
0.2
0.4
0.6
0.7
0.8
0.9
1
1
1.5
2
2.5
3
3.5
4
x 10
−12
S
ta
ti
c
 E
n
e
rg
y
 [
J
]
D
y
n
a
m
ic
 E
n
e
rg
y
 [
J
]
Vdd [V]
Vbs [V]
Vdd [V]
Vbs [V]
Fig. 4.20: Impact of voltage scaling and body bias on the static (a) and dynamic energy consump-
tion (b) for a 15mm interconnect line.
increasing the leakage, since the delay remains unaffected. It is further to be noted, that
body biasing has a stronger impact on the static energy at higher supply voltages, since
the static current on the line increases linearly with Vdd. As expected from (4.77), body
biasing has no impact on dynamic energy consumption. Thus, voltage scaling is the main
mechanism to control the dynamic energy at circuit level.
4.3 Voltage-Mode Signaling Model
To evaluate the advantages of the novel PCM signaling technique, a model for the classic
voltage-mode signaling circuit is developed in this section. In addition, the integration
of both models in the design space exploration allows a selection of the PCM circuit only
when it brings a performance advantage over the classic signaling.
4.3.1 Equivalent Current-Source Circuit Model
The buffer circuit used for voltage-mode signaling is shown in Fig. 4.21. Assuming a
symmetric inverter stage, the delay can be estimated e.g. from the pull-down cycle, where
the NMOS transistor drives the line to ground. In this case, a short current flows through
the line which discharges the gate capacitance of the receiver and the parasitic capacitance
of the line. In this case, the equivalent circuit model includes the receiver capacitance,
the line model, and a voltage-controlled current source together with a parasitic output
capacitance for the driving stage, as illustrated in Fig. 4.21.
During the pull-down switch, the drain current Id rises abruptly to a maximum value
corresponding to the full-drive of transistorM1 (Vgs = Vds = Vdd), as shown by the wave-
forms in Fig. 4.22(a). When Vgs = Vdd, the PMOS at the output of the driver is turned off,
130 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Line
Input Output
R
L
CId
i
CR
M1
CD
Fig. 4.21: Voltage-mode buffer circuit and equivalent circuit model.
Input
Output
near-end
far-end
Delay
Id
Region of
interest
Id
t
Id,max
t0
Id,max
50%
Deviation outside
the region of interest
Region of
interest
(a)
(b)
Fig. 4.22: Region of interest in the drain current characteristic for computing the delay (a) and
exponential approximation for the delay model (b).
therefore the current sinked by the NMOS originates entirely from the load. After that,
the capacitances on the line start to discharge and the voltage at the receiver gate starts to
decrease. As discussed in Sec. 4.2.3, the propagation time is defined at 50% of the swing,
which for the voltage-mode CMOS buffer occurs at Vdd
2
. Thus, an accurate model for the
drain current is required only until Vo =
Vdd
2
, as indicated in Fig. 4.22.
4.3 VOLTAGE-MODE SIGNALING MODEL 131
LR
C2C1
Id
IC1
Id
IC2UL=L
dI
dt
C2
Vo
V1
Fig. 4.23: Voltage-mode driver and line model.
Further, the identified region of interest from the drain current characteristic is ap-
proximated as shown in Fig. 4.22(b) with the following exponential function:
Id (t) =
 Id,max + γ
[
1− e tt0 ln
“
1+
Id,max
γ
”]
, t < t0
0, t ≥ t0
(4.79)
where Id,max is the maximum drain current of M1 corresponding to Vgs = Vds = Vdd, t0
is the time at which Id = 0 (in this partial approximation), and γ is a curvature fitting
parameter. Typical values for the considered 90 nm technology are γ = 0.5 · 10−4 and
t0 = 2.5ns for a 10mm segment, respectively t0 = 1.16ns for a 10µm segment.
4.3.2 Analytic Model for Delay and Energy Consumption
The equivalent circuit model derived from the observations made in the previous section
is drawn in Fig. 4.23. Here, the current Id is described by the relationship (4.79) and capac-
itance C1 includes the parasitic output capacitances of the driver. The second capacitor C2
models the sum of the line capacitance and the parasitic gate capacitances of the receiver.
Further, the transistor-level parasitic capacitances are estimated using BSIM4-derived
equations with technology-dependent parameters, as described in Sec. 4.2.2. Similar to
the PCMmodel, the line parasitics are estimated using the PTM analytic expressions [116].
In the considered case of a pull-down transition, the capacitances are charged up to
the supply voltage, such that Q1 (0) = C1Vdd and Q2 (0) = C2Vdd. Consequently, the time
variation of the voltages on capacitances C1 and C2 can be written as:
V1 (t) = Vdd − 1
C1
∫ t
0
IC1 (τ) dτ (4.80)
Vo (t) = Vdd − 1
C2
∫ t
0
IC2 (τ) dτ (4.81)
The voltage V1 can also be expressed in terms of the output voltage, as:
V1 (t) = Vo (t)− LdIC2 (t)
dt
−RIC2 (t) (4.82)
which, together with (4.80) gives:
1
C1
∫ t
0
IC1 (τ) dτ =
1
C2
∫ t
0
IC2 (τ) dτ + L
dIC2 (t)
dt
+RIC2 (t) (4.83)
132 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
By deriving (4.83) and considering that IC1 (t) = Id (t)−IC2 (t)we obtain the following
second-order ODE:
d2IC2 (t)
dt2
+
R
L
· dIC2 (t)
dt
+
C1 + C2
LC1C2
IC2 =
Id (t)
LC1
(4.84)
Replacing IC2 (t) = e
λt leads to the characteristic equation:
λ2 +
R
L
λ+
C1 + C2
LC1C2
= 0 (4.85)
with the solutions:
λ1,2 =
−R
L
±
√(
R
L
)2 − 4(C1+C2)
LC1C2
2
(4.86)
A particular solution of the ODE can be found in the form IC2p (t) = u1 (t) e
λ1t + u2 (t) e
λ2t
with the variation functions described by:
u′1 (t) = −
Id (t)
(λ2 − λ1)LC1 e
−λ1t (4.87)
u′2 (t) =
Id (t)
(λ2 − λ1)LC1 e
−λ2t (4.88)
If we re-write the drain current from (4.79) as:
Id (t)
∣∣∣∣∣
t<t0
= I1 − γeαt (4.89)
where I1 = Id,max + γ and α =
1
t0
ln
(
1 +
Id,max
γ
)
, integrating (4.87) and (4.88) gives:
u1 (t) =
I1
(λ2 − λ1)LC1λ1 e
−λ1t +
γ
(λ2 − λ1)LC1 (α− λ1)e
(α−λ1)t (4.90)
u2 (t) = − I1
(λ2 − λ1)LC1λ2 e
−λ2t − γ
(λ2 − λ1)LC1 (α− λ2)e
(α−λ2)t (4.91)
Finally, the general solution of (4.84) is obtained in the following form:
IC2 (t) = K1e
λ1t +K2e
λ2t +
I1
LC1λ1λ2
− γe
αt
LC1 (α− λ1) (α− λ2) (4.92)
It is to be remembered, that (4.92) is valid only for t < t0. Nevertheless, the model must
be accurate only up to the time when Vo reaches the half of its full swing, which occurs
before t0, as shown in Fig. 4.22.
Verifying the initial condition IC2 (0) = 0 gives the first equation for computing K1
and K2:
K1 +K2 +
I1
LC1λ1λ2
− γ
LC1 (α− λ1) (α− λ2) = 0 (4.93)
4.3 VOLTAGE-MODE SIGNALING MODEL 133
Recalling that V1 (t) = Vo (t)−LdIC2 (t)dt −RIC2 (t) and by admitting that V1 (0) = Vo (0) (the
initial condition before the pull-down transition), with IC2 (0) = 0 from above, we obtain:
dIC2 (t)
dt
∣∣∣∣∣
t=0
= 0 (4.94)
which gives the second equation for findingK1 and K2:
K1λ1 +K2λ2 − γα
LC1 (α− λ1) (α− λ2) = 0 (4.95)
with the results:
K1 =
1
LC1 (λ1 − λ2)
(
γ
α− λ1 +
I1
λ1
)
(4.96)
K2 =
γ
LC1 (α− λ1) (α− λ2) −
I1
LC1λ1λ2
−K1 (4.97)
Finally, the output voltage has the following expression:
Vo (t) =Vdd − 1
C2
[
K1
λ1
(
eλ1t − 1)+ K2
λ2
(
eλ2t − 1)
− γ (e
αt − 1)
αLC1 (α− λ1) (α− λ2) +
I1t
LC1λ1λ2
]
(4.98)
again, valid only for t < t0 and in particular needed only until Vo (t) =
Vdd
2
. In addition,
the expression for the case when the roots λ1,2 are complex is presented in appendix A.
Once the analytic expression of the output voltage has been derived, the delay is eval-
uated as:
Delay = t50%
∣∣∣∣∣
Vo(t50%)=0.5·Vdd
(4.99)
Note that, the voltage-mode buffer has a full swing from Vdd to ground, therefore the
delay is computed at Vdd/2.
The dynamic energy consumption is a function of the switching activity, which is dic-
tated by the communication load divided by the segment width, if we assume one buffer
per segment wire. Therefore, the energy is approximated by the expression (4.77).
Further, the leakage energy consumption is estimated by adding the leakage currents
of the driver and receiver and by multiplying the result with the idle time:
Es ≈
(
Idriverleakage + I
receiver
leakage
)
Vddtidle (4.100)
Unlike in the case of PCM signaling, there is no significant static current flowing through
the circuit. Again, the leakage currents are evaluated using the statistical transistormodel.
134 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
0 0.5 1 1.5 2 2.5 3
6
6.5
7
7.5
8
8.5
9
9.5
x 10
−12
PCM
Voltage−Mode
Line length [cm]
D
e
la
y
 [
p
s
]
Line length [cm]
S
ta
ti
c
 E
n
e
rg
y
 [
J
]
(a) (b)
0 0.5 1 1.5 2 2.5 3
0
500
1000
1500
PCM
Voltage−Mode
Fig. 4.24: Delay (a) and static energy (b) comparison between voltage-mode and PCM signaling.
4.3.3 Performance Evaluation under Voltage Scaling and Body Biasing
The classic voltage-mode signaling technique exhibits a stronger increase in delay with
the interconnect length with respect to the previously-analyzed PCMmethod. Fig. 4.24(a)
shows a direct comparison for interconnect lengths varying between 1µm and 3 cm. It is
also important to notice, that voltage-mode signaling actually achieves a shorter propa-
gation delay for local interconnects. More precisely, up to a line length of approximately
600µm the voltage-mode signaling achieves smaller delays, which is the direct conse-
quence of the lower complexity of the two-stage buffer with respect to the PCM driver.
For lines longer than 600µm, the propagation time is dominated by the interconnect delay
and the current-mode method becomes more efficient.
Fig. 4.24(b) illustrates the static energy difference between the two analyzed signaling
methods. As expected, the PCM segment has a higher static energy consumption due
to the signaling current. Nevertheless, the difference of up to 3 pJ (for line lengths of
3 cm) is still relatively low compared to the gain achieved in delay. Note also, that the
static energy of the voltage-mode line is determined only by the leakage currents flowing
through the driver and receiver buffers in idle mode. Thus, interconnect length has no
impact on the static energy consumption of the voltage-mode circuit.
In a voltage-mode line the delay is primarily determined by the strength of the driver
current, which charges or discharges the line and circuit capacitance. As a direct con-
sequence, through voltage scaling and body biasing the delay can be significantly in-
fluenced, as shown by the results from Fig. 4.25(a). Here, the delay increases with ap-
prox. 700 ps for a 15mm interconnect line when the supply voltage is decreased with
0.4V. In addition, a ±0.4V body bias varies the delay over a range of 98 ps at the nominal
Vdd, while the impact increases to a variation range of 755 ps at 0.6V.
Similar to the results from the PCM circuit, body biasing has a much stronger influ-
4.4 MODELING OF COMMUNICATION SEGMENTS 135
−0.4
−0.2
0
0.2
0.4
0.6
0.7
0.8
0.9
1
500
1000
1500
2000
2500
−0.4
−0.2
0
0.2
0.4
0.6
0.7
0.8
0.9
1
0
1
2
3
4
5
6
x 10
−11
Vdd [V]
Vbs [V]
Vdd [V]
Vbs [V]
(a) (b)
D
e
la
y
 [
p
s
]
S
ta
ti
c
 E
n
e
rg
y
 [
J
]
Fig. 4.25: Influence of voltage scaling and body biasing on the delay (a) and static energy con-
sumption (b) of a 15mm voltage-mode line.
ence on the static energy consumption than voltage scaling. Again, the effect is more
significant at higher supply voltages and can vary the energy consumption by a factor of
11.
4.4 Modeling of Communication Segments
As mentioned in Sec. 2.1.1 and shown in Fig. 2.3, this work uses the concept of commu-
nication nodes (CNs) to model the on-chip communication. Chapter 3 developed a sta-
tistical methodology for representing and propagating the variability across performance
macromodels and discussed in detail the representation of processing nodes (PNs). After
developing in the current chapter a thorough model for the communication circuits, we
are able to present the modeling of communication nodes in the context of the overall
delay and energy macromodels.
The information required for building the CNs includes the following:
• Inter-task communication loads, extracted in the profiling step and specified in the
input configuration file
• Task-to-resource mapping associations, created during the design exploration steps
• Resource scheduling lists, also generated during the optimization loop
Given the resource mapping configuration, the inter-task communication needs are iden-
tified: tasks connected by data dependencies and assigned to different resources require
an inter-resource communication segment. Given the communication needs and the
resource placement information, the synthesis algorithm allocates communication seg-
ments between the resources, using the circuit-level models developed in this chapter.
136 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Vdd
Vb
PTM
R, L, Cclk
D
D
clk
D
D
1
Signal
Cs
clk
D
D
D
D
clk1
Signal
Cs
Differential
Current Sensing
clk2
choice
PCM
Voltage-Mode
Buffer
Line
Fig. 4.26: Communication segment using the available circuit-level models.
This section presents the data structures employed to represent the inter-resource com-
munication nodes, shows how the circuit-level models are employed in the CN structures,
and discusses the integration of communication nodes into the system-level macromod-
els.
4.4.1 Transceiver and Interconnect Model
A communication segment is represented within the synthesis framework as an intercon-
nect line model and a transceiver model. To achieve the fast estimations required during
the optimization loop, the transceiver uses the analytic models for signaling circuits de-
veloped in Sec. 4.2.3 and 4.3.2, while the interconnect line is represented using the R, L,
and C analytic expressions of the predictive technology model [116].
In this way, the optimization algorithm can select from the available signaling meth-
ods the one which has the better performance for a given segment. In addition, through
voltage scaling and body biasing at the circuit level, the CN performance can be further
adjusted in fine steps.
Once an optimized communication architecture is found during the synthesis, a pre-
cise validation can be performed through simulation, using the complete transistor-level
circuits of the driver and receiver, and a more accurate model of the interconnect line.
Such a model for the segment lines which is accurate over a wide range of frequencies is
developed in chapter 5.
4.4 MODELING OF COMMUNICATION SEGMENTS 137
A
R
B
RBA
A
R
B
R1
R2 R3
R4 R5
R6
Local Cluster
In
te
rm
e
d
ia
te
 C
lu
ste
r
Global (Root) Cluster
0
1
2
R2 R3 R4 R5 R1 R6
Cluster Level
(a)
(b)
Fig. 4.27: Floorplan clusters enclosing the on-chip resources (a) and the corresponding floorplan
cluster tree (FCT).
4.4.2 Floorplan Model using Clusters
Communication parameters, such as segment length, width, and the resulting speed,
are strongly dependent on floorplanning and the exact placement of cores. Since the
floorplanning and placement steps in the hardware/software co-synthesis are beyond
the scope of this thesis, they are abstracted using the following cluster model.
A floorplan cluster encloses the resources placed on the chip within a given distance.
This inter-resource distance determines the level of a cluster, such that resources placed
closer to each other will belong to local (or lower-level) clusters, whereas resources placed
e.g. at the opposite corners of a die will be enclosed by a global (or high-level) cluster.
The concept of floorplan clusters is illustrated in Fig. 4.27(a) for 6 resources placed on
the die. Smaller clusters are associated to local communication segment, whereas the top-
level cluster encloses resources communicating over global lines. Further, a cluster level
is associated to each cluster size and all clusters are organized in a linked data structure
called floorplan cluster tree (FCT). The FCT associated to the three clusters is represented
in Fig. 4.27(b), where level 0 corresponds to the root cluster and level 2 indicates the local
cluster.
Complex MPSoC architectures are described typically using several local and inter-
mediate clusters. Moreover, the number of cluster levels is not limited, therefore the
granularity of floorplan clusters can be adjusted. In addition, characteristics such as seg-
ment length and width can be associated to each cluster level according to the average
distance between the enclosing resources. Thus, a communication model between two
given resources can quickly estimate the segment length and width by searching their
first common parent in the floorplan cluster tree. The lowest common parent node in the
FCT represents the level of the smallest cluster which encloses the two resources, hence
138 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Center 
connecting line
Placement
estimation
R1
R2S
p
a
ti
a
l 
C
o
rr
e
la
ti
o
n
 G
ri
d
Fig. 4.28: Estimation of communication circuit location for considering spatially-correlated pa-
rameter variations.
the length and width of the shortest communication path are employed in performance
estimations.
Another important aspect to be considered when using the FCT model is that a first
routing estimation must be known together with the floorplan information. This is par-
ticularly important, on the one hand, for custom hierarchical bus architectures, where
the main directions of the buses must be specified. On the other hand, for network-on-
chip (NoC) architectures, the communication distance between cores can be estimated,
due to the regular structure of a NoC. In this context, the routing algorithm determines
the traveling distance, therefore the length estimation is strictly coupled with the routing
mechanism. Resource clustering helps in this case, giving a first coarse estimation if an
exact knowledge of the routing algorithm is not available during the synthesis.
4.4.3 Estimation of Communication Circuit Placement on Die
The position of the driver and receiver circuits on the die must be known to consider
spatially-correlated process and temperature variations. Nevertheless, the exact location
of communication segments is not known during the synthesis loop, as an optimized
communication architecture is searched at each step in the solution space. Thus, an esti-
mation of the communication circuit position is performed as illustrated in Fig. 4.28.
Given the floorplan specification for the processing resourcesRi, the center position of
each resource is determined in coordinates of the correlation grid. After that, the distance
between the centers of the two communicating resources is evaluated from the center
coordinates and the positions of the communication circuits are evaluated at the inter-
section between the line connecting the centers of the two resources with the resource
boundaries. In the example from Fig. 4.28, assuming resource R1 of size w1×h1 with cen-
ter coordinates (x1, y1) and resource R2 with size w2 × h2 and center coordinates (x2, y2)
where d is the center-to-center distance, the coordinates of the driver circuit at the R1 side
4.4 MODELING OF COMMUNICATION SEGMENTS 139
Vo
t
V
dd
V
dd
R
1
I
pulse
-
V
dd
R
1
I
pulse
- 0.5
initial
guess
bisection
bisection
Solution
approximation
Fig. 4.29: Fast approximation of the delay solution using the bisection method (example shown
for the PCM signaling circuit).
are evaluated as:
xd,1 =

x1 + (x2 − x1) h1
2 (y2 − y1) , if
|x2 − x1|
w1
<
d√
w21 + h
2
1
x1 +
w1
2
sign (x2 − x1) , otherwise
(4.101)
yd,1 =

y1 + (y2 − y1) w1
2 (x2 − x1) , if
|y2 − y1|
h1
<
d√
w21 + h
2
1
y1 +
h1
2
sign (y2 − y1) , otherwise
(4.102)
(4.103)
The coordinates of the communication circuit at R2 side are estimated similarly.
4.4.4 Quick Delay Solution
The delay of communication segments is evaluated using the analytic models of signaling
circuits. It is to be noticed, that solving the equations (4.76) and (4.99) involves finding
the solution of transcendental equations. Thus, this work adopts a fast numerical solv-
ing of (4.76) and (4.99) using the bisection method [38]. First, the middle of the interval
[0, tinitial] is checked, with an initial guess tinitial large enough to cover all possible delay
values. After checking the middle, the time interval is halved and the method contin-
ues as shown in Fig. 4.29 until the approximation interval is smaller than an adjustable ε.
Then, the considered solution is the middle of this approximation interval, which gives
an approximation error of maximum ε
2
.
4.4.5 Implementation of Communication Nodes
Starting from the formulations developed in Sec. 4.2 and 4.3, a statistical model for com-
munication circuits has been implemented, having the structure shown in Fig. 4.30. The
140 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
P1
P2
P3
Pn
Signaling Resource
Model
Signaling
Circuit
PTM Line Model
Parameter
Variations
(process, temperature)
Delay
Static Current
Leakage Currents
Fig. 4.30: Statistical model for signaling resources embedding the analytical formulations from
Sec. 4.2 and 4.3.
Ti
e
+ max
Tj
s
/Lc
i,j
Ws
Delay
CN Structural Element
Fig. 4.31: Implementation of a communication node in the delay macromodel.
structure embeds the models of signaling circuits, as well as the interconnect line model
attached to the segment. The wire attributes are read from the first common floorplan
cluster of the connecting resources and the spatially-dependent process and temperature
variations are read from the variations and correlations models described in Sec. 4.1.2.
Additional inputs are the choice of signaling circuit, as well as the Vdd and Vbs values.
According to the models, the statistical distributions for the segment delay, for the static
current (in the case of PCM signaling), and for the leakage currents are evaluated and
available as outputs. In addition, the total segment capacitance is also available for dy-
namic energy evaluations.
The delay of communication nodes is included in the overall system macromodel as
shown in Fig. 4.31. Here, the total delay of the communication across a given segment is
computed as the communication load Li,jc (in bits) between two communicating tasks PNi
and PNj multiplied by the delay of transmitting a single bit over the line. For communi-
cation segments with multiple parallel wires, the communication load is first divided by
the segment widthWs. The communication delay computed in this way is then inserted
between the end of the execution of PNi and the start of the execution of PNj .
Whenever a change occurs in the communication circuit, such as a change in Vdd or
Vbs, or the change of signaling method, the delay random variable must be reevaluated.
4.4 MODELING OF COMMUNICATION SEGMENTS 141
Pd,i
RT
Ti
x
j Ed,i
+ Ed,total
Lc
i,j
C Vdd
2Signaling Resource
Model
CN Structural Element
Fig. 4.32: Inclusion of communication nodes in the dynamic energy macromodel.
As a consequence, all the downstream PM nodes starting with the multiplication node
shown in Fig. 4.31 must reset their evaluation and update the stored pdf to reflect the
change. Further, during task re-mappings to different resources, the structural elements
for the newly-required CNs must be built and inserted into the macromodel, while the
CNs between the tasks which do not need inter-resource communications anymore due
to the different mappings must be removed from the macromodel. Adding and removing
CN structural elements requires also the propagation of updates downstream in the delay
macromodel.
In the case of dynamic energy, a CN is included in the macromodel as illustrated in
Fig. 4.32. Note, that the total capacitance of the segment is computed using the signaling
resource model as the sum between the line capacitance, the output capacitance of the
driver, and the input capacitance of the receiver. Therefore, whenever changing the sig-
naling circuit during the optimization loop, the distribution of C is updated. This change
requires an update of the evaluation for the product operator in the CN structural ele-
ment and for the output sum operator of the dynamic energy macromodel. Nevertheless,
a change in Vdd for the current segment (during voltage scaling) triggers a similar update.
Finally, communication nodes are represented within the leakage energy macromodel
by structural elements similar to the one shown in Fig. 4.33. The upper branch of the
structural element evaluates the leakage energy of the driver and receiver circuits using
the leakage currents computed using the transistor-level model. The lower branch is only
present in the structural elements representing PCM signaling circuits and evaluates the
static energy consumption due to the signaling current pulses. Again, changes during
the optimization loop, such as choice of circuit, voltage scaling, and body biasing require
the update of statistical evaluations in the corresponding structural element and in the
output node of the macromodel.
142 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
Pl,RTj
E l,i
+ E l,total
Leakage Currents
Vdd
Sum of all slacks
of the segment+
Static Current
VddDelay
+
CN Structural Element
Fig. 4.33: Structural element for modeling the static energy of a communication node within the
leakage energy macromodel (example shown for a PCM signaling circuit).
4.4.6 Performance Results
A three-task example with an emphasis on inter-resource communication is used to test
the performance characteristics of the developed circuit-level models. Fig. 4.34(a) shows
the task dependencies and resource mappings for the considered example. Two commu-
nication nodes implement the data transfers between the resources and are synthesized
as two communication segments as shown in Fig. 4.34(c). It is further assumed, that re-
sources R3 and R1 are placed close to each-other and thus will be connected by a short
local segment, whereasR2 is located farther away on the chip andwill be connected using
a global segment. The floorplanning information is stored in the two-level cluster tree de-
picted in Fig. 4.34(b), where the root cluster FC0 corresponds to the long global segment
connecting R3 with R2 and the local cluster FC1 covers the distance between R3 and R1.
Tab. 4.2 shows the values used for this example. In order to emphasize the perfor-
mance costs of communication activities, the execution times and power dissipation lev-
els of the processing resources have been assumed to be relatively small. The opposite
holds for the communication loads, which have been chosen to be large enough such
that the total performance cost is strongly influenced by the communication synthesis.
Further, variations in the communication loads, execution times, and power dissipations
have been specified using normal distributions N (µ, σ). The communication in the local
cluster is synthesized using a segment of length L = 50µm, for which the voltage-mode
(VM) signaling achieves the smallest delay. In contrast, the root cluster requires a segment
of length L = 5mm, which lies in the region where PCM signaling performs better.
The statistical evaluations in the macromodel were performed on discretized pdfs
4.4 MODELING OF COMMUNICATION SEGMENTS 143
R1
Start
End
1 R22
R33
CN1 CN2
R11
R22R33
2
4
Segment 1
Segment 2
(a)
(c)
0
1
R1 R3 R2
FC1
FC0
(b)
Fig. 4.34: Task graph example with emphasis on communication nodes (a), the associated floor-
plan cluster tree (b), and the modeled communication segments (c).
L1,3c N (100, 2) [MB]
L2,3c N (50, 2) [MB]
Pl,RTi N (5, 1) [µW]
T xi,RTj N (20, 2) [ps]
P
RTj
d,i N (5, 1) [µW]
FC0 L = 5mm,Ws = 4
FC1 L = 50µm,Ws = 2
Tab. 4.2: Input values for the communication synthesis of the three-task example.
with Nb = 30 bins and a number of Nsb = 500 individual samples from a total sample
count Ns = 10 000 for the numerical operators. The relevant sigma-domain for the statis-
tical distributions has been set to ±3σ and a 10 × 10 correlation grid was employed for
parameter variations.
First, the delay of the synthesized architecture is analyzed, with respect to the different
signaling choices. The results plotted in Fig. 4.35 show the delay pdfs evaluated using the
delay macromodel for different signaling circuits on the two communication segments.
Using the PCM signaling method on the first segment is detrimental for the delay,
since the 50µm length of segment 1 lies below the threshold where PCM signaling starts
to achieve smaller delays, as discussed in Sec. 4.3.3 and shown by Fig. 4.24(a). In addition,
the high communication load L1,3c amplifies the delay difference between the two signal-
ing methods. Thus, the two configurations with PCM signaling on segment 1 achieve the
worst delays, as shown by the plots in Fig. 4.35. Using voltage-mode signaling on the first
segment reduces the delay mean with at least 5ms in this example.
Further, the second segment with a large length is more appropriate for PCM signal-
ing. Thus, the use of PCM signaling on the second segment reduces the delay on average
with 6ms and represents the optimum configuration for the given example.
144 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
10 15 20 25 30 35 40 45
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Total Delay [ms]
N
o
rm
a
li
z
e
d
 p
d
f
Seg. 1 VM
Seg. 2 VM
Seg. 1 VM
Seg. 2 PCM
Seg. 1 PCM
Seg. 2 VM
Seg. 1 PCM
Seg. 2 PCM
Fig. 4.35: Total delay of the synthesized structure from Fig. 4.34(c) with different signaling circuits
on the two communication segments.
The impact of body biasing on the leakage power distribution is investigated in Fig. 4.36
for the minimum-delay configuration identified from Fig. 4.35. Here, the sweep of Vbs
from -0.4V to +0.4V increases the leakage energy by a factor of 1.5. As discussed in
Sec. 4.2.4 and 4.3.3, the leakage energy of both signaling methods is significantly influ-
enced by body biasing.
To conclude, the modeled signaling circuits are optimally employed in different cases.
For very short communication segments, voltage-mode signaling achieves the lowest de-
lay, whereas PCM signaling has an advantage in longer communication lines. Voltage
scaling has mainly an influence on the delay of voltage-mode circuits and less of an im-
pact on the PCM lines (only the delay of the driver is affected, which represents a small
fraction from the total delay of the line). Body biasing has also a negligible impact on
the delay of PCM lines, while showing an influence on voltage-mode circuits (especially
at lower Vdd values). In contrast, leakage energy is strongly dependent on the body bias,
while the dynamic energy is quadratically influenced by voltage scaling.
4.5 Summary
Different signaling methods and circuit-level techniques can be used in on-chip commu-
nication to optimize the delay or energy consumption. For instance, pulsed current-mode
(PCM) signaling achieves a low delay over long interconnect lines, whereas conventional
voltage-mode signaling has a low static energy consumption. To embed the signaling cir-
4.5 SUMMARY 145
0.1 0.2 0.3 0.4 0.5 0.6
0
2
4
6
8
10
12
14
Total Leakage Energy [µJ]
N
o
rm
a
li
z
e
d
 p
d
f
Vbs= - 0.4V Vbs= + 0.4V
Fig. 4.36: Influence of body biasing on the leakage energy of the configuration with voltage-mode
signaling on segment 1 and PCM signaling on segment 2.
cuits in the communication synthesis flow, a set of analytic models have been developed
in this chapter starting from process-dependent parameters and circuit-level analysis.
First, a technology-dependent statistical transistor model has been derived from the
BSIM4 model equations. The developed current-source model supports variability de-
scriptions for all process-dependent parameters and employs the statistical operators de-
veloped in the previous chapter to propagate the parameter distributions throughout the
model expressions. Spatially-correlated intra-die parameter variations are described us-
ing a two-dimensional correlation grid and a correlation decay model. Within this con-
text, the use of principal component analysis (PCA) for the variable decomposition has
also been presented.
Furthermore, the PCM and voltage-mode signaling circuits have been analyzed and
modeled using equivalent circuit representations. Using the current-source transistor
model, an analytic expression has been derived for the voltage at the receiver side and
the delay has been computed at 50% voltage swing. In addition, static and dynamic
energy models have been derived for the signaling circuits based on the current values
and on the parasitic capacitances. All model evaluations are performed statistically us-
ing the underlying current-source transistor model which is dependent on process and
environmental variations. The impact of voltage scaling and body biasing on the circuit
performance has also been analyzed.
Finally, the circuit-level statistical models have been employed for modeling entire
communication segments. The segment parameters, such as length and bitwidth are de-
146 CHAPTER 4 TECHNOLOGY-ACCURATE, VARIABILITY-AWARE CIRCUIT-LEVEL MODELS
rived from a floorplan model using clusters, while the location of transceiver circuits
is approximated from the placement of processing resources. The inclusion of segment
models into the system-level performance macromodels employed in the optimization
has also been presented and the particularities of updates in the signaling method, volt-
age scaling, and body biasing have been discussed. The use of circuit-level models for
communication segments has been demonstrated by means of a two-segment example,
where the delay-optimal signaling configuration and the influence of body biasing on
leakage energy consumption have been discussed.
Chapter 5
Technology-Aware Characterization
Method for On-Chip Segments
Contents
5.1 Wideband Characterization Method . . . . . . . . . . . . . . . . . . . . . . 148
5.1.1 Interconnect Modeling Challenges . . . . . . . . . . . . . . . . . . . 149
5.1.2 Multistep Extrapolated S-Parameter Model . . . . . . . . . . . . . . 150
5.2 Parameter Extraction Framework . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Multistep Extrapolation Method . . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.1 Extraction of the Base Parameter Set . . . . . . . . . . . . . . . . . . 154
5.3.2 Incremental Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 159
5.3.3 Passivity Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.4 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Since the precision of the developed communication macromodels is directly deter-
mined by the estimation accuracy for propagation delay and energy consumption across
the lines, a further refinement of the interconnect model is welcome. Particularly high-
frequency quality drops such as dielectric losses, reflections, and crosstalk are becom-
ing important at high switching speeds and require more complex interconnect models.
Therefore, this chapter presents a novel wide-bandwidth interconnect modeling tech-
nique, which takes into account the complex stacked dielectric environments and tightly-
coupled metal layers present in the state-of-the-art CMOS processes. Additionally, the
presented method is applied for a 90-nm technology and the results are compared with a
full-wave field simulator.
Sec. 5.1 enumerates briefly the main challenges of interconnect modeling and intro-
duces our extrapolated S-parameter model. Further, Sec. 5.2 describes the simulation
147
148 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
framework employed for extracting the base parameter set used within the model. Next,
the parameter extrapolation method is detailed in Sec. 5.3. Finally, the experimental eval-
uations are presented and discussed in Sec. 5.4.
5.1 Wideband Characterization Method
At gigahertz frequencies, bus data and clock signals in integrated circuits are entering
the microwave-specific range and the global on-chip interconnects become a more and
more critical bottleneck in the global system performance [48, 103]. Moreover, numer-
ous vias, crossing lines, and dielectric discontinuities, as well as a high wire packing
density, are common attributes of state-of-the-art CMOS processes, but constitute never-
theless a frequent cause of crosstalk and reflections [54]. In addition, important signal
quality drops generated through skin effects and dielectric losses augment with the fre-
quency and cannot be ignored anymore in the present interconnect wires. In this respect,
an increasingly high percentage of the final circuit performance becomes dictated by the
interconnects [48], although devices and, recently, device parameter variability continue
to influence a significant performance amount. Thus, with increasingly high integration
scales, the electrical performance of interconnects must be accurately characterized, mod-
eled, and seamlessly integrated into IC design flows.
There are two common approaches in the analysis, measurement, and description of
interconnect structures, namely in the time domain (using e.g. time-domain reflectometry
or eye diagrams) and in the frequency domain (employing e.g. S-parameters) [103]. Al-
though the two approaches may embed the same information in different forms, for prac-
tical circuit modeling purposes there are other factors, such as simulator support or the
amount of computational overhead vs. accuracy, which decide which method becomes
more appropriate. Furthermore, the analysis of on-chip interconnects is significantly bur-
dened by challenging factors like high losses, scaled aspect ratios, increased number of
wires, and strong non-uniformities in the dielectric stack [86], which contribute to the
difficulty of employing standard measurement and simulation techniques.
In order to enable the accurate characterization of such parasitic effects within a prac-
tical workflow, designers need efficient performance estimation models at lower abstrac-
tion levels, which are capable to describe arbitrary interconnect structures [54] and are
developed to support an integration with industry standard simulation frameworks. For
instance, signal integrity analyses in digital communication structures operating at gi-
gahertz frequencies [78] require the interconnect models to be valid over very wide fre-
quency ranges.
5.1 WIDEBAND CHARACTERIZATION METHOD 149
Fig. 5.1: Complexity of mutually-coupled inductances in distributed RLCG models.
5.1.1 Interconnect Modeling Challenges
The parasitic effects which affect the densely-packed interconnect wires, such as dielectric
and substrate-induced dispersion, skin effects, and proximity effects, are strongly depen-
dent on frequency [137] and need to be carefully considered by the interconnect models.
Further, the evaluation of self and mutual impedances requires finding the current return
paths for each individual wire, which are also frequency-dependent [109]. In addition,
the actual return paths are difficult to estimate in interconnects, since there is no ground
plane between the metal layers. Finally, the rapidly-switching signals exhibit very nar-
row rise and fall times and therefore contain significant spectral components within wide
frequency ranges.
Interconnect models have traditionally evolved from simple, lumped capacitance,
through lumped and distributed RC, until the state-of-the-art transmission-line distributed
RLCG chains. Lumped and distributed RC models neglect inductive effects and fail to
model lossy interconnect lines with propagation delays comparable or larger than the
signal rise time [111]. Inductively and capacitively coupled, distributed RLCG models
are today generally preferred [91, 137, 53], as they provide a good tradeoff between ac-
curacy and model complexity. Within a distributed RLCG representation, the wires are
divided into cascaded segments of circuit elements, extracted to reflect the interconnect
response up to the desired significant frequency. These models are designed to enclose
high-frequency effects, they achieve a perfect circuit-level compatibility with simulators,
and are fast to simulate. On the other hand, they rely on coupled mutual inductances
between all segments from all chains, as illustrated in Fig. 5.1 for only one inductance of
a single chain segment. As a consequence, the modeling complexity increases exponen-
tially with the number of cascaded segments and it becomes extremely hard to compute
the value of each mutually coupled inductance pair.
Field solvers are commonly employed for computing accurate capacitance values [112],
and for resistance and inductance [88] extractions. Nevertheless, for fast estimations of
arbitrary interconnect structures needed in the early design stages, analytic expressions
have been developed for capacitance [166] and self inductance [145] computation. On
150 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
the other hand, mutual inductance values can only be estimated for two parallel run-
ning lines of equal wire length [123], hence they are restricted to a small number of cases.
Field solver extractions of wire parameters, while being very accurate, are not applicable
for real-time estimations, due to the time overhead and complex structural setup they
imply. Moreover, the extracted models exhibit a decreasing accuracy with frequency in-
crease and their maximum frequency of validity is specified in terms of the acceptable
error [111]. In addition, distributed RLCG models rely on the quasi-TEM (transverse
electric and magnetic) propagation of electromagnetic waves in transmission lines [53],
which does not account for radiation losses and steep discontinuities (vertical vias, wire
segment bends) [137] and assumes that the cross-sectional wire dimensions are much
smaller than the wavelength at the maximum frequency of interest. In such cases, more
precise electromagnetic analyses might be required.
Full-wave characterization methods [137] are ideal for accurate wide frequency-range
modeling purposes, as they rely on a direct discretization of Maxwell’s equations and
find a numerical solution at every frequency. Such methods include differential-based
approaches, such as either frequency-domain finite element solvers [76] or time-domain
finite difference solvers [74], and integral-based techniques, such as the method of mo-
ments [114] and the partial element equivalent circuits (PEEC) [138,137]. The complexity
implied by the discretization and numerical solving ofMaxwell’s equations in differential
or integral form is requiring however a substantial computational overhead [137]. Hence,
while holding the highest accuracy, these methods are not directly applicable for inter-
connect synthesis applications, unless a method exists to extract fast characterizations of
arbitrary interconnect segments.
5.1.2 Multistep Extrapolated S-Parameter Model
As discussed in Sec. 5.1.1, there are two main directions in the high-frequency intercon-
nect modeling: first, the full-wave numerical methods, with high accuracy but restricted
by their large computation time, and second, the transmission-line distributed RLCG cir-
cuit models, with a good accuracy over the frequency range, but with a very complex
tracking of the mutual inductive couplings between the distributed segments. In con-
trast, a third approach with limited complexity is proposed here, which exhibits a mod-
eling performance close to the precision of a field simulator. This method consists of an
incremental extrapolation technique for generating a set of S-parameters for an arbitrary
interconnect segment in a given CMOS process, which relies on a predefined set of mea-
sured parameters obtained either with a vector network analyzer (VNA), or with a field
simulator and a structural model of the silicon environment. The resulting model of the
interconnect segment is an n-port with its associated S-parameter matrix.
As illustrated in Fig. 5.2, the proposed methodology starts with an initial set of ex-
tracted parameters, which samples the entire range of possible interconnect structures,
as seen from a designer’s point of view, in a predetermined way. This initial set ex-
5.1 WIDEBAND CHARACTERIZATION METHOD 151
CMOS
Process
Information
Structural Model De-Correlated
Attribute
Sweeps
Test Structures
VNA
Measurements
Full-Wave
Field Simulator
Initial S-Parameter Set
Base Parameter
Variable-Length
Extrapolation
Variable-Length
Correction
Variable-Width
Extrapolation
Variable-Width
Correction
Variable-Spacing
Extrapolation
Variable-Spacing
Correction
Variable-NDF
Extrapolation
Variable-NDF
Correction
Extrapolated Parameter
p
1
S
p
2
Arbitrary
Segment
Parameter Request
p
1
S
p
2
n, M  , [w ], [s ]k i i
NDF  , NDFk-1 k+1
Passivation
Passivation
Fig. 5.2: Overview of the extrapolated S-parameter modeling workflow.
plores variations in the metal layer, wire number, wire length, as well as individual wire
widths, wire spacings, and neighboring routing configurations in the adjacent metal lay-
ers. Within this process, the parameters are extracted for a wide frequency range which
extends up to the bandwidth required by the target application. Furthermore, the ex-
tracted set can be obtained either from direct measurements on a test chip, or using an
152 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
|S12|
|Y12|
|Z12|
0
0.25
0.5
0.75
1
1.25
1.5
|S
1
2
|
|S
1
2
|
0 50 100 150 200 250
Frequency [GHz]
101
102
103
104
105
106
|Z
1
2
|[Ω
]
|Z
1
2
|[Ω
]
0.01
0.1
1
10
|Y
1
2 | [Ω
−
1 ]
|Y
1
2 | [Ω
−
1 ]
Fig. 5.3: Magnitude plot of the Z12, Y12, and S12 parameters for a single-wire segment.
accurate field simulator and a multi-layered representation of the substrate, metal, and
dielectric environment for the target technology. In both cases, measurement or compu-
tational errors are likely to affect the parameter values, threatening the stability of the
interconnect model. Thus, a passivity enforcement criterion is employed in the model-
ing flow, requiring the real part of the admittance-parameter matrix to be positive defi-
nite [158].
The initial set of parameters is then used as basis for an incremental suite of extrapo-
lations, directed on the individual wire attributes, such as length, width, spacing, metal
layer, and neighboring routing information. The individual extrapolation for each wire
attribute is possible since the initial extraction of the base parameters is designed to min-
imize the correlation between the attributes. Furthermore, the inclusion of common de-
sign practices, such as orthogonal routing in neighboring metal layers and shielding of
bus segments with VDD and ground (GND), as well as the layout design rules for a spec-
ified process, limit the complexity of the initial extraction procedure to a polynomial
O (N2) for a maximum of N minimum-width wires between the power grid shielding
lines.
Several n-port parameter sets are available for representing the frequency behavior of
interconnect segments, including impedance (Z), admittance (Y), and scattering (S) pa-
rameters. In Fig. 5.3 the magnitudes of a Z, Y, and S parameter for a 100-µm wire in a
90 nm technology (metal 5) have been plotted. The characteristics show that the Z and
Y parameters vary across several orders of magnitude within the considered frequency
range, whereas the S parameter value remains between 0.8 and 1. If we consider the ex-
trapolation of parameters across the entire frequency range, then the amount of variation
of each parameter within this range will directly affect the extrapolation performance. As
5.2 PARAMETER EXTRACTION FRAMEWORK 153
Nitride
M1
1
2
3
4
5
6
7
8
9
[x10 nm]
3
M2
M3
M4
M5
M6
M7
Pad
Nitride
Nitride
Nitride
Nitride
Nitride
Nitride
Nitride
SOG
SOG
SOG
SOG
C
N
T
V
1
V
2
V
3
V
IA
4
V
IA
5
V
IA
6
V
IA
7
Poly
N-Well
P-Substrate
BPSG
FSG-M2
FSG-M3
FSG-M4
FSG-V4
FSG-M5
FSG-V5
FSG-M6
FSG-V6
FSG-M7
BPSG
Passivation
Al
Cu
Cu
Cu
Cu
Cu
Cu
Cu
FSG-M1
Fig. 5.4: Cross-section through the structural model of the CMOS process.
a result of this observation, the S-parameter representation has been chosen for the mod-
eling methodology, since it exhibits the least amount of variation across the frequencies
of interest.
5.2 Parameter Extraction Framework
The incremental extrapolation method relies on an initial set of S-parameters, which can
be obtained either from direct VNA measurements, or with a field solver. An industry-
standard 3D full-wave finite element method-based field simulator [15] has been used in
this work to extract the base parameters from an interconnect structural model represent-
ing the target technology. Fig. 5.4 depicts a cross-section through the simulated 7-metal-
layer (4-2-1) structure for the 90-nm, 1.0-V digital CMOS process employed within this
work. The structure includes a total of 7 copper metal layers within a fluorosilicate glass
(FSG) dielectric, separated by spin-on-glass (SOG) etch stop layers, silicon nitride dielec-
tric diffusion barriers, and borophosphosilicate glass (BPSG), and finally bounded by a
p-type doped silicon substrate at the bottom and an aluminum-pad grid on the top. This
154 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
stacked material structure models effectively the complex dielectric environment and the
inter-wire couplings present in tightly-integrated CMOS digital circuits.
For each metal layer, all the wire structures required for the subsequent extrapolations
have been simulated across a frequency range from DC up to the maximum significant
frequency. Hereby, the maximum frequency for employing the model in SPICE simula-
tions has been selected as:
Fmax = Fknee ·Nsteps ≈ 0.5
trisemin
· 5 (5.1)
where Fknee denotes the knee frequency and Nsteps is the number of required time steps
per rise time (trise) during a SPICE transient simulation. A number of 5 steps together
with a rounded upper bound for the knee frequency have been chosen for increased fre-
quency validity. As a consequence, for an arbitrary minimum rise time of e.g. 10 ps, a
maximum frequency of 250GHz is obtained.
The extracted S-matrices usually contain generalized parameters, which are normal-
ized to the impedances of each port. Since the port impedance depends on the attached
load or driver, it is more practical to have all the parameters normalized to a single known
impedance value. For convenience, the results have been normalized to the standard spe-
cific impedance of 50Ω. In many cases, the solution of the solver would consider only the
dominant mode. If higher-order modes are present in the structure, they should also be
included. In such a case, a multi-mode analysis can be performed and the propagation
constant γ = α+ jβ can be inspected for each mode. Nevertheless, each additional mode
at a port adds an additional set of S-parameters. In this case however, the results show
that a multi-mode analysis is not necessary, and the coupled lines can be accurately mod-
eled with one mode per terminal.
5.3 Multistep Extrapolation Method
5.3.1 Extraction of the Base Parameter Set
The target of the developed extrapolation procedure is to compute a requested parameter
Sp1p2 from the available set of extracted results, given the following specifications: the
requested frequency fk, the requested ports p1 and p2, and the structural details of the
interconnect segment, such as the metal layerMk, the wire length l, the set of wire widths
[wi], the set of wire spacings [si], and the neighboring routing configurations. In order
to extrapolate the requested parameter values, we must have an initial set of extracted
S-parameters for every metal layerM1, . . . ,MN , a variable number of wires, variable wire
lengths, variable wire widths, variable wire spacings, and variable routings in the neigh-
boring metal layers.
To keep the problem tractable, a set of simplifying assumptions must be considered,
which actually reflect common best practices in the design of state-of-the-art high-density
5.3 MULTISTEP EXTRAPOLATION METHOD 155
Mi
Mi+1
Mi-1
NDF   = 0
Mi+1
Mi
Mi+1
Mi
i+1
NDF   = 50%
i+1
~
(a)
(b)
Fig. 5.5: (a)Orthogonal routing directions in adjacent metal layers. (b)NDF values of 0, respec-
tively 50%.
w1 w2 wn-1 wn 3 wmin3 wmin
s1 s2 sn
GND VDD
Mk+1
Metal Mk
Mk-1
NDFk+1
NDFk-1
(6+n     ) w    + (n     +1) smin minmax max
w
ire
 le
ng
th
Fig. 5.6: Structural model of an n-wire interconnect segment.
digital signal processors. First, the number of parallel running wires is limited to nmax by
introducing a power grid consisting of VDD and GND shielding lines, in order to en-
sure a controlled low-impedance current return path and to limit the inductive-coupling
effects [134, 91]. For this purpose, a maximum of six minimum-width signal wires is
assumed between every two shielding lines. Finally, it is assumed that the routing in
neighboring metal layers occurs only in orthogonal directions, to further minimize the
inductive coupling, as shown in Fig. 5.5(a).
The influence of routed wires in the neighboring layers is considered by introducing
156 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
S-parameter
MA description
(Touchstone)
DC      5 F
knee
Z    = 50
ref
n
2
1
2n
n+2
n+1
Fig. 5.7: Associated n-port model for an n-wire segment.
a neighboring density factor (NDF). Further, the existence of routed wires in the adjacent
metal layers is modeled by considering a density factor between 100% (i.e. a metal plane
or a very thick wire which covers 100% of the considered segment) and 0% (i.e. no routing
in the neighboring layer, as illustrated in Fig. 5.5(b) for e.g. 0 and 50%. Given this defini-
tion, an NDF value is considered for each side of the metal, except for the lowest and the
highest metal layers, which have only one neighbor. For instance, a segment placed on
metal layerMk has an NDF corresponding to the neighbors inMk+1 given by:
NDFk+1 =
1
lk
nk+1∑
j=1
wj (5.2)
where nk+1 is the number of wires, including shielding lines, which cross the segment, wj
is the width of each wire, and lk is the length of the segment under consideration.
The structural model of an n-wire interconnect segment routed on a given metal layer
Mk is depicted in Fig. 5.6. All wires within the segment have the same length l, but indi-
vidual widths wi and spacings si. Additionally, interconnects with distinct wire lengths
can be modeled by concatenating several n-port segments [53]. The associated n-port
model of the segment is shown in Fig. 5.7.
A complete base of initial parameters for the subsequent extrapolation must cover all
metal layers and all numbers of wires in a segment, from 1 to nmax. As stated before, the
number of parallel-running wires is restricted by the presence of a shielding grid, thus the
distance between two subsequentGND and VDD lines allows for the routing of maximum
nmax minimum-width, minimum-spacing wires. Furthermore, the minimum wire width
wmin and wire spacing smin are dictated by the layout design rules for each metal layer.
Since the model applies an incremental sequence of extrapolations for each individual
wire characteristic, the initial extracted set must be chosen in such a way, as to minimize
the correlation between wire attributes. More specifically, we can describe the sequence
of extrapolations for a requested Sp1p2 as:
Sˆp1p2 =
∑
ai∈{a1,a2,... }
extrap
(
Sˆp1p2i , Fsi
)
(5.3)
5.3 MULTISTEP EXTRAPOLATION METHOD 157
length
width
spacing
le
n
g
th
 s
w
e
e
p
spacing sweep
Fig. 5.8: Orthogonal sweeps of the wire attributes, illustrated here for length and spacing (NDF
axis not shown).
where Sˆp1p2 is the extrapolated parameter value, a1, a2, . . . are the individual wire at-
tributes, Sˆp1p2i is the extrapolated contribution of wire attribute ai to the final parameter
value, and Fsi is the sweep function for wire attribute ai. In order to apply the extrap-
olations individually on each attribute and sum the contributions, the sweep functions
Fsi must be orthogonal, i.e. they must not introduce correlations between the attributes
during the sweeps.
Orthogonality between the attribute sweeps can be achieved by varying only one at-
tribute at a time, while keeping the other attributes constant. In the attribute space, such
sweeps would correspond to orthogonal lines, parallel to each of the attribute axes, as
exemplified in Fig. 5.8 for wire length and spacing sweeps. An additional orthogonal
NDF axis can not be displayed in Fig. 5.8, however it only adds a fourth dimension to the
attribute space.
It is to be noted, that the individual wire attributes are not completely independent one
from each other. For instance, the assumption of having a fixed power grid introduces a
relatively strong dependence betweenwire width and spacing. Specifically, the width of a
wire cannot be changedwithout affecting also the spacing to its neighbors. This generates
a residual correlation between the attribute sweeps which must be taken into account and
corrected afterwards. A controlled weighting of the incremental attribute correction steps
is performed during the subsequent extrapolations, which are described in Sec. 5.3.2.
Given these observations, the orthogonality of the parameter sweeps is maximized by
sweeping each individual wire attribute while keeping the other attributes at a neutral
value (i.e. the minimum, or the average value, depending on the attribute). The algo-
rithm applied for the extraction of the base parameter set is given in Listing 5.1. First, the
length of the segment is varied across the relevant domain for metal layer Mk, i.e. from
lmin (Mk) to lmax (Mk), with all the wires set to the minimum width and equally-spaced
between the bounding power grid. From this first set of simulations we collect parame-
ter sets which reflect only changes in wire length, while the influence of wire width and
158 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
EXTRACTINITIALSET()
1 for eachmetal layerMk
2 do for n← 1 to nmax
3 do /* Length Sweep */
4 NDFk−1 ← NDFk+1 ← 0;
5 for i← 1 to n
6 do wi ← wmin;
7 si ← nmax−nn+1 wmin + nmax+1n+1 smin;
8 for l← lmin (Mk) to lmax (Mk)
9 do EXTRACT-S-PARAMETERS();
10
11 /* Width Sweep */
12 l← lmean (Mk) ;
13 for i← 1 to n
14 do for wsweep ← wmin to (nmax − n) (wmin + smin) + wmin
15 do wi ← wsweep;
16 si ← (nmax−n+1)(wmin+smin)+smin−wsweep2 ;
17 for j ← 1 to n, j 6= i
18 do wj ← wmin;
19 sj ← smin;
20 if i < n
21 then si+1 ← (nmax−n+1)(wmin+smin)+smin−wsweep2 ;
22 EXTRACT-S-PARAMETERS();
23
24 /* Spacing Sweep */
25 for i← 1 to n
26 do wi = wmin;
27 for i← n downto 1
28 do for ssweep ← smin to (nmax − n) (wmin + smin) + smin
29 do si ← ssweep;
30 for j ← 1 to n, j 6= i
31 do sj ← smin;
32 if i < n
33 then si+1 ← (nmax − n) (wmin + smin) + 2smin − ssweep;
34 EXTRACT-S-PARAMETERS();
35
36 /* NDF Sweep */
37 for i← 1 to n
38 do wi ← wmin;
39 si ← nmax−nn+1 wmin + nmax+1n+1 smin;
40 for NDFk−1 ← NDFmin to NDFmax
41 do for NDFk+1 ← NDFmin to NDFmax
42 do EXTRACT-S-PARAMETERS();
Listing 5.1: Extraction of the base parameter set.
spacing is minimized. Next, the length is kept constant at an average value for the given
metal layer (lmean (Mk)) and the width of each wire is varied from wmin up to the maxi-
5.3 MULTISTEP EXTRAPOLATION METHOD 159
wminsmin
sweep width
wminsmin
wminsmin smin
wminsmin
wminsmin
wminsmin smin
sweep spacing
sweep all wires
GND
GND
V
DD
V
DD
(a)
(b)
sweep all wires
Fig. 5.9: Variable-width (a) and variable-spacing (b) sweeps during the initial parameter extrac-
tion.
mum allowed by the minimum spacing to its neighbors, with all the other wires kept to
the minimum width and spacing. While doing this, the varying wire is placed exactly in
the middle of the distance between its two direct neighbors. This approach minimizes
the influence of wire length and spacing on the results obtained from the variable-width
sweeps. An illustration of the variable-width sweep procedure is shown in Fig. 5.9(a). Af-
ter that, a variable-spacing sweep is performed sequentially for every wire, as illustrated
in Fig. 5.9(b), with all the wires kept at minimum width. Again, the influence of wire
length and width on the results is minimized. During all the previous sweeps, the NDF of
both upper and lower metal layers was set to zero, to avoid any influence of neighboring
routed wires on these first results. Finally, the NDF sweeps add the information related
to the presence of wires routed in the neighboring layers. During the sweeps, the density
factors are varied from a minimum value NDFmin, which corresponds to either 0 (i.e. no
routed wire), or to the value computed from the presence of only theGND and VDD lines,
depending on the position of the metal layer. The maximum value NDFmax corresponds
to the maximum routing density present in the adjacent layer, including the power lines
and maximum-width thick wires covering all the length of the segment. An illustration
of such a maximum NDF case is shown in Fig. 5.10, where a single wire extends to the
maximum width allowed by the fixed power grid. The EXTRACT-S-PARAMETERS() call des-
ignates the extraction of the S-parameter matrices for the given segment in a frequency
sweep from 0 (DC) up to Fmax with a step dictated by the application requirements.
5.3.2 Incremental Extrapolation
The input for the extrapolation procedure consists of the following:
• The request for computing a parameter Sp1p2 for a multi-wire interconnect segment,
160 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
GND
VDD
GND VDD
n 
   
  w
   
 +
 (n
   
  -
1)
 s
m
in
m
in
m
ax
m
ax
sm
in
sm
in
Fig. 5.10: Maximum NDF in the upper metal layer, with power grid and maximum-width signal
line.
described by its length, metal layer, individual wire widths and spacings, as well as
the NDF information for the adjacent metal layers;
• An initial set of extracted base parameters for the target CMOS process, built as
described in Sec. 5.3.1.
The result represents the extrapolated Sp1p2 parameter for the specified segment, at all
frequencies from DC to Fmax (with the same step as the input data), in the standard MA-
format.
First, the initial set of extracted parameters is parsed in a search for a base parameter
for the extrapolation. This base parameter must be the closest-matching value for the
requested parameter, i.e. a parameter describing an interconnect segment with the closest
attributes to the requested one. To do this, a matching rank is first evaluated and the
parameter with the highest matching rank will be afterwards selected. Considering a
requested parameter Sp1p2 , the factors which contribute to the matching rank and their
respective weight are as follows:
• The wire length, which contributes to the wire resistance, coupling capacitance, and
coupling inductance, thus having a high weight;
• The widths of the primary wires (connected to the ports p1 or p2), which contribute
mainly to the wire resistance and coupling capacitance, having a high weight;
• The spacings of the primary wires, which mainly affect the coupling capacitance,
with a medium weight;
• The widths of the secondary wires (not attached to the requested ports p1 and p2),
which mainly influence the coupling inductance, with a relatively low weight;
5.3 MULTISTEP EXTRAPOLATION METHOD 161
• The spacings of the secondary wires, with a relatively low weight;
• The NDF, which affects only the coupling capacitance, hence with a relatively low
weight.
The matching rank of an extracted parameter is computed as the sum of the individ-
ual weights for the structural details that match with the requested segment. If a closest-
matching parameter is found, then it is used further as the base parameter for the ex-
trapolation. If no structural attributes can be matched with any of the already-extracted
results, then the base parameter must be extrapolated from e.g. the variable-length set.
Concretely, if the wire length for the requested parameter Sp1p2 is lr, then the extrapolated
base parameter is computed as:
Mp1p2b =extrap
(
[li] ,
[
Mp1p2li
]
, lr, ’method’
)
Ap1p2b =extrap
(
[li] ,
[
Ap1p2li
]
, lr, ’method’
) (5.4)
whereMp1p2b and A
p1p2
b are the magnitude, respective angle, of the base parameter for the
extrapolation of Sp1p2 , [li] represents the set of wire lengths available in the initial extracted
set, whileMp1p2li and A
p1p2
li
are the magnitude, respectively angle, of Sp1p2 for the segment
with wire length li from the initial extracted set. The keyword ’method’ designates the de-
sired extrapolation function, which can be based either on linear interpolation, piece-wise
cubic hermite polynomials, cubic interpolation, or cubic spline interpolation with smooth
derivatives, to name only a few. The results shown in this work have been obtained with
a cubic spline interpolation method, which proved to offer the best precision.
The base parameter represents the very first approximation of the requested Sp1p2
value. Because in the most cases the structural attributes of the requested segment do
not coincide with the attributes related to the base parameter, a set of incremental correc-
tions for each structural element must be further applied as explained in the following.
Let’s first assume that the wire length related to the base parameter is lb. Then, two pa-
rameter values are extrapolated from the length-sweep results, one for lb and one for the
requested wire length lr:
Mp1p2lb =extrap
(
[li] ,
[
Mp1p2li
]
, lb, ’method’
)
Mp1p2lr =extrap
(
[li] ,
[
Mp1p2li
]
, lr, ’method’
) (5.5)
The corresponding angle values Ap1p2lb and A
p1p2
lr
are computed in a similar way:
Ap1p2lb =extrap
(
[li] ,
[
Ap1p2li
]
, lb, ’method’
)
Ap1p2lr =extrap
(
[li] ,
[
Ap1p2li
]
, lr, ’method’
) (5.6)
Next, two variable-length correction terms ∆lM
p1p2 , respectively ∆lA
p1p2 are computed
as the following differences:
∆lM
p1p2 =Mp1p2lr −Mp1p2lb
∆lA
p1p2 =Ap1p2lr − Ap1p2lb
(5.7)
162 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
and the variable-length correction is applied to the base parameter as follows:
Mp1p2b =M
p1p2
b + wc ·∆lMp1p2
Ap1p2b =A
p1p2
b + wc ·∆lAp1p2
(5.8)
where wc is a correction weighting factor for the parameter inter-correlations and is there-
fore data-dependent.
Further, to take into account the influence of wire width and spacing, for each wire in
the segment the following parameter values are extrapolated:
Mp1p2wb,i =extrap
(
[wj,i] ,
[
Mp1p2wj,i
]
, wb,i, ’method’
)
Mp1p2wr,i =extrap
(
[wj,i] ,
[
Mp1p2wj,i
]
, wr,i, ’method’
) (5.9)
Mp1p2sb,i =extrap
(
[sj,i] ,
[
Mp1p2sj,i
]
, sb,i, ’method’
)
Mp1p2sr,i =extrap
(
[sj,i] ,
[
Mp1p2sj,i
]
, sr,i, ’method’
) (5.10)
where wb,i and wr,i represent the width of wire i (with i varying from 1 to n) for the
base and the requested parameter, respectively, while sb,i and sr,i are the corresponding
spacing values. The sets [wj,i] and [sj,i] contain the width and spacing arrays employed in
the sweeps from Sec. 5.3.1 (see Listing 5.1 lines 14, respectively 28). In addition, the angle
components Ap1p2wb,i , A
p1p2
wr,i
, Ap1p2sb,i , and A
p1p2
sr,i
are obtained in a similar way.
After computing the correction terms ∆wiM
p1p2 , ∆wiA
p1p2 , ∆siM
p1p2 , and ∆siA
p1p2 as
the corresponding differences from the previously-extrapolated values, the variable-width
and variable-spacing corrections are applied:
Mp1p2b =M
p1p2
b + wc
n∑
i=1
(∆wiM
p1p2 +∆siM
p1p2)
Ap1p2b =A
p1p2
b + wc
n∑
i=1
(∆wiA
p1p2 +∆siA
p1p2)
(5.11)
Finally, the variable-NDF correction is computed and applied:
Mp1p2b =M
p1p2
b + wc ·∆NDFMp1p2
Ap1p2b =A
p1p2
b + wc ·∆NDFAp1p2
(5.12)
where the correction terms∆NDFM
p1p2 and∆NDFA
p1p2 represent the differences between
the extrapolated parameters for the base NDF and for the requested NDF values.
The correction steps applied to the base parameter are incremental and the influences
of various wire attributes on the S-parameters are treated as independent. Although min-
imized, a non-zero residual correlation still exists between the individual influences, es-
pecially in the case of wire width variation, which has a significant influence on the spac-
ing, see e.g. Fig. 5.9(a). Thus, a correction weighting factor wc < 1 is employed, which
accounts for the residual correlation and prevents therefore an overscaling of the final cor-
rected value. A further correction of the extrapolated parameters is provided in Sec. 5.3.3.
5.3 MULTISTEP EXTRAPOLATION METHOD 163
Fig. 5.11: Passivation example for a single-wire interconnect segment (metal 1, l = 10µm,
w = 400 nm, s = 810 nm).
5.3.3 Passivity Enforcement
Both measured and extrapolated S-parameters must exhibit a passive behavior, i.e. the
interconnect model must dissipate active power, as opposed to generate it, at any value
of the input voltage and at any frequency. Here, a passivation enforcement criterion is
employed, based on the correction of the eigenvalues of the admittance matrix [72]. First,
the Y-parameter matrix can be computed from the S-parameter matrix as follows [84]:
Y = G−1ref · Z−1ref · (S+ E)−1 · (E− S) ·Gref (5.13)
where Zref = Zref ·E is the reference impedance matrix,Gref = 1q|Zref | ·E is the reference
conductance matrix, and E is the identity matrix. The passivity criterion requires the real
part of the Y matrix to be positive definite [72], i.e. the eigenvalues of Re {Y} to be all
positive. This relatively simple technique ensures both the passivity and the stability of
the model. A more detailed discussion on passivity and stability conditions can be found
in [158].
After setting the negative eigenvalues of Re {Y} to zero as in [78], the real part is
recomputed as:
Re {Y} = V ·Dcorr ·V−1 (5.14)
whereV contains the eigenvectors of Re {Y} andDcorr is a diagonal matrix with the cor-
rected eigenvalues. The S-parameter matrix is recomposed from the corrected admittance
matrix as:
S = Gref · (E− Zref ·Y) · (E+ Zref ·Y)−1 ·G−1ref (5.15)
164 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
Fig. 5.12: RMS error between extrapolated and extracted results for the entire range of tested
interconnect segments.
Fig. 5.11 shows two extracted S-parameters before and after the passivation, for a 10-
µm single-wire interconnect segment placed on theM1 layer. It can be observed that the
passivity correction becomes more substantial as the frequency increases, which shows
that measurement and numerical computation errors increase with frequency.
5.4 Experimental Validation
In order to assess the overall precision of the extrapolation method, a wide range of in-
terconnect segments in the 90-nm technology have been tested, with up to six wires per
segment and varying from metal 1 up to metal 7. In every case, the evaluation has been
performed on a “difficult”, non-standard segment configuration, with each wire having
an individual width and spacing, randomly assigned with a uniform distribution be-
tween the minimum and maximum values allowed by the design rules.
The RMS error (RMSE) between the extrapolated parameters and the parameters ob-
tained with the field simulator has been computed from all the (2n)2 S-parameters of each
segment as:
RMSE =
√√√√ 1
Nf · 4n2
2n∑
i,j=1
Nf∑
fk=1
(
Sˆfkij − Sfkij
)2
(5.16)
where Sˆij and Sij represent the extrapolated, respectively the extracted parameter values,
and fk is the frequency index, indicating the steps fromDC up to the maximum frequency
of interest. The results for all the investigated configurations are summarized in Fig. 5.12,
where the angle values have been normalized to 360◦. The maximum absolute errors
were 1.8·10−2 in magnitude and 6.8 degrees in angle. The main causes for the exhibited
deviations are given by:
5.4 EXPERIMENTAL VALIDATION 165
Fig. 5.13: Magnitude of extrapolated and extracted parameters for a single-wire interconnect seg-
ment.
• The residual correlations between the wire attributes, especially width and spacing;
• The non-optimal passivity correction [158,72];
• The precision of the extrapolation method, which is limited by the number of sam-
ples available in the initial set.
A more detailed example is shown in Fig. 5.13 and 5.14 for a single-wireM1-segment
with l =10µm, w = 580 nm, s = 420 nm, and NDF = 35%. The plots show the values for
S11 and S12, while the other two parameters, S22 and S21, are virtually identical with S11,
respectively S12 due to the inherent symmetry of the wire. From Fig. 5.13, one can see
that |S12|, which reflects the power wave transmission from port 2 to port 1, reaches a
maximum of 1 at DC and starts to drop relatively fast as the frequency increases into
the multi-GHz range. This behavior shows the rate of losses in signal power with the
frequency increase, for a direct transmission across the line from port 2 to port 1, and
points out the expected signal integrity issues which influence the interconnect at high
frequencies. The level of signal reflections at port 1 is shown by the plot of |S11|, which in-
dicates that reflections are essentially zero at DC, but increase with the frequency. Again,
the expected signal reflections within the wire are here illustrated and quantified. The
extrapolated model is overall in a good agreement with the directly-extracted parameter
data.
A further example is depicted in Fig. 5.15 for a three-wire segment in metal 4, with
the attributes presented in Tab. 5.1. Only six parameters have been selected for the plot
from the complete set of 36, in order to maintain a reasonable amount of visible detail.
166 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
Fig. 5.14: Angle values for the extrapolated and extracted parameters of a single-wire segment.
Wire Length Width Spacing NDF3 NDF5
1 140 nm 200 nm
2 2.5mm 300nm 290 nm 70% 25%
3 300 nm 295 nm
Tab. 5.1: Wire attributes for a three-wireM4-segment.
S14, S25, and S36 represent the direct signal transfer along the three wires, and show sub-
stantial losses at the maximum frequency of interest. S13, S24, and S35 reflect the crosstalk
between wires 3 and 1, 1 and 2, respectively 2 and 3. Here it can be noticed that the
crosstalk also increases significantly with the frequency. Thus, we can clearly observe
that a wide-frequency interconnect model is extremely important to quantify the amount
of performance losses at very high switching speeds. Beyond these observations, a very
good agreement between the extrapolated model and the directly-extracted parameters
can be noticed as well. In order to obtain a more detailed quantitative evaluation of the
modeling performance, the RMS error has been computed for every parameter across the
investigated frequency range. The results are displayed in Fig. 5.16 where the parameters
which are identical due to the wire symmetry have been omitted, i.e. S44 with S11, S41
with S14, S55 with S22, S52 with S25 etc. The measured errors are in line with the previous
evaluations from Fig. 5.12.
Next, the extrapolated S-parameter models have been tested within transient circuit-
level simulations. For this purpose, a SPICE-level simulator which supports the direct
modeling of n-port elements using S, Y, or Z-parameter descriptions [30] has been used.
5.4 EXPERIMENTAL VALIDATION 167
Fig. 5.15: Magnitude plot of six S-parameters for a three-wireM4-segment.
Fig. 5.16: RMS errors between extrapolated and directly-extracted parameters (three-wire M4-
segment).
Within the modeling framework, the extrapolated S-parameters are saved as standard
Touchstone files which are directly supported by the simulator. The circuit configuration
employed for the tests is shown in Fig. 5.17, where each wire of the interconnect model is
driven and terminated independently.
The signal delay has been measured across each wire with both quiet and switching
neighboring lines and the results obtained using the extrapolated S-parameters and pa-
rameters extracted with the field solver were compared. A detailed view of the results
is shown in Fig. 5.18 for three-wire interconnect segments of various lengths placed on
168 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
N-Port
S-Parameter 
Model
(Touchstone)
Delay
Fig. 5.17: Circuit employed for the transient simulations.
Fig. 5.18: Signal propagation delays from three-wire interconnect segments placed on three metal
layers.
Fig. 5.19: Delay RMSE for the transient simulations of interconnect segments on metal 5.
5.4 EXPERIMENTAL VALIDATION 169
Number of wires per segment
1 2 3 4 5 6
M
et
al
la
y
er
1 4.28% 7.37% 7.77% 10.01% 10.47% 11.88%
2 5.04% 6.69% 8.19% 10.39% 10.24% 11.46%
3 5.46% 5.56% 6.94% 8.28% 9.51% 11.50%
4 6.93% 5.34% 7.78% 8.83% 9.94% 11.74%
5 6.79% 7.32% 6.21% 9.32% 10.39% 10.51%
6 6.37% 5.78% 7.41% 8.60% 11.27% 12.21%
7 5.56% 7.18% 8.38% 8.97% 9.29% 11.14%
Tab. 5.2: Maximum relative delay error across all considered metal layers and wires per segment.
M1, M5, and M7, in each case with one neighbor switching at the same time in opposite
direction. To better quantify the differences, the RMS and relative errors for every series
of simulations have been measured. First, the RMS errors on each metal layer were mea-
sured for each number of wires per segment and for each sweep of the wire attributes. A
detailed plot of the RMS errors in the case of metal 5 is depicted in Fig. 5.19. Each RMSE
value is computed across the attribute sweep range, across the investigated frequency
range, and across all S-parameters.
In addition to the RMSE values, the maximum relative delay error has been evaluated
across all metal layers and numbers of wires per segment. For each metal layer and for
each wire number we have varied individually the segment length, wire widths, wire
spacings, and the NDF, and the maximum error was evaluated as:
εmaxr = max
{
max
li
{∣∣∣∣∣ δˆli − δliδli
∣∣∣∣∣
}
,max
wj,i
{∣∣∣∣∣ δˆwj,i − δwj,iδwj,i
∣∣∣∣∣
}
,
max
sj,i
{∣∣∣∣∣ δˆsj,i − δsj,iδsj,i
∣∣∣∣∣
}
,max
NDFi
{∣∣∣∣∣ δˆNDFi − δNDFiδNDFi
∣∣∣∣∣
}}
(5.17)
where δˆ is the delay measured with the extrapolated model and δ is the delay obtained
using the directly-extracted parameters. It is to be noted that the attribute variations
during these tests have been selected in such a way that they do not include the same
values found in the initial set, for a better evaluation of the extrapolation performance.
Wire length has been varied between 1 and 500µm for the local layers (M1 to M4), be-
tween 100µm and 5mm for intermediate layers (M5 andM6) and from 3mm to 5 cm for
the global layer (M7). Wire width and spacing have been varied from the minimum de-
sign rule for each metal layer up to the maximum allowed by the shielding grid and the
number of wires between two successive power lines (see Listing 5.1 lines 14 and 28). Ad-
ditionally, the minimumNDFwas always zero, while the maximumNDF varied between
85% and 100% depending on the wire length (see Fig. 5.10).
The evaluated maximum relative errors are shown in Tab. 5.2. The maximum error
generally increases with the number of wires per segment, since the total number of S-
170 CHAPTER 5 TECHNOLOGY-AWARE CHARACTERIZATION METHOD FOR ON-CHIP SEGMENTS
parameters increases quadratically with the number of wires. As it can be seen, the de-
veloped method achieved during the tests a maximum delay error of 12.21% for six-wire
segments placed on the 6th metal layer.
5.5 Summary
Technology-accurate wide-bandwidth interconnect models are needed for the precise es-
timation of signal delays, crosstalk, and energy losses across the complex on-chip com-
munication structures. Traditional transmission-line distributed models offer a good ac-
curacy at the expense of limited frequency validity and complex mutual inductance ex-
tractions, therefore they always imply a tradeoff between precision and computational
efficiency. On the other hand, full-wave interconnect analyses provide a high accuracy
at all frequencies, but require extensive numerical computations which can not be per-
formed in real time. In addition, the amount of possible wire configurations across vari-
ous lengths, widths, spacings, and metal layers increases exponentially the complexity of
the modeling problem.
This chapter has introduced a computationally-efficient wide-bandwidth characteri-
zation method for arbitrary interconnect structures, which is based on the incremental
extrapolation of S-parameters. The method defines a set of a priori parameter extractions,
designed to reflect the particularities of a given manufacturing process. This initial set of
parameters can be extracted with high precision within an independent time frame prior
to the application, and represents a data base for the subsequent computations. Further, it
has been shown how the initial set can be extracted by means of a full-wave field simula-
tor and a structural model reflecting the technological process. It has also been shown that
the complexity of covering the large set of possible wire attributes can be substantially re-
duced by minimizing the correlations between the segments employed in the initial set.
A further measure to limit the complexity is to consider a fixed power grid and orthogo-
nal routing directions. Moreover, the presence of wire segments in the neighboring metal
layers has been modeled by introducing a density factor which indicates the amount of
coupling capacity between the metal layers. Next, an incremental extrapolation proce-
dure is performed in real time for every parameter request, which includes a search for
a best-matching base parameter and a suite of extrapolated corrections applied for ev-
ery wire attribute. A passivation enforcement criterion has been also described, which
ensures that the obtained model is stable and exhibits a passive behavior.
The model has been tested across all metal layers and up to six wires per segment and
the results have been compared with an industry-standard field simulator. The results
show a good agreement with the directly-extracted parameters and lie within 2·10−2 and
7 degrees for magnitude and angle values, respectively. Another suite of extensive tests
has been performed in the time domain, within circuit-level simulations. The results
summarized in Tab. 5.2 show a maximum error of less than 12.5%.
Chapter 6
Methodology Binding
Contents
6.1 Application Profile Example . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.1.1 Description of the SoC Resource Set . . . . . . . . . . . . . . . . . . 173
6.1.2 Floorplan Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.1.3 Design Space Exploration Method . . . . . . . . . . . . . . . . . . . 175
6.1.4 Cost Function Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2 Evaluation of Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.1 Delay-Optimized Architecture . . . . . . . . . . . . . . . . . . . . . 177
6.2.2 Energy-Optimized Architecture . . . . . . . . . . . . . . . . . . . . 180
6.2.3 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
The previous three chapters have presented a modelingmethodology focused on vari-
ability propagation and technology accuracy for the optimization of on-chip communica-
tion architectures. The optimization is guided by the top-level performance macromodels
for delay and energy consumption, which have been introduced in Sec. 3.5 and 3.6 and
have been extended with circuit-level models for different signaling methods in Sec. 4.4.
In addition, a refinement of the interconnect models for accurate validations has been
developed and presented in chapter 5. A first method for the design space exploration
considering the mapping and scheduling of processing tasks on resources has been pre-
sented in Sec. 3.7 and the inclusion of communication activities in the form of communi-
cation nodes has been discussed in Sec. 4.4.5. Nevertheless, the inclusion of circuit-level
models in the communication nodes enables the use of signaling choice, voltage scaling,
and body biasing as additional exploration techniques in the global optimization loop.
The results of applying the developedmethodology to an application scenario are pre-
sented and analyzed in this chapter. First, an application profile is derived for a video de-
171
172 CHAPTER 6 METHODOLOGY BINDING
Start
1
Entropy
Decoding
2Mem 1 3Mem 2 4Mem 3 5Mem 4
6
Motion Vector
Prediction
9Mem 5
7
Intra-
Block
Prediction
8
iScan/
iQuant
10iTrans
11Prediction
12Reconstruction 13
Mem 6
14
Loop
Filter
15
Framebuffer
Processing
End
MEM MEM
MEM MEM
1 2
3 4
GPP1
GPP2 GPP3ASIC
W = 15 mm, ResW = 30
H
 =
 1
0
 m
m
, 
R
e
s
H
 =
 2
0
(a)
(b)
Fig. 6.1: Application task graph (a) and the considered SoC architecture (b).
coding example and the description of the target MPSoC architecture including floorplan
information and process parameter variations is described. Second, a set of parameters
which guide the design space exploration method toward the desired optimization cri-
teria are presented. Within this context, the parametrization of the cost function using
quantile functions and weighting factors is discussed. Furthermore, the synthesis results
are presented for a delay-driven optimization and for an energy-oriented exploration and
the accuracy of the modeling method for communication segments is investigated.
6.1 Application Profile Example
A video decoding application based on the H.264 standard [162] has been selected in this
chapter to illustrate the results of the proposed methodology. The execution flow illus-
trated by the task graph from Fig. 6.1(a) contains typical decoding tasks, such as entropy
decoding, motion-compensated prediction, and the integer quantization and transform
operations, but also memory access operations between the decoding tasks, denoted as
“Mem i”.
6.1 APPLICATION PROFILE EXAMPLE 173
Parameter GPPi ASIC MEMi
µ σ µ σ µ σ
Pl,RTi [mW] 18 167 1.8 16.7 3.6 83.5
P
RTj
d,i [mW] 480 5 100 2 100 2
T x1,RTj [µs] 1152 72.3 – – – –
T x2,3,4,5,9,13,RTj [µs] – – – – 132 8.7
T x6,RTj [µs] 72 4.5 24 1.5 – –
T x7,RTj [µs] 74 4.7 25 1.6 – –
T x8,RTj [µs] 1654 119.75 551 39.9 – –
T x10,RTj [µs] 2636 162.47 879 54.16 – –
T x11,RTj [µs] 146 11.2 49 3.73 – –
T x12,RTj [µs] 1754 126.57 584 42.19 – –
T x14,RTj [µs] 2549 154 849 51.33 – –
T x15,RTj [µs] 1367 94 – – – –
Tab. 6.1: Application profile parameters used for the communication synthesis.
6.1.1 Description of the SoC Resource Set
The target MPSoC architecture for application mapping and communication synthesis
is illustrated in Fig. 6.1(b) and contains a mixture of generic general-purpose processors
(GPP) and memory blocks. In addition, for checking different synthesis alternatives, a
high-speed ASIC block is considered as an alternative for the execution of data-intensive
operations.
The estimated profile parameters for the selected application are presented in Tab. 6.1.
Here, the leakage power Pl,RTi of each resource type has been estimated for the considered
1.0-V, 90-nm CMOS process from a drain current value of 0.18 nA at zero Vgs, as indicated
by Fig. 4.6. A number of approx. 106 leakage paths has been further assumed for the GPP
cores, 103 paths for the ASIC, and 5 · 104 paths for the memory modules. It is to be noted,
that the leakage power exhibits a very large variation with the process parameters and it
has been modeled with a lognormal distribution (starting from a standard deviation of
0.167µA of the leakage current).
The dynamic power consumption values P
RTj
d,i have been approximated by assuming
an 800MHz frequency for the GPP cores and an average of e.g. 0.6mW/MHz power
consumption. For the ASIC and memory blocks a lower average value of 100mW power
consumption has been assumed.
Execution times of processing tasks on the GPP cores have been estimated assuming
the decoding of a high-definition resolution image of 1280 × 720 pixels and 4 × 4 integer
transforms. Furthermore, single-cycle integer operations have been assumed on the GPPs
and the ASIC execution has been assumed to be three times faster than the GPP execution.
The execution times of the memory access operations have been averaged from the total
174 CHAPTER 6 METHODOLOGY BINDING
FC0
FC2 FC3FC1 FC4
MEM1 MEM2 MEM3 ASIC GPP2 MEM4 GPP1 GPP3
A
R
B
MEM
1
MEM
2
A
R
B
MEM
3 ASIC
A
R
B
GPP
2
MEM
4
A
R
B
GPP
1
GPP
3
A
R
B
C
lu
s
te
r 
0
Cluster 1 Cluster 2
Cluster 3 Cluster 4
(a)
(b)
Fig. 6.2: Floorplan cluster tree for the processing resources (a) and one possible inter-resource
connection in a hierarchical bus architecture (b).
Floorplan Cluster Length BitwidthWs
FC0 20mm 16
FC1 1mm 4
FC2 100µm 2
FC3 100µm 2
FC4 1mm 4
Tab. 6.2: Floorplan cluster parameters.
number of required load and store operations. Finally, it has been assumed, that PN1 and
PN15 can be only executed on the GPP cores, whereas all the intermediate tasks can be
run on both GPP and the ASIC block.
A silicon die area of 15 × 10mm2 has been further assumed, in a 1.0-V 90-nm CMOS
process, with the processors cores and ASIC block occupying an area of about 4× 4mm2
and the memory modules around 50% of the processor core area. For spatially-correlated
process parameter variations, a 30× 20 grid has been used with a PWL correlation model
as described by equation (4.24) with the parameters dd = 6mm and ρr = 0.2.
6.1.2 Floorplan Cluster Tree
Fig. 6.2 depicts the associated floorplan cluster tree for the processing resources consid-
ered in the previous section and assumes local clusters connecting MEM3 with the ASIC
block and MEM4 with GPP2. Further, GPP1 and GPP3 are grouped in a local cluster, as
well as MEM1 with MEM2. Since the exact parameters of the communication segments
can be specified only after routing, a generic cluster description is employed instead in
the design space exploration. The information used to describe the five floorplan clusters
is given in Tab. 6.2.
6.1 APPLICATION PROFILE EXAMPLE 175
For the global cluster, which connects all resources together, an average length of
20mm and a segment width of 16 bits have been considered. Clusters FC2 and FC3,
which connect the ASIC and GPP2 with a memory block have been assumed to be only
100µm long in order to balance the choice of signaling circuits, since voltage-mode sig-
naling is more effective for such short lengths. Furthermore, the segment bitwidth has
been assumed to increase with segment length, to improve the performance of global
data transfers.
6.1.3 Design Space Exploration Method
The search for an optimized communication architecture employs the macromodels for
processing and communication nodes described in Sec. 3.5, 3.6, and 4.4.5 for estimating
the performancemetrics of possible solutions in the design space. Hereby, the exploration
method is implemented as a nested optimization loop based on simulated annealing. The
results presented in this chapter were obtained using an exponential cooling schedule of
the form Tk+1 = 0.9 · Tk in 100 steps and 10 000 local iterations for each cooling step.
As a starting point for the exploration, an initial architecture is generated by randomly
assigning the processing nodes to compatible resources and creating the initial scheduling
lists dictated by inter-task dependencies. Then, communication nodes are inserted in the
macromodels for the inter-resource segments and the signaling circuit for each segment is
selected randomly from the described PCM and voltage-mode techniques. For this initial
solution, the supply voltage Vdd and the body bias Vbs are set at the nominal values of
1.0V and 0V, respectively.
At each local iteration, the next jump in the solution space is decided using a uniform
random number generator and a set of probabilities for each jump type. In this way, the
next solution is searched by applying one of the following modifications to the current
architecture:
• Remap a PN from one resource to another, with a probability Premap = 0.05;
• Change the scheduling order of two PNs on a resource, with a probabilityPreschedule =
0.05;
• Change the signaling circuit on a communication segment, with a probability Psig =
0.1;
• Change the supply voltage of a communication segment, with a probability PVdd =
0.4;
• Change the body bias on a communication segment, with a probability PVbs = 0.4.
Supply voltage scaling is performed between 1.0V and 0.6V in 100-mV steps, while the
body bias is adjusted from -0.4V to +0.4V in 100-mV steps. Note, that the jump prob-
abilities have been selected with a focus on communication architecture optimization.
176 CHAPTER 6 METHODOLOGY BINDING
Consequently, a signaling circuit change is performed twice as often as PN remapping
or rescheduling. In addition, voltage scaling and body biasing are performed each four
times as often as the change of signaling circuits, to provide a better coverage of these
optimization resources at the circuit level.
6.1.4 Cost Function Settings
Given the global system delay T eglobal and the global dynamic, respectively leakage energy
consumptions Ed,total and El,total, which are statistically evaluated using the performance
macromodels, the cost function employed for the considered example has the following
form:
C (T eglobal, Ed,total, El,total) = wT · Q (T eglobal)Q (Ti) + wEd · Q (Ed,total)Q (Ed,i) + wEl · Q (El,total)Q (El,i) (6.1)
where wT , wEd , and wEl represent the weights with which the influence of the individual
performance metrics are considered in the optimization. Different values of the weights
are employed in this chapter to provide an emphasis to either delay minimization or low
energy consumption.
It is important to note, that the distributions T eglobal, Ed,total, and El,total are strongly
application-dependent and might be separated by several orders of magnitude. To en-
sure that the influence of each of them is appropriately considered in the cost function, a
scaling of these values is performed by normalizing them to Ti, Ed,i, and El,i. The latter
represent the performance metrics of the initial solution, as given by the macromodels.
Finally, since the performancemetrics are evaluated as statistical distributions, a quan-
tile function Q : P → R is used to extract the 99% inferior quantile of the performance
metric distributions as explained in Sec. 3.4.4:
Q (pX (x)) = z99%,inf = F
−1
X (0.99) (6.2)
6.2 Evaluation of Synthesis Results
The application profile and exploration algorithm discussed in the previous section have
been employed to find a communication architecture optimized with respect to differ-
ent targets. Delay-oriented and energy-oriented optimizations achieve different resource
mappings, scheduling lists, and communication circuit characteristics, such as signaling
method and voltage levels. To illustrate the impact of the optimization choices and to
show the benefit of the developed macromodels, the synthesis results are presented and
discussed in each case. In addition, an overall evaluation of the macromodel accuracy is
given for the synthesized communication segments based on the comparisonwith circuit-
level simulations.
6.2 EVALUATION OF SYNTHESIS RESULTS 177
1 GPP8 6 11 2 7 GPP15 310 ASIC12 14
5 MEM1 3 MEM2 2
13 MEM3 4 MEM9 4
2 2
16
16
16 16 16
16
S
e
g
. 
1
Seg. 2
Seg. 3 Seg. 4
S
e
g
. 
5
Seg. 7
16
Seg. 6
Seg. 9
Seg. 8
Fig. 6.3: Resource mapping configuration, scheduling sequences, and communication segments
synthesized for minimum delay.
6.2.1 Delay-Optimized Architecture
To set the emphasis on delay minimization, the weighting factors in the cost function (6.1)
have been set to wT = 0.98 and wEd = wEl = 0.01. Using these settings, the optimiza-
tion algorithm generated the task-resource mappings and scheduling sequences shown
in Fig. 6.3 with 9 inter-resource communication segments. It is interesting to note, that
the first GPP core is unused in this configuration. There is no additional parallelism
which can be speculated bymapping PNs on GPP1 and the additional inter-resource com-
munication would only increase the overall system latency. Thus, GPP1 can be turned
off to reduce the power consumption. This configuration achieves a total system delay
of 9.391ms, a total leakage energy of 3.506mJ, and a dynamic energy consumption of
2.305mJ. All these performance metrics have been evaluated by extracting the 99% quan-
tile from the statistical distributions evaluated using the developed macromodels.
The synthesis generates also the scheduling times for the processing nodes, which can
be extracted from the nodes of the delay macromodel. Tab. 6.3 presents the obtained
timing values for the PNs in the considered application, extracted from the statistical
distributions for a desired parametric yield level of 99%. It is to be noted, that the start and
end times for the PNs are computed considering also the inter-resource communication
delays.
The synthesized communication segments shown in Fig. 6.3 are described in Tab. 6.4.
It can be observed, that PCM signaling has been employed on the long segments con-
necting the ASIC, GPP2, and GPP3, with the memory blocks MEM1 and MEM2. On the
other hand, the local segments connecting the ASIC with MEM3 and GPP2 with MEM4
use voltage-mode signaling, which is faster than PCM at lengths of 100µm.
The delays on the local segments 2 and 8 are relatively small and typical for the use of
voltage-mode signaling at distances in the range of 100µm. On the remaining segments,
the delay values are in the range of 300 ps, as expected for the use of PCM signaling.
Small delay variations from segment to segment are mainly the result of process param-
eter variations. Considering the leakage power results, PCM signaling circuits exhibit a
significantly higher dissipation. This is mainly due to the larger number of current flow-
178 CHAPTER 6 METHODOLOGY BINDING
Processing Node PNi Start Time T
s
i [ms] End Time T
e
i [ms]
PN1 0.000 1.320
PN2 2.021 2.157
PN3 1.718 1.839
PN4 1.388 1.508
PN5 1.729 1.856
PN6 3.158 3.255
PN7 2.286 2.373
PN8 1.576 2.559
PN9 3.725 3.863
PN10 2.985 3.829
PN11 4.021 4.196
PN12 4.700 5.244
PN13 5.905 6.055
PN14 6.799 7.627
PN15 8.125 9.391
Tab. 6.3: Scheduled start and end times for processing nodes, evaluated as 99% inferior quantile
from the statistical distributions.
Comm. Seg. Delay [ps] Leak. Power [nW] Dyn. Power [nW] Signaling Circuit Vdd [V] Vbs [V]
Seg. 1 380 180.345 91.358 PCM 1.0 0.0
Seg. 2 56 21.1 28.819 VM 1.0 0.4
Seg. 3 374 225.973 85.721 PCM 1.0 0.0
Seg. 4 379 137.423 79.855 PCM 1.0 0.0
Seg. 5 381 112.926 80.205 PCM 1.0 0.0
Seg. 6 376 223.036 86.331 PCM 1.0 0.0
Seg. 7 367 113.179 87.953 PCM 1.0 0.0
Seg. 8 61 23.392 28.701 VM 1.0 0.4
Seg. 9 373 112.838 91.150 PCM 1.0 0.0
Tab. 6.4: Parameters of the synthesized communication segments, evaluated as 99% inferior quan-
tile from the statistical distributions.
ing paths, such as 8 leakage paths in the driver and 3 in the receiver, with respect to 4
leakage paths (two in the driver buffer and two in the receiver buffer) for the voltage-
mode circuits. Nevertheless, the contribution of the pulse signaling current to the static
power dissipation increases the difference between the two circuits. Most variations in
the dynamic power values of the communication segments are due to the variations in
the communication loads. Moreover, longer segments also have a relatively high dynamic
power dissipation, due to the larger total line capacitance.
It is also important to note, that the Vdd and Vbs values chosen by the exploration al-
gorithm were optimized for delay minimization. A nominal Vdd value ensures the mini-
6.2 EVALUATION OF SYNTHESIS RESULTS 179
ASIC
MEM
3
MEM
4
GPP
2
MEM
2
MEM
1
A
R
B
GPP
3
T = 381 psp
T = 61 psp
T = 56 psp
T
=
 3
7
4
 p
s
p
PCM
V = 1.0 Vdd
V = 0.0 Vbs
V
=
 1
.0
 V
d
d
V
=
 0
.0
 V
b
s
PCM
VM
V = 1.0 Vdd
V = 0.4 Vbs
VM
V = 1.0 Vdd
V = 0.4 Vbs
Fig. 6.4: Delay-optimized communication architecture synthesized as a shared bus and three
point-to-point links.
Communication Link Scheduled Activities
ASIC – MEM3 PN12 → PN13, PN13 → PN14
GPP2 – MEM4 PN1 → PN4, PN4 → PN8, PN6 → PN9, PN9 → PN11
Shared Bus PN1 → PN3, PN1 → PN2, PN3 → PN7, PN2 → PN6, PN7 → PN9,
PN5 → PN14, PN14 → PN15
GPP2 – MEM1 PN1 → PN5
Tab. 6.5: Scheduled communication activities on the synthesized architecture from Fig. 6.4.
mum latency, whereas the positive body bias of 0.4V achieves a delay improvement on
the voltage-mode lines, as illustrated also by the results from Fig. 4.25(b) from Sec. 4.3.3.
After identifying the required communication segments depicted in Fig. 6.3, a com-
plete communication architecture can be synthesized e.g. as multiple bus connections. It
can be noticed, that the communications PN1 → PN2 and PN1 → PN3 are serialized by
the mapping of PN3 and PN2 on the same resource (see Fig. 6.3), therefore they can share
the same communication link. In contrast, the communication PN1 → PN5 should be
performed in parallel with PN1 → PN3, as indicated by the scheduling times of PN3 and
PN5 from Tab. 6.3, thus a separate link connecting GPP2 with MEM1 must be provided.
In addition, PN6 starts after PN7 finishes, as indicated by the timing values in Tab. 6.3,
therefore the communications PN3 → PN7 and PN2 → PN6 can be serialized on the
same bus shared by GPP2, GPP3, andMEM2. Similarly, the communications PN7 → PN9,
PN5 → PN14, and PN14 → PN15 can be serialized on a single communication link. Thus,
the communication segments 1, 4, 5, 6, 7, 9, which belong to the same floorplan cluster
FC0 and exhibit therefore comparable delays, can be shared by a global bus. The local
segments 2 and 8, which connect the ASIC with MEM3 and GPP2 with MEM4 are to be
implemented due to their small delay by individual point-to-point links. The identified
180 CHAPTER 6 METHODOLOGY BINDING
1 GPP6 14 15 210 ASIC11 127 8
MEM3 2 MEM44 9 133 5
2 2
Seg. 1Seg. 3
16
Seg. 2
Fig. 6.5: Architecture optimized for minimum energy consumption, requiring only four resources
and three communication segments.
parallel connection between GPP2 and MEM1 is implemented as a separate link.
The communication architecture synthesized in this way is presented in Fig. 6.4, where
each communication link is denoted by its maximum delay. Note, that the association of
several communication segments into a shared bus was possible due to their shared floor-
plan cluster and the propagation delay over the bus has been selected as the maximum
delay of the unified communication segments. The corresponding scheduling lists of the
communication tasks are given in Tab. 6.5.
6.2.2 Energy-Optimized Architecture
A synthesis oriented on minimizing the energy consumption with a strong emphasis on
leakage energy has been performed by setting the weighting factors in the cost function
to the following values:
wT = 0.10 (6.3)
wEl = 0.70 (6.4)
wEd = 0.20 (6.5)
This set of weights amplifies the contribution of leakage energy to the overall system
cost while also considering to a small extent improvements in delay and dynamic energy.
Guided by these values, the exploration algorithm found the solution shown in Fig. 6.5,
which illustrates the resource mappings, scheduling lists, and the resulting communica-
tion segments. It is to be noticed that the 15 PNs have been mapped to only four of the
eight available resources in this energy-optimized configuration. As a consequence, the
remaining two GPP cores and the two unused memory blocks can be turned off e.g. us-
ing power gating, which reduces significantly the total energy consumption. In addition,
most processing nodes are executed on the ASIC block, which has a much lower power
dissipation than the GPPs (see Tab. 6.1). As a result of this energy-driven optimization,
the system achieves a total leakage energy consumption of only 0.965mJ, a moderate
dynamic energy of 2.505mJ, and a total delay of 10.191ms (all values evaluated as the
99% inferior quantile from their respective distributions). It is to be mentioned that the
leakage energy consumption is improved with respect to the previous delay-optimized
architecture by a factor of 3.6, at the expense of a moderate increase in delay by a factor
of approx. 1.09.
6.2 EVALUATION OF SYNTHESIS RESULTS 181
Comm. Seg. Delay [ps] Leak. Power [nW] Dyn. Power [nW] Signaling Circuit Vdd [V] Vbs [V]
Seg. 1 68 8.402 10.284 VM 1.0 -0.4
Seg. 2 372 45.005 92.486 PCM 0.6 -0.4
Seg. 3 67 9.241 41.165 VM 1.0 -0.4
Tab. 6.6: Parameters of the three synthesized communication segments shown in Fig. 6.5 (evalu-
ated using the 99% inferior quantile from the statistical distributions).
The performance metrics for the synthesized inter-resource communication segments
are presented in Tab. 6.6, together with the signaling circuit parameters. Voltage-mode
signaling is again employed on the short segments connecting the ASIC with MEM3 and
the GPP with MEM4, as this signaling method is both energy-efficient and also delay-
efficient at very short interconnect lengths. A reverse body bias of -0.4V decreases sig-
nificantly the leakage energy on all segments, as expected from the results presented in
Sec. 4.2.4 and 4.3.3. In addition, as discussed in Sec. 4.2.4, the reverse body bias on the
PCM line does not significantly affect the delay. Thus, the choice of PCM signaling is
optimal for the long communication segment, as it allows the leakage reduction through
body biasing while keeping the delay on a low level. In contrast, voltage-mode signaling
is recommended on the short segments since it exhibits a lower leakage energy, lower
delay, and the delay increase due to the reverse body bias remains relatively small in the
context of the overall system latency.
Scaling the supply voltage on the PCM line to the value of 0.6V improves the dynamic
energy consumption due to the large parasitic capacitance of the global segment. The
leakage energy consumption is less influenced by this scaling (particularly at -0.4V re-
verse body bias, see also Fig. 4.20(a)) and the delay of PCM lines, as discussed in Sec. 4.2.4,
remains practically unaffected. It is important to notice, that voltage scaling on the VM
lines has been rejected by the exploration algorithm, since it brings a negligible improve-
ment on the dynamic energy due to the very small line capacitance, at the cost of a signif-
icant delay increase, since the delay of voltage-mode lines is sensitive to voltage scaling
(see Fig. 4.25(a)).
6.2.3 Accuracy Evaluation
Explorations in the solution space use the developed circuit-level models and RLC line
models [116] for fast estimations. The performance metrics evaluated in this way serve
mainly to compare the cost functions of different solutions in the search for an optimized
architecture. Nevertheless, it is useful to investigate the accuracy of this modeling ap-
proach in predicting the actual performance level of the synthesized architecture.
In order to check the modeling accuracy, two test scenarios are defined. First, the
circuit-level modeling precision is tested by comparing the segment models employed in
the optimization with simulations of the signaling circuits using the Cadence Spectre [31]
182 CHAPTER 6 METHODOLOGY BINDING
Comm. Seg. RLC Line Model Extrapolated S-Parameter Line Model
Delay-Optimized Architecture
Seg. 1 2.882% 10.618%
Seg. 2 2.806% 4.975%
Seg. 3 2.477% 9.894%
Seg. 4 3.823% 10.551%
Seg. 5 2.872% 10.379%
Seg. 6 2.842% 10.160%
Seg. 7 2.068% 11.082%
Seg. 8 3.185% 4.328%
Seg. 9 3.912% 10.510%
Energy-Optimized Architecture
Seg. 1 3.137% 5.291%
Seg. 2 3.751% 11.572%
Seg. 3 2.943% 4.345%
Tab. 6.7: Relative delay error of the communication circuit models with respect to circuit simula-
tions.
circuit simulator. The relative delay error, evaluated as:
εd,rel =
|Td,model − Td,sim|
Td,sim
· 100 [%] (6.6)
(where Td,model and Td,sim are the segment delays obtained using the model and circuit
simulation, respectively) is given in the second column of Tab. 6.7 for every communica-
tion segment synthesized in Sec. 6.2.1 and 6.2.2. It is to be noted that this first test scenario
uses the same RLC line model from [116] in the segment models and within the circuit
simulations. Thus, the results from Tab. 6.7 using the RLC line model practically reflect
the accuracy of the signaling circuit models. As a result of employing technology-accurate
BSIM4 equations and parameters in the circuit models and due to the direct analytical so-
lution to the equivalent circuit model, the results show a very good agreement with the
circuit simulations (which employ the commercial BSIM4 implementation and the same
technology parameters).
In the second test scenario, the communication segments are described using the re-
fined interconnectmodels developed in chapter 5. Within this context, thewide-bandwidth
interconnect models are employed in a practical application scenario to validate the syn-
thesis results from the optimization framework. The synthesized segments are simulated
first using the extrapolated S-parameter models, then the results are compared with sim-
ulations of similar S-parameter models extracted with the field solver. The relative error
results are given in the last column of Tab. 6.7 and show an accuracy level similar to the
results from chapter 5 for this practical example. It is important to add that shorter seg-
ments have only two parallel wires (see Fig. 6.3 and Fig. 6.5) and therefore exhibit a lower
6.3 SUMMARY 183
relative error. As indicated in Sec. 5.4, the modeling error increases generally with the
number of wires per segment.
6.3 Summary
This chapter has presented the results of applying the developed communication synthe-
sis framework in the context of a practical example. An application scenario has been
chosen and the description of the corresponding profile considering the target MPSoC
architecture has been presented. To include the influence of process parameter varia-
tions, the chip area has been divided into a two-dimensional grid and a spatial correla-
tion model has been applied. In addition, floorplanning information has been considered
using a cluster tree model description.
Further, the exploration method for optimizing the communication architecture has
been discussed and the steps performed in the solution space have been explained. An
individual probability value has been assigned to each step type to increase the relative
frequency of optimizations at the circuit level, such as signaling choice, voltage scaling,
and body biasing. It has also been shown that the relative influence of performance met-
rics on the system cost function has to be normalized and balanced using weighting fac-
tors to achieve different optimization goals.
Afterwards, the synthesis results have been evaluated for two optimization scenarios.
The delay-optimized architecture produced a relatively large number of communication
segments at nominal supply voltage and body bias values. The choice of signaling circuits
and voltages has been discussed and a communication architecture has been synthesized
by unifying similar segments and scheduling communication activities. Additionally, an
energy-optimized communication architecture has been synthesized by changing the cost
weighting factors and the obtained parameters have been analyzed. The optimum choice
of PCM and voltage-mode signaling techniques for long, respectively short communica-
tion lines has been pointed out once more and the choice of voltage values for minimizing
the energy has been discussed. Finally, the modeling accuracy for the synthesized com-
munication segments has been evaluated in comparison with circuit-level simulations
using both the fast lumped RLC line models and the extrapolated interconnect model
developed in this work.
Chapter 7
Conclusions
Contents
7.1 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
This thesis has presented a unified methodology for the modeling and optimization of
on-chip communication, with an emphasis on technology accuracy and parameter vari-
ations. By developing a statistical parameter model and a method for the fast propaga-
tion of statistical distributions across algebraic operations, the development of variability-
aware performancemacromodels has been enabled. Consequently, optimizationmethods
for system-level design decisions such as task mapping and scheduling have been an-
alyzed in the context of variability and the particularities of result interpretation from
statistical distributions have been explored. Technology accuracy in the communica-
tion models has been achieved by developing low-level analytic models for signaling
circuits starting from an accurate statistical transistor-level model and by providing a
wide-bandwidth modeling methodology for interconnection segments. Thus, the opti-
mization of the communication architecture through circuit-level decisions such as the
choice of signaling method, voltage scaling, and body biasing was enabled and analyzed.
7.1 Contributions of the Work
The proposed communication modeling and synthesis methodology has a strong focus
on parameter variability. Thus, a structured method for extracting the application profile
and for specifying the target MPSoC architecture considering parameter variations has
been presented. Within this context, a unified application profile interface has been de-
fined and implemented in the form of entries in a configuration file, which specify data
dependencies, communication loads, SoC resource types, leakage and dynamic power
consumption, task-resource compatibilities, execution times, floorplanning information,
185
186 CHAPTER 7 CONCLUSIONS
and technology characteristics, such as process parameter values and variations. To en-
able the accurate description of variations in power dissipation, execution times, commu-
nication loads, as well as temperature and process parameters, a discretized distribution
model with adjustable accuracy has been developed. The control of accuracy using limits
of the approximation interval, number of pdf bins, and number of samples, as well as
a method to extract samples from the discretized description have been also presented.
An adjustable tradeoff between accuracy and speed, as well as the ability to represent
non-standard distributions and profiling data are the main advantages of the proposed
model.
A novel method for the propagation of complete distribution functions has been in-
troduced, relying upon the implementation of statistical operators. Unlike previous sta-
tistical analysis approaches which derive variability dependencies as linear sensitivities,
moments, or polynomial approximations, this methodology propagates the entire pdf
description across the model expressions by applying statistical operators. Within this
context, the complex implementation of a statistical product operator has been described
in detail. In addition, a novel technique for the fast numerical implementation of statis-
tical operators with adjustable accuracy has been developed. It has been shown that a
very good accuracy can be obtained using only 0.25% of the execution time required by
Monte Carlo sampling. This technique enabled the implementation of every statistical
operation required by the models, such as exponential, square root, logarithm, division
etc. Based upon this methodology, statistical macromodels for delay and power have
been developed and presented. The inclusion of statistical representations in the opera-
tional nodes of the macromodels has been discussed and the implied re-evaluations and
updates propagated across the tree structures during the optimization-driven changes in
the macromodel structure have been analyzed. In addition, the interpretation of perfor-
mance costs for optimization decisions using quantile functions applied to the statistical
distributions has been presented.
An accurate current-source transistor model for technology-accurate estimations has
been derived from BSIM4 equations, in which the modeling expressions are performed
statistically, using the developed statistical operators and distribution propagation tech-
nique. In this way, state-of-the-art CMOS process parameters and parameter variations
are included in the model to achieve a high technology accuracy. Spatially-correlated
intra-die parameter variations were described using two-dimensional grids and correla-
tion decay models. In addition, spatial correlations have been modeled using the prin-
cipal component analysis. Based upon the statistical transistor model, equivalent circuit
models have been derived for pulsed current-mode signaling circuits and for traditional
voltage-mode signaling buffers. Using analytic approximations for the signal waveforms
in the regions of interest, the circuit equations have been solved analytically and models
for the delay and energy consumption have been derived. Using these circuit-level mod-
els, the performance of the investigated signaling methods has been investigated under
the influence of voltage scaling and body biasing. Furthermore, the circuit-level mod-
7.2 DIRECTIONS FOR FUTURE WORK 187
els have been employed for the representation of communication segments within the
optimization framework and design decisions such as the selection of signaling method,
voltage scaling, and body biasing have been included in the solution space exploration.
For accurate validations of the synthesized communication segments, a computation-
ally-efficient wide-bandwidth characterization method based on an incremental extrapo-
lation of S-parameters for arbitrary interconnect structures has been developed and pre-
sented. The characterization method defines a systematic set of a priori parameter ex-
tractions and performs on-demand multistep extrapolations for interconnect segments
with specified wire length, widths, spacings, metal layer, and neighboring routing in-
formation. The experimental evaluations have shown a maximum absolute error of less
than 2 · 10−2 (magnitude) and 7 degrees (angle) between the developed model and an
industry-standard full-wave field simulator for a 90-nmCMOS process. Circuit-level sim-
ulations with the extrapolated model have shown a maximum signal delay error of less
than 12.5% across multiple metal layers and wire configurations.
Finally, the complete developed methodology has been employed on an application
example and the communication synthesis has been conducted first for speed optimiza-
tion and then for minimum energy consumption. In each case, the synthesized architec-
ture and the parameters of the communication segments have been analyzed in detail.
An evaluation of the modeling accuracy has shown an error level of less than 4% using
lumped RLC line models and less than 12% using the extrapolated interconnect model.
7.2 Directions for Future Work
The proposed performance modeling methodology represents an important backbone for
a variability-aware accurate communication synthesis framework. Several optimization
resources have been enabled by the developed circuit-level models. Nevertheless, several
directions for future enhancements can be identified.
Additional optimization techniques at the circuit level: The describedmethodology can
be extended to include device sizing, wire width sizing, and wire spacing as optimization
techniques. Since the developedmodels are already dependent on these parameters, such
extensions can be easily integrated in the current framework.
Complete integration with SPICE-level circuit simulators: A direct synthesis of the
communication segments into SPICE-level netlists is possible, since the signaling circuits
and the interconnect line parasitics are known in the macromodels. This enhancement
would directly enable the automatic validation of the synthesized segments and the re-
placement of RLC lumped parameters with the extrapolated n-port segmentmodels. Sim-
ulation results can be read back into the framework and additional refinements can be
applied to the synthesized architecture.
188 CHAPTER 7 CONCLUSIONS
Inclusion of floorplanning algorithms: The current implementation of the synthesis
framework considers a limited floorplanning information in the form of cluster trees and
cluster descriptions. A complete floorplanning algorithm could be included in the frame-
work to optimize the MPSoC floorplan in conjunction with the communication synthesis.
Routing of communication segments: A routing algorithm oriented on the manufactur-
ing process could generate a complete wiring structure for the synthesized communica-
tion architecture which can be further optimized across the metal layers and can adjust
the wire thickness, wire spacing, and routing configurations in the adjacent metal lay-
ers. In order to not interfere with the existing interconnects of the resources on the chip, a
three-dimensional constraint system should be implemented to delimit the allowed space
for routing.
Appendix A
Complex Expression of the Output
Voltage for the Voltage-Mode Signaling
Circuit
Depending on device parameters and interconnect sizes, the roots of the characteristic
equation (4.85) might be complex, such that:
λ1 = ℜλ + jℑλ (A.1)
λ2 = ℜλ − jℑλ (A.2)
As a consequence, the output voltage Vo (t) has also a complex expression. In such a case,
the delay model considers only the real part of Vo, which after applying the computations
described in Sec. 4.3.2 results in the following form:
ℜVo(t) =Vdd +
1
LαC1C2
(ℑ2λ + (α−ℜλ)2) (ℑ2λ + ℜ2λ)
{−tαI1 (ℑ2λ + (α−ℜλ)2)
+
(−1 + eαt) γ (ℑ2λ + ℜ2λ)+ LαC1 (ℑ2λ + (α−ℜλ)2) [ℑK1ℑλ −ℑK2ℑλ
+ ℜK1ℜλ + ℜK2ℜλ − eℜλt sin (ℑλt) (ℑλ (ℜK1 + ℜK2) + (−ℑK1 + ℑK2)ℜλ)
−eℜλt cos (ℑλt) ((ℑK1 −ℑK2)ℑλ + (ℜK1 + ℜK2)ℜλ)
]}
(A.3)
where the constants K1 and K2 have complex expressions given by:
ℜK1 =
γ
2LC1
(ℑ2λ + (α+ ℜλ)2) − I12LC1 (ℑ2λ + ℜ2λ) (A.4)
ℑK1 = −
αγ
2LC1ℑλ
(ℑ2λ + (α+ ℜλ)2) + γℜλ2LC1ℑλ (ℑ2λ + (α−ℜλ)2) − I1ℜλ2LC1ℑλ (ℑ2λℜ2λ) (A.5)
ℜK2 = −ℜK1 +
γ (α2 + ℑ2λ − 2αℜλ + ℜ2λ)
LC1
(ℑ2λ + (α−ℜλ)2)2 −
I1
LC1 (ℑ2λ + ℜ2λ)
(A.6)
ℑK2 = −ℑK1 (A.7)
189
References
[1] A. AGARWAL, D. BLAAUW, and V. ZOLOTOV. Statistical Timing Analysis for Intra-Die Process
Variations with Spatial Correlations. In Intl. Conf. on Computer-Aided Design (ICCAD), pages
900–907, 2003.
[2] A. AGARWAL, D. BLAAUW, V. ZOLOTOV, S. SUNDARESWARAN, M. ZHAO, K. GALA, and
R. PANDA. Statistical Delay Computation Considering Spatial Correlations. In Asia and
South Pacific Design Automation Conf. (ASP-DAC), pages 271–276, 2003.
[3] A. AGARWAL, D. BLAAUW, V. ZOLOTOV, and S. VRUDHULA. Statistical Timing Analysis using
Bounds and Selective Enumeration. In IEEE/ACM Intl. Workshop on Timing Issues in the
Specification and Synthesis of Digital Systems (TAU), pages 29–36, 2002.
[4] A. AGARWAL, D. BLAAUW, V. ZOLOTOV, and S. VRUDHULA. Computation and Refinement of
Statistical Bounds on Circuit Delay. InDesign Automation Conf. (DAC), pages 348–353, 2003.
[5] A. AGARWAL, D. BLAAUW, V. ZOLOTOV, and S. VRUDHULA. Statistical Timing Analysis using
Bounds. In Design Automation and Test in Europe (DATE), pages 62–67, 2003.
[6] A. AGARWAL, V. ZOLOTOV, and D. BLAAUW. Statistical Timing Analysis Using Bounds and
Selective Enumeration. IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits
and Systems, 22(9):1243–1260, Sept. 2003.
[7] K. AGARWAL and S. NASSIF. Characterizing Process Variation in Nanometer CMOS. In
Design Automation Conf. (DAC), pages 396–399, 2007.
[8] K. AGARWAL, R. RAO, D. SYLVESTER, and R. BROWN. Parametric Yield Analysis and Opti-
mization in Leakage Dominated Technologies. IEEE Trans. on Very Large Scale Integration
(VLSI) Systems, 15(6):613–623, June 2007.
[9] I. AHSAN, N. ZAMDMER, O. GLUSHCHENKOV, R. LOGAN, E. J. NOWAK, H. KIMURA, J. ZIMMER-
MAN, G. BERG, J. HERMAN, E. MACIEJEWSKI, A. CHAN, A. AZUMA, S. DESHPANDE, B. DIRA-
HOUI, G. FREEMAN, A. GABOR, M. GRIBELYUK, S. HUANG, M. KUMAR, K. MIYAMOTO, D. MO-
CUTA, A. MAHOROWALA, E. LEOBANDUNG, H. UTOMO, and B. WALSH. RTA-Driven Intra-Die
Variations in Stage Delay, and Parametric Sensitivities for 65nm Technology. In IEEE Symp.
on VLSI Technology, pages 170–171, 2006.
[10] C. S. AMIN, N. MENEZES, K. KILLPACK, F. DARTU, U. CHOUDHURY, N. HAKIM, and Y. I. ISMAIL.
Statistical Static Timing Analysis: How Simple Can We Get? In Design Automation Conf.
(DAC), pages 652–657, 2005.
[11] H. ANANTHAN and K. ROY. A Fully Physical Model for Leakage Distribution under Process
Variations in Nanoscale Double-Gate CMOS. InDesign Automation Conf. (DAC), pages 413–
418, 2006.
[12] M. ANDERS, N. RAI, R. K. KRISHNAMURTHY, and S. BORKAR. A Transition-Encoded Dynamic
Bus Technique for High-Performance Interconnects. IEEE Journal of Solid-State Circuits,
38(5):709–714, May 2003.
191
192 REFERENCES
[13] A. ANDREI, M. T. SCHMITZ, P. ELES, Z. PENG, and B. M. AL HASHIMI. Quasi-Static Voltage
Scaling for Energy Minimization with Time Constraints. In Design Automation and Test in
Europe (DATE), pages 514–519, 2005.
[14] F. ANGIOLINI, J. CENG, R. LEUPERS, F. FERRARI, C. FERRI, and L. BENINI. An Integrated Open
Framework for HeterogeneousMPSoC Design Space Exploration. InDesign Automation and
Test in Europe (DATE), pages 1145–1150, 2006.
[15] ANSOFT CORP. HFSS: 3D Full-wave Electromagnetic Field Simulation.
http://www.ansoft.com/products/hf/hfss, Sept. 2008.
[16] R. BASHIRULLAH, W. LIU, R. CAVIN, and D. EDWARDS. A Hybrid Current/Voltage Mode On-
Chip Signaling Scheme With Adaptive Bandwidth Capability. IEEE Trans. on Very Large
Scale Integration (VLSI) Systems, 12(8):876–880, Aug. 2004.
[17] R. BASHIRULLAH, W. LIU, R. CAVIN, and D. EDWARDS. A 16 Gb/s Adaptive Bandwidth On-
Chip Bus Based on Hybrid Current/Voltage Mode Signaling. IEEE Journal of Solid-State
Circuits, 41(2):461–473, Feb. 2006.
[18] R. BASHIRULLAH, W. LIU, and R. K. CAVIN. Current-Mode Signaling in Deep Submicrometer
Global Interconnects. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 11(3):406–
417, June 2003.
[19] M. BERKELAAR. Statistical Delay Calculation, a Linear Time Method. In Intl. Workshop on
Timing Issues (TAU), pages 15–24, 1997.
[20] K. BERNSTEIN, D. J. FRANK, A. E. GATTIKER, W. HAENSCH, B. L. JI, S. R. NASSIF, E. J. NOWAK,
D. J. PEARSON, and N. J. ROHRER. High-performance CMOS variability in the 65-nm regime
and beyond. IBM J. Res. & Dev., 50(4/5):433–449, July/Sept. 2006.
[21] S. BHARDWAJ, Y. CAO, and S. VRUDHULA. Statistical Leakage Minimization Through Joint
Selection of Gate Sizes, Gate Lengths and Threshold Voltage. InAsia and South Pacific Design
Automation Conf. (ASP-DAC), pages 953–958, 2006.
[22] S. BHARDWAJ, S. VRUDHULA, and D. BLAAUW. τAU: Timing Analysis Under Uncertainty. In
Intl. Conf. on Computer-Aided Design (ICCAD), pages 615–620, 2003.
[23] S. BHARDWAJ, S. VRUDHULA, P. GHANTA, and Y. CAO. Modeling of Intra-Die Process Varia-
tions for Accurate Analysis and Optimization of Nano-Scale Circuits. In Design Automation
Conf. (DAC), pages 791–796, 2006.
[24] D. BLAAUW, K. CHOPRA, A. SRIVASTAVA, and L. SCHEFFER. Statistical Timing Analysis: From
Basic Principles to State of the Art. IEEE Trans. on Computer-Aided Design (CAD) of Integrated
Circuits and Systems, 27(4):589–607, Apr. 2008.
[25] T. BLICKLE, J. TEICH, and L. THIELE. System-Level Synthesis Using Evolutionary Algorithms.
J. Des. Automation Embedded Syst., 3(1):23–58, 1998.
[26] A. BONNOIT, S. HERBERT, and D. M. L. PILEGGI. Integrating Dynamic Voltage/Frequency
Scaling and Adaptive Body Biasing using Test-Time Voltage Selection. In Intl. Symp. on Low
Power Electronics and Design (ISLPED), pages 207–212, 2009.
[27] S. BORKAR, T. KARNIK, S. NARENDRA, J. TSCHANZ, A. KESHAVARZI, and V. DE. Parameter
Variations and Impact on Circuits andMicroarchitecture. InDesign Automation Conf. (DAC),
pages 338–342, 2003.
[28] K. A. BOWMAN, B. L. AUSTIN, J. C. EBLE, X. TANG, and J. D. MEINDL. A Physical Alpha-Power
Law MOSFET Model. IEEE Journal of Solid-State Circuits, 34(10):1410–1414, Oct. 1999.
REFERENCES 193
[29] C.-P. CHEN and D. F. WONG. Optimal Wire Sizing Function with Fringing Capacitance Con-
sideration. In Design Automation Conf. (DAC), pages 604–607, 1997.
[30] CADENCE DESIGN SYSTEMS. Virtuoso Spectre Circuit Simulator User Guide. Product Version
5.1.41, July 2004.
[31] Cadence Design Systems, Inc. Virtuoso Spectre Circuit Simulator Reference, Nov. 2004. Product
Version 5.1.41.
[32] A. E. CALDWELL, A. B. KAHNG, S. MANTIK, I. L. MARKOV, and A. ZELIKOVSKY. On Wire-
length Estimations for Row-Based Placement. IEEE Trans. on Computer-Aided Design (CAD)
of Integrated Circuits and Systems, 18(9):1265–1278, 1999.
[33] K. CAO, S. DOBRE, and J. HU. Standard Cell Characterization Considering Lithography In-
duced Variations. In Design Automation Conf. (DAC), pages 801–804, 2006.
[34] Y. CAO and L. T. CLARK. Mapping Statistical Process Variations Toward Circuit Performance
Variability: An Analytical Modeling Approach. In Design Automation Conf. (DAC), pages
658–663, 2005.
[35] H. CHANG and S. SAPATNEKAR. Statistical Timing Analysis Considering Spatial Correlations
using a Single Pert-Like Traversal. In Intl. Conf. on Computer-Aided Design (ICCAD), pages
621–625, 2003.
[36] H. CHANG, V. ZOLOTOV, S. NARAYAN, and C. VISWESWARIAH. Parametrized Block-Based Sta-
tistical Timing Analysis with Non-Gaussian Parameters, Nonlinear Delay Functions. In
Design Automation Conf. (DAC), pages 71–76, 2005.
[37] R. T. CHANG, N. TALWALKAR, C. P. YUE, and S. S. WONG. Near Speed-of-Light Signaling
Over On-Chip Electrical Interconnects. IEEE Journal of Solid-State Circuits, 38(5):834–838,
May 2003.
[38] S. C. CHAPRA and R. P. CANALE. Numerical Methods for Engineers. McGraw-Hill Higher
Education, 2006.
[39] K. CHEN and C. HU. Performance and Vdd Scaling in Deep Submicrometer CMOS. IEEE
Journal of Solid-State Circuits, 33(10):1586–1589, 1998.
[40] M. CHEN, W. ZHAO, F. LIU, and Y. CAO. Fast Statistical Circuit Analysis with Finite-Point
Based Transistor Model. In Design Automation and Test in Europe (DATE), pages 1391–1396,
2007.
[41] T. CHEN and J. GREGG. A Low Cost Individual-Well Adaptive Body Bias (IWABB) Scheme
for Leakage Power Reduction and Performance Enhancement in the Presence of Intra-Die
Variations. In Design Automation and Test in Europe (DATE), 2004.
[42] T. CHEN and S. NAFFZIGER. Comparison of Adaptive Body Bias (ABB) and Adaptive Supply
Voltage (ASV) for Improving Delay and Leakage Under the Presence of Process Variation.
IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 11(5):888–899, Oct. 2003.
[43] C. CHO, D. KIM, J. KIM, J.-O. PLOUCHART, and R. TRZCINSKI. Statistical Framework for
Technology-Model-Product Co-Design and Convergence. In Design Automation Conf.
(DAC), pages 503–508, 2007.
[44] S.-W. CHOI, H.-B. LEE, and H.-J. PARK. A Three-Data Differential Signaling Over Four Con-
ductors With Pre-Emphasis and Equalization: A CMOS Current Mode Implementation.
IEEE Journal of Solid-State Circuits, 41(3):633–641, Mar. 2006.
[45] K. CHOPRA, S. SHAH, A. SRIVASTAVA, D. BLAAUW, and D. SYLVESTER. Parametric Yield Maxi-
mization using Gate Sizing based on Efficient Statistical Powe and Delay Gradient Compu-
tation. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 1020–1025, 2005.
194 REFERENCES
[46] K. CHOPRA, N. SHENOY, and D. BLAAUW. Variogram Based Robust Extraction of Process
Variation Model. In Intl. Workshop on Timing Issues (TAU), 2007.
[47] B. CLINE, K. CHOPRA, D. BLAAUW, A. TORRES, and S. SUNDARESWARAN. Transistor-Specific
Delay Modeling for SSTA. In Design Automation and Test in Europe (DATE), pages 592–597,
2008.
[48] J. CONG. An Interconnect-Centric Design Flow for Nanometer Technologies. Proceedings of
the IEEE, 89:505–528, Apr. 2001.
[49] J. CONG and Z. PAN. Interconnect Performance Estimation Models for Design Planning.
IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits and Systems, 20(6):739–752,
2001.
[50] B. DAVE, G. LAKSHMINARAYANA, and N. JHA. COSYN: Hardware-Software Co-Synthesis of
Embedded Systems. In Design Automation Conf. (DAC), pages 703–708, 1997.
[51] B. DAVE, G. LAKSHMINARAYANA, and N. JHA. COSYN: Hardware-Software Co-Synthesis
of Heterogeneous Distributed Embedded Systems. IEEE Trans. on Computer-Aided Design
(CAD) of Integrated Circuits and Systems, 7(1):92–104, 1999.
[52] D. J. DELEGANES, M. BARANY, G. GEANNOPOULOS, K. KREITZER, M. MORRISE, D. MILLIRON,
A. P. SINGH, and S. WIJERATNE. Low-Voltage Swing Logic Circuits for a Pentium 4 Processor
Integer Core. IEEE Journal of Solid-State Circuits, 40(1):36–43, Jan. 2005.
[53] A. DEUTSCH, P. W. COTEUS, G. V. KOPCSAY, H. H. SMITH, C. W. SUROVIC, B. L. KRAUTER, D. C.
EDELSTEIN, and P. J. RESTLE. On-Chip Wiring Design Challenges for Gigahertz Operation.
Proceedings of the IEEE, 89(4):529–555, Apr. 2001.
[54] G. V. DEVARAYANADURG and M. SOMA. An interconnect model for arbitrary terminations
based on scattering parameters. Analog Integrated Circuits and Signal Processing, 5:31–45,
Jan. 1994.
[55] A. DEVGAN and C. KASHYAP. Block-Based Static Timing Analysis with Uncertainty. In Intl.
Conf. on Computer-Aided Design (ICCAD), pages 607–614, 2003.
[56] A. DOBOLI. Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for De-
sign of Embedded Systems under Power and Latency Constraints. In Design Automation
and Test in Europe (DATE), pages 612–619, Mar. 2001.
[57] A. DOBOLI and P. ELES. Scheduling under Control Dependencies for Heterogeneous Archi-
tectures. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 602–608, 1998.
[58] M. EISELE, J. BERTHOLD, D. SCHMITT-LANDSIEDEL, and R. MAHNKOPF. The Impact of Intra-
Die Device Parameter Variations on Path Delays and on the Design for Yield of LowVoltage
Digital Circuits. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 5(4):360–368, 1997.
[59] P. ELES, A. DOBOLI, P. POP, and Z. PENG. Scheduling with Bus Access Optimization for
Distributed Embedded Systems. IEEE Trans. on Very Large Scale Integration (VLSI) Systems,
8(5):472–491, 2000.
[60] W. C. ELMORE. The Transient Response of Damped Linear Networks with Particular Regard
to Wide-Band Amplifiers. J. Appl. Phys., 19(1):55–63, 1948.
[61] R. ERNST and W. YE. Embedded Program Timing Analysis Based on Path Clustering and
Architecture Classification. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 598–604,
1997.
[62] D. GAJSKI, N. DUTT, C. WU, and Y. LIN. High-Level Synthesis: Introduction to Chip and System
Design. Kluwer Academic Publishers, 1991.
REFERENCES 195
[63] D. GAJSKI, F. VAHID, S. NARAYAN, and J. CONG. Specification and Design of Embedded Systems.
Prentice-Hall, Englewood Cliffs, N.J., 1994.
[64] D. D. GAJSKI and F. VAHID. Specification and Design of Embedded Hardware-Software Sys-
tems. IEEE Design & Test of Computers, 12:53–67, 1995.
[65] M. GALASSI, J. DAVIES, J. THEILER, B. GOUGH, G. JUNGMAN, M. BOOTH, and F. ROSSI. GNU
Scientific Library Reference Manual, 1.11 edition, Feb. 2008.
[66] M. R. GAREY and D. S. JOHNSON. Computers and Intractability: a Guide to the Theory of NP
Completeness. W. H. Freeman & Co., 1979.
[67] A. GATTIKER, S. NASSIF, R. DINAKAR, and C. LONG. Timing Yield Estimation from Static
Timing Analysis. In Intl. Symp. on Quality Electronic Design (ISQED), pages 437–442, 2001.
[68] P. GHANTA, S. VRUDHULA, R. PANDA, and J. WANG. Stochastic Power Grid Analysis Consid-
ering Process Variations. In Design Automation and Test in Europe (DATE), pages 964–969,
2005.
[69] A. G. GLEN, L. M. LEEMIS, and J. H. DREW. Computing the distribution of the product of two
continuous random variables. Computational Statistics & Data Analysis, 44(3):451–464, Jan.
2004.
[70] J. GONG, D. GAJSKI, and S. NARAYAN. Software Estimation from Executable Specifications.
In Proc. European Design Automation Conf. (EuroDAC), 1995.
[71] J. GU, S. SAPATNEKAR, and C. KIM. Width-Dependent Statistical Leakage Modeling for Ran-
dom Dopant Induced Threshold Voltage Shift. In Design Automation Conf. (DAC), pages
87–92, 2007.
[72] B. GUSTAVSEN and A. SEMLYEN. Enforcing Passivity for Admittance Matrices Approximated
by Rational Functions. IEEE Trans. on Power Systems, 16(1):97–104, Feb. 2001.
[73] S. HANSON, B. ZHAI, K. BERNSTEIN, D. BLAAUW, A. BRYANT, L. CHANG, K. K. DAS, W. HAEN-
SCH, E. J. NOWAK, and D. M. SYLVESTER. Ultralow-Voltage Minimum-Energy CMOS. IBM J.
Res. & Dev., 50(4/5):469–490, July/Sept. 2006.
[74] W. HEINRICH, K. BEILENHOFF, P. MEZZANOTTE, and L. ROSELLI. Optimum Mesh Grading for
Finite-Difference Method. IEEE Trans. on Microwave Theory and Techniques, 44(9):1569–1574,
Sept. 1996.
[75] J. HENKEL. A Low Power Hardware/Software Partitioning Approach for Core-based Em-
bedded Systems. In Design Automation Conf. (DAC), pages 122–127, 1999.
[76] V. HILL, O. FARLE, and R. DYCZIJ-EDLINGER. A Stabilized Multilevel Vector Finite-Element
Solver for Time-Harmonic Electromagnetic Waves. IEEE Trans. on Microwave Theory and
Techniques, 39(3):1203–1206, May 2003.
[77] C. HOARE. Communicating Sequential Processes. Comm. ACM, 21(8):666–677, 1978.
[78] C.-C. HUANG. Using S parameters for signal integrity analysis. eeDesign (EE Times EDA
News), Feb. 2004.
[79] E. HUMENAY, D. TARJAN, and K. SKADRON. Impact of Process Variations onMulticore Perfor-
mance Symmetry. In Design Automation and Test in Europe (DATE), pages 1653–1658, 2007.
[80] INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS, 2006 UPDATE. Lithography.
http://www.itrs.net, Feb.
[81] INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS, 2007 EDITION. Interconnect.
http://www.itrs.net, Feb. 2008.
196 REFERENCES
[82] INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS, 2008 UPDATE. Design.
http://www.itrs.net, May 2009.
[83] N. IZUMI, H. OZAKI, Y. NAKAGAWA, N. KASAI, and T. ARIKADO. Evaluation of Transistor Prop-
erty VariationsWithin Chips on 300-nmWafers Using a NewMOSFETArray Test Structure.
IEEE Trans. on Semiconductor Manufacturing, 17(3):248–254, Aug. 2004.
[84] S. JAHN, M. MARGRAF, V. HABCHI, and R. JACOB. The Qucs Project: Technical Papers.
http://qucs.sourceforge.net/tech/technical.html, Nov. 2008.
[85] J. A. G. JESS, K. KALAFALA, S. R. NAIDU, R. H. J. M. OTTEN, and C. VISWESWARIAH. Statistical
Timing for Parametric Yield Prediction of Digital Integrated Circuits. In Design Automation
Conf. (DAC), pages 932–937, 2003.
[86] D. JIAO, M. MAZUMDER, S. CHAKRAVARTY, C. DAI, M. KOBRINSKY, M. HARMES, and S. LIST.
A Novel Technique for Full-Wave Modeling of Large-Scale Three-Dimensional High-Speed
On/Off-Chip Interconnect Structures. In International Conference on Simulation of Semicon-
ductor Processes and Devices (SISPAD), pages 39–42, Sept. 2003.
[87] A. P. JOSE, G. PATOUNAKIS, and K. L. SHEPARD. Pulsed Current-Mode Signaling for Nearly
Speed-of-Light Intrachip Communication. IEEE Journal of Solid-State Circuits, 41(4):772–780,
Apr. 2006.
[88] M. KAMON, M. TSUK, and J. WHITE. FastHenry: A Multipole-Accelerated 3D Inductance
Extraction Program. IEEE Trans. on Microwave Theory and Techniques, 42(9):1750–1758, Sept.
1994.
[89] K. KANG, B. PAUL, and K. ROY. Statistical Timing Analysis Using Levelized Covariance
Propagation. In Design Automation and Test in Europe (DATE), pages 764–769, 2005.
[90] J. T. KAO, M. MIYAZAKI, and A. R. CHANDRAKASAN. A 175-mV multiply-accumulate unit
using an adaptive supply voltage and body bias architecture. IEEE Journal of Solid-State
Circuits, 37:1545–1554, Nov. 2002.
[91] H. KAUL, D. SYLVESTER, and D. BLAAUW. PerformanceOptimization of Critical Nets Through
Active Shielding. IEEE Trans. on Circuits and Systems I: Regular Papers, 51(12):2417–2435, Dec.
2004.
[92] V. KHANDELWAL, A. DAVOODI, and A. SRIVASTAVA. Efficient Statistical Timing Analysis
Through Error Budgeting. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 473–477,
2004.
[93] J. KIM and M. ORSHANSKY. Towards Formal Probabilistic Power-Performance Design Space
Exploration. In Great Lakes Symp. on VLSI (GLSVLSI), pages 229–234, 2006.
[94] K. J. KUHN. Reducing Variation in Advanced Logic Technologies: Approaches to Process
andDesign forManufacturability of Nanoscale CMOS. In IEEE Intl. Electron DevicesMeeting
(IEDM), pages 471–474, Dec. 2007.
[95] M. KURIHARA, M. IZAWA, J. TANAKA, K. KAWAI, and N. FUJIWARA. Gate CD Control Con-
sidering Variation of Gate and STI Structure. IEEE Trans. on Semiconductor Manufacturing,
20(3):232–238, Aug. 2007.
[96] J.-Y. LAI, N. SAKA, and J.-H. CHUN. Evolution of Copper-Oxide Damascene Structures in
Chemical Mechanical Polishing. J. of the Electrochemical Society, 149(1):G41–G50, 2002.
[97] J. LE, X. LI, and L. PILEGGI. STAC: Statistical Timing Analysis with Correlation. In Design
Automation Conf. (DAC), pages 343–348, 2004.
REFERENCES 197
[98] Y. S. LI, S. MALIK, and A. WOLFE. Performance Estimation of Embedded Software with
Instruction Cache Modeling. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 380–
387, 1995.
[99] J.-J. LIOU, K.-T. CHENG, S. KUNDU, and A. KRISTIC´. Fast Statistical Timing Analysis By Prob-
abilistic Event Propagation. In Design Automation Conf. (DAC), pages 661–666, 2001.
[100] J.-J. LIOU, A. KRISTIC´, L.-C-WANG, and K.-T. CHENG. False-Path-Aware Statistical Timing
Analysis and Efficient Path Selection for Delay Testing and Timing Validation. In Design
Automation Conf. (DAC), pages 566–569, 2002.
[101] F. LIU. A General Framework for Spatial Correlation Modeling in VLSI Design. In Design
Automation Conf. (DAC), pages 817–822, 2007.
[102] M. LOGHI, F. ANGIOLINI, D. BERTOZZI, L. BENINI, and R. ZAFALON. Analyzing On-Chip Com-
munication in a MPSoC Environment. In Design Automation and Test in Europe (DATE),
pages 752–757, 2004.
[103] J. LOYER. S-parameters and digital-circuit design. EDNMagazine, Feb. 2003.
[104] J. LUO, S. SINHA, Q. SU, J.+KAWA, and C. CHIANG. An IC Manufacturing Yield Model Con-
sidering Intra-Die Variations. In Design Automation Conf. (DAC), pages 749–754, 2006.
[105] M. MANI, A. DEVGAN, and M. ORSHANSKY. An Efficient Algorithm for Statistical Minimiza-
tion of Total Power under Timing Yield Constraints. In Design Automation Conf. (DAC),
pages 309–314, 2005.
[106] G. MARSAGLIA. Ratios of Normal Variables and Ratios of Sums of UniformVariables. Journal
of the American Statistical Association, 60(309):193–204, Mar. 1965.
[107] H. MASUDA, S. OKAWA, and M. AOKI. Approach for Physical Design in Sub-100nm Era. In
Intl. Symp. on Circuits and Systems (ISCAS), volume 6, pages 5934–5937, 2005.
[108] J. W. MCPHERSON. Reliability Challenges for 45nm and Beyond. In Design Automation Conf.
(DAC), pages 176–181, 2006.
[109] A. V. MEZHIBA and E. G. FRIEDMAN. Properties of On-Chip Inductive Current Loops. In
Great Lakes Symp. on VLSI (GLSVLSI), pages 12–17, Apr. 2002.
[110] M. MIYAZAKI, G. ONO, and K. ISHIBASHI. A 1.2-GIPS/W Microprocessor using Speed-
Adaptive Threshold-Voltage CMOS with Forward Bias. IEEE Journal of Solid-State Circuits,
37:210–217, Feb. 2002.
[111] F. MOLL and M. ROCA. Interconnection Noise in VLSI Circuits. Kluwer, Dordrecht, The Nether-
lands, 2004.
[112] K. NABORS and J. WHITE. FastCap: A Multipole-Accelerated 3D Capacitance Extraction
Program. IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits and Systems,
21(11):50–62, Nov. 1991.
[113] S. R. NAIDU. Timing Yield Calculation Using an Impulse-Train Approach. In Asia and South
Pacific Design Automation Conf. (ASP-DAC), pages 219–224, 2002.
[114] K. NAISHADHAM and P. MISRA. Order RecursiveMethod of Moments (ORMoM) for Iterative
Design Applications. IEEE Trans. onMicrowave Theory and Techniques, 44(12):2595–2604, Dec.
1996.
[115] O. S. NAKAGAWA, N. CHANG, S. LIN, and D. SYLVESTER. Circuit Impact and Skew-Corner
Analysis of Stochastic Process Variation in Global Interconnect. In IEEE Intl. Conf. on Inter-
connect Technology, pages 230–232, 1999.
198 REFERENCES
[116] NANOSCALE INTEGRATION AND MODELING GROUP ASU. Predictive Technology Model.
http://www.eas.asu.edup˜tm, Sept. 2008.
[117] S. NARENDRA, A. KESHAVARZI, B. A. BLOECHEL, S. BORKAR, and V. DE. Forward Body Bias for
Microprocessors in 130-nm Technology Generation and Beyond. IEEE Journal of Solid-State
Circuits, 38:696–701, May 2003.
[118] S. R. NASSIF. Design for Variability in DSM technologies. In Proc. IEEE ISQED, pages 451–
454, Mar. 2000.
[119] M. OLIVIERI, G. SCOTTI, and A. TRIFILETTI. A Novel Yield Optimization Technique for Digital
CMOS Circuits Design by Means of Process Parameters Run-Time Estimation and Body
Bias Active Control. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 13:630–638,
May 2005.
[120] M. ORSHANSKY, J. C. CHEN, and C. HU. Direct SamplingMethodology for Statistical Analysis
of Scaled CMOS Technologies. IEEE Trans. on Semiconductor Manufacturing, 12(4):403–408,
Nov. 1999.
[121] M. ORSHANSKY, L. MILOR, P. CHEN, K. KEUTZER, and C. HU. Impact of Spatial Intrachip
Gate Length Variability on the Performance of High-Speed Digital Circuits. IEEE Trans. on
Computer-Aided Design (CAD) of Integrated Circuits and Systems, 21(5):544–553, 2002.
[122] M. ORSHANSKY, L. MILOR, and C. HU. Characterization of Spatial Intrafield Gate CD Vari-
ability, Its Impact on Circuit Performance, and Spatial Mask-Level Correction. IEEE Trans.
on Semiconductor Manufacturing, 17(1):2–11, Feb. 2004.
[123] D. PAMUNUWA. Modelling and Analysis of Interconnects for Deep Submicron Systems-on-Chip.
PhD thesis, Royal Inst. of Technology, Stockholm, Sweden, 2003.
[124] S. PANDEY and R. DRECHSLER. Robust On-Chip Bus Architecture Synthesis for MPSoCs Un-
der Random Tasks Arrival. In Asia and South Pacific Design Automation Conf. (ASP-DAC),
pages 601–606, Mar. 2008.
[125] S. PANDEY, R. DRECHSLER, T. MURGAN, and M. GLESNER. Process Variations Aware Robust
On-Chip Bus Architecture Synthesis for MPSoCs. In Intl. Symp. on Circuits and Systems
(ISCAS), pages 2989–2992, May 2008.
[126] S. PANDEY and M. GLESNER. Statistical on-chip communication bus synthesis and voltage
scaling under timing yield constraint. In Design Automation Conf. (DAC), pages 663–668,
2006.
[127] S. PANDEY and M. GLESNER. Simultaneous On-Chip Bus Synthesis and Voltage Scaling Un-
der RandomOn-Chip Data Traffic. IEEE Trans. on Very Large Scale Integration (VLSI) Systems,
15(10):1111–1124, Oct. 2007.
[128] S. PANDEY, N. UTLU, and M. GLESNER. Tabu Search Based On-Chip Communication Bus
Synthesis for Shared Multi-Bus Based Architecture. In IFIP Intl. Conf. on VLSI-SoC, pages
222–227, 2006.
[129] A. S. PAPA and M. MUTYAM. Power Management of Variation Aware Chip Multiprocessors.
In Great Lakes Symp. on VLSI (GLSVLSI), pages 423–428, 2008.
[130] S. PASRICHA, N. DUTT, E. BOZORGZADEH, and M. BEN-ROMDHANE. Floorplan-Aware Auto-
mated Synthesis of Bus-based Communication Architectures. In Design Automation Conf.
(DAC), pages 565–570, June 2005.
[131] T. PHAM-GIA, N. TURKKAN, and E. MARCHAND. Density of the Ratio of TwoNormal Random
Variables and Applications. Communications in Statistics - Theory and Methods, 35(9):1569–
1591, Sept. 2006.
REFERENCES 199
[132] S. PRAKASH and A. PARKER. SOS: Synthesis of application-specific heterogeneous multipro-
cessor systems. J. Parallel Distrib. Comput., 16:338–351, 1992.
[133] R. R. RAO, A. DEVGAN, D. BLAAUW, and D. SYLVESTER. Anaytical Yield Prediction Consid-
ering Leakage/Performance Correlation. IEEE Trans. on Computer-Aided Design (CAD) of
Integrated Circuits and Systems, 25(9):1685–1695, Sept. 2006.
[134] J. M. RABAEY, A. CHANDRAKASAN, and B. NIKOLIC´. Digital Integrated Circuits. A Design Per-
spective. Prentice Hall, Upper Saddle River, New Jersey, 2nd edition, 2003.
[135] R. R. RAO, D. BLAAUW, and D. SYLVESTER. Modeling and Analysis of Parametric Yield under
Power and Performance Constraints. IEEE Design & Test of Computers, 22(4):376–385, July
2005.
[136] J. RUBINSTEIN, P. PENFIELD JR., and M. A. HOROWITZ. Signal Delay in RC Tree Networks.
IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits and Systems, 2(3):202–211,
1983.
[137] A. E. RUEHLI and A. C. CANGELLARIS. Progress in the Methodologies for the Electrical Mod-
eling of Interconnects and Electronic Packages. Proceedings of the IEEE, 89(5):740–771, May
2001.
[138] A. E. RUEHLI and H. HEEB. Challenges and Advances in Electrical Interconnect Analysis. In
Design Automation Conf. (DAC), pages 460–465, Anaheim, California, June 1992.
[139] T. SAKURAI and A. R. NEWTON. Alpha-Power Law MOSFET Model and its Applications to
CMOS Inverter Delay and Other Formulas. IEEE Journal of Solid-State Circuits, 25(2):584–
594, 1990.
[140] H. SATO, H. KUNITOMO, K. TSUNEMO, K. MORI, and H. MASUDA. Accurate Statistical Process
Variation Analysis for 0.25-µm CMOS with Advanced TCADMethodology. IEEE Trans. on
Semiconductor Manufacturing, 11(4):575–582, Nov. 1998.
[141] S. SATO. Growing Importance of Fundamental Understanding of the Source of Process
Variations. In IEEE Intl. Conf. on Adv. Thermal Processing of Semiconductors (RTP), pages 5–9,
2006.
[142] L. SCHEFFER. Explicit Computation of Performance as a Function of Process Variation. In
IEEE/ACM Intl. Workshop on Timing Issues in the Specification and Synthesis of Digital Systems
(TAU), pages 1–8, 2002.
[143] L. SCHEFFER. The Count of Monte Carlo. In IEEE/ACM Intl. Workshop on Timing Issues in the
Specification and Synthesis of Digital Systems (TAU), 2004.
[144] H. SHEN and F. PE´TROT. Novel Task Migration Framework on Configurable Heterogeneous
MPSoC Platforms. In Asia and South Pacific Design Automation Conf. (ASP-DAC), pages 733–
738, 2009.
[145] K. L. SHEPARD and T. ZIAN. Return-Limited Inductances: A Practical Approach to On-Chip
Inductance Extraction. IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits and
Systems, 19(4):425–436, Apr. 2000.
[146] Y. SHIN and K. CHOI. Power Conscious Fixed Priority Scheduling for Hard Real-Time Sys-
tems. In Design Automation Conf. (DAC), pages 134–139, 1999.
[147] A. K. SINGH, M. MANI, and M. ORSHANSKY. Statistical Technology Mapping for Parametric
Yield. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 510–517, 2005.
[148] A. SRIVASTAVA, R. BAI, D. BLAAUW, and D. SYLVESTER. Modeling and Analysis of Leakage
Power Considering Within-Die Process Variations. In Intl. Symp. on Low Power Electronics
and Design (ISLPED), pages 64–67, 2002.
200 REFERENCES
[149] A. SRIVASTAVA, K. CHOPRA, S. SHAH, D. SYLVESTER, and D. BLAAUW. A Novel Approach to
Perform Gate-Level Yield Analysis and Optimization Considering Correlated Variations in
Power and Performance. IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits
and Systems, 27(2):272–285, 2008.
[150] A. SRIVASTAVA, S. SHAH, K. AGARWAL, D. SYLVESTER, D. BLAAUW, and S. DIRECTOR. Accurate
and Efficient Gate-Level Parametric Yield Estimation Considering Correlated Variations in
Leakage Power and Performance. In Design Automation Conf. (DAC), pages 535–540, 2005.
[151] A. SRIVASTAVA, D. SYLVESTER, and D. BLAAUW. Statistical Analysis and Optimization for VLSI:
Timing and Power. Springer, New York, USA, 2005.
[152] C. SVENSSON. OptimumVoltage Swing on On-Chip andOff-Chip Interconnect. IEEE Journal
of Solid-State Circuits, 36(7):1108–1112, July 2001.
[153] S. TASIRAN and A. DEMIR. Smart Monte Carlo for Yield Estimation. In Intl. Workshop on
Timing Issues (TAU), 2006.
[154] R. TEODORESCU, J. NAKANO, A. TIWARI, and J. TORRELLAS. Mitigating Parameter Variation
with Dynamic Fine-Grain Body Biasing. In Intl. Symp. on Microarchitecture (MICRO), pages
27–39, 2007.
[155] N. THEPAYASUWAN and A. DOBOLI. Hardware-Software Co-Design of Resource Constrained
Systems on a Chip. In Intl. Conf. on Distributed Computing Systems Workshops (ICDCSW),
pages 818–823, Mar. 2004.
[156] N. THEPAYASUWAN and A. DOBOLI. Layout Conscious Approach and Bus Architecture Syn-
thesis for Hardware/Software Codesign of Systems on Chip Optimized for Speed. IEEE
Trans. on Very Large Scale Integration (VLSI) Systems, 13(5):525–538, May 2005.
[157] R. O. TOPALOGLU and A. ORAILOGLU. Forward Discrete Probability Propagation Method for
Device Performance Characterization under Process Variations. In Asia and South Pacific
Design Automation Conf. (ASP-DAC), pages 220–223, 2005.
[158] P. TRIVERIO, S. GRIVET-TALOCIA, M. S. NAKHALA, F. G. CANAVERO, and R. ACHAR. Stability,
Causality, and Passivity in Electrical Interconnect Models. IEEE Trans. on Advanced Packag-
ing, 30(4):795–808, Nov. 2007.
[159] J. W. TSCHANZ, J. T. KAO, S. T. NARENDRA, R. NAIR, D. A. ANTONIADIS, A. P. CHANDRAKASAN,
and V. DE. Adaptive Body Bias for Reducing Impacts of Die-to-Die andWithin-Die Parame-
ter Variations onMicroprocessor Frequency and Leakage. IEEE Journal of Solid-State Circuits,
37(11):1396–1402, Nov. 2002.
[160] S. TSUKIYAMA, M. TANAKA, and M. FUKUI. A Statistical Static Timing Analysis Considering
Correlations Between Delays. In Asia and South Pacific Design Automation Conf. (ASP-DAC),
pages 353–358, 2001.
[161] N. TZARTZANIS andW. W. WALKER. Differential Current-Mode Sensing for Efficient On-Chip
Global Signaling. IEEE Journal of Solid-State Circuits, 40(11):2141–2147, Nov. 2005.
[162] E. B. VAN DER TOL, E. G. JASPERS, and R. H. GELDERBLOM. Mapping of H.264 Decoding on a
Multiprocessor Architecture. In Image and Video Communications and Processing, pages 707–
718, May 2003.
[163] J. WANG, P. GHANTA, and S. VRUDHULA. Stochastic Analysis of Interconnect Performance in
the Presence of Process Variations. In Intl. Conf. on Computer-Aided Design (ICCAD), pages
880–886, 2004.
REFERENCES 201
[164] W.-S. WANG and M. ORSHANSKY. Robust Estimation of Parametric Yield under Limited De-
scriptions of Uncertainty. In Intl. Conf. on Computer-Aided Design (ICCAD), pages 884–890,
2006.
[165] J. WATTS, N. LU, C. BITTNER, S. GRUNDON, and J. OPPOLD. Modeling FET Variation within a
Chip as a Function of Circuit Design and Layout Choices. In Nanotech Workshop on Compact
Modeling, pages 87–92, 2005.
[166] S.-C. WONG, T. G.-Y. LEE, D.-J. MA, and C.-J. CHAO. An Empirical Three-Dimensional
Crossover Capacitance Model for Multilevel Interconnect VLSI Circuits. IEEE Trans. on
Semiconductor Manufacturing, 13(2):219–227, May 2000.
[167] X. XI, M. DUNGA, J. HE, W. LIU, K. M. CAO, X. JIN, J. J. OU, M. CHAN, A. M. NIKNEJAD, and
C. HU. BSIM4.3.0 MOSFET Model - User’s Manual. University of California, Berkeley, 2003.
[168] J. XIONG, K. TAM, and L. HE. Buffer Insertion Considering Process Variation. In Design
Automation and Test in Europe (DATE), pages 970–975, 2005.
[169] J. XIONG, V. ZOLOTOV, and L. HE. Robust Extraction of Spatial Correlation. IEEE Trans. on
Computer-Aided Design (CAD) of Integrated Circuits and Systems, 26(4):619–631, Apr. 2007.
[170] L. YAN, J. LUO, and N. K. JHA. Combined Dynamic Voltage Scaling and Adaptive Body
Biasing for Heterogeneous Distributed Real-Time Embedded Systems. In Intl. Conf. on
Computer-Aided Design (ICCAD), pages 30–37, 2003.
[171] X. YE, P. LI, and F. LIU. Practical Variation-Aware Interconnect Delay and Slew Analysis for
Statistical Timing Verification. In Intl. Conf. on Computer-Aided Design (ICCAD), 2006.
[172] G. YU, W. DONG, Z. FENG, and P. LI. A Framework for Accounting for Process Model Un-
certainty in Statistical Static Timing Analysis. In Design Automation Conf. (DAC), pages
829–834, 2007.
[173] P. YU, S. X. SHI, and D. Z. PAN. Process Variation Aware OPC with Variational Lithography
Modeling. In Design Automation Conf. (DAC), pages 785–790, 2006.
[174] L. ZHANG, J. SHAO, and C. C. CHEN. Non-Gaussian Statistical Parameter Modeling for SSTA
with Confidence Interval Analysis. In Intl. Symp. on Physical Design (ISPD), pages 33–38,
2006.
[175] M. ZHANG, M. OLBRICH, H. KINZELBACH, D. SEIDER, and E. BARKE. A Fast and Accurate
Monte Carlo Method for Interconnect Variation. In IEEE Intl. Conf. on Integrated Circuit
Design and Technology (ICICDT), 2006.
[176] M. ZHANG, M. OLBRICH, D. SEIDER, M. FRERICHS, H. KINZELBACH, and E. BARKE. CMCal: An
Accurate Analytical Approach for the Analysis of Process Variations with Non-Gaussian
Parameters andNonlinear Functions. InDesign Automation and Test in Europe (DATE), pages
243–248, 2007.
List of Publications
[177] P. B. BACINSCHI andM. GLESNER. Modeling andDesign of Organic Transistor Circuits. Work-
shop of Analog Integrated Circuits, Kaiserslautern, Germany, Mar. 2006.
[178] P. B. BACINSCHI andM. GLESNER. AMultistep Extrapolated S-ParameterModel for Arbitrary
On-Chip Interconnect Structures. In IFIP/IEEE Intl. Conf. on VLSI-SoC, Floriano´polis, Brazil,
Oct. 2009. Extended version accepted as book chapter to be published by Springer.
[179] P. B. BACINSCHI and M. GLESNER. Technology-Accurate Variability-Aware Performance
Macromodels for On-Chip Communication Synthesis. In IFIP/IEEE Intl. Conf. on VLSI-SoC,
PhD Forum, Floriano´polis, Brazil, Oct. 2009.
[180] P. B. BACINSCHI and M. GLESNER. Variability-Aware Synthesis of On-Chip Communication
Architectures. Workshop ”Design of Future Reliable Systems from Unreliable Fabrics”,
Munich, Germany, July 2009.
[181] P. B. BACINSCHI, T. MURGAN, K. KOCH, and M. GLESNER. Process Variations Robust Design
of Buffers and Schmitt Triggers for a Hierarchical DLL Timing Measurement Framework.
In edaWorkshop, pages 47–52, Hannover, Germany, June 2007.
[182] P. B. BACINSCHI, T. MURGAN, K. KOCH, and M. GLESNER. Variability-Aware Design of CMOS
Schmitt Triggers for On-Chip Timing Measurement Frameworks. Workshop ”Analogschal-
tungen”, Freiburg, Germany, Mar. 2007.
[183] P. B. BACINSCHI, T. MURGAN, K. KOCH, and M. GLESNER. An Analog On-Chip Adaptive Body
Bias Calibration for Reducing Mismatches in Transistor Pairs. InDesign Automation and Test
in Europe (DATE), pages 698–703, Munich, Germany, Mar. 2008.
[184] M. MOMENI, P. B. BACINSCHI, and M. GLESNER. Comparison of Opamp-Based and
Comparator-Based Delta-Sigma Modulation. In Design Automation and Test in Europe
(DATE), pages 688–693, Munich, Germany, Mar. 2008.
[185] T. MURGAN, P. B. BACINSCHI, A. GARCI´A ORTIZ, and M. GLESNER. Partial Bus-Invert Bus
Encoding Schemes for Low-Power DSP Systems Considering Inter-Wire Capacitance. In
Intl. Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pages
169–180, Montpellier, France, Sept. 2006.
[186] T. MURGAN, P. B. BACINSCHI, and M. GLESNER. Exploiting the Bit-Level Correlation of DSP
Signals for Low Power Coding Schemes Construction in Capacitively Coupled Buses. In
edaWorkshop, pages 15–20, Hannover, Germany, June 2007.
[187] T. MURGAN, P. B. BACINSCHI, S. PANDEY, A. GARCI´A ORTIZ, and M. GLESNER. On the Necessity
of Combining Coding with Spacing and Shielding for Improving Performance and Power
in Very Deep Sub-Micron Interconnects. In Intl. Workshop on Power and Timing Modeling,
Optimization and Simulation (PATMOS), pages 242–254, Go¨teborg, Sweden, Sept. 2007.
203
204 LIST OF PUBLICATIONS
[188] T. MURGAN, A. GUNTORO, H. HINKELMANN, P. B. BACINSCHI, and M. GLESNER. Low-
Complexity Adaptive Encoding Schemes Based on Partial Bus-Invert for Power Reduction
in Buses Exhibiting Capacitive Coupling. In Intl. Workshop on Reconfigurable Communication
Centric System-on-Chips (ReCoSoC), pages 7–14, Montpellier, France, June 2007.
[189] T. MURGAN, O. MITEA, S. PANDEY, P. B. BACINSCHI, andM. GLESNER. Simultaneous Placement
and Buffer Planning for Reduction of Power Consumption in Interconnects and Repeaters.
In IFIP Intl. Conf. on VLSI-SoC, pages 302–307, Nice, France, Oct. 2006.
Supervised Theses
[190] P. BEDNAROVSCHI. Prognostics and Health Management Systems for Electronics. Studien-
arbeit, Technische Universita¨t Darmstadt, May 2010. Co-advised with Andre Guntoro.
[191] E. CHENG. Performance Modeling of Pulsed Current-Mode Signaling Circuits. Studienar-
beit, Technische Universita¨t Darmstadt, July 2008.
[192] M. HOEHL. Scheduling and Mapping of Data Processing Tasks for a MPEG-4 SoC. Studien-
arbeit, Technische Universita¨t Darmstadt, Sept. 2008.
[193] Y. HU. Equivalent Cell Models. Master’s thesis, Technische Universita¨t Darmstadt, Oct.
2007. Co-advised with Klaus Koch and Tudor Murgan.
[194] M. K. MEBRAHTU. Accurate In-Situ Measurement of Crosstalk-induced Delay Change and
Waveform Reconstruction. Master’s thesis, Technische Universita¨t Darmstadt, Oct. 2006.
Co-advised with Tudor Murgan.
[195] T. VOLLBERG. Pra¨zise On-Chip Messung der durch U¨bersprechen erzeugten Span-
nungsspitzen. Master’s thesis, Technische Universita¨t Darmstadt, Feb. 2005. Co-advised
with Tudor Murgan.
205
Lebenslauf
Petru Bogdan BACINSCHI
Zur Person:
Geburtsdatum: 24. Juli 1981
Geburtsort: Ludus¸, Ruma¨nien
Ausbildung:
1996 bis 2000 Gymnasium “Cantemir Voda˘” in Bukarest
Abschluss: Abitur
2000 bis 2005 Student an der Fakulta¨t fu¨r Elektronik, Nachrichtentech-
nik und Informationstechnik, Universita¨t ‘Politehnica’
Bukarest
Abschluss: Diplomingenieur
2005 bis 2010 Doktorand am Fachgebiet Mikroelektronische Systeme
der Technischen Universita¨t Darmstadt
Beruflicher Werdegang:
2001 bis 2005 Softwareingenieur an der Firma Crystal Interactive Sys-
tems S.R.L. in Bukarest
2005 bis 2010 Wissenschaftlicher Mitarbeiter am Fachgebiet Mikroelek-
tronische Systeme der Technischen Universita¨t Darmstadt
seit dem 1.8.2010 Hardwarearchitekt bei Infineon Technologies AG in
Neubiberg
