Variation-aware and adaptive timing optimisation methods in field programmable gate arrays by Guan, Zhenyu
Imperial College London of Science,
Technology and Medicine
Department of Electrical and Electronic Engineering
Variation-aware and adaptive timing
optimisation methods in Field
Programmable Gate Arrays
Zhenyu Guan
Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Electrical and Electronic Engineering
of Imperial College London.

Abstract
This thesis proposes optimisation methods for improving the timing performance
of digital circuits implemented in Field-Programmable Gate Arrays (FPGAs) with
the knowledge of process variation. With the current trend of transistor scaling,
improvements in fabrication processes alone will not completely solve the problem
of process variability due to the physical limitation of the process and materials.
Therefore, higher-level optimisation strategies, such as variation-aware and adap-
tive design are required to alleviate the erosion of overall timing performance. Three
novel optimisation methods, including variation-aware placement, routing and re-
timing are introduced in this thesis to reduce the impact of process variation on
FPGAs using measured variation maps.
By measuring and mapping real delay variation on FPGAs, traditional delay models
can be replaced with actual delay maps that allows variation-aware design methods
to be applied to produce more optimal designs on FPGAs. In this thesis, we propose
a new two-stage classification-based placement methodology to alleviate the impact
of delay variability while maintaining practical computational complexity and exe-
cution time. In addition, a variation-aware partial re-routing method is introduced
to improve the timing performance of designs by re-routing a portion of critical
and near-critical paths. Finally, a variation-aware retiming method is proposed to
further enhance timing performance after placement and routing.
Similar to the timing improvement achieved by full chipwise optimisation (19%), the
proposed two-stage placement, partial rerouting and retiming methods can provide
13% timing improvement. In addition, about 20 times speedup can be achieved
compared with full chipwise methods. Overall, the observed timing improvement
and reduction in execution time for MCNC benchmarks with the proposed optimi-
sation methods clearly demonstrate their effectiveness and practicality against delay
variability in FPGAs.
i
ii
Acknowledgements
The work in this thesis was carried out under the supervision of Prof. Peter Y. K.
Cheung and Prof. George A. Constantinides. Firstly, I wish to express my gratitude
to my supervisor, Prof. Peter Y. K. Cheung who is always patient and enthusiastic
to support me in this project. During my Ph.D, Peter offered invaluable assistance,
support and guidance to me. I have learned a lot from Peter not only in research
but also for my life. Also I want to thank my second supervisor Prof. George
A. Constantinides, who gave me invaluable advice to my research. In addition, he
patiently helped me to organize my reports, papers and thesis.
Also I would like to thank Dr. Justin S. J. Wong and Dr. Sumanta Chaudhuri
for their patience and support on technical problems. Without help from them, it
would have been impossible for me to finish my experiment. I would like to thank
the other members of the “variability club” who gave me brilliant ideas. I thank all
my friends in my research group for their support.
Lastly, I would like to thank my parents and my wife for their unconditional love
and support through out my PhD study.
iii
iv
Copyright Declaration
The copyright of this thesis rests with the author and is made available under a Cre-
ative Commons Attribution Non-Commercial No Derivatives licence. Researchers
are free to copy, distribute or transmit the thesis on the condition that they at-
tribute it, that they do not use it for commercial purposes and that they do not
alter, transform or build upon it. For any reuse or redistribution, researchers must
make clear to others the licence terms of this work
v
vi
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
1.1 The Impending Impact of Variability . . . . . . . . . . . . . . . . . . 1
1.2 The Solutions and Opportunities with Reconfigurability . . . . . . . . 2
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Variation in Advanced Electrical Devices . . . . . . . . . . . . . . . . 8
2.2.1 Sources of Variability in Advanced Integrated Circuits . . . . 9
2.2.2 Classification of Process Variation . . . . . . . . . . . . . . . . 10
vii
viii CONTENTS
2.2.3 Timing of FPGAs under Process Variation . . . . . . . . . . . 13
2.3 Measurement of Process Variation for FPGAs . . . . . . . . . . . . . 17
2.3.1 At-Speed Testing . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Built-in Self-test (BIST) . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 FPGA Delay Measurement Methods . . . . . . . . . . . . . . 18
2.4 Methodologies to Mitigate Process Variation . . . . . . . . . . . . . . 22
2.4.1 Worst-case Timing and Guard-banding . . . . . . . . . . . . . 23
2.4.2 Speed-binning . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.3 Variation-aware Optimization in CAD . . . . . . . . . . . . . 26
2.4.4 Summary of Existing Methodologies . . . . . . . . . . . . . . 33
2.5 Placement, Routing and Retiming Methods . . . . . . . . . . . . . . 34
2.5.1 FPGAs Architecture . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.4 Limitation of VPR Simulation . . . . . . . . . . . . . . . . . . 48
2.5.5 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Two-stage Variation-aware Placement 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Two-stage Variation-aware Placement . . . . . . . . . . . . . . . . . . 53
CONTENTS ix
3.2.1 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Variation Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.3 Two-stage Variation-aware Placement Based on VPR . . . . . 64
3.3 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Experiment Flow . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.3 Experiments and Results of Two-stage Placement . . . . . . . 70
3.4 Discussion on Quality of Two-stage Placement . . . . . . . . . . . . . 72
3.4.1 Comparison of Run Time Cost between Chipwise and Two-
stage Variation-aware Placement . . . . . . . . . . . . . . . . 72
3.4.2 Choosing the Number of Classes . . . . . . . . . . . . . . . . 73
3.4.3 The Effect of FPGA Utilisation on Timing Improvement . . . 74
3.4.4 Classification Enhancement . . . . . . . . . . . . . . . . . . . 75
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Partial re-routing 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Delay and Process Variation Modeling . . . . . . . . . . . . . . . . . 78
4.2.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Generation of Variation Maps . . . . . . . . . . . . . . . . . . 79
4.3 Variation-aware Routing and Partial re-routing . . . . . . . . . . . . 81
x CONTENTS
4.3.1 Variation-aware Cost Function . . . . . . . . . . . . . . . . . . 81
4.3.2 Variation-aware Timing Analysis . . . . . . . . . . . . . . . . 82
4.3.3 Overhead of Execution Time for Variation-aware Routing . . . 84
4.3.4 Partial re-routing . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Experiment Flow . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.3 Results of Variation-aware Partial re-routing . . . . . . . . . . 92
4.4.4 Comparison of Run Time Cost between Chipwise routing and
Partial re-routing . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 Variation-aware Retiming 98
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Concept and Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Generation of Variation Maps . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Measured Variation Maps . . . . . . . . . . . . . . . . . . . . 101
5.3.2 Amplification of Variation Maps . . . . . . . . . . . . . . . . . 103
5.4 Retiming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 FF constraints for Modern FPGAs . . . . . . . . . . . . . . . 103
5.4.2 Formal Description . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.3 Variation-Aware Retiming Algorithm . . . . . . . . . . . . . . 108
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.1 Design of Experiments . . . . . . . . . . . . . . . . . . . . . . 111
5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6 Conclusion 117
6.1 Optimisation for Process Variation . . . . . . . . . . . . . . . . . . . 117
6.2 Future work and Opportunities . . . . . . . . . . . . . . . . . . . . . 121
Bibliography 123
Glossary 137
Appendix 140
xi
xii
List of Tables
2.1 Summary of variation measurement strategies. . . . . . . . . . . . . . 22
2.2 The boundaries of each bin in Fig. 2.7. . . . . . . . . . . . . . . . . . 25
2.3 Comparison between different optimisation strategies. . . . . . . . . . 33
2.4 Temperature update schedule. . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Results of variation blind, two-stage and chipwise. . . . . . . . . . . . 70
xiii
xiv
List of Figures
2.1 Classification of Variability [1]. . . . . . . . . . . . . . . . . . . . . . . 11
2.2 σth as a function of technology nodes based on predictive technology
models [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Extreme device variations will become more typical in the future [3]. 14
2.4 A typical Ring Oscillator (RO) structure using inverters. . . . . . . . 19
2.5 Circuit diagram of the transition probability (TP) measurement method [4]. 20
2.6 Example of delay of 4 inverters with worst-case timing method. . . . 24
2.7 Example of for speed-binning. . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Example of delay of 4 inverters with worst-case timing method. . . . 27
2.9 The delay probability density of variation-blind and SSTA-driven op-
timization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Relocating regions in an FPGA [5]. . . . . . . . . . . . . . . . . . . . 30
2.11 Overview of FPGA island-style architecture. . . . . . . . . . . . . . . 34
2.12 Structure of basic logic element (BLE) and logic array block (LAB). . 36
2.13 Overview of FPGA island-style architecture. . . . . . . . . . . . . . . 37
xv
xvi LIST OF FIGURES
2.14 Structures of three switch box. . . . . . . . . . . . . . . . . . . . . . . 38
2.15 Example of routing based on maze/Dijkstra algorithm for two nets [6]. 43
2.16 Example of routing based on A* search algorithm [6]. . . . . . . . . . 45
3.1 One practical flowchart of two-stage placement. . . . . . . . . . . . . 55
3.2 One variation map measured on Cyclone III FPGA. . . . . . . . . . . 56
3.3 Possible curve for the overall process correlation according to Eq. 3.4b
(theoretically ideal case). . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Overall process variation correlation curve for 129 variation maps
according to Eq. 3.4b (real case). . . . . . . . . . . . . . . . . . . . . 60
3.5 Results of classification on variation maps. . . . . . . . . . . . . . . . 62
3.6 Extraction (a) and amplification (b) of random variation from mea-
surements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 An example to explain the change of cost due to logic blocks swap. . 66
3.8 Work flows for variation-blind, two-stage placement and full chipwise
placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9 Density of delay of critical path for frisc. . . . . . . . . . . . . . . . . 71
3.10 Run time cost of SSTA chipwise and two-stage placement. . . . . . . 72
3.11 Number of clusters (k) against critical path delay for frisc. . . . . . . 74
3.12 FPGA Utilisation Ratio (Ur) against critical path delay for frisc. . . 75
3.13 Density of critical path produced by k-means classification against
PCA with k-means for frisc . . . . . . . . . . . . . . . . . . . . . . . 76
4.1 Approximated delay model, variation is modeled for active compo-
nents: Vth variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Coordinate system and region-based structure used by VPR router. . 80
4.3 Example of delay of paths before and after partial re-routing. . . . . . 86
4.4 Dashed lines are reserved tracks used for variation-aware re-routing. . 88
4.5 Work flows for variation-blind, partial re-routing and full chipwise
routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.6 The delay of critical path for 20 MCNC circuits performing variation-
blind, partial re-routing and variation-aware routing. . . . . . . . . . 93
4.7 Density of delay of critical path for ex5p. . . . . . . . . . . . . . . . . 94
4.8 The timing improvement made by partial re-routing by scaling Crit T
from 0.5 to 0.95. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.9 Improvement of execution time by varying Crit T , and the number
of FPGAs for the ex5p benchmark. . . . . . . . . . . . . . . . . . . . 97
5.1 Motivational example: circuits with several retiming choices with
equivalent logic depth can lead to great improvement through variation-
aware retiming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Variation map example: one measured variation map on Cyclone III
FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Example of post P&R retiming. . . . . . . . . . . . . . . . . . . . . . 105
5.4 The work flow of retiming. . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Variation-blind and aware retiming on (a) sequential and (b) modified
combinatorial MCNC circuits. . . . . . . . . . . . . . . . . . . . . . . 115
xvii
6.1 The timing improvement made by combined optimisation methods. . 120
xviii
Chapter 1
Introduction
1.1 The Impending Impact of Variability
As Complementary Metal-Oxide-Semiconductor (CMOS) technology continues to
scale beyond 45nm, process variation is increasingly affecting the performance of
electronic circuits. By classifying Very Large-Scale Integration (VLSI) circuits into
different timing performance level and setting a pessimistic guard band for devices
on each level, the manufacturer can maintain device’s reliability at specific operating
conditions for the majority of application [7]. However, as predicted by the Interna-
tional Technology Roadmap for Semiconductors (ITRS), it has become more difficult
and expensive to keep the uniformity of electrical characteristics on one chip due
to process variations [8]. Considering the physical limitation of silicon-based elec-
tronics [9], the problem of process variability can not be completely solved by the
improvements in the fabrication processes. The gate length error is predicted to in-
crease from 5.4% at 65 nm technology to 14% at 25nm technology. ITRS 2013 states
that “process variations are getting increasingly important in the future technology”.
One of the difficult question to answer with yield enhancement is ”how and where
to stop the process variation” [8]. There is unlikely to be a definitive answer, but it
1
2 Chapter 1. Introduction
seems that enhancements through improved fabrication process can only go so far
before we are forced to take alternative counter measures.
In recent years, Field-Programmable Gate Arrays (FPGAs) have become one of the
most popular implementation media for digital circuits. Like any advanced VLSI
devices, FPGAs will also face the same process variation issues. At-speed testing
techniques combined with speed-binning have been used to maintain reliability for
FPGAs under delay variability [10, 11]. To guarantee the physical timing perfor-
mance and reliability of FPGA products, either the physical yield of FPGAs chips
would be compromised through a more vigorous manufacturing test, or a more con-
servative timing model that takes variation into account is used. However, it is
impossible to have one universal model that precisely describes the timing of all
components under process variation for a big number of FPGAs because each of
them may have a unique variation pattern. Therefore, existing commercial FPGA
design tools rely on the worst-case delay model of FPGA components, such that
all possible patterns of process variation are taken into account for all FPGAs.
Nonetheless, this approach may cause sub-optimal operating speed because certain
components may be able to operate at faster speeds.
1.2 The Solutions and Opportunities with Recon-
figurability
Notwithstanding the impact of variation, the unique reconfigurability and flexibility
of FPGAs actually bring great opportunities in variation-aware methodologies that
will not only alleviate the initial process variation problem but also go beyond that by
exploiting variations to improve the overall delay performance of a design. Efficient
methods for characterising delay variation in individual FPGA have recently been
developed [12] which allow us to obtain a delay variation map for any given FPGA
1.3. Overview 3
with relative ease. This gave us the opportunity to devise different variation-aware
design methods using variation maps measured from actual FPGAs. These methods
could be incorporated into Computer-Aided Design (CAD) tools to increase timing
yield of FPGA designs with relatively low computational overhead in the design
flow.
With the modified full chipwise variation-aware CAD tools, the reliability and/or
timing performance of a circuit in an FPGA under process variation can be dras-
tically increased in terms of operating frequency and throughput. However such
an approach is not practical because of the huge overhead of execution time. By
classifying FPGAs into a small number of classes based on their variation maps and
performing placement optimisation specifically for each class instead of each chip,
two-stage placement can greatly reduce the execution time with similar timing im-
provement as achieved by full chipwise optimal placement. While variation-aware
placement provides the best speed improvements, variation-aware routing can yield
further gains. Furthermore, by only performing the variation-aware routing on a
small portion of critical and near-critical paths, the re-routing method described
in this thesis can speed up routing process significantly. Lastly, after variation-
aware place and route (P&R), the timing performance can be further improved by
variation-aware retiming with the measured variation map.
1.3 Overview
Chapter 2 gives an overview about impact of the process variation to electronic de-
vice such as FPGAs at first. Then the methods to measure FPGAs are introduced
to collect the information of process variation used by variation-aware optimisa-
tion. After that, existing variation-aware optimisation methodologies are reviewed
and criticised in terms of delay improvement and execution time. A brief descrip-
4 Chapter 1. Introduction
tion about the FPGA CAD tools named Versatile packing, Placement and Routing
(VPR) is given, which can be modified to be variation-aware to simulate the design
flow for FPGA and exploit the potential improvement that can be achieved.
Chapter 3 introduces full-chip variation-aware placement and proposes a two-stage
variation-aware placement method that benefits from the optimality of a full chip-
wise placement but only requires a fraction of total execution time for a large number
of FPGAs with different variation patterns. By classifying variation maps into finite
number of classes, each class will only require one variation-aware placement execu-
tion with its median variation map, regardless of how many FPGAs are in the class,
thus saving significant amount of execution time. The k-means methodology [13] of
classification is used in this chapter to classify variation maps into different classes.
In addition, a trade-off between timing performance and execution time can be made
with different number of classes according to the requirement of end-users.
Chapter 4 describes a partial re-routing method to reduce the delay of the critical
path of a design using the measured variation map. To apply the partial routing
method, a variation-blind routing process is executed at the beginning to produce
a basic routing configuration and find the minimum required channel width. A
certain amount of routing resources are reserved during variation-blind routing for
partial re-routing. Later, the critical paths and near-critical paths are re-routed with
variation maps to reduce the delay of critical path. This work is implemented by
modifying VPR’s timing driven router to take variations into account in the timing
analysis cost function. By only performing variation-aware routing on a portion of
the paths with high criticality, the routing process is speeded up compared with
full chipwise routing and achieves better timing performance than variation-blind
routing.
Chapter 5 proposes a variation-aware post P&R retiming method to counteract
process variation in FPGAs. This method is facilitated by the presence of abundant
1.4. Statement of Originality 5
unused registers in FPGAs and takes into account exact variation maps (measured
on FPGAs) as opposed to Statistical Static Timing Analysis (SSTA) which models
delay variations with statistical distributions. Also, it is applied after P&R and
only requires the reprogramming of flip-flops’ positions. We show that for typical
designs with several retiming choices, their critical path delays benefit greatly from
this method in face of variation.
Chapter 6
1.4 Statement of Originality
The main contributions presented in this thesis are covered separately in Chapter
3, 4 and 5. Further contributions are described in the introduction of each chapter,
but the key contributions are summarized as follows.
• A CAD work flow that performs variation-aware P&R and retiming to optimise
the timing performance for FPGAs with measured variation maps.
• The development of a two-stage variation-aware placement algorithm with
classification of variation maps to improve timing performance of circuits on
FPGAs. A detailed study and simulation of the trade-off between timing
performance and execution time with a flexible number of variation classes.
This work is described in Chpater 3 and formed the subject of one of our
publications [14].
• The development and validation of the partial re-routing methodology. A de-
tailed study of re-routing different numbers of critical and near-critical paths.
• An post P&R variation-aware retiming methodology to alleviate process vari-
ation.
6 Chapter 1. Introduction
1.5 Publications
In 2012 a paper was published in the proceedings of International Conference On
Field Programmable Logic And Applications that described the methods for two-
stage variation-aware placement [14]. The above work was later developed further
to form part of a paper published in the IEEE Design and Test of Computers [15]
in 2013 to give a clear overview on how our two-stage placement method con-
tributes to the overall framework of variation-adaptive design to improve timing
yield and reliability. In addition, a paper about variation-aware retiming is pending
publication following by the International Conference on Field Programmable Logic
and Applications in 2013 which proposes a method to further improve circuit timing
performance after placement and routing.
Chapter 2
Background
2.1 Introduction
In the last five decades, the performance of advanced VLSI has been improved by
Moore’s-Law-driven technology scaling [16]. As transistor size continues to decrease,
a variety of challenges need to be overcome [3]. One of these challenges faced by
semiconductor industry is to alleviate the impact on timing performance caused by
process variation [17, 18, 19]. The effects of variation can be tackled at various
levels in the design flow. For both ASICs and FPGAs, variation-aware optimization
methodologies are designed to improve yield. The approach taken has significantly
impacted the development of parametric test techniques, circuit design methods
and tools [10, 20]. Compared with ASICs, FPGAs have two distinct advantages
with their reconfigurability. Firstly, the actual timing performance of each FPGA
(variation map) can be measured and characterised by configuring the device with
Built-in self-test (BIST) circuits [12, 21]. Secondly, it is possible to make use of the
measured variation maps to improve the timing performance for specific design by
variation-aware optimization methods.
7
8 Chapter 2. Background
In this chapter, the literature on process variation in FPGAs, relevant variation-
aware optimisation methods and tools for mitigating process variation are reviewed.
This includes:
• Introduction of variability in advanced electronic devices— discussion on the
importance of process variation.
• The methodology for collecting variation maps in FPGAs — the foundation
of the variation-aware optimisation methods proposed in this PhD project.
• Relevant previous research that attempts to mitigate process variation by high
level variation-aware optimisation— identify major problems in this field left
unsolved.
• Introduction of FPGA architecture, algorithms for FPGA placement, routing
and retiming— the simulation and experiment environment used in this PhD
project.
2.2 Variation in Advanced Electrical Devices
As transistor density continues to grow, the minimum feature sizes in semiconduc-
tor devices are approaching scales where it is difficult and expensive to achieve
uniformity in manufacturing. This results in variability in timing performance of
circuits [22, 23]. Historically, variability has occurred between wafers, lots and dies.
Speed-bining process has been used to compensate for this effect by marking the
chips with different speed grade. However, the variation between the same type
of components in a single die is now increasing dramatically due to both intra-die
systematic and stochastic effects [24].
2.2. Variation in Advanced Electrical Devices 9
2.2.1 Sources of Variability in Advanced Integrated Circuits
The electrical behaviour of advanced integrated circuits is affected by three main
sources of variation: environmental, physical and long term aging sources [25]. The
features of each variation source are explained as follows
• Environmental factors
– These are variations in ambient and operational conditions of a circuit,
such as voltage and temperature fluctuations.
– Circuit delay may change dramatically depending on operating condi-
tions.
– It is hard to model the relationship between delay variability and envi-
ronmental conditions.
– Correlations between circuit timing and environmental factors are also
application or operation dependent.
• Physical factors during the manufacturing process
– These are variations due to random dopant fluctuation, non-ideal pat-
terning and implantation in the physical fabrication process.
– Process variations are permanent once the chip has been fabricated.
– They Contain spatially correlated variation related to physical distance.
– Variation patterns are unique to each chip because of uncertainty during
fabrication.
• Long term aging factors
– These are non-uniform timing degradation of nominally identical compo-
nents.
10 Chapter 2. Background
– They develop over time due to uneven electrical and temperature stresses
induced by circuit operation across each chip.
– They are more predictable than process variation but difficult to control.
Environmental factors are mainly caused by change in temperatures and voltage.
These variations depend on the operating conditions, which are hard to predict with
statistical models. With dynamic reconfigurability and online chip monitoring [26],
the variability caused by environmental factors can be tackled to improve the timing
performance for specific applications. In addition, variation due to circuits aging
caused by environmental and long term aging (operational stress) factors can be
carefully considered during the design phase [27].
The research in this PhD thesis focuses on optimization methods based on the second
category of variations sources, i.e. process variation induced by physical factors
during chip fabrication which cause fluctuation in the value of process parameters
observed after fabrication.
2.2.2 Classification of Process Variation
Process variation is the variation in the attributes of transistors when integrated
circuits are fabricated. There are several classifications of process variations. One
typical overview of process variation is shown in Fig. 2.1. Variability is classified
into the following five categories with different features [1].
Wafer to wafer variation describes the variation between different wafers. The
source of this variation is the change in machine conditions between the man-
ufacturing process of each wafer.
Inter-die variation is the variation between dies on the same wafer, which can be
caused by any on-wafer nonuniformity e.g. temperature and gas flow. Time
2.2. Variation in Advanced Electrical Devices 11
(a)Wafer to wafer
(b)Inter-die
(c)Intra-die
(d)Systematic(spatial 
correlation)
(e)Stochastic  
Figure 2.1: Classification of Variability [1].
dependence of lithography exposures may also be responsible. The value of
inter-die variation can be represented by a single value for each die and rep-
resent a shift from the nominal value. Inter-die variations are easily captured
by corner models which are of more interest to the industry. Commercial
FPGAs have been assigned different speed grades through the speed-binning
method [28].
Intra-die variation describes the variability within one die. This variation typically
originates from lithography steps, because pattern exposure is performed die-
by-die. It may also be caused either by imperfection in reticles and non-ideality
in lens systems, as well as imperfection within the silicon substrate, doping
process and metal wires. Intra-die variation refers to the non-uniformity of
electrical characteristics of the same type of components across the chip [29,
12 Chapter 2. Background
30, 31]. Currently, intra-die variation has not been efficiently dealt with by
the industry, since there are no existing practical methods that are applicable
to the conventional way of FPGA use. For variation-aware design, intra-
die process variation should be modeled or measured accurately. However,
characterising intra-die process variation is typically more challenging than
inter-die process variation.
Systematic/spatial correlation variation is defined to describe the deterministic
part of the variation which is caused by identifiable features such as mask errors
from inaccuracies in the process model, lithographic off-axis focusing errors,
and reticle stepper alignment error. The timing performance caused by this
kind of variations shares the same patterns across all dies in a wafter or within
a die which is spatially correlated. For example, interlayer dielectric-thickness
variation is systematic and depends on layout density [32].
Historically, systematic process variation has been of interest to semiconductor
manufactures because it is a strong driver for yield; a variety of strategies are
proposed to improve the performance in yield in each process generation [9].
Stochastic variation is used to describe the variation which is independent of dis-
tance and location. The sources of stochastic variation include vibrations dur-
ing lithography, wafer unevenness, and non-uniformity in resist thickness etc.
Fig. 2.2 shows the predicted stochastic process variation with developing tech-
nology node. A significant increase in variation can be seen, as a result of the
combination of Random Dopant Fluctuations (RDF), Line Edge Roughness
(LER) and Oxide Thickness (OTF), in terms of Vth, the transistor’s threshold
voltage [21, 2].
In recent years, stochastic variation has increased dramatically due to smaller
length of transistors and reduction in operating voltages [7]. More impor-
tantly, several sources of stochastic variation are intrinsic to the materials
2.2. Variation in Advanced Electrical Devices 13
Figure 2: σVth as a function of technology nodes, based on
predictive technology models. Considering the individual ef-
fects of random dopant fluctuations (RDF), line edge rough-
ness (LER) and oxide thickness (OTF) from [19]
variation component, it is the exponential dependence on
Vth that brings about the harmful effects of random varia-
tion on the current through a transistor.
Ids,sat = WvsatCox
(
Vgs − Vth − Vd,sat
2
)
(1)
Ids,sub =
W
L
ηCox(n− 1) · vT 2 · e
Vgs−Vth
n · vT
(
1− e
−Vds
vT
)
(2)
In turn, the propagation delay τpd and leakage energy of the
circuit are a function of current (Eqs. 3, 4).
τpd = Cl · Vds
Ids
(3) Eleak = Ids,sub ·Vds · τcycle (4)
As such, random physical variation expresses itself in differ-
ences in the energy efficiency and delay of a transistor.
Statistical static timing analysis (SSTA) [14] attempts to
model the expected random variation and with it the ex-
pected behavior of the FPGA. With this model, the CAD
tools can generate a mapping that, statistically speaking,
will reduce the effects of random variation. Unfortunately,
this solution inherently fails to accommodate every FPGA.
Instead of employing this one-size-fits-all solution, Timing
Extraction measures and extracts detailed delay information
from the FPGA after fabrication. This can then be provided
to the CAD flow which generates a component-specific map-
ping tailoring the design to the particular FPGA.
The delay of a component in the FPGA is not only af-
fected by process variation but can also fluctuate due to
environmental and temperature changes [7] as well as ag-
ing effects [15]. To ensure that measured delays consis-
tently represent process variation, Timing Extraction re-
quires that measurements be taken in a highly controlled
manner. Sec. 4.1 details the controls employed for our appli-
cation on the Cyclone FPGA. The consistency of the results
presented in Sec. 4.3 concretely demonstrates that Timing
Extraction does measure process variation.
2.2 Altera Cyclone LAB Architecture
Timing Extraction is a general methodology that provides
fine-grain delay measurement of small groups of components
within an FPGA. Although it is applicable to any FPGA, to
ground the presentation in this paper, we focus our applica-
tion to the logic array blocks (LAB) of the Altera Cyclone
III and Cyclone IV FPGAs.
The LAB in these FPGAs is composed of 16 Logic Ele-
ments (LE) each having a 4-LUT and optional register out-
put, a set of 38 routing channels for external inputs, and
16 local routing channels for LE-to-LE communication with
50% depopulation (Fig. 4). The scope of this paper limits
delay measurements to the 16 LEs and the 16 local routing
channels in the LAB.
To better understand the results presented later in Sec. 4,
it is worth noting that the architecture of the LUTs is such
that nominally, the first two inputs, A and B, have similar
delays and by design are slower than input C which in turn is
tailored to be slower than input D. Moreover, inputs A and
B form a complete input set, where every LE can connect
to every other LE in the LAB by using either input A or B,
and similarly inputs C and D form a complete input set.
2.3 Path-Delay Measurements
We use a launch-capture technique to measure the delay
of a path in an FPGA. In this approach, a combinatorial cir-
cuit, known as the circuit under test (CUT), is configured
between a launch register and a capture register. Start-
ing at an initial frequency and increasing to a maximum
frequency, signals are sent from the launch register to the
capture register. When a signal fails to reach the capture
register within half of a clock cycle, we know that the de-
lay of the path is greater than twice the frequency at which
that signal was clocked. This technique has been success-
fully used to capture the delay of paths on FPGAs for many
applications [8, 12,13,18].
A limitation of this measurement technique, however, is
that it cannot measure a path that is faster than twice
the highest frequency supported by the FPGA’s on-chip
PLLs. Twice the frequency comes from the fact that the
launch and capture registers are clocked on opposite clock
edges. Therefore, any work that exclusively uses this mea-
surement technique will be limited to reporting delays of
long paths. To ground this, consider that the maximum
frequency for the Cyclone III PLLs used in this work is
402.5 MHz. This means that the fastest path we can mea-
sure is 1
2 · 402.5 = 1.24 ns. Fig. 1a shows that, on average, a
path of length 7 LUTs is measured to take 1.90 ns, mean-
ing that, roughly on average, the delay through one LUT is
271 ps. Combining this fact with our maximum frequency
leads to the conclusion that the smallest path we can mea-
sure is 5 LUTs long. This ignores the expected variation
spread. Therefore, to err on the side of caution, we do not
measure anything with less than 6 LUTs in a path. Nev-
ertheless, as we will later show, this work reports on de-
lays on the order of one LUT by taking delay measurements
of long paths and breaking them into smaller parts. [18]
and [17] take only a single measure within each LAB or CLB
and make no attempt to characterize within-LAB variation.
The most closely related technique used in [3] and [20] takes
the difference between two ring oscillators to extract sub-
cluster delays. However, this approach fails to account for
the unique interconnect delay between pairs of LUTs, nor is
it able to account for register delays.
Due to the nature of cmos and FPGA circuit design that
uses nmos pass transistors, there is a marked delay difference
in a rising transition, as compared to a falling transition. In
order to separate the falling and rising delays, our CUT is
composed of buffers in series. In this way, all elements in a
path transition in the same direction, allowing us to sepa-
rate the rising transition through the path from falling tran-
sitions (Fig. 14). Fig. 3 shows a diagram of the path-delay
measurement circuit used. A signal with a 50% duty cycle
is provided to the launch register. The signal propagates
through the CUT and the capture register records its out-
put. Errors are detected by the two error detection circuits,
one monitoring rising failures, the other, falling failures.
Figure 2.2: σth as a function of technology nodes based on predictive technology
models [2].
used in fabrication [5]. It is impossible to eliminate process variation by only
improving the fabrication process because of the physical limitations of semi-
conductors [33, 34]. The seminal paper by Pelgrom in 1989 proposed a clear
relationship between the area of metal-oxide-semiconductor field-effect tran-
sistor (MOSFET) device and the local threshold vol age variation Vth [35].
According to Pelgrom’s Law the uncorrelated random Vth variation at a tech-
nology node is proportional to 1√
WL
. Thus it increases from generation to
generation.
2.2.3 Timing of FPGAs under Process Variation
Impact of Process Variation to FPGAs
In recent years, Field-Programmable Gate Arrays (FPGAs) have become one of the
most popular implementation media for digital circuits. Similar to ASICs, FPGAs
14 Chapter 2. Background
Vth in 22 nm technology
Vth in 16 nm technology
100
50
0
60 90 120 150 (mV)
Mean
Worst-case
Re
la
ti
ve
 n
o
. 
tr
an
si
st
or
s 
o
p
e
r
a
t
i
n
g
 
a
t
 
g
i
v
e
n
 
V t
h
30 180 210
Figure 2.3: Extreme device variations will become more typical in the future [3].
also face the challenges of process variation as the transistor size shrinks into the deep
submicron domain. An at-speed testing technique combined with speed-binning has
been used to improve production yield for FPGAs vendors.
Traditionally, the worst-case delay of components across each FPGA is chosen as
nominal delay by FPGA vendors to guarantee that all FPGA designs work correctly
within the maximum operating frequency defined by the speed-grade. However,
the actual delay of nominally identical components on FPGA varies over a range
of values caused by process variation. The intra-die stochastic Vth variation can be
described by a Probability Density Function (PDF), and is expected to become more
significant [3, 35] as shown in Fig. 2.3. The target Vth of the transistor is 150 mV
(σ/µ = 10%). With more advanced technology in the future such as 16 nm, the mean
value of Vth will decrease to 100 mV as predicted in [36], but the spread is expected to
increase further (σ/µ = 30%) because of process variation [37]. To guarantee every
FPGA product meet timing constraints, the timing yield target must be lowered,
or else the vendors risk a significant drop in FPGA production yield, where a large
portion of chips fail the timing target of the guard-band. Otherwise the nominal
operating frequency should be degraded by setting a large guard-band. Therefore, a
2.2. Variation in Advanced Electrical Devices 15
high-level method to optimize the performance of FPGA without the requirement of
improvement in fabrication process is necessary, which is expected to increase yield
and timing performance of FPGA with relatively small increase in cost.
Region-based Variation Model/Map
A Region-based variation model/map is used in this thesis, which is defined as
a set of regions (grid cell) that are super imposed on top of the chip area [32].
All process parameters within the same region are assumed to have same timing
characteristics. The spatial correlation for process parameters within one region is
always one (perfectly correlated).
The region-based variation model can be adapted to handle more fine-grained varia-
tion by varying the number, size and shape of the region. One extreme case (upper-
bound) of this fine-grain region-based variation model is where every element such
as a switch or lookup table (LUT) forms its own region on an FPGA. With this
upper-bound, the optimisation methodology can fully make use of variation map
to produce optimal solution. However, such finest-grained variation model is im-
practical for commercial FPGAs because the huge memory and processing overhead
for handling the large amount of data in variation maps could result in very long
execution times.
The model used in this thesis is based on measurement on commercial FPGAs.
Therefore, the minimum size of a region is limited by the measurement method-
ology. Beside that, the nature of the optimisation method proposed in this thesis
allows coarse-grained variation maps to be used. For example, during the process
of variation-aware placement, only Logic Array Blocks (LABs) are swapped. Thus,
the variation map where a LAB forms a region is sufficient for this purpose.
16 Chapter 2. Background
Delay Variation Model
To appropriately apply variation-aware optimization methods, it is important to
firstly understand the characteristics of variation and correctly model delay variabil-
ity. Let the delay of components on FPGAs be a random variable D representing
process variation that can be modeled in a standard or canonical first-order form as
follow [38]:
D = a0 +
n∑
i=1
ai∆Xi + an+1∆Ra (2.1)
where a0 is the mean/nominal delay value of this type of component; ∆Xi represents
the variation of i systematic sources of variation Xi; ai gives the sensitivities of
each of the systematical variation; ∆Ra is the variation of an independent random
variable Ra from its mean value; and an+1 is the sensitivity of timing quantity to
Ra. By scaling the sensitivities, Xi and Ra can be assumed to be standard Gaussian
— N (0, 1).
Although there are numerous sources of variation, variation in lithographic effects
affecting Leff and dopant atoms in oxide layers affecting Vth are usually considered
by simulation program with integrated circuit emphasis (SPICE) to create the model
of process variation in existing literature.
However, the variation maps in this thesis are based on the measurement on com-
mercial FPGAs. The delay of each region is directly obtained from measurement.
Let V (x, y) be the measured delay for one region from variation map with coordi-
nates x and y. The parameter of intra-die process variation Vs(x, y) is normalized
to the mean of variation maps as follows
Vs(x, y) =
V (x, y)
mean(V (x, y))
(2.2)
Assuming each type of element on FPGA has an internal nominal delay (Vn) which
2.3. Measurement of Process Variation for FPGAs 17
is location independent. The delay of element is defined as Eq. 2.3. This has the
inherent assumption that the measured delay variation is reflected faithfully and
uniformly on all circuit components within a region.
Delay(x, y) = Dnominal · Vs(x, y) (2.3)
2.3 Measurement of Process Variation for FPGAs
The effect of process variation in FPGA can be observed from aspects of a design
such as propagation delay and power dissipation.
Sensitive analogue probes such as infrared cameras can be used on FPGAs to detect
the variation in power dissipation by mapping the dice’s surface temperature [39].
However, unlike power and heat measurements, propagation delay can be measured
accurately with advanced measurement methods on FPGAs [12, 21]. With these
technologies, fine-grained variation maps can be obtained from commercial FPGAs,
which can be used for variation-aware optimization methods. Three widely used
measurement methods are introduced and compared in this section.
2.3.1 At-Speed Testing
At-speed testing is a test to exercise the circuits under test (CUT) at the actual
operating frequency to discover the real world timing behaviours [4, 40]. For ASICs,
scan-chain test circuit is a structural methodology for detecting defects including
network timing defects and heating effects in the clock network and logic under
high switching frequency using automatic test pattern generation. By increasing
operating frequency from low to high until the timing failure appears, the maximum
operating frequency can be found for the CUT.
18 Chapter 2. Background
Most ASICs use at-speed testing for speed-binning purposes. A test circuit is re-
quired to be designed to achieve high frequency test which causes overhead in terms
of design time and power consumption. FPGAs are alternative platforms to exe-
cute at-speed testing because of their reconfigurability. A non-permanent built-in
test circuit can be used in FPGA platforms to characterise the chip quickly and
efficiently.
2.3.2 Built-in Self-test (BIST)
BIST is a mechanism where a test, including vector generation, result analysis and
test control, is built on-chip to observe faults or measure the power/timing perfor-
mance itself. With BIST, the test is independent of the input and output (I/O)
of the CUT. BIST is widely used by ASIC manufacturers to reduce test cost and
increase productivity. In addition, BIST is used as on-line test to monitor tim-
ing failures occurring in the critical path. The resources and area used by BIST
cause long term penalties to implementation circuit for ASICs. Moreover, to design
application-specific BIST to measure the implementation circuits is costly and tim-
ing consuming. However, with the feature of reconfigurability, FPGAs do not suffer
from these problems. There are usually unused resources on FPGAs to implement
BIST [41].
2.3.3 FPGA Delay Measurement Methods
In recent years, accurate FPGA measurement methods have been
developed to collect intra-die variation maps. In this section, three main measure-
ment methods are introduced.
2.3. Measurement of Process Variation for FPGAs 1958 Chapter 2. Background and Related Work
Oscillations
...
Odd number of Inverters
(a)
Any Circuit where
In Out
Out = In
Oscillations
(b)
Figure 2.10: (a) A typical Ring Oscillator (RO) structure using inverters. (b) The general
definition of an RO circuit.
are “black boxes” with inaccessible internal details. In fact the use of IP blocks in a design
prohibits the general idea of DFT, where internal circuit detail is required to make appropriate
modifications to improve testability.
2.6 Existing Test and Measurement Methods
This section presents an examination of existing test methods and concepts in detail. Both
FPGA specific and FPGA relevant general test methods are examined.
Since manufacturing tests used by vendors are considered vigorous and highly mature in de-
tecting static and functional faults, it is safe to assume that qualified devices are free from
such basic faults. Therefore, the test methods presented in this section are mainly oriented
towards post-manufacturing testing, where precise timing performance measurements, small
delay faults, and delay variability are of interests.
Figure 2.4: A typical Ring Oscillator (RO) structure using inverters.
Ring Oscillator Measurement
To apply the Ring Oscillator (RO) measurement method, an odd number of inverters
in a closed loop chain are used to measure the propagation delay of the signal path
in that loop [42]. The oscillation frequency is governed by the propagation delay of
the entire signal path in the loop as in Fig. 2.4. Since fosc can be easily obtained by
counting the number of signal transitions in a period by a counter, it is widely used
to discover the impact of the process variation [10] and degradation on FPGAs [43].
Let Tosc be the o cillation period and fosc be the oscillation frequency. The path
delay in the RO loop (tloop) is given by:
tloop =
Tosc
2
=
1
2 · fosc (2.4)
A variation map can be obtained for each FPGA by placing an array of ROs across
the chip [23]. However, the smallest unit that can be measured by RO is limited by
the maximum frequency of the counter in FPGA. In addition, the requirement of
a combinatorial feedback path prevents ROs from representing typical delay paths
in most FPGA applications. Lastly, significa t heat-up caused by co tinuous loop
oscillations with high frequency affect the measurement accuracy.
20 Chapter 2. Background
Transition Probability Measurement
124 Chapter 4. The Transition Probability Measurement Method
Test Vector
Generator
(TVG)
Circuit Under Test (CUT)
Transition
Probability
Analyser
(TPA)
Test Clock
Generator
(TCG)
Transition
Activity Counter
(TAC)
LR SR
Circuit Timing Measurements
.....
V
Output
Statistics
Z
Measurement Circuitries
Launch
Register
Sample
Register
y
Figure 4.3: Principle circuit diagram of the transition probability (TP) measurement method.
The Launch Register (LR) and the Sample Register (SR) at the beginning and end of the CUT
are clocked by a Test Clock Generator (TCG) which steps through a range of test frequencies.
The achievable timing resolution in terms of clock period using the TCG is the same as the
FRD case and it can be estimated by Eq. 3.1 derived earlier in Section 3.2.1.
A Test Vector Generator (TVG) provides test vectors V to the CUT such that during normal
operation, each output bit of the CUT exhibits a non-zero but steady transition probability
D(yi), where i is the bit index.
The Transition Activity Counter (TAC) processes samples (y) from the sample register on every
clock cycle for a length of K test clock periods (cycles) and accumulates the number of signal
transitions. The obtained statistical information (transition count) is then processed further
by the Transition Probability Analyser (TPA) to calculated a normalised value in terms of
average signal transition per test clock cycle. This value is essentially the probability of signal
transition per cycle (D(y)) which lies within the range of 0 to 1. For a given length of sampling
time over K test clock cycles, D(y) is given by:
D(y) =
signal transition count
K
(4.2)
Figure 2.5: Circuit diagram of the transition probability (TP) measurement
method [4].
A transition probability (TP) based method is proposed to measure the FPGAs
to collect variatio maps [12]. The transition probabilities P (yi) are defined as
probabilities of the primary output nodes yi changing state when the next input
stimuli is applied to the circuit. The basic idea of t is method is to measure the
transition probabilities at signal nodes while ramping the clock freque cy up. B
detecting changes in the transition probability, it is possible to indirectly derive the
maximum working freque cy at which CUT starts to fail [12, 4].
The top level implementation of TP measurement is shown in Fig. 2.5. The Launch
Register (LR) and the Sample Register (SR) of the CUT are clocked by a Test Clock
Generator (TCG) which increases test frequencies step by step. A Test Vector Gen-
e tor (TVG) provides test vectors V to the CUT. When the CUT works nor ally,
the output of the CUT exhibits a non-zero steady transition probability TP (yi)
where i is the bit index. The Transition Activity Counter (TAC) processes samples
y on every clock period. The information is analyzed by a Transition Probability
Analyzer (TPA) to find out the change in TP, and the point at which it deviates
from the initial steady value reflects the maximum operating frequency of the CUT.
2.3. Measurement of Process Variation for FPGAs 21
By placing the test circuits across the entre FPGA, a variation map can be obtained
with TP delay measurements. The smallest measurement unit includes 2 LUTs and
the interconnection between them [4].
GROK-LAB Timing Extraction
Based on the limitation of FPGA architecture, the key challenge of TP measurement
is that it is not possible to directly measure the delay of each switch or wire. In
addition, the tiny delay of one switch means that the test frequency required for
the measurement is extremely high which may heat up of the chips and decrease
the accuracy of measurement. The idea of GROK-LAB measurement is to measure
a set of overlapping paths with TP measurement [21] to form a system of linear
equations. The individual delay of components in those paths can be given if the
equations can be solved [44].
A simple example is used to explain the idea of this method. First, three paths are
measured with the TP measurement method, where Path 1 composed components
A and B; Path 2, B and C; and finally, Path 3, C and A. Assuming the delay of
each path is 3ps, 4ps and 5ps respectively, a set of linear equations are formed as
follows:
A+B = 3ps Path1 (2.5a)
B + C = 4ps Path2 (2.5b)
C + A = 5ps Path3 (2.5c)
By solving these equations, the delays of components A, B and C are found to be
1ps, 2ps and 3ps respectively [21].
22 Chapter 2. Background
Table 2.1: Summary of variation measurement strategies.
SMU Feature
RO 3 LUTs, feedback feedback loop
TP 2 LUTs, interconnect arbitrary circuit
GROK-LAB 1 LUT detailed variation map
Summary of Variation Measurement Methodologies
One goal of this thesis is to propose an optimisation method to make use of measured
variation maps instead of a variation model which is insufficient but widely used
by existing variation-aware methodologies. Three existing measurement methods
are summarized in Table 2.1. RO measurement includes a feedback loop in the
measurement unit (may not reflect the feature of most designs). Comparing the
smallest measurement unit (SMU) of those methods, GROK-LAB can obtain the
most detailed variation maps, including delay of each LUT, which has best potential
for variation-aware optimisation. However, finest-grained variation maps may cause
a huge overhead in execution time as explained earlier in Sec.2.2.3. A region-based
variation maps is considered sufficient for specific optimisation process e.g. variation-
aware placement. Therefore, the variation maps measured by TP measurement are
chosen in this thesis.
2.4 Methodologies to Mitigate Process Variation
In this section, the strategies for variability-adaptive design in FPGAs are in-
troduced. Firstly, the methods used by industry such as worst-case timing and
speed-bining are introduced. The maximum operating frequency defined by tradi-
tional speed-binning can guarantee correct operation of the slowest element, how-
ever they are conservative and causes any synchronous designs to run at signifi-
cantly slower speed than the potential speed that the actual hardware can support.
2.4. Methodologies to Mitigate Process Variation 23
Several high level variation-aware optimizations based on Computer-Aided Design
(CAD) tools are discussed later. Compared with manufacturing level process con-
trol, variation-aware design technology only involving post-silicon optimisation is
more cost-effective and significantly easier to apply.
2.4.1 Worst-case Timing and Guard-banding
Conventionally, the issue of process variation can be alleviated by several post-silicon
methods. The simplest strategy is to execute an exhaustive measurement on one
type of resource in a die, and then the delay of the slowest one is used as a nominal
delay for all other resources of this type, called the worst-case timing strategy. The
worst-case delays are pessimistic because even the fast elements are constrained to
run with this worst-case delay.
However, the efficiency of doing exhaustive measurement to find the worst-case delay
is quite low. In addition, there isn’t a single fixed worst-case delay that can univer-
sally represent the timing performance in all cases because its value is also affected
by the rate of switching activity, temperature and power supply. Alternatively, by
defining statistical parameters µ and σ, a PDF that describes the distribution of
possible delays can be used to set a guard-band. The delay of one component (Tgrd)
is defined as µ + cσ instead of the worst-case value, where cσ is the guard-band
factor. The measurement for calculating statistical parameters is not required to
be exhaustive but it must be able to determine a statistically reliable bounds for a
certain yield [10].
Statistical Timing Analysis (STA) can be performed to obtain the guard-banded
delay, Tgrd. The guard-band cost is defined as (Tgrd/Treal)-1, where Treal is the
real delay of the circuit by measurement. The guard-band cost can be changed by
choosing a different guard-band factor that varies the guard-band as multiples of σ.
24 Chapter 2. Background
N~(µ, ơ ) N~(µ, ơ ) N~(µ, ơ ) N~(µ, ơ )
Output
Tgrd Tgrd Tgrd Tgrd
4Tgrd
Probability density of path 
delay N~(4µ,2ơ )
99.7%
Yield loss
Figure 2.6: Example of delay of 4 inverters with worst-case timing method.
An example of the worst-case timing analysis with guard-bands is illustrated in
Fig. 2.6. In this simple example, each inverter is assumed to subject to a variation
following a Gaussian Distribution (GD) N ∼ (µ, σ). Based on the theory of the
worst-case timing combined with guard-band, the delay Tgrd = µ + cσ is used as
the nominal delay for all inverters. Then the delay of path is calculated as a sum
of the nominal delay, 4 × Tgrd, while the real delay of this path is another GD
with N ∼ (4µ, 2σ). Therefore, using worst-case delay (4µ + 4cσ) results in overly
pessimistic delay estimates because in reality, the delay of components on one path
will compensate for each other because of process variation.
2.4.2 Speed-binning
FPGAs, along with digital signal processors (DSP), microprocessors and other logic
chips, have long-used speed-binning to sort chips into different speed-grades ac-
cording to their timing behaviours [45]. By measuring the physical path delay and
maximum operating clock frequency for a given test path, the speed-grade of each
chip can be identified.
The idea of speed-binning is illustrated in Fig. 2.7. Considering process variation,
the timing performance varies a lot when applying an identical design to a large
number of FPGAs. With the speed-binning methodology, each chip is assigned to
a speed bin based on its worst-case delay, marked as ‘fast’, ‘medium’, ‘slow’ and
2.4. Methodologies to Mitigate Process Variation 25
Medium SlowFast Dead
-1σ 1σ 3σ
Delay distribution in 
one grade bin
Delay distribution 
across all FPGAs
Figure 2.7: Example of for speed-binning.
Table 2.2: The boundaries of each bin in Fig. 2.7.
Cutoff frequency of bins Bin name
[−∞, µ− σ] fast
[µ− σ, µ+ σ] medium
[µ+ σ, µ+ 3σ] slow
[µ+ 3σ,+∞] dead
‘dead’ as in Table. 2.2. Inside each bin, Tbin is the worst-case delay of one type of
component across all chips in this bin which is used as nominal delay for this type
of component.
The speed-binning method is effective for dealing with inter-die variation by classify-
ing chips into different bins. However, there are two disadvantages of speed-binning:
Firstly, the speed-binning process must consider all possible extreme case to make
sure the chips operate correctly within each identified speed-grade. The unique
challenge in FPGAs is that the functionality is unknown because different designs
with different critical paths can be compiled onto an FPGA. It is hard to design
a measurement circuit that covers all corner cases to obtain a worst-case timing
for every chip. Secondly, assuming the intra-die stochastic process variation will
26 Chapter 2. Background
increase dramatically in the future (longer tail towards the higher delay end in the
delay distribution), there is a big possibility that the chip fails timing despite having
a large guard-band.
2.4.3 Variation-aware Optimization in CAD
The remaining option is to devise variation-aware technology to alleviate the impact
of process variation to FPGAs. It is possible to compensate for and even make use of
measured variation maps by adapting the application circuits with variation-aware
optimization methods [5, 10].
Statistical Static Timing Analysis
Conventionally, STA has been used for digital circuits over 30 years. However, in the
presence of increasing process variation in the semiconductor industry, traditional
STA combined with the worst-case method can not handle the problem efficiently.
With STA, the critical path is defined as the path with the worst-case delay value
by summing up the nominal delay of the component along that path. However, if
significant inter- and/or intra-die variation exists in the target FPGAs, the delay
of the critical paths may vary from device to device based on the variation map.
Statistical static timing analysis (SSTA) has been proposed in recent years as a
promising timing analysis and optimisation method which takes process variation
into account [46, 30].
By replacing the static and deterministic timing of LUT and interconnects with a
probability distribution, the delay of end-to-end signal paths is calculated by SSTA
as a distribution. The same example introduced in Sec. 2.3.1 is shown in Fig. 2.8.
The timing requirement of this path is chosen as 4µ+ 6σ, therefore 99.7% chips can
meeting the timing requirement. The SSTA method would outperform worse-case
2.4. Methodologies to Mitigate Process Variation 27
timing with a negligible quantity of chips failing timing requirement. In addition,
SSTA enables a trade-off between product parametric yield and speed.
N~(µ, ơ ) N~(µ, ơ ) N~(µ, ơ ) N~(µ, ơ )
Output
Probability density of path 
delay N~(4µ,2ơ )
99.7%
Figure 2.8: Example of delay of 4 inverters with worst-case timing method.
SSTA method can be path-based or block-based [47]. The path-based method treats
end-to-end signal paths separately. An exhaustive SSTA for all paths in a circuit
is highly accurate but computationally expensive. Otherwise, the critical path and
near-critical paths must be chosen carefully prior to running the analysis because
there is a potential that the critical path is not analyzed under significant process
variation. Therefore, path selection is the key issue for the path-based method [48].
The block-based method, on the other hand, generates the arrival and required times
for each node [49, 50, 38, 51]. The results of this method are promising except that
mismatch between delay model and actual variation map is unavoidable because of
spatial correlation.
Moreover, the SSTA method can be applied to the FPGA CAD flow, including
packing, placement and routing, by calling SSTA-driven optimization to achieve
a better timing performance. The probabilistic equivalents of ‘max’, ‘min’, ‘add’,
and ‘subtract’ operations are involved in SSTA for handling component delays. In
addition, during this SSTA-driven placement and routing process, the critical and
a proportion of near-critical paths are considered to calculate the cost change and
delay of the critical paths as a distribution [5, 10].
Without modifying the CAD work flow, the delay of the critical path of one P&R
implementation (called variation-blind configuration) on a large number of FPGAs
28 Chapter 2. Background
PD of Variation-blind
PD of SSTA-driven 
optimization 
99.7%99.7%
Timing 
Improvement
Figure 2.9: The delay probability density of variation-blind and SSTA-driven opti-
mization.
spreads as a distribution because of intra-die variation. Theoretically, SSTA-driven
CAD and retiming [52] achieves better timing performance by taking process vari-
ation into account during placement and routing. As illustrated in Fig. 2.9, the
probability density of the variation-blind method across many chips is assumed to
be gaussianly distributed. With SSTA-driven optimization, the distribution of the
critical path can be shifted to the lower direction under process variation. The tim-
ing improvement is defined as the difference between the two bounds defining the
required yield for both distributions.
The results of SSTA-driven optimisation reported so far are promising [53, 54, 55,
56, 57]. Nevertheless, these works assume that the same timing variability model
is representative for all devices of a given type. This “one configuration fits all”
approach, though efficient, it assumes that the approximated timing model matches
the physical timing of specific FPGA chips.
2.4. Methodologies to Mitigate Process Variation 29
Multiple Configurations
To take advantage of the reconfigurability of FPGA, one variation-aware optimiza-
tion is to use multiple implementations of the same circuit design. For one native
implementation of a circuit, there is a probability that it can meet the timing re-
quirement without any knowledge of process variation. Therefore, by selecting one
appropriated solution from a set of functionally equivalent configurations, the delay
of the critical path can be reduced under large stochastic and spatial correlated
process variation. Comparing with the “one configuration fits all” approach, the
multiple configuration method can potentially improve timing yield at the expense
of needing to compute and store extra configurations [58, 59].
This method generates multiple configuration bitstreams for specific circuit imple-
mentation and stores them in a database. All these configurations are functionally
identical with different placement and routing generated from the same netlist. Ide-
ally, distinctive resources are used by critical path and near-critical path by different
configurations, since configurations with placement and routing highly correlated to
the initial configuration is unlikely to show significant timing gain.
An at-speed test is required to analyze the delay of each configuration based on the
target FPGA to verify the actual timing gain in terms of the critical path delay.
The configuration for this FPGA is chosen when minimum delay of the critical path
is found. However, there is a possibility that the multiple configuration process fails
where none of configurations can meet the target timing requirement.
Multiple configurations can provide more choices for one implementation of a cir-
cuit in attempt to alleviate the impact of process variation and improve timing
yield. Theoretically, this method can outperform SSTA methods when given a large
number of different configurations. However, there are some limitations with this
methodology. Firstly, each configuration must first be generated with a set of valid
30 Chapter 2. Background
placement and routing, and maybe stored in some form of memory in an embedded
system. Secondly, it is hard to make sure that multiple configurations can cover all
possible optimized solution for specific variation map [5].
Region RelocationParametric Yield Modeling and Simulations of FPGA Circuits · 10: 7
Fig. 2. Relocating regions in an FPGA.
at-speed test configuration is required for each module. Compared with the
multiple configuration case, the amount of bitstream storage required is re-
duced, and the implementation of the circuit needs to be generated only once.
The approach has some limitations. The strategy would be most suitable
for large systems comprising distinctly separate IP blocks connected by a sys-
tem bus or on-chip network, since such systems are designed in an inherently
modular way. Implementing relocatable modular circuits increases the com-
plexity of system design, in particular in the connectivity betweenmodules. An
important constraint on this strategy is that the connections between the mod-
ule blocks cannot become the critical path for the system, since there would
then be no advantage in relocating the modules. Moreover, while there are
many ways to assemble circuit modules to form different implementations of
the system, the implementations are clearly not all independent. The space of
potential solutions is therefore large and not trivial to search.
4. MODELING AND ANALYSIS
In the preceding section, several broad strategies for variability aware design
were described. Before pursuing an implementation of any particular strat-
egy, it is expedient to determine quantitatively the benefits the approach will
provide. This section presents an analysis of each of the strategies of Sec-
tion 3. It is emphasized that the objective of the analysis is to determine theo-
retical bounds on the relative yield or speed improvement of each approach
given ideal conditions. It is not intended that the theory presented below
should be used unaltered to predict the yield or speed of an actual implemen-
tation of a given strategy for a particular circuit, since practical considerations
will invariably degrade the achieved improvements. Moreover, a number of
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 2, Article 10, Pub. date: June 2008.
Figure 2.10: Relocating regions in an FPGA [5].
Region relocation also takes advantage of reconfigurability of FPGAs. Instead of
providing completely different implementations as multiple configurations, the cir-
cuits are divided into different regions and then assembling the regions in different
ways [60, 61]. This can potentially increase the probability that the implementations
2.4. Methodologies to Mitigate Process Variation 31
pass the timing requirement.
To apply the region relocation method, the design circuit must be divided such that
the critical and/or near-critical paths are encapsulated in one region. Besides that,
the design must support relocation of divided regions. For example, swapping two
regions or shifting the region to an unused place. Ideally, the implemented circuits
can be stored as partial bitstreams and then perform a dynamic reconfiguration
(re-location) [61]. An at-speed test is also required after relocating the regions to
evaluate the delay of the critical path with process variation. Compared with the
multiple configurations, the storage of bitstream is reduced with this approach.
The limitation of region relocation is that the interconnect between two regions are
re-routed after region swapping which adds extra complexity to the process. In ex-
treme cases, it is possible that the implementation after relocation is not routable.
Moreover, although there are several methods to relocate the regions to form dif-
ferent implementations, it is clear that different implementations are highly corre-
lated [5] which may not provide significant timing improvement as explained previ-
ously.
Path Reconfiguration
Without relocating a region of circuit or providing multiple configurations, path
reconfiguration only re-connects the critical path which fails timing requirement. It
may be possible to reduce the propagation delay by choosing different components
on the critical path. For example, the LUT and switch can be replaced by the
unused resources near the critical path. Even coarse alterations such as rerouting
can be used to reduced the delay of the critical path [5].
The main limitation with this method is that serious routing congestion problem may
occur during the process. Therefore, this method is not always feasible, especially
32 Chapter 2. Background
when the critical path is located at regions where routing resources are already
congested [62].
Full Chipwise Placement and Routing
Full chipwise placement and routing (P&R) is another method to alleviate the im-
pact of process variation. For a given FPGA chip, the variation map can be mea-
sured with Built-in Self-test circuits, and applied to this methodology to perform
variation-aware placement and routing optimisation for better timing. Theoretically,
full-chip P&R can make full use of variation map [53, 63, 64] and achieve better so-
lutions compared with other optimization methodologies such as SSTA and multiple
configurations.
To apply full chipwise placement and routing, each component is assigned an unique
delay value based on the variation map. The algorithms of placement and routing
are modified to use the delay valued from the variation map to evaluate the timing
performance. For each FPGA, the configuration of placement and routing is unique
and are generated in a chip-by-chip basis. Ideally, this method can achieve the most
optimal results.
The limitation of full chipwise placement and routing is mainly its overhead of
execution time. For each P&R process, it requires frequent memory access of timing
data from variation maps which is extremely expensive if the fine-grained variation
map is used. For example, the process to place and route one of MCNC benchmarks,
clma, cost 12 hours in our experiment. Assuming this circuit design is applied to
100 FPGAs, the execution time should last for 50 days with one computer process.
Therefore, full chipwise methodologies with traditional static timing analysis are
not practical when dealing with the large number of permutations of variation maps
because of overhead in run time.
2.4. Methodologies to Mitigate Process Variation 33
2.4.4 Summary of Existing Methodologies
Table 2.3: Comparison between different optimisation strategies.
Target variation Overhead Weakness
Speed-binning Inter-die N/A Ignores intra-die variation
SSTA Intra-die/Sto. Low One config. fits all method
Multiple Conf. Intra-die/Sys. Medium Low timing improvement
Path Reconf. Intra-die/Sto. Medium Not guaranteed improve-
ment
Full Chipwise Intra-die/Sto./Sys. High “per-chip” config. genera-
tion causes huge overhead in
execution time.
In this section, the features of each optimisation method is summarized in Table. 2.3.
“Target variation” describes the type of variation that the method is most effective.
Speed-binning combined with worst-case delay is widely used by industry to improve
production yield, which is good to deal with inter-die but not intra-die variation. The
results of SSTA method are most promising and effective for dealing with stochastic
process variation (shown as Sto. in Table. 2.3). Although SSTA method can take
into account of the systematic variation (shown as Sys. in Table. 2.3), this “one
config. fits all” method provides solution that maybe far from optimal at a per-chip
level, considering each chip has a unique variation map. Multiple configurations
and path Reconfiguration (Path Reconf. in Table. 2.3), on the other hand, adopts
the best solution from pre-designed or reserved alternatives. However, it can not
guarantee any target timing improvement because of correlation between different
alternatives. The full chipwise method is based on the variation maps including the
delay of each element. Theoretically, this methodology can achieve most optimal
solutions (upper-bound). However, considering millions of elements on one FPGA
device, the huge overhead in runtime of full chipwise makes this strategy impractical.
Based on the optimisation methods presented in the existing literature, the main
goal in this thesis is to propose a method that can achieve similar improvement to
the full chipwise method but only require a fraction of execution time.
34 Chapter 2. Background
2.5 Placement, Routing and Retiming Methods
To make the best use of measured variation maps, the high performance variation-
aware CAD tools are required to be designed for FPGAs implementation. For
placement and routing simulation, Versatile Packing, Placement and Routing (VPR)
is widely used by academics to enable FPGA architecture and timing exploration. In
addition, P&R algorithm in VPR can be modified to be variation-aware to alleviate
the impact of process variation.
2.5.1 FPGAs Architecture
Placement Architecture (Island-style)
SB CB SB CB SB CB
LAB LAB LAB
LAB LAB LAB
LAB LAB LAB
SB
SB CB SB CB SB CB SB
SB CB SB CB SB CB SB
SB CB SB CB SB CB SB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
CB
I/O I/O I/O
I/O I/O I/O
I/O
I/O
I/O
I/O
I/O
I/O
Vertical Routing 
Channel
Switch Box 
(SB)
Connection 
Box
Horizontal 
Routing Channel
Logic Array 
Block (LAB)
I/O Block
Channel 
Width 
(W)
Routing 
Architecture 
Example
Figure 2.11: Overview of FPGA island-style architecture.
2.5. Placement, Routing and Retiming Methods 35
The term “programmable or reconfigurable” of FPGAs indicates their ability to im-
plement a new functions after chip fabrication. The reconfigurability/programmability
of an FPGA is based on underlying configurable logic blocks (LABs) and pro-
grammable routing interconnect [65].
Island-style FPGAs architecture is used in this thesis for placement and routing
where LABs —clusters of LUTs and registers— look like islands in a sea of routing
resources. A generalized example of this type of FPGAs is shown in Fig. 2.11.
Each LAB is a basic component of an FPGA that provides the basic logic unit and
storage functionality for a target application ranging from simple logic or arithmetic
unit to an entire processor. In Island-style architecture, LABs are arranged in a two
dimensional grid and are interconnected by programmable routing network. The
Input/Output (I/O) blocks on the periphery of FPGA chip are connected to the
routing network. The routing network comprises of pre-fabricated wiring segments
including routing tracks and connection box (CB) and programmable switches, i.e.
switch boxes (SB) that are organized in horizontal and vertical routing channels.
Usually, the routing network of an FPGA occupies 80-90% of total chip area. The
routing problem for FPGAs is an NP complete problem [66] which costs the majority
of the execution time in the P&R process.
Logic Array Blocks
Cluster-based logic blocks is used for commercial FPGA produced by Xilinx [67]
and Altera [68] as shown in Fig. 2.12. The LAB with this type of architecture has
two-level hierarchy. The overall block is a collection of basic logic elements (BLEs)
as shown in Fig. 2.12(a). One basic logic element is composed of a 4 input LUT and
a register. The complete logic block consists of N interconnected BLEs as shown in
Fig. 2.12(b) [69].
36 Chapter 2. Background
Abstract
While modern FPGAs often contain clusters of 4-input
lookup tables and flip flops, little is known about good
choices for two key architectural parameters: the number of
these basic logic elements (BLEs) in each cluster, and the total
number of distinct inputs that the programmable routing can
provide to each cluster. In this paper we explore the effect of
these parameters on FPGA area-efficiency. We show that a
cluster containing N BLEs needs only 2N + 2 distinct inputs
(vs. the 4N maximum) to achieve complete logic utilization.
Secondly, we find that a cluster size of 4 is most area-effi-
cient, and leads to an FPGA that is 5 - 10% more area-efficient
than an FPGA based on a single BLE logic block.
1. Introduction
One of the key determinants of an FPGA’s area-efficiency
is the structure and granularity of its logic block. If a very sim-
ple, or fine-grained, logic block is employed, more logic
blocks will be required to implement a given circuit, and the
routing area required to interconnect the blocks may become
excessive. On the other hand, if a very complex, or coarse-
grained, logic block is used, much of the logic block function-
ality may be unused in most circuits, again wasting area.
Most commercial FPGAs use logic blocks based on look-
up tables (LUTs) [1, 2, 3], and accordingly most prior
research has focused on LUT-based logic blocks [4, 5, 6]. In
[4], it is shown that a 4-input LUT is the most area-efficient
LUT, chiefly because LUT complexity grows exponentially
with the number of inputs. In this study, we investigate a logic
block based on a cluster of 4-input LUTs. The complexity of
this logic block cluster grows less than quadratically with
cluster size, so it holds promise as a practical coarse-grained
logic block.
We explore two questions concerning this cluster architec-
ture. First, how many distinct inputs should be provided to a
cluster of N 4-LUTs? Secondly, how many 4 LUTs should be
included in a cluster to create the most area-efficient logic
block? Recent FPGAs from Xilinx [7], Altera [1], Lucent
Technologies [3] and Actel [8] have all grouped several LUTs
together into a more coarse-grained logic block, but there has
been little published work investigating the number of LUTs
which should be included in a cluster.
The next section describes the cluster architecture in
detail. Section 3 outlines the experimental method we used to
evaluate each variant of the architecture. Section 4 describes
the algorithms used in our logic cluster packing program. Sec-
tion 5 presents results concerning the number of inputs that
must be provided to a cluster of N 4-LUTs, while Section 6
evaluates the area-efficiency of clusters of different sizes.
Finally, we summarize our results and conclusions.
2. Cluster-Based Logic Blocks
Fig. 1 shows the structure of a logic cluster. This logic
block has a two-level hierarchy; the overall block is a collec-
tion of basic logic elements (BLEs). As shown in Fig. 1a, our
basic logic element is composed of a 4-LUT and a register,
and the BLE output can be either the registered or unregis-
tered version of the LUT output. The complete logic block
consists of N interconnected BLEs, as shown in Fig. 1b. We
call the total logic block a logic cluster.
We describe a logic cluster via two parameters, N and I. N
is the number of BLEs per cluster, while I is the number of
inputs to the cluster. As Fig. 1 shows, not all 4N LUT inputs
are accessible from outside the logic cluster. Instead, only I
external inputs are provided to the logic cluster -- multiplexers
allow arbitrary connections of these cluster inputs to the BLE
inputs. The same multiplexers also connect to each of the N
BLE outputs, allowing the output of any BLE within the clus-
ter to be connected to any of the BLE inputs. All N outputs of
the logic cluster can be connected to the FPGA routing for use
by other logic clusters.
Notice that the logic cluster of Fig. 1 is fully connected; i.e.
each of the 4N BLE inputs can be connected to any of the I
cluster inputs or any of the N BLE outputs. It is simpler to
write CAD tools that completely exploit logic clusters that are
fully connected than those which are not. For example, deter-
mining if a group of BLEs can be implemented in a single
cluster only requires counting the number of cluster inputs
Inputs 4-input
LUT Clock
D FF Out
(a) Basic logic element (BLE)
BLE
BLE
.
 
.
 
.
.
 
.
 
.
N
N
BLEs
N
Outputs
Clock
I
Inputs
I
(b) Logic cluster
Fig. 1.  Structure of basic logic element (BLE) and logic cluster.
#1
#N
Cluster-Based Logic Blocks for FPGAs: Area-Efficiency vs. Input Sharing and Size
Vaughn Betz and Jonathan Rose
Department of Electrical and Computer Engineering, University of Toronto
10 King’s College Road, Toronto, Ontario, CANADA  M5S 3G4
{vaughn, jayar}@eecg.utoronto.ca
This research was supported by the Information Technology Centre of
Ontario, the Walter C. Sumner Foundation, and an NSERC 1967 Scholarship.
(a) Basic logic element (BLE)
4 of 12
of inputs to the cluster. As Figure 3 shows, not all of the LUT inputs (of which there are 4 x N) are
accessible from outside the logic cluster. Instead, only I external inputs are provided to the logic
cluster -- multiplexers within the logic block allow arbitrary connections of these cluster inputs to
the BLE inputs. The same multiplexers also connect to each of the BLE outputs, allowing the out-
put of any BLE within the cluster to be connected to any of the BLE inputs. All N outputs of the
logic cluster can also be connected to the main FPGA routing for use by other logic clusters.
Notice that each of the BLE inputs can be connected to any of the cluster inputs or any of the
BLE outputs. We therefore call these logic clusters fully connected. It is simpler to write CAD
tools for fully-connected logic clusters than it is to write tools for clusters with less flexible local
interconnect. For example, determining if a group of BLEs can be implemented in a single cluster
is simple -- if the BLEs need no more distinct inputs than the number of cluster inputs (I), they
can all go in one cluster. As well, in a fully-connected logic cluster all the cluster inputs and all the
cluster outputs are logically-equivalent. That is, all of the inputs are functionally identical, and all
Inputs 4-input
LUT Clock
D FF Out
Figure 2: Basic Logic Element (BLE)
BLE
BLE
.
 
.
 
.
.
 
.
 
.
N
N
BLEs
N
Outputs
Clock
I
Inputs
I
Logic cluster
Figure 3: Logic cluster structure.
#1
#N
FPGA
(b) Logic array block (LAB)
Figure 2.12: Structure of basic logic element (BLE) and logic array block (LAB).
Two parameters N and I are used to describe the clusters. N is the number of BLEs
per cluster, while I is number of inputs to the cluster. In this thesis, we consider
multiplexers are used to connect the inputs of cluster I to arbitrary inputs of the
BLEs. With the same multiplexer, the output of any BLE within the cluster can
connect to any of BLE inputs. Besides that, the delay of multipexers are assumed
identical when connected different pins for the different BLEs. All N outputs of the
logic cluster can be connected to the FPGA routing for use by other logic clusters.
The LAB with this structure is defined as fully connected, which is more flexible for
high level optimization such as retiming.
2.5. Placement, Routing and Retiming Methods 37
Routing Architecture
Short wire 
segment
LAB
LABLAB
LAB
Long wire 
segment
Switch
block
Programmable routing switch
Logic
block
Connection
block
Programmable 
connection 
box
Figure 2.13: Overview of FPGA island-style architecture.
To perform routing, more information of the FPGA architecture should be specified.
In this section, parameters relevant to variation-aware optimisation methodology are
introduced. The region highlighted as “Routing architecture example” in Fig. 2.11
is zoomed in and illustrated in Fig. 2.13 to demonstrate the routing architecture for
FPGA.
The notation for FPGA architecture is listed as follow
• W : The number of tracks, or wires contained in a channel.
• Fc: The number of wires in each channel to which a logic block pin can connect
(connection box flexibility).
• Fs: The number of wires to which each incoming wire can connect in a switch
block (switch box flexibility).
38 Chapter 2. Background
• Wl: The length of a wiring segment which is the number of logic blocks it
spans.
• long line: A wiring segment that spans the entire width or height of an FPGA.
Switch Box
Wiltonuniversaldisjoint
0
1
2
3
4
0
1
2
3
4
0 1 2 3 4 0 1 2 3 4
0
1
2
3
4
0 1 2 3 4
Fig. 1. Different switch block styles.
both long wire segments and the interaction of many switch blocks connected together.
This framework includes a restricted switch block model which allows us to analyse the
diversity of the network. The framework is used to design an ad hoc switch block named
shifty and two analytic ones named diverse and diverse-clique. These new switch blocks
are very diverse, and routing experiments show they are as effective as the others.
2 Design Framework
This section describes the switch block framework being composed of a switch block
model, permutation mapping functions, and simplifying assumptions and properties.
2.1 Switch Block Model
The traditional model of a switch block draws a large box around the intersection of a
horizontal and vertical routing channel. Within the box, switches connect a wire on one
side to any wires on the other three sides. Long wire segments pass straight across the
switch block, but some track shifting is necessary to implement fixed length wires with
one layout tile. Figure 2a) presents this model in a new way by partitioning the switch
block into three subblocks: endpoint (f
e
), midpoint (f
m
), and midpoint-endpoint (f
me
)
subblocks. The endpoint (midpoint) subblock is the region where the ends (midpoints)
of wire segments connect to the ends (midpoints) of other wire segments. The f
me
subblock connects the middle regions of some wires to the ends of others. A switch
placed between two sides always falls into one of these subblocks.
The traditional model in Figure 2a) is too general for simple diversity analysis, so
we propose restricting the permissible switch locations. One restriction is to prohibit
f
me
switches; this was done in the Imran block [5]. We propose to further constrain the
f
m
switch locations to lie within smaller subblocks called f
m;i
, as shown in Figure 2b)
for length-four wires. This track group model is a key component to the framework.
The track group model partitions wires into track groups according to their wire
length and starting points. The midpoint subblocks are labeled f
m;i
, where i is a posi-
tion between 1 and L  1 along a wire of length L. This model is somewhat restrictive,
but it can still represent many switch blocks, e.g., Imran, and we will show that it per-
forms well. As well, early experiments we conducted without the f
m;i
subblock restric-
Figure 2.14: Structures of three switch box.
Switch blocks illustrated in Fig. 2.14 are designed using different methodologies.
FPGAs such as the Xilinx XC4000-series [67] use a switch block style known as
disjoint. The universal switch block is analytically designed to be independently
routable for all two-point nets. In comparison, the Wilton switch blocks are examples
of ad hoc design with experimental validation. The Wilton block changes the track
number assigned to a net as it turns and two different global routes may reach two
different tracks at the same destination channel. This forms two disjoint paths, a
feature called the diversity of a network [70].
2.5.2 Placement
Placement algorithms are used to determine the location of logic blocks of circuit
on the target FPGA. The optimization goals are to place connected logic blocks
cl se to each other to minimize the wire length (wire-length-driven) or to balance
2.5. Placement, Routing and Retiming Methods 39
wiring density (routability-driven) or to maximize the speed of the circuit (timing-
driven) [71].
Placement Algorithm
Main placement algorithms for FPGA [72] are:
Quadratic placement method uses the squared wire length as the objective func-
tion. The main advantage of this technique is that it significantly improves the
execution time compared to simulated annealing placement algorithm. How-
ever, since the squared wire length is the only factor considered in the objective
function, the timing part of the placement can not be shown in the quadratic
placement [73].
Partitioning-based/min-cut placement technique recursively apply min-cut par-
titioning to map the net-list of the circuit into the FPGA layout region. A
circuit is recursively bi-partitioned, minimizing the number of cuts of the nets
that connect component between partitions, while leaving the highly connected
blocks in one partition. This recursive process is repeated until each partition
contains only a few blocks to group the highly-connected blocks together in or-
der to decrease placement cost. One disadvantage of this algorithm is that so-
lutions may vary dependent on how the min-cut/partition is performed [74, 75].
Simulated Annealing is applied to FPGA placement to solve the placement prob-
lem. In particular, the Versatile Place and Route (VPR) algorithm bases its
placement algorithm on simulated annealing optimization technique [71]. This
algorithm is explained in detail in Sec. 2.5.2
Hybrid placement is a mixture of Genetic Algorithm and Simulated Annealing
(GASA) optimization technique for symmetrical FPGA. This algorithm con-
sists of two stages. First, it uses Genetic Algorithm to optimize placement
40 Chapter 2. Background
globally, then Simulated Annealing algorithm is used to improve the solu-
tion locally. It is able to reach the global optimum solution and it has faster
computation than genetic algorithm. However, this optimization technique is
complex and can be difficult to implement [76].
Incremental placement propose an algorithm to update the placement of logic
elements when given an incremental netlist change. With this algorithm, in-
cremental placement engine assumes that the restructuring algorithm provide
a list of new logic elements along with preferred locations for each of these
new elements. It then tries to shift non-critical logic elements in the original
placement out of the way to satisfy the preferred location requests [77].
In this thesis, the simulated annealing placement algorithm used by VPR is chosen
to apply variation-aware optimisation method which is widely used by academics as
well as commercial tools [78].
VPR Placement
Simulated annealing mimics the annealing process used to cool molten metal very
slowly according to a specific schedule to produce high-strength metal lattice [79].
Pseudo-code for a generic simulated annealing-based placement is shown in Alg. 1.
In this process, an initial placement is created by assigning logic blocks randomly
to available locations as S. A temperature parameter (T ) controls the likelihood of
accepting moves which is initialized as a very high value.
After initialization, an iterative process is executed to produce a refined placement
with less cost and lower the temperature. In each iteration, a random logic block is
moved to a new random location. If there is a logic block on the selected new location
already, the two logic blocks would be switched. The change of cost caused by this
move is computed. If the cost decreases, the move is always accepted and the block
2.5. Placement, Routing and Retiming Methods 41
is moved. If the cost increases, there is still a chance of the move being accepted,
even though it makes the placement worse. The cost function is used to evaluate the
quality of the placement towards the optimization goal of placement. For example,
a cost function in timing-driven placement aims at producing placement with the
minimum net delay.
Algorithm 1 Simulated annealing-based placement
1: S = RandomPlacement();
2: T = InitialTemperature();
3: Rlimit() = InitialRlimit();
4: while ExitCriterion() == False do
5: while InnerLoopCriterion() == False do
6: Snew = GenerateV iaMove(S,Rlimit);
7: ∆C = Cost(Snew)− Cost(S);
8: r = random(0, 1);
9: if r < e−∆C/T then
10: S = Snew;
11: end if
12: end while
13: T = UpdateTemp();
14: Rlimit = UpdateRlimit
15: end while
The probability of acceptance depends on the current temperature as e−∆C/T i.e. all
moves are accepted with high T , but as T decreases the probability of accepting a
move that makes the placement worse will drop accordingly. The iterative process
is terminated when an optimized placement is achieved [71].
The parameter α is used to control how quickly the temperature drops where α is the
fraction of moves being accepted. Based on the experimental experience, α is desired
to keep 0.44 for as long as possible which can provide most optimal solution [80].
The new temperature is calculated based on the old temperature as follow
Tnew = γ · Told (2.6)
Where γ is predefined according to Table. 2.4 to achieve the best performance.
42 Chapter 2. Background
Table 2.4: Temperature update schedule.
α γ
α > 0.96 0.5
0.8 < α ≤ 0.96 0.9
0.15 < α ≤ 0.8 0.95
α ≤ 0.15 0.8
Rlimit is used to control how close together blocks must be considered for swapping.
Initially, Rlimit is fairly large, the blocks to be swapped can be far from each other on
the chip. During the annealing process, Rlimit is adjusted according to the parameter
α as in Eq. 2.7. Rlimit is set to the size of the entire chip at the beginning, shrinking
gradually to 1 logic block at low temperature.
Rnewlimit = R
old
limit · (1− 0.44 + α) (2.7)
The inner loop terminates when a number of moves are executed. The number of
moves per temperature is calculated as Eq. 2.8 where Nblocks is the total number of
logic blocks plus the number of I/O pads in a circuit, and InnerNum is set to have
a default value of 10.
MovesPerTemperature = InnerNum · (Nblocks)4/3 (2.8)
The outer loop terminates when:
T <  · Cost
Nnets
(2.9)
where Nnets is the number of nets in the circuit, and  is 0.005 as default value. The
movement of a logic block will always affect at least one net. When the temperature
is low, it is unlikely that any move that results in a cost increase will be accepted,
2.5. Placement, Routing and Retiming Methods 43
therefore the annealing process is terminated.
2.5.3 Routing
Once the location of logic block is chosen by placement, a router is used to determine
the connections between logic blocks based on their netlist. The architecture of
FPGA as shown in Fig. 2.11 is represented as directed graph [81].
Algorithms for Routing
There are three main algorithms which can be used for FPGA routing:
(a) Start WE for net 1 (b) Continue WE for net 1 (c) Results of routing for net 1
(d) Start WE for net 2 (e) Continue WE for net 2
avoiding used resources
(f) Results of routing for net 1
and 2
Figure 2.15: Example of routing based on maze/Dijkstra algorithm for two nets [6].
• Maze Routing
This basic routing algorithm is based on Dijkstra’s algorithm [82]. The maze
router is based on a wavefront expansion (WE) technique that attempts to
find the shortest path between two points while avoiding any used routing
resources [83, 84].
44 Chapter 2. Background
A example of applying maze router to route two nets is illustrated in Fig. 2.15.
The start stage, middle stage and final result of routing net 1 are illustrated in
Fig. 2.15(a, b and c) respectively. It can be seen that the wavefront extend to
all surrounding directions. The resources in the yellow region are checked for
establishing a connection between source and sink (red blocks). After net 1 is
completely routed, the routing solution for net 2 is provided by maze/Dijkstra
router without overuse any resources. Two drawbacks of this router can be
observed from this example. Firstly, the wavefront is extended to unneces-
sary directions which is unlikely to be used by the routing. Secondly, for the
maze/Dijkstra’s algorithm, the result of one net may block the subsequent
nets. This means the performance of the algorithm is net-ordering dependant.
• A* search Routing
A* routing is an extension of Dijktra’s algorithm which adaptively tunes the
path search method between a breadth-first search (BFS) and depth-first
search (DFS) [85].
Breadth-first search: A strategy uses the router to search for a possible
connection in a graph. The BFS is limited to essentially two operations:
(a) visit and inspect a resource of a graph; (b) gain access to visit the
resources that neighbor the currently visited resource. The BFS begins at
a source node and calculates the cost of every the neighboring resources
if they are used in routing. Then for each of those neighbor resources
in turn, it inspects their neighboring nodes that were unvisited until all
nodes have been visited.
Depth-first search: A strategy can be used by router to search a graph.
DFS starts at the source node and explores as far as possible along each
branch before backtracking.
The BFS is an exhaustive search considering all possible paths but it is slow
2.5. Placement, Routing and Retiming Methods 45
(a) Start stage of A* routing (b) Mid stage of A* routing (c) Result
Figure 2.16: Example of routing based on A* search algorithm [6].
while the DFS may not find the minimum cost path but can be faster. With a
weight factor α the cost function of A* routing is defined to evaluate the cost
for potential resource i considering both current cost and potential cost to the
destination:
fi = (1− α)× (fi−1 + ci) + α× di (2.10)
Where ci is the node cost that indicates the current usage of the node, and
it is used to penalise nodes occupied by the previous routes. fi−1 is the total
cost of the previous path and di is the estimated cost of the path from node i
to the destination.
An example of A* routing is illustrated in Fig. 2.16. Compared with maze
routing, the potential resources checked during routing process are limited to
the area around the final routing. By considering the cost from the current
node to the destination, the routing can be speeded up by applying A* routing.
• Pathfinder
The pathfinder algorithm is based on the maze router, but the algorithm is
speeded up by routing every connection on a obstacle-free environment and
allowing routing resources to be overused.
After a single iteration of the algorithm, all nets are routed once as if they
were the only connection to be routed. The cost of using every resource is
calculated according to its demand. The cost function implemented by the
pathfinder is:
46 Chapter 2. Background
fi = (1 + hn ∗ hfac)× (1 + pn ∗ pfac) + bn,n+1 (2.11)
where bn,n+1 is the penalty of bending the wire, pn is the cost of using a specific
wire, hn is the history that keeps track of the usage of the wire during previous
iterations, and hfac and pfac are the respective weighting factors.
Subsequent iterations rip up and re-route all nets, and the process goes on
until no overuse of routing resources exist. This process of ripping out and re-
routing every net allows the pathfinder algorithm to minimize the net ordering
problem of the maze routing.
• Incremental routing
Incremental routing is another routing strategy proposed in [86, 87] as an Fault
Tolerant (FT) technique for FPGAs. It is a process by which some of the sig-
nal nets routed through programmable interconnect network are ripped-up
(unrouted) and rerouted, while the majority of the signal nest are left intact.
This algorithm generates precompiled partial configurations that can be down-
loaded when faults occur. The downloaded partial rerouting will avoid using
the interconnection with faults. Since the precompiled partial configurations
are stored on a net by net basis, the required storage space and the download
time is minimal.
VPR adopts a combined global and detail routing algorithm based on Pathfinder
and A* algorithm. PathFinder method is used in the negotiating congestion-delay
algorithm [88] which uses a more sophisticated technique in which the congestion-
delay trade-off of each connection is controlled by how timing critical it is [71]. In
other words, a timing-critical path connection will be routed by a minimum delay
path even if it is congested, while a non-timing-critical net will take a longer path.
Our experiment is based on the VPR routing algorithm and combined the idea of
2.5. Placement, Routing and Retiming Methods 47
incremental routing to partial rerouting the critical and near-critical paths with
variation-aware optimisation method.
The pseudo-code for the algorithm is shown in Alg. 2. The criticality of the connec-
tion from the source of net i to one of its sinks j, is
Crit(i, j) = 1− slack(i, j)
Dmax
(2.12)
where Dmax is the delay of the circuit critical path, and slack(i, j) is the amount of
delay that could be added to this connection before it affects the circuit’s critical
path. Crit(i, j) is therefore between 0 and 1.
The cost function of using a routing resource on node n, as part of connection (i, j)
is
Cost(n) = Crit(i, j) · delay(n) + [1− Crit(i, j)] · [b(n) + h(n)] · p(n) (2.13)
The first term in Eq. 2.13 is called the delay sensitive term where the value is
defined as the criticality of the connection times the intrinsic delay of the node. The
second term is the congestion sensitive term. b(n) is the base cost of node n which
is equal to delay(n). h(n) is the historical congestion of the node n. h(n) is increased
after every iteration when node n is overused. p(n) is the present congestion cost
of node n. It has the value of 1 if no congestion is caused by using the current
node, and it increases with the amount of overuse of the node. p(n) is a function
of the number of routing iterations that have been performed. In early iteration,
p(n) grows slowly with the current overuse of node n, but the growth rate of p(n)
increases rapidly in later iterations to penalise overuse towards the end.
Each routing iteration is defined as rip-up and re-route every net in the circuit once.
This process is repeated until the result is free of congestion issues. In the first rout-
48 Chapter 2. Background
Algorithm 2 Pathfinder routing algorithm
1: Crit(i, j) = 1 for all nets i and sinks j;
2: while overused resources exist do
3: for each net, i do
4: rip-up routing tree RT (i) and update affected p(n) value;
5: RT (i) = NetSource(i);
6: for each sink, j, of net(i) in decreasing crit(i, j) order do
7: PriorityQueue = RT (i) at PathCost(n) = crit(i, j) · delay(n) for each
node n in RT (i);
8: while sink(i, j) not found do
9: Remove lowest cost node, m, form PriorityQueue;
10: for all fanout nodes n of node m do
11: Update p(n);
12: Add n to RT (i);
13: end for
14: end while
15: end for
16: end for
17: Update h(n) for all n;
18: Perform timing analysis and update Crit(i, j) for all nets i and sinks j;
19: end while
ing iteration, the net is routed by ignoring congestion and overuse resources problem.
Consequentially, one or more routing iteration should be executed if overuse exists
at the end of a routing iteration. After each iteration, the cost for overusing a rout-
ing resource is increased such that the congestion problem is solved iteratively. A
timing analysis is performed after one or several iterations of routing to determine
the net delay and compute the criticality of each net which is used to control how
much attention is paid to delay and how much is paid to congestion-avoidance for
each connection. The final results of routing is produced when a routing iteration
is completed without any congestion problems.
2.5.4 Limitation of VPR Simulation
Although VPR is widely used by academics to explore the FPGA architecture with
placement and routing simulation, there are certain limitations with VPR. Firstly,
2.5. Placement, Routing and Retiming Methods 49
the version used at the time of this research (VPR 5) does not take delay variability
into account, nor can such variability be incorporated into its existing timing model
easily. The timing model used by VPR assumes uniform delay of all nominally iden-
tical components of the same type. The algorithm has to be significantly modified
to be variation-aware to explore the impact and potentials under process variation.
Secondly, the performance of placement and routing is inconsistent. By choosing
a different random initial seed for placement and routing algorithm, the timing
performance varies with a large range. The improvement made by variation-aware
optimization method might be masked by the noise of P&R, where noise reduction
counter measure is necessary during the routing process [89]
Finally, with VPR5, the results of P&R are not compatible with commercial FPGAs.
Therefore, the improvement of variation-aware P&R optimization can not be verified
by commercial CAD tools and tested on actual FPGA architectures.
2.5.5 Retiming
Retiming [90] has been studied extensively to optimize the timing performance in
sequential circuits for ASIC and FPGA. Three main retiming methods related to
our work are introduced as follows
Original retiming (binary search) is proposed by C. E. Leiserson in 1983 that
moves flip-flops within a circuit while keeping its functionality. The original
retiming algorithm based on binary search method to check the feasibility of
a range of clock periods [90].
Polynomial-time retiming (without binary search) is proposed by C. Lin in
2006. Contrary to the original retiming method, this method directly checks
50 Chapter 2. Background
the optimality of the current feasible clock period and can thus either push
down the period or certify the optimality [91].
Constraint driven retiming (CDR) is proposed to utilize retiming in modern
FPGAs. This method is based on the Polynomial-time retiming with polyno-
mial time complexity but with additional constraints for FPGAs. In addition,
CDR is extended to statistical retiming (sCDR) to handle process variation by
considering statistical timing analysis during the retiming procedure [52]. The
results shows that sCDR can achieve 6.93% timing improvement on average.
Retiming has been used by industry to improve the timing performance of circuits.
However, its primary drawback is that the state encoding of the circuit is destroyed,
making debugging, testing, and verification substantially more difficult. Some re-
timing methods may also require complicated initialization logic to have the circuit
start in an identical initial state e.g. push registers to the sink of each path. Fi-
nally, although the retiming process can be verified [92], the changes in the circuit’s
topology may have negative consequences in other logical and physical synthesis
steps that make design closure complexity. Moreover, if full chipwise variation-
aware retiming strategy is adopted, the verification process should be applied for
every FPGA because each chip has a unique retiming configuration.
2.6 Summary and Discussion
This chapter has reviewed the importance to explore the impact of process variation
in ASIC devices, and the challenges and opportunities of alleviate process variation
with high level optimization method is discussed.
With the size of transistors shrinking down towards the deep sub-mircon domain,
the process variation causes significant and measurable variability in the timing per-
2.6. Summary and Discussion 51
formance of designs. The improvement in fabrication process cannot solve or avoid
variability completely. The process variation is predicted to increase dramatically in
the future. Therefore, post-silicon optimization method in CAD flow is a necessity.
The features of flexibility and reconfigurability of FPGAs make it suitable to be
measured with BIST to collect variation maps, which can be used for full chipwise
optimization methodology.
Several methodologies such as SSTA, full chipwise can deal with the problem of
process variation effectively. As described in 2.4.3, the limitations of these methods
are visible. New method to mitigate process variation that is flexible to make a
trade-off between run time and timing performance using retiming and/or P&R
algorithms in VPR are of interest.
Chapter 3
Two-stage Variation-aware
Placement
3.1 Introduction
This chapter introduces a two-stage variation-aware and adaptive placement method
which benefits from the optimality of a full chipwise (chip-by-chip) placement but
only requires a fraction of its execution time. This work was published at the Inter-
national Conference On Field Programmable Logic And Applications in 2012 [14].
A statistical model of variation is currently used by published research in variation-
aware placement. However, in this thesis, measured delay variation (variation maps)
is adopted by the variation-aware placement process. The variation maps are col-
lected from commercial FPGAs using delay measurement methods with high spa-
tial and timing resolution, such as the transition possibility method introduced in
Sec. 2.3.3.
Theoretically, performing variation-aware placement on an individual device based
on its delay variation, called full chipwise placement, will yield optimal timing per-
52
3.2. Two-stage Variation-aware Placement 53
formance for that device. However, this “per chip” strategy is impractical because
the execution time required is O(N) whereN is the total number of devices used, and
can be very large. A two-stage variation-aware placement is proposed in Sec.3.2.1
to save execution time. The key idea of two-stage variation-aware placement is to
classify variation maps into a finite number of classes in the first stage, and then
apply full chipwise variation-aware placement only to the median map of each class.
In the second stage, the placement produced in the first stage can be loaded for the
other FPGAs with variation maps that fall into the same class, thus reducing the
execution time required.
The advantage of our two-stage variation-aware and adaptive placement is that it
only requires O(k) time where k is the number of classes used by classification, and
k can be chosen to be significantly smaller than N . A tradeoff between timing per-
formance and execution time can be made according to the end-users’ requirement
by choosing a suitable k.
3.2 Two-stage Variation-aware Placement
3.2.1 Algorithm Design
As introduced in Sec. 3.1, applying variation-aware placement on median maps of
classes rather than the full set of maps can drastically reduce computational time.
The main advantage of this method is that it avoids treating every FPGA as having
unique different variations. From the results of our experiment, we show that the
two-stage placement can achieve similar timing performance as full chipwise place-
ment based on the full set of variation maps. By choosing an appropriate number
of variation map classes, a significant speedup of execution time can be achieved
while maintaining timing performance within design specifications. A summary of
54 Chapter 3. Two-stage Variation-aware Placement
the two-stage variation-aware placement is as follows:
• The first stage
– Measure the variation maps of N FPGAs using the characterisation
method.
– Classify N variation maps into k classes of maps using the k-means algo-
rithm.
– Perform full chipwise placement on the median map of each class.
• The second stage:
– For each FPGA (variation map), the placement is loaded according to its
class identified (Class-ID) in the first stage.
– Apply variation-aware timing analysis based on its variation map to eval-
uate the delay of critical path under process variation.
Practical Work Flow
Pure full chipwise variation-aware placement is impractical for industry because of
huge overhead in execution time. With two-stage placement, one possible work flow
we propose for use in industry is illustrated in Fig. 3.1. In this work flow, the process
of placement has been divided into two parts:
Variation map measurement (at FPGA manufacturer): The variation maps
can be measured by the manufacturer during testing process. Similar to speed-
binning, we name the classification process “variation-binning” and Class-ID
is assigned to each chip analogous to speed grades from speed-binning. The
number of classes can be chosen based on the impact of process variation at
that technology node. The number of classes (k) is kept small, enough to
3.2. Two-stage Variation-aware Placement 55
FPGA
 Manufacturer
FPGA end-user
FPGA user
Delay 
Characterisation
Classification
K median 
maps
FPGA pool
 Median map 
Variation-aware 
P&R  based on 
median map 
Class-ID
Figure 3.1: One practical flowchart of two-stage placement.
represent different variation patterns, but it may increase adaptively as more
patterns are observed or higher placement optimality is required.
Class-specific FPGA placement at end-users: Using the Class-ID assigned to
each FPGA chip, the corresponding median map is used for variation-aware
placement.
The advantage of this work flow is that for the FPGA manufacturer, only median
variation maps are stored. For FPGA end-user, the median map can be downloaded
with the assigned Class-ID for the specific FPGA to perform variation-aware and
adaptive optimisation.
56 Chapter 3. Two-stage Variation-aware Placement
Figure 3.2: One variation map measured on Cyclone III FPGA.
3.2.2 Variation Maps
Variation Map Characterisation
As introduced in Sec. 2.3.3, the transition probability method used in this exper-
iment to collect variation map for each FPGA is efficient and easy to apply. For
the Cyclone III EP3C16 FPGA on the DE0 board, it only takes about 20 seconds
to collect the variation map. One of the variation maps measured in this exper-
iment is shown in Fig. 3.2. The unit of the finest measurable component in the
Cyclone III FPGAs using the transition possibility method is two adjacent Look
Up Tables (LUTs) and the interconnect between them inside one Logic Array Block
(LAB) [12]. However, to avoid runtime overhead for full chipwise variation-aware
placement, region-based variation maps are used in this project which means the
3.2. Two-stage Variation-aware Placement 57
resources share the same variation parameters in one region.
According to the placement algorithm in VPR [71], only LABs in FPGAs are swaped
during the process of placement. The locations of LUTs inside one LAB are kept the
same after packing. Therefore, for the purpose of demonstrating the improvement
we can achieve by two-stage variation-aware and adaptive placement with acceptable
run time, we assume that all LUTs inside one LAB have the same timing perfor-
mance. It is not necessary to use more fine-grained variation maps for placement in
this case.
One of the advantages of using measured variation maps instead of a variation model
is that the measured maps may reflect the truth information of process variation on
an FPGA. Therefore the variation-aware optimisation can be tested and proved for
commercial FPGAs.
Analysing Variation Maps
To fully make use of measured variation maps V and predict the process variation in
the future, all measured variation maps are analysed. The variation is divided into
inter-die and intra-die, systematic and random variation. The variables representing
measured delays for the elements i, j on the different locations of variation map are
Vi and Vj and variances are σ
2
Vi
and σ2Vj . The covariance between them is given by
cov(Vi, Vj) ≡ ρi,j · σVi · σVj (3.1)
Where ρi,j is the overall process correlation between process parameters for two
elements i and j.
Let Xg be the inter-die global variation component; Xs be the intra-die spatial-
variation component; and Xr be the intra-die random-variation component with
the variances σ2G, σ
2
S and σ
2
R respectively. ρ(v) is the intra-die spatial-correlation
58 Chapter 3. Two-stage Variation-aware Placement
function with distance v. For two elements located on (xi, yi) and (xj, yj), v is
defined as
v =
√
(xi − xj)2 + (yi − yj)2 (3.2)
If the spatial variation is modeled as a homogeneous and isotropic random field in a
two-dimensional space, then for the parameters of interest at two arbitrary different
points, their covariances are given by [32]
cov(Vi, Vj) = cov(Xg, Xg) + cov(Xs,j, Xs,j) (3.3a)
= σ2G + ρ(v)σ
2
vS (3.3b)
For the parameters of interest at different locations with distance of v, the overall
process variation between them is given by
ρv ≡ cov(Vi, Vj)
σViσVj
(3.4a)
=
σ2G + ρ(v)σ
2
S
σ2G + σ
2
S + σ
2
R
(3.4b)
The intra-die spatial correlation ρ(v) is a function of the distance v and so is the
overall spatial correlation ρv. As ρ(v) is homogeneous and isotropic so is ρv because
of the one-to-one correspondence between ρ(v) and ρv.
One simulated curve for the overall correlation ρv is shown in Fig. 3.3 as a function
of distance v based on Eq. 3.4b. The total variation is divided into three part: part
G is the inter-die correlated variation; part S is the intra-die spatial correlation;
3.2. Two-stage Variation-aware Placement 59
0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
))(( 22222 RSGSG  
)/( 2222 RSGG  
Part R
Part S
Part G
C
o
rr
el
at
io
n
Distance between two components (v)
Figure 3.3: Possible curve for the overall process correlation according to Eq. 3.4b
(theoretically ideal case).
part R is caused by the purely uncorrelated random variation (intra-die stochastic
variation). From this curve, it can be seen that overall process variation ρv starts
to settle to a constant value when the distance becomes large enough. The reason
is that even if two components on the same chip are far from each other, there
is still some correlation between them due to their shared inter-die variation. In
addition, there is a steep drop of ρv as distance increases from zero which is caused
by purely uncorrelated stochastic variation. Perfect correlation (ρv = 1) only occurs
at the same location (v = 0), i.e. the same component. The distance that separates
stochastic from spatial correlation variation is found to be 5 in Fig. 3.3 [32].
A Gaussian lowpass filter is designed to separate the spatial-correlated process vari-
ation (low frequency component) from stochastic process variation (high frequency
component). The transfer function for the 2D Gaussian lowpass filter used in this
project is given by
Hlp(u, v) = e(−D(u,v)
2/2D20) (3.5)
where D(u, v) is the distance (u, v) from the origin point. The ideal cut off frequency
60 Chapter 3. Two-stage Variation-aware Placement
0 5 10 15 20 25 30 350.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Distance between two components (v)
Co
rre
la
tio
n
Figure 3.4: Overall process variation correlation curve for 129 variation maps ac-
cording to Eq. 3.4b (real case).
in this lowpass filter is the distance D0 [93]. D(u, v) is given by
D(u, v) =
√
(u−M/2)2 + (v −N/2)2 (3.6)
where N is the number of rows and M is the number of columns. As shown in
Fig. 3.3, the value of D0 can be chosen as 5 to separate spatial correlation and
stochastic process variation.
Using actual data from the 129 measured variation maps, Fig. 3.3 is re-plotted to
form the curve in Fig. 3.4. The resultant plot of the overall correlation decreases
gradually when distance increases. However, it is not easy to distinguish stochastic
from spatial-correlated variation. It shows that the spatial-correlated variation exists
across the entire chip. Therefore the cut off frequency is chosen as D0 = 35 to
extract the spatial-correlated process variation (variation pattern) of the variation
maps. Hence, the classification method is based on the pattern of each variation
map after filtering.
3.2. Two-stage Variation-aware Placement 61
Classify Variation Maps
To avoid treating each FPGA specifically, the k-means method [94, 13] is used to
classify all measured variation maps into a finite number of classes based on least-
square error. The k-means method aims to partition N pattern observations into k
classes in which each observation belongs to the class with the closest mean. With
this method, the variation maps with similar patterns are grouped into the same
class. For each class, the median map is defined as the variation map with the
smallest difference between the mean map of the particular class.
In our experiment, a data set Vi, ..., VN consisting of N filtered variation maps with
D-dimensional variable V . Our goal is to partition the variation maps into a fixed
numberK of classes (clusters). A set ofD-dimensional vectors µk, where k = 1, ..., K
is used represent the centers of kth classes. Binary indicator variables rnk ∈ {0, 1},
where k = 1, ..., K, is used to describe to which of the K class that variation map
Vn is assigned. If Vn is assigned to class k then rnk = 1, and rnj = 0 for j 6= k. The
objective function is given by
J =
N∑
n=1
K∑
k=1
rnk ‖Vn − µk‖2 (3.7)
which represents the sum of the squares of the distances of each data point to its
assigned vector µk. The values for {rnk} and µk are found so as to minimize J [94].
The number of classes is flexible and adaptive to the design requirement but 16 is
chosen in our experiment as the initial class number for the 129 variation maps. The
impact of the number of classes will be discussed in later sections. The results of
classification based on the filtered variation maps are shown in Fig. 3.5 where the
median maps of 16 classes are illustrated. From this figure, it can be seen clearly
that the median maps of different classes differ significantly from each other. These
median maps are used to create 16 chipwise placement configurations in the first
62 Chapter 3. Two-stage Variation-aware Placement
Figure 3.5: Results of classification on variation maps.
stage of this experiment as shown in Sec. 3.2.1.
Obviously, a larger number of classes can be chosen to obtain more optimal place-
ment at the expense of longer execution time. As mentioned earlier, the number
of classes needed depends on the required timing optimality of the resultant FPGA
configurations. Tests are conducted with benchmark circuits to observe such rela-
tionships. In addition, to explore the impact of different classification methods, we
applied a more complex algorithm using Principal Component Analysis (PCA) [95]
combined with k-means, to test the quality of classification in our two-stage place-
ment experiment.
The filtered variation patterns are only used for classification. Variation-aware and
two-stage placement are based on the amplified variation maps and median maps
respectively.
3.2. Two-stage Variation-aware Placement 63
(a) Extraction of random variation from the measured delay map by subtracting spatially correlated
variation (lowpass-filtered map).
(b) Amplification of random variation extracted from measurement to extrapolate random variation
in future FPGA devices.
Figure 3.6: Extraction (a) and amplification (b) of random variation from measure-
ments.
Amplified Variation Maps
Although our measurements represent actual variation in real FPGA devices, they
are obtained from 65nm FPGAs which may not reflect the degree of variation at
smaller state-of-the-art processes as well as future processes where our variation-
aware placement method will be of most interest. Therefore, to examine the full
potential of our method in the future, the variation maps are amplified to match the
level of variation expected in future technologies. A substantial increase in random
(stochastic) variation in future devices has been predicted [8].
We extracted and amplified the stochastic component of variation from our measure-
ments for use in our variation-aware placement method. The method used to extract
64 Chapter 3. Two-stage Variation-aware Placement
spatial correlation from a measured variation map is discussed in [32]. To achieve
this, a 2D Gaussian low-pass filter is applied to the original delay map to obtain
a low frequency spatially correlated variation map [96], which is then subtracted
from the original map to extract the underlying random variation in Fig. 3.6(a). In
this experiment the σ of low pass Gaussian filter is chosen as 35 by analysing mea-
sured variation maps [32] introduced in Sec. 3.2.2. It means that for one location on
the variation map, another location within a radius of 35 LABs is correlated with
this location. After that, the random variation map is amplified by a constant A
and then added back to the correlated map to emulate variation in future FPGA
devices. To obtain the potential upper bound improvement of the variation-aware
method can achieve, we choose A such that the variation in each map is amplified
to σdelay/µdelay = 30% [23]. However, these amplified variation maps are not verified
to match the process variation in the future. The next work is needed to measure
the most advanced FPGAs and verify our amplified model.
3.2.3 Two-stage Variation-aware Placement Based on VPR
In this section, the method used to modify placement algorithm in VPR to become
variation-aware is introduced. This full chipwise placement method was developed
in 2006 [54]. However, one of main differences between the full chipwise placement
in this thesis and previous work is that variation maps from commercial FPGAs are
amplified and used instead of a variation model. Beside that, the improvement made
by full chipwise placement is used as the reference group for two-stage variation-
aware placement.
3.2. Two-stage Variation-aware Placement 65
Implementation Variation-aware Placement
To explore the impact on timing performance caused by process variation, the
timing-driven placement algorithm is modified in our experiment. In the original
VPR [71], a simulated annealing process is applied to obtain high quality solutions
as introduced in Sec. 2.5.2.
For each net, the net cost is defined as a sum of timing cost (tdcost) and bounding box
cost (bbcost). Timing cost is the net delay from the source block to the sink block
of the net where the value is estimated by routing ignoring congestion problems.
Without considering process variation, both timing cost and bounding box cost are
simply proportional to the manhattan distance of the net.
The goal of the timing driven placement algorithm is to place connected logic blocks
close together with the shortest possible length of interconnect and the smallest
FPGA area, in other words, to minimize the sum of all net cost. In each iteration,
LABs are chosen randomly to swap to new random locations. A cost change function
is used to assess whether the timing performance is improved or worsened by the
swap as shown in Eq. 3.8. Let ∆bbcost and ∆tdcost represent the cost change arising
from this swap to bounding box and timing respectively. λ is the trade-off parameter
to balance the bounding box and timing cost. pre bbcost is the bounding box cost
before swap, as well as pre tdcost is the previous timing cost. ∆c is the cost change
for the net which is defined as
∆c = (1− λ)× ∆bbcost
pre bbcost
+ λ× ∆tdcost
pre tdcost
(3.8)
To explain the calculation of ∆c for one net, an example is illustrated in Fig. 3.7.
Initially, the timing cost of the net is proportional to D x = 6 and D y = 3. After
swapping blocks {6, 2} and {6, 1}, the new cost is proportional to Dnew x = 6 and
Dnew y = D y+delta y = 4. The placement algorithm of the original VPR assumes
66 Chapter 3. Two-stage Variation-aware Placement
(1,4) (2,4) (3,4) (4,4) (5,4) (6,4)
(1,3) (2,3) (3,3) (4,3) (5,3) (6,3)
(1,2) (2,2) (3,2) (4,2) (5,2) (6,2)
(1,1) (2,1) (3,1) (4,1) (5,1) (6,1)
D
_y
=3
D
el
ty
_y
=1
D_x=6
D
n
e
w
_
y=
4
Net 1
Net 2
Figure 3.7: An example to explain the change of cost due to logic blocks swap.
that all LUTs and switches across the chip are identical in terms of timing, therefore
it is not necessary to include the delay of LUT in the cost function.
∆c =
(
1− λ× ∆bbcost
pre bbcost
)
+ λ× ∆tdvar net cost + ∆tdLUT cost
pre tdcost
(3.9)
However, to consider process variation, the cost function has to be modified as
shown in Eq. 3.9. Firstly, instead of using manhattan distance, the delay of one net
is associated with the locations of source and sink block. Secondly, the delay of each
LUT in the LAB and each switch in the routing network is updated using variation
maps.
It can be seen that in Eq. 3.9, ∆tdcost is divided into two variation-aware parts as
∆tdvar net cost and ∆tdLUT cost. During the variation-aware placement process, each
net is routed ignoring the congestion problem but loading the variation map. A
matrix of delay between any two blocks is built. Besides that, the LUT delay is
included in this modified function as ∆tdLUT cost to indicate the change in LUT
3.2. Two-stage Variation-aware Placement 67
delay after a swap. For the same example in Fig. 3.7, ∆tdvar net cost is equal to the
net delay from block {1.4} to block {6.2} minus the net delay from block {1.4} to
block {6.1}. ∆tdLUT cost is equal to the delay of LUT {6.1} minus that of LUT
{6.2}. Therefore, the placement process is modified to be variation-aware.
Although the timing improvement of full chipwise placement is significant, there
are some limitations with this methodology. To achieve an optimal configuration, it
requires an exhaustive routing search to build up a complex data matrix to store the
block to block delay. For each swap iteration, frequent memory access are required
to update the delay of LUTs and switches from the variation maps, resulting in a
long execution time. The worst-case benchmark, clma, in our experiment with full
chipwise placement took 8 hours.
One possible strategy to reduce the execution time of full chipwise variation-aware
placement, which involves a large number of simulated annealing iterations (block
swaps), is to perform the timing analysis over multiple iterations instead of one
iteration. This significantly reduces execution time, but may introduce result incon-
sistencies, masking out the timing gain achieved by our algorithm. The two-stage
strategy introduced in this chapter reduces the number of variation maps by pattern
classification and significantly reduces the total execution time when optimising for
a large number of FPGAs.
Implementation of Two-stage Variation-aware Placement
As introduced in Sec. 3.2.1, by classifying variation maps into a finite number of
classes with the k-means method, a much smaller set of full chipwise variation-aware
placement configurations associated with these median patterns are applied to VPR,
and hence significantly reducing the placement execution time. For example, if N
FPGAs are used to implement a particular design with k classes of variation map,
then k configurations would be generated based on the median maps by the two-stage
68 Chapter 3. Two-stage Variation-aware Placement
variation-aware placement, and the average execution time is reduced by a factor of
N/k compared against full chipwise placement for all the FPGA chips. The number
of classes can be chosen adaptively according to the design requirement.
3.3 Experiment and Results
3.3.1 Experiment Setup
This section introduces the setup and parameters used in our experiment. Dur-
ing the process of placement, the net delay has to be evaluated through variation-
aware routing without considering congestion. For this reason, it is possible that
the variation-aware placement algorithm would generate un-routable results. To
avoid this problem, we used 1.1 times the minimum channel width required by
variation-blind placement for each placement strategy to provide a 10% headroom
for congestion. In addition, the inherent P&R uncertainty (noise) within VPR is
reduced by using a slower simulated annealing schedule and a router noise reduction
method from [89]. For two-stage placement, the default number of classes is chosen
as 16. Later, the experiment will be designed to show the results of choosing dif-
ferent number of classes. To ensure fair comparison of results between full chipwise
and two-stage placements across different benchmarks, the Utilisation ratio (Ur) of
every benchmark circuit is set to 80% according to:
Ur =
used blocks
total blocks
(3.10)
Although the MCNC benchmarks are relatively small and may not be represen-
tative of realistic designs, they are widely used by researches for variation-aware
optimisation based on VPR. To fairly compare our optimisation method with pre-
vious research, the improvement of the full chipwise placement and the two-stage
3.3. Experiment and Results 69
16 variation-aware 
placement 
1 Placement 
configuration
129 routing and  
variation-aware
timing analysis
129 variation 
maps
Variation-blind 
placement
Full chipwise 
placement
Two-stage 
placement
16 median 
variation maps 
for classes
k_means 
Classification 
Class-ID for 
each variation 
map (FPGA)
16 placement 
configuration
Load placement for 
each variation map 
based on its Class-ID
129 variation-aware 
placement 
129 placement 
configuration
129 routing and  
variation-aware
timing analysis
1 variation-blind 
placement 
20 MCNC 
benchmark circuits
1 routing and 129 
variation-aware
timing analysis
Comparison between different 
methods
Figure 3.8: Work flows for variation-blind, two-stage placement and full chipwise
placement.
approach is examined by 20 MCNC benchmark circuits.
3.3.2 Experiment Flow
Three placement methods, variation-blind, variation-aware full chipwise and two-
stage placement are tested as shown in Fig. 3.8. For the variation-blind method,
a original placement and noise-reduced routing is performed to generate one place-
ment and routing configuration. Based on this configuration, variation maps are
loaded for variation-aware timing analysis to evaluate the delay of critical paths;
for variation-aware placement, the variation maps are used during the variation-
aware placement process. In this case, 129 placement configurations are produced
70 Chapter 3. Two-stage Variation-aware Placement
with the corresponding variation maps. A variation-blind noise-reduced routing and
variation-aware timing analysis is performed after placement to evaluate the delay of
critical paths; for the two-stage placement method, all variation maps are classified
into 16 classes in the first stage and each variation map is assigned a Class-ID. The
median map of each class is used to perform a variation-aware placement. In the
second stage, the results of variation-aware placement are loaded for each variation
map according to its Class-ID. Also, the noise-reduced routing and variation-aware
timing analysis is performed in the second stage. At the end, all results provided by
different methods are collected and compared in terms of delay of the critical paths.
3.3.3 Experiments and Results of Two-stage Placement
Table 3.1: Results of variation blind, two-stage and chipwise.
Name var blind (ps) two-stage (ps) Gain full chipwise (ps) Gain
alu4 4.89 4.70 4.01% 4.55 6.97%
apex2 5.70 5.51 3.33% 5.38 5.61%
apex4 4.97 4.81 3.18% 4.75 4.47%
bigkey 3.73 3.30 11.60% 3.19 14.46%
clma 11.17 10.74 3.90% 10.67 4.51%
des 8.59 8.53 0.71% 8.03 6.50%
diffeq 5.70 5.22 8.32% 5.24 8.06%
dsip 4.16 3.98 4.28% 3.63 12.70%
elliptic 8.33 6.71 19.42% 6.61 20.61%
ex1010 7.30 7.06 3.35% 6.88 5.73%
ex5p 5.33 5.01 6.02% 4.97 6.85%
frisc 10.61 9.33 12.06% 8.59 19.11%
misex3 4.71 4.64 1.43% 4.49 4.72%
pdc 7.79 7.39 5.11% 7.18 7.84%
s298 9.39 8.31 11.53% 7.93 15.56%
s38417 7.57 7.26 4.09% 7.00 7.43%
s38584.1 6.60 5.65 14.49% 5.23 20.83%
seq 5.01 4.68 6.61% 4.57 8.79%
spla 6.99 6.86 1.82% 6.42 8.14%
tseng 6.02 4.98 17.32% 4.45 26.12%
mean 6.73 6.23 7.36% 5.99 11.01%
The results between variation-blind, two-stage and full chipwise placement for 20
3.3. Experiment and Results 71
Figure 3.9: Density of delay of critical path for frisc.
MCNC benchmarks are shown in Table. 3.1. The column var blind (variation blind)
represents the results produced by the unmodified VPR, where timing data from the
variation maps are only used at the end for obtaining critical path delay. The timing
gain with the two-stage and full chipwise methods are calculated using the variation-
blind results as references. The mean improvement of the two-stage method is 7.3%.
The mean improvement of full chipwise is about 11%.
The improvement of variation-aware placement varies between benchmarks because
of variations in the length of critical path, the ratio of logic delay to interconnect
delay and the number of near critical paths. All 20 MCNC benchmarks are tested
by the following experiment. However, to explain our experiment clearly, the result
of one of the benchmarks, frisc, is chosen to explain the results of our experiment,
whose logic delay is twice as much as its interconnect delay, allowing it to benefit
more from the placement optimisation, and hence shows the highest timing improve-
ment.
Fig. 3.9 shows the timing distribution (density) of critical path delays for the frisc
72 Chapter 3. Two-stage Variation-aware Placement
circuit over 129 FPGAs (variation maps). Since logic delay is twice as much as
its net delay in frisc’s critical path, it is ideal for demonstrating the effectiveness
of our placement methodology, which is more sensitive to logic delay. It can be
seen that both the full chipwise and the two-stage method achieved better critical
path timing than variation blind placement across the 129 variation maps. Another
promising observation is that the distribution with the two-stage method follows
the full chipwise results closely with only a small shift towards higher path delay.
It highlights that we lose little in terms of delay with the quicker two-stage method
when compared with full chipwise placement.
3.4 Discussion on Quality of Two-stage Placement
3.4.1 Comparison of Run Time Cost between Chipwise and
Two-stage Variation-aware Placement
Figure 3.10: Run time cost of SSTA chipwise and two-stage placement.
3.4. Discussion on Quality of Two-stage Placement 73
It is not easy to measure the absolute run time of our experiment. The execution
time is estimated and normalized to show the difference between different placement
optimization methods in Fig. 3.10 for 129 FPGAs. For the SSTA method, one SSTA
placement is applied to all FPGAs. For full chipwise, the run time is proportional to
the number of FPGAs because it treats every FPGA differently. For the two-stage
placement, instead of the number of FPGAs, the run time is related to the number
of classes we chosen. In this case, 16 full chipwise placement are required based on
the median map of each class. For other FPGA in each class, the execution time
cost to load the placement is negligible compared with the time used by full chipwise
placement. The run time will increase proportionally with an increasing number of
classes, but may also result in better timing performance in terms of critical path
delay due to better matching of the actual variation map on each FPGA.
Although the improvement achieved by two-stage placement is less than that of
full chipwise variation-aware placement, its main advantage over full chipwise is the
much reduced execution time. In this case, the two-stage method only requires 16
variation-aware placements based on the median map of each class, whereas the
full chipwise method has to do 129 individual placements for all variation maps.
The execution time of the classification process using Matlab (about 30 seconds) is
negligible compared with VPR placement (more than 30 minutes). Therefore, the
amount of computation is reduced by a factor of 129/16 which translates into about
8 times speedup.
3.4.2 Choosing the Number of Classes
The potential timing improvement is a function of the number of classes (k). What
we are interested in is how the critical path timing scales with k which in turn scales
the total execution time of the placement. To find out the relationship between them,
we scaled k from 1 to 129 and observed how the critical path delay changes for the
74 Chapter 3. Two-stage Variation-aware Placement
Figure 3.11: Number of clusters (k) against critical path delay for frisc.
frisc benchmark circuit. If k is 1, the two-stage placement is similar to the SSTA
placement method which provided a lower-bound solution with one configuration.
Conversely, k equals N gives the upper-bound solution identical to full chipwise
placement. The results are shown in Fig. 3.11. As expected, the critical path
delay decreases with higher number of classes. Therefore, FPGA users can select a
desirable number of classes according to their design specifications and placement
execution time constraints by trading-off between timing performance and execution
time.
3.4.3 The Effect of FPGA Utilisation on Timing Improve-
ment
How FPGA resource utilisation ratio affects timing gain with the two-stage method
is also of interest, since more unused resources should imply more freedom in the
placement process to achieve better critical path timing. The results of the frisc
3.4. Discussion on Quality of Two-stage Placement 75
Figure 3.12: FPGA Utilisation Ratio (Ur) against critical path delay for frisc.
benchmark with differ Ur are shown in Fig. 3.12. With variation blind, there is a
clear increase in critical path delay as the FPGA Utilisation ratio (Ur) approaches
100%. However, the negative effect of increasing Ur is much less apparent with
the full chipwise approach, and our two-stage method lies slightly above it with
better overall timing than the variation blind approach. One explanation of the
observations is that frisc has critical and near critical paths that fit well within the
fast region of the FPGAs in most cases and thus reducing the size of the FPGA to
increase Ur has a relatively small impact on the critical path delay.
3.4.4 Classification Enhancement
The k-means method used in the experiment is low in complexity and execution
time, and is easy to apply. To determine if enhancing the classification with other
algorithms, such as PCA, is beneficial, we combined PCA with the original k-means
method and repeated our experiment with the frisc benchmark. The results are
shown in Fig. 3.13 where the difference between the distributions of k-means and
PCA with k-means in terms of critical path over 129 variation maps is within the
uncertainty (i.e. noise) of the placement, and is not significant. Therefore, we
76 Chapter 3. Two-stage Variation-aware Placement
Figure 3.13: Density of critical path produced by k-means classification against PCA
with k-means for frisc
conclude that the k-means method alone is sufficient and should be used.
3.5 Conclusion
This chapter employed delay variation measurements of 129 Cyclone III FPGA chips
to demonstrate the potential timing improvement of designs using variation-aware
placement in VPR. With full-chipswise placement, an average of 11% improvement
in terms of critical path delay was observed, while the two-stage optimization method
achieved a slightly lower 7% improvement but showed an 8 times speedup in total
placement execution time compared with the full chipwise placement. For N FPGAs
and k classes, the speedup relative to full chipwise is expected to be N/k. The
observed timing improvement and reduction in execution time with the two-stage
method clearly demonstrate its effectiveness and practicality against delay variability
in FPGAs. While variability will inevitably impact design timing yield on FPGAs in
the future, our method provides one promising solution that can be easily employed
to alleviate the problem caused by severe processes variation.
Chapter 4
Partial re-routing
4.1 Introduction
In the previous chapter, a two-stage variation-aware placement method has been
proposed to improve overall timing performance of circuits with a small increase in
execution time. In this chapter, re-routing of critical and near-critical paths using
variation map information is investigated. Any given design would generally include
different delay requirements. In a fully synchronous circuit, the slowest-path dictates
the overall circuit performance in terms of maximum operational clock frequency.
Therefore it is at least in theory possible to match the requirement of the design by
adaptively optimising the critical path.
Variation-aware routing has been explored by other researchers in the past. For
example, the impact of process variation on FPGA routing architecture has been
studied by Jamieson [97] and Pourhashemi [98]. Lin [55] and Sivaswamy [57] applied
statistical models to improve the timing performance and yield using static statistical
timing analysis (SSTA). What is distinctive about the method proposed in this
chapter is that, instead of using assumed timing models, actual measured variation
77
78 Chapter 4. Partial re-routing
maps are used to re-route the critical or near-critical paths. In that way, it is possible
to better match the timing requirements of a design to a given FPGA device. At
the very least, this work can establish an upper-bound of the possible improvement
that variation-aware routing can achieve.
The new contributions of this chapter are:
1. Enhancement to the VPR router so that the delay variation information (vari-
ation maps) is used to optimise chip-level routing adaptively;
2. Incorporation of the VPR “noise reduction” method reported in [89] in order
to quantify the potential gain of such variation-aware routing method more
accurately;
3. Further improve the VPR router so that only a fraction of the most critical
paths are re-routed based on the variation maps.
4.2 Delay and Process Variation Modeling
4.2.1 Delay Model
A simplified driver model for gates and an Elmore-delay based reduced order model
for interconnects are used in the variation-aware routing process, as illustrated in
Fig. 4.1. This is the model used in VPR [71], except that delay variation due to the
input waveform/slope of transitions (slew rate) is not taken into account. Thus the
net delay is the sum of the lump delay of the driver and the Elmore delay of the
routing network.
In CMOS circuits, the main contributing factor of delay variability is variations
in Vth (threshold voltage). Thus in our model we apply the delay variation only to
4.2. Delay and Process Variation Modeling 79
Vth
N(μ,σ) 
Figure 4.1: Approximated delay model, variation is modeled for active components:
Vth variation.
lumped gate delays. The resistance and capacitance (RC) of interconnect is assumed
to be constant all over the die. In other words, most of the variations are caused
by transistors either in the logic elements or the routing switches, and not in the
interconnect wires or gate capacitances. Thus in our experiments, each net has a
constant delay (due to RC effect) and a lumped gate delay (due to routing buffers
and LUTs etc) which is subject to process variability.
4.2.2 Generation of Variation Maps
A coarse-grained region-based variation map is used in this experiment to exploit the
improvement that can be made by variation-aware routing, The basic measurement
unit of variation maps used for routing includes a LAB and the routing resources
on the peripheral. Let V (x, y) be the measured delay from variation map with
coordinates x and y. The parameter of process variation used in the router (Vs(x, y))
80 Chapter 4. Partial re-routing
Chanx
(2,1)
Chanx
(2,2)
Chany
(2,1)
Chanx
(2,0)
Chany
(3,2)
Chany
(3,1)
Chanx
(2,2)
Chanx
(2,1)
Chanx
(2,0)
 
 
 
Chanx
(1,2)
Chany
(1,2)
Chany
(1,1)
Chany
(0,2)
Chany
(0,1)
Chanx
(1,0)
Chanx
(1,1)
I/O Pad
(1,3)
I/O Pad
(2,3)
I/O Pad
(0,2)
LAB
(1,2)
LAB
(2,2)
LAB
(3,2)
I/O Pad
(0,1)
LAB
(1,1)
LAB
(2,1)
LAB
(3,1)
I/O Pad
(1,0)
I/O Pad
(2,0)
Chany
(2,2)
I/O Pad
(3,2)
I/O Pad
(3,1)
I/O Pad
(3,3)
I/O Pad
(3,0)
  
 
  
 
Figure 4.2: Coordinate system and region-based structure used by VPR router.
is scaled to the mean delay as follows
Vs(x, y) =
V (x, y)
mean(V (x, y))
(4.1)
The original VPR router has an internal nominal delay for each circuit component
(Dn). Within the modified variation-aware VPR router, Eq. 4.2 is used to update
the delay of circuit components. Instead of using nominal delay, the delay of each
component is equal to the product of Dn, representing the nominal delay for type
of component, and Vs(x, y), representing the process parameter for that region.
Thus, the standard deviation of process variation (σ) can be amplified to predict
the variation at a future technology node without having a negative delay.
Delay(x, y) = Dn · Vs(x, y) (4.2)
4.3. Variation-aware Routing and Partial re-routing 81
A region-based variation map is efficient to apply but it has the inherent assumption
that the measured delay variation is reflected faithfully and uniformly on all circuit
components within a region, which of course may not be universally valid.
An example of a region-based variation map and coordinate system used for routing
is illustrated in Fig. 4.2. A LAB and adjacent routing resources such as switches on
Chanx (horizontal channel) and Chany (vertical channel) share the same character-
istics of process variation.
4.3 Variation-aware Routing and Partial re-routing
As introduced in Sec. 2.5.3, the timing-driven routing algorithm is an iterative pro-
cess with a routing congestion/timing cost model to produce an optimal routing
configuration. In each iteration, all nets are ripped-up and re-routed once based
on the historical congestion cost from the previous iteration. Timing analysis is
performed at the end of each iteration to update the delay of the critical path and
criticality for each source-sink connection. A wave expansion algorithm is used to
connect the net source i to the net sink j in descending order of their criticality
Crit(i, j) [71]. To make the original routing algorithm variation-aware, two main
modifications should be made in both cost function and timing analysis.
4.3.1 Variation-aware Cost Function
The cost function of routing is used to calculate the cost when a routing resource
(node) is used by a net. Among all possible routing resources, the resource with the
least cost will be chosen to connect the source-sink pairs in the current iteration.
82 Chapter 4. Partial re-routing
The original cost function for using node n in VPR router [71] is given by:
Cost(n) = Crit(i, j) · delay(n) + [1− Crit(i, j)] · [b(n) + h(n)] · p(n) (4.3)
The first term is the delay sensitivity which is given by the product of criticality
(Crit(i, j)) of the connection and the intrinsic delay (delay(n)) of the node. The
second term relates to the congestion sensitivity, where b(n) is the base cost of
routing a net through the routing resource node n. h(n) is the historical congestion
of n which is increased after every routing iteration if the resource n is overused.
p(n) is the present congestion cost of n. The modification for variation-aware routing
involves updating the delay value and criticality based on the variation map. The
modified variation-aware cost function Costv(n) for the resource n is as follow
Costv(n) = Critv(i, j) ·Delay(x, y) + [1− Critv(i, j)][b(n) + h(n)] · p(n) (4.4)
whereDelay(x, y) is defined in Eq. 4.2. Critv(i, j) of each source-sink pair is updated
during the variation-aware timing analysis.
4.3.2 Variation-aware Timing Analysis
The variation-aware timing analysis in routing is used for two purposes.
• To determine the delay of critical path using the variation map at the end of
each iteration of routing;
• To estimate the Critv(i, j) of each source-sink connection based on variation
maps in order to decide which connections must be made via fast paths to
avoid slowing down the circuit.
4.3. Variation-aware Routing and Partial re-routing 83
Modified variation-aware timing is similar to the original timing analysis method
but variation maps are used to update the delay for each component. Critv(i, j)
is the modified criticality considering process variation as shown in Eq. 4.5 which
defines how important each source-sink connection is in terms of its effect on the
circuit’s delay.
Critv(i, j) = max([MaxCrit− slackv(i, j)
Dv max
]η, 0) (4.5)
where Dv max is the delay of the critical path, and slack v(i, j) is the amount of
timing slack determined by variation-aware timing analysis which is added to this
particular connection. η and MaxCrit are parameters that control how a connec-
tion’s slack impacts the congestion-delay tradeoff in the cost function. The value
of η and MaxCrit is set to 1 and 0.99 experimentally to achieve best timing and
routability [71].
The variation-aware timing analysis is performed based on the timing graph rep-
resenting the circuit structure. With a breadth-first traversal of the timing graph,
the minimum required clock period for a circuit can be determined. The traversal
begins at nodes with no incident edges (primary inputs and register outputs) and
is labeled with a signal arrival time, Tv ar, of 0 [71]. The arrival time for the other
node i is labeled according to
Tv ar(i) = Max∀j∈fanin(i){Tv ar(j) + delay(j, i, x, y)} (4.6)
where delay(j, i, x, y) is the delay value marked on the edge joining node j to node i
with its region coordinates (x, y). This procedure continues until every node in the
timing graph is labeled. At the end, the largest arrival time is defined as maximum
delay Dv max.
84 Chapter 4. Partial re-routing
To compute the slack, another breadth-first backward traversal of the timing graph
is performed. Each node with no outward edges has its required time, Tv re, set to
Dv max. The required time of any node with fanout j is given by
Tv re(i) = Min∀j∈fanout(i){Tv re(j)− delay(i, j, x, y)} (4.7)
The slack of the connection from node i to node j is then
slack v(i, j) = Tv re(j)− Tv ar(i)− delay(i, j, x, y) (4.8)
A connection with a zero-slack is on the circuit’s critical path where any increase
in the delay of such a connection will lead to corresponding increase in the circuit’s
delay. A connection with large slack could be routed via slower paths or resources
on the variation maps without affecting the circuit delay. In the presence of process
variations, the path delay is location dependent. The delay of the switches, buffers
and LUTs are updated according to the variation map introduced in Sec. 4.2.2.
4.3.3 Overhead of Execution Time for Variation-aware Rout-
ing
During the process of variation-aware routing, the variation-aware cost function
is called thousands of times to assess whether a routing resource can be used for
the current source-sink connection with the delay from variation maps. Variation-
aware timing analysis is performed at the end of each routing iteration to calculate
Critv(i, j) of all nets. Therefore, chipwise variation-aware routing as described above
is computationally intensive if applied to a complex design on an advanced FPGA.
In addition, this full chipwise routing treats every FPGA specially. Similar to full
4.3. Variation-aware Routing and Partial re-routing 85
chipwise placement, the execution time required by full chipwise routing is O(N)
where N is the total number of devices used. While full chipwise routing is still
useful to provide an upper bound on the improvement that one can expect from
such a variation-aware approach, it is not practical to perform for each individual
FPGA chip.
4.3.4 Partial re-routing
Motivation of Partial re-routing
Partial re-routing is proposed as a more practical and adaptive method in this
section which can improve timing performance of circuits under process variation
with significantly shorter execution time than full chipwise variation-aware routing.
For most circuits, the critical and near-critical paths, which can affect the speed/delay
of the circuit, occupy only a small portion of the total number of paths. Therefore,
one reasonable variation-aware optimisation method is to only re-route the critical
and near-critical paths in a circuit based on variation maps. The partial re-routing
procedure is performed on a base routing configuration generated from variation-
blind routing. The critical and near-critical paths are released and re-routed with
variation-aware routing. However, unless the FPGA chip is not already highly con-
gested and lots of routing resources are available, such partial re-routing attempts
may be in vain. To avoid such a potential fruitless endeavour, two strategies are em-
ployed. First, we can reserve a portion of resources during the variation-blind routing
process which are initially unallocated, which are then available for the variation-
aware re-routing phase. Second, we can release a proportion of non-critical paths
during the variation-aware re-routing phase, thus increasing the number of available
routing resources for time-critical routes.
An example of our partial re-routing algorithm is illustrated in Fig. 4.3. 20 paths are
86 Chapter 4. Partial re-routing
0
1
2
3
4
5
6
7
8
9
10
1 2 3
Non-critical paths
D
el
ay
 o
f 
p
at
h
 (
n
s)
Path number
Crit_T
Non_Crit_T
Critical and near-critical paths
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C
ri
ti
ca
lit
y 
o
f 
p
at
h
(a) The delay of paths in descending order of criticality before re-routing.
Non-critical paths
Path number
Crit_T
Non_Crit_T
Critical and near-critical paths
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
1
2
3
4
5
6
7
8
9
10
D
el
ay
 o
f 
p
at
h
 (
n
s)
(b) The delay of paths after partial re-routing.
Figure 4.3: Example of delay of paths before and after partial re-routing.
displayed in descending order in terms of delay and marked with path number 1-20.
The criticality of paths according to Eq. 4.5, are shown on the right vertical axis
of the chart. The critical path is represented with criticality equal to 1. To apply
partial re-routing, two parameters Crit T and Non Crit T are chosen adaptively to
4.3. Variation-aware Routing and Partial re-routing 87
control the number of paths that are optimised and released respectively. Crit T is
a threshold in criticality (top horizontal dotted line in Fig. 4.3(a)) which selects the
paths with higher criticality that may affect the overall circuit delay to be released
and re-routed. For example, in Fig. 4.3(a), Crit T is set to 0.8. For Non Crit T ,
it is defined as the proportion of the total number of paths (right vertical dotted
line in Fig. 4.3(a)) instead of criticality. The purpose of Non Crit T is to release a
portion of non-critical paths to provide more space for re-routing but avoid releasing
more paths than necessary if there is a large number of paths with low criticality.
In Fig. 4.3(a), Non Crit T is set to 20% of the 20 paths and therefore the 5 least
critical paths (16-20) are released and re-routed.
The resultant path delays after partial re-routing are illustrated in Fig. 4.3(b). Ide-
ally, variation-aware partial re-routing can result in a decrease of critical and near-
critical paths’ delays while causing some increase in non-critical paths’ delays.
However, there is a possibility that non-critical paths are far from the critical paths
on the chip. In such cases, even if a large portion of non-critical paths are released
(large Non Crit T ), no resources can be used in partial re-routing phase. Thus as
a small amount (e.g. 20%) of routing resources are reserved in the initial variation-
blind routing and used in the partial re-routing as illustrated in Fig. 4.4 where the
dashed tracks are reserved for variation-aware partial re-routing.
Partial re-routing Algorithm
The top level of partial re-routing algorithm is described by the pseudo code in
Alg. 3. Let num net be the total number of nets in the circuit. The variation-aware
timing analysis introduced in Sec. 4.3.2 is firstly executed using variation map.
Crit q is a queue used to store critical and near-critical paths with criticality higher
than Crit T (line 2-6). Non Crit q is another queue used to store a number of
88 Chapter 4. Partial re-routing
SB CB SB CB
CLB CLB
CLB CLB
SB
SB CB SB CB SB
SB CB SB CB SB
CB
CB
CB
CB
CB
CB
I/O I/O
I/O I/O
I/O
I/O
I/O
I/O
Figure 4.4: Dashed lines are reserved tracks used for variation-aware re-routing.
non critical paths chosen according to the parameter, Non Crit T from line 7 to
12. The resources used by nets in the Crit q and Non Crit q during variation-blind
routing phase are released by function release path() and stored in NetSource. In
addition, the reserved routing resources are added to NetSource (line 13-14). Later,
the released paths in the Crit q and Non Crit q are re-routed based on the modified
routing function, V ar aware route(), with the variation-aware cost function and
timing analysis introduced in Sec.4.3. ExitCriterion on line 15 returns true when
either of the two conditions are satisfied. (i) when the maximum number of routing
4.4. Experiment and Results 89
Algorithm 3 Partial re-routing algorithm.
1: (Critv(i), num net) := timing analysis(variation map)
2: for each net i ∈ N do
3: if Critv(i) > Crit T then
4: Crit q ← i
5: end if
6: end for
7: for each net i, in increasing Critv(i) order do
8: if Length of Non Crit q < Non Crit T × num net then
9: Non Crit q ← i;
10: i = i+ 1;
11: end if
12: end for
13: NetSource := release path(Crit q,Non Crit q)
14: NetSource := NetSource&&ResSource
15: while ExitCriterion() == False do
16: for net,m, in Crit q do
17: V ar aware route(m,NetSource);
18: end for
19: for net,n, in Non Crit q do
20: V ar aware route(n,NetSource);
21: end for
22: end while
iterations is achieved; (ii) when the delay of critical path does not change after 3
routing iterations in a queue and there are no over used resources.
4.4 Experiment and Results
4.4.1 Experiment Flow
Three routing methods, variation-blind, variation-aware and partial re-routing are
tested as shown in Fig. 4.5. At the beginning, an original placement and routing
is executed to produce a placement configuration and to search for the minimum
routing channel width (MCW). This placement configuration is used by all three
routing methods later. For the variation-blind method, an original routing process
with 1.2 times MCW is performed to generate one routing configuration. Based
90 Chapter 4. Partial re-routing
1 variation-blind 
routing with 1.2 
MCW
1 variation-blind 
routing with 1.0 MCW
Normal P&R by VPR
Placement configuration/
minimum MCW
100 variation-aware
routing with 1.2 MCW 
1 routing 
configuration
100 variation-aware
timing analysis
100 variation-aware
timing analysis
Variation-aware 
partial rerouting 
with 0.2 MCW
100 routing 
configuration
1 routing 
configuration
100 variation-aware
timing analysis
100 routing 
configuration
100 variation 
maps
Comparison of results in 
terms of delay of critical path
Variation-blind Full chipwise Partial rerouting
Figure 4.5: Work flows for variation-blind, partial re-routing and full chipwise rout-
ing.
on this routing configuration, 100 variation maps are loaded for variation-aware
timing analysis to evaluate the delay of critical path; for variation-aware routing,
the variation maps are used both during the routing process with 1.2 times MCW
and timing analysis. In this case, 100 routing configurations are produced with
corresponding variation maps; for the partial re-routing method, a variation-blind
routing is performed firstly to produce a base routing configuration with 1.0 times
MCW. After that, the critical path and a small number of near-critical paths are
identified with Crit T by variation-aware timing analysis and then released. At the
4.4. Experiment and Results 91
same time, a certain percentage (Non Crit T ) of non-critical paths are released to
provide more routing space. Later, a variation-aware routing is performed to re-route
all these paths based on the routing resources provided by reserved 0.2 times reserved
MCW and released resources. At the end, all results are collected and compared
in terms of delay of critical paths. The stages highlighted with dotted lines show
the modified routing process with variation-aware routing algorithm introduced in
Sec. 4.3.
4.4.2 Experiment Setup
The variation maps used in this experiment are based on the region-based varia-
tion maps in Sec. 4.2.2. It is predicted that stochastic variation should increase
dramatically in the future [21], therefore, the stochastic part of process variation
from variation maps is amplified to make overall variation σ/µ = 30%. We has
tested 20 MCNC benchmarks circuits by variation-aware routing to investigate the
potential upper bound of improvement. After that, the partial re-routing optimisa-
tion method is executed to reduce the actual path delays under process variation.
For both routing methodologies, the same placement configuration is used. For the
partial re-routing method, Crit T is set to 0.9 and Non Crit T is set to 10%, which
means the nets with criticality higher than 0.9 are re-routed for, and at least 10%
of the non-critical paths are released for the partial re-routing process. However the
Crit T and Non Crit T can be chosen adaptively to meet the design requirement.
The research by Rubin [89] demonstrated that the VPR router does not provide
a repeatable delay for the critical path. Such variation —termed “router noise”—
in the algorithm may mask the delay variation due to process variability. They
proposed a method called target delay search in order to reduce this algorithmic
induced uncertainty. This noise reduction method is applied in our router.
92 Chapter 4. Partial re-routing
The architecture used in this experiment is set to have the same logic cluster struc-
ture as the Cyclone III FPGA with 16 LUTs inside one LAB. The variation-blind
routing is executed first to obtain the minimum channel width and the delay of
critical paths under process variation. For partial re-routing, an extra 20% of the
channel width is reserved during the initial variation-blind routing and then used in
variation-aware re-routing for critical and near-critical paths. The maximum num-
ber of routing iterations is set to 100. To fully make use of routing resources for
partial re-routing, the universal switch box is adopted in this experiment.
There are two cases that the re-routing process may be marked as “failed”. Firstly,
partial re-routing may result in negative improvement because of routing noise,
which may not be completely avoided even appying the noise reduction method [89].
Secondly, a legal routing solution with specified CW may not be found within 100
iterations. If any of these two cases occur during the re-routing process, the original
routing is adopted as the final re-routing solution and the variation-aware re-routing
phase provides no improvement.
4.4.3 Results of Variation-aware Partial re-routing
The results of variation-blind, partial re-routing and variation-aware routing are
illustrated in Fig. 4.6. Similar to variation-blind placement, variation-blind routing
is used as the reference group. The mean timing improvement provided by variation-
aware routing in terms of delay of critical path is about 6.4% with 95% timing yield.
However, with the partial re-routing method with Crit T = 0.9 and Non Crit T =
10%, 5.2% improvement is achieved.
A number of interesting observations can be made from Fig.4.6. Firstly, for some
benchmarks, e.g. ex5p and des, the differences between variation-aware routing (the
upper bound) and partial re-routing are not big, which shows that partial re-routing
4.4. Experiment and Results 93
0
2
4
6
8
10
12
14
16
18
20
alu
4
ap
ex
2
ap
ex
4
big
ke
y
clm
a
de
s
dif
feq ds
ip
ell
ipt
ic
ex
10
10
ex
5p fris
c
m
ise
x3 pd
c
s2
98
s3
84
17
s3
85
84
.1
se
q
sp
la
tse
ng
D
el
ay
 o
f c
rit
ica
l p
at
h 
fo
r 9
5%
 ti
m
in
g 
yie
ld
 (n
s) 
 
 
Variation−blind routing
Partial rerouting
Variation−aware routing
Figure 4.6: The delay of critical path for 20 MCNC circuits performing variation-
blind, partial re-routing and variation-aware routing.
is effective in dealing with process variation. Secondly, for some benchmark circuits,
e.g. alu4 and apex4, the improvement provided by partial re-routing is modest and
is similar to variation-blind routing. One possible reason for this result is that the
critical paths of this benchmark circuit are all located in one region of the variation
map. The delay of routing elements are identical in one region, therefore, partial re-
routing can not provide any improvement by choosing fast resources in other regions
which may use a longer critical path. In addition, the improvement of partial re-
routing may be affected by the initial variation-blind routing. Another observation
from this figure is that the improvement made by variation-aware routing (6%) is not
as significant as variation-aware placement (11%) because there are more constraints
in variation-aware routing process such as the congestion problem and router noise.
The density of the critical path delay for ex5p, which can achieve most improvement,
is shown in Fig. 4.7. It can be seen that both the full chipwise routing and the partial
re-routing method achieved better critical path timing than variation-blind routing
across the 100 variation maps. However, some results provided by the full chipwise
94 Chapter 4. Partial re-routing
0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
x 10−8
0
0.05
0.1
0.15
0.2
0.25
Delay of critical path (n)
D
en
si
ty
 
 
Variation−blind routing
Partial rerouting
Full chipwise routing
Figure 4.7: Density of delay of critical path for ex5p.
are similar to variation-blind routing which is caused by the noise of router even
when applying noise reduction method. The standard deviation of partial re-routing
is smaller than variation-aware routing because all partial re-routing processes are
based on the same initial variation-blind routing. All in all, this result highlights
that we can achieve similar timing performance with the quicker partial re-routing
method compared with full chipwise routing.
4.4.4 Comparison of Run Time Cost between Chipwise rout-
ing and Partial re-routing
As illustrated in Fig.4.6, the difference in terms of timing improvement achieved
by variation-aware routing and partial re-routing is insignificant. In this section,
the timing performance and execution time of partial re-routing is examined by
scaling Crit T from 0.5 to 0.95. Non Crit T is fixed to 10% to provide re-routing
space in this case. The benchmark ex5p which achieved the most improvement from
variation-aware and partial re-routing is used to exploit the relationship between
4.4. Experiment and Results 95
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.9511
11.5
12
12.5
Crit_T
D
el
ay
 o
f c
rit
ica
l p
at
h 
(ps
)
 
 
Variation−aware routing
Partial rerouting
Variation−blind routing
Figure 4.8: The timing improvement made by partial re-routing by scaling Crit T
from 0.5 to 0.95.
timing improvement and Crit T .
The improvement for partial re-routing is shown in Fig. 4.8. Partial re-routing (plot
with circle markers) with Crit T = 0.95, Crit T = 0.5, can achieve about 8.7%
and 7.8% improvement respectively. In other words, increasing the number of re-
routed near-critical paths (decreasing Crit T from 0.95 to 0.5) only provides less
than 1% improvement in timing performance. Besides that, timing performance
with Crit T = 0.9 in Fig. 4.8 appears to be worse than that with Crit T = 0.95
which is likely due to the remaining router noise in VPR after noise reduction.
The execution time for partial re-routing is illustrated in Fig. 4.9(a). The execution
time of full chipwise routing is shown as the top dotted line in the plot as a reference.
Compared with full chipwise, partial re-routing with Crit T = 0.95 only requires
about 20% of the execution time. With decreasing Crit T , more near-critical paths
are re-routed and the plot of execution time for partial re-routing (plot with diamond
markers) increases as expected. Due to the noise of the router, the plot is not
96 Chapter 4. Partial re-routing
perfectly monotonic. Considering one complete routing procedure, a variation-blind
routing is required to provide a pre-defined routing solution. The execution time
of variation-blind routing is about 120 seconds (lower dotted line). Therefore, for
one full re-routing procedure, the execution time is the sum of variation-blind and
partial re-routing (plot with circle markers), which is still less than full chipwise
variation-aware routing when Crit T is 0.8 or higher.
Moreover, consider the scenario where partial re-routing is applied to multiple FP-
GAs. The execution time is dramatically reduced compared with full chipwise
variation-aware routing. The execution time for performing 100 chips is illustrated
in Fig. 4.9(b). Since the initial variation-blind routing configuration is universal to
each FPGA with different variation maps, with the same CW and architecture the
generation process is only required once. For full chipwise routing, the total exe-
cution time for 100 variation-aware routings is about 2× 104 seconds which is over
5.5 hours. By applying partial re-routing, the execution time is reduced to about
0.25× 104 seconds (≈42 minutes). In this case, about 8 times speedup is achieved.
4.5 Conclusion
This work employs detailed delay variation information of individual FPGA chips
to drive a timing-driving router in VPR in order to improve the delay of the critical
paths for a given design. The results show that the full chipwise routing can achieve
about 6% improvement in terms of delay of critical path for 20 MCNC benchmarks
on average. Partial re-routing with Crit T = 0.9 can achieve similar improvement
(5%). For 100 FPGAs, 8 times speedup in execution time is observed by the pro-
posed variation-aware partial re-routing method against full chipwise variation-aware
routing. In addition, a tradeoff between timing performance and execution time can
be made adaptively by choosing different values of Crit T and Non Crit T .
4.5. Conclusion 97
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
50
100
150
200
250
Crit_T
Ex
ec
ut
io
n 
tim
e 
(s)
 
 
Variation−aware routing
Partial rerouting
Variation−blind routing
Variation−blind +Partial rerouting
(a) The execution time used by partial re-routing by scaling Crit T from 0.5 to 0.95.
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
Number of FPGAs
Ex
ec
ut
io
n 
tim
e 
(s)
 
 
Variation−blind routing
Variation−aware routing
Partial rerouting
(b) The execution time used by partial re-routing for 1 to 100 FPGAs.
Figure 4.9: Improvement of execution time by varying Crit T , and the number of
FPGAs for the ex5p benchmark.
Chapter 5
Variation-aware Retiming
5.1 Introduction
Chapters 3 and 4 showed the improvement made by variation-aware placement and
routing. The results of our experiment proved that the impact of process variation
can be alleviated by these methods efficiently. However, further enhancement of the
timing performance using variation measurements and post place-and-route (P&R)
methods may be possible. Therefore, we proposed a post P&R variation-aware
retiming based on the measured variation maps, which is accepted as a paper in the
International Conference On Field Programmable Logic And Applications in 2013.
We tackle variation by retiming the initial design, taking into account the variation
maps. This method is facilitated by the availability of a large number of unallo-
cated registers in a typical FPGA design. Also, it does not affect the configuration
of placement and routing, only the positions of the flip-flops are reprogrammed.
We show that designs having several retiming choices can benefit greatly from this
method in the presence of variation.
We conducted experiments based on the measured variation maps from commercial
98
5.2. Concept and Potentials 99
FPGAs, and the retiming algorithm is implemented using Matlab.
5.2 Concept and Potentials
In this section we discuss a small example to explain our variation-aware retiming
method. The example circuit is an 8-bit ripple-carry adder with a 3-stage pipeline.
In Fig. 5.1(a) we show the retiming choices with equivalent logic depths for this
configuration. The nominal delay of each path in terms of full-adder (FA) delay δ is
marked on the top. Now if we imagine that the FA delays are subject to variations
with Gaussian distribution N(µ, σ), the delay of each stage (path) can then be
represented using a Random Variable (RV) (e.g. P11, P12 etc.). Since we assume FA
delays are an independent and identically distributed Gaussian distribution, path
delay can be presented as P11 ∼ N(3× µ,
√
(3)× σ).
The delay distribution of the first retiming choice can be represented using a RV
choice1 = max(P11, P12, P13). Similarly, choice2 = max(P21, P22, P23) and choice3 =
max(P31, P32, P33). These distributions are plotted in Fig. 5.1(b). We see that the
distributions are identical since they have equivalent logic depths. We have used a
sample size of 100,000, with µ = 8, σ = 0.8 which has a variation coefficient of 10%
(σ
µ
=0.1).
Given that we have the measured variation maps of each sample of this adder, let
us imagine a very simple variation-aware retiming method:
• For each sample, evaluate the delay of 3 choices using the measured variation
maps.
• Retain the choice which gives minimal delay and program it into FPGA.
The result of such a variation-aware retiming is plotted in Fig. 5.1(b). If we compare
100 Chapter 5. Variation-aware Retiming
67 45 3 12 0
67 45 3 12 0
67 45 3 12 0
Retiming Choice 1
Retiming Choice 2
Retiming Choice 3
P13=sum(δ7,δ6 ) P12=sum(δ5,δ4 δ3) P11=sum(δ2 δ1,δ0)
P23=sum(δ7,δ6,δ5) P22=sum(δ4,δ3) P21=sum(δ2,δ1,δ0)
P32=sum(δ4,δ3,δ2) P31=sum(δ1,δ0)P33=sum(δ7,δ6,δ5)
(a) Retiming choices with equivalent logic depth in a 8-bit ripple carry adder
with 3-stage pipeline.
18 20 22 24 26 28 30
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
0.
30
N = 100000   Bandwidth = 1
D
en
si
ty
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
0.
30
D
en
si
ty
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
0.
30
D
en
si
ty
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
0.
30
D
en
si
ty
Retiming Choices
choice1,choice2,
choice3
min(choice1,
choice2,
choice3)
99%99%
(b) Effect of variation-aware retiming is shown with the green curve
min(choice1, choice2, choice3).
Figure 5.1: Motivational example: circuits with several retiming choices with equiv-
alent logic depth can lead to great improvement through variation-aware retiming.
the cumulative quantities at 99% of distributions of choice1 andmin(choice1, choice2,
choice3) we observe a 4% improvement in terms of critical path delay.
5.3. Generation of Variation Maps 101
The reader must note that the improvement is highly dependent on the number of
retiming options. For example, for a 8-bit adder with a 2-stage pipeline the unique
equivalent retiming choice is the one where both stages have 4 FAs {4,4}. Although
in the presence of high variation, the choice {5,3} can become less than {4,4} in some
cases, the overall improvement in delay is not so much. Similarly for a fully pipelined
adder (8 FFs) there is only one retiming choice with equivalent logic depth, and hence
no improvement through variation-aware retiming. The factors responsible for the
number of choices include the scale of process variation, complexity of benchmarks
(length of critical path), number of FFs and the heterogeneity of FPGA architecture.
5.3 Generation of Variation Maps
5.3.1 Measured Variation Maps
Similar to the measurement in the variation-aware placement and routing, we mea-
sured real FPGAs to collect variation maps, but with different methods to change
the measured variation maps to be suitable for retiming purpose. At present, process
variation is not yet significant enough to worry the industry, e.g. The Cyclone III
FPGA (65nm) has only 1.3% (σdelay/µdelay) variation. Nevertheless, as feature sizes
continue to scale down, this is expected to increase in the future to become a driver
for the industry. As introduced in Sec. 3.2.2, the variation maps of DE0 boards (Cy-
clone III FPGAs) are measured using the transition probability (TP) measurement
method from Wong etal. [12]. However, for retiming, more fine-grained delay varia-
tion maps are measured. To take into account Pelgrom’s Law for future technology
nodes, we scaled variation maps by amplifying the uncorrelated random component
of the variation in the measured variation maps which can be used in variation-
aware retiming to evaluate the performance. An example of a cross-chip variation
map measured from an Altera Cyclone III EP3C16 is shown in Fig. 5.2. Fig. 5.2(a)
102 Chapter 5. Variation-aware Retiming
(a) The layout of measurement circuit on Cyclone III FPGA.
(b) Measured variation map for Cyclone III FPGA.
Figure 5.2: Variation map example: one measured variation map on Cyclone III
FPGA.
shows the layout of the FPGA architecture from Quartus and Fig 5.2(b) is the
corresponding variation map from our measurement. Different from the variation
maps for placement in Sec. 3.2.2, a 16 logic element (LE) logic array block (LAB)
is divided into 4 basic measurement unit (BMU) and the delay of each BMU is col-
lected to form an unit on variation map. For practical use, we assume the resources
within a measurement unit, a group of 4 LUTs and FFs, share the same timing
5.4. Retiming Algorithm 103
characteristics. This process is repeated to build up a database with 100 variation
maps, but the correlated variations will remain roughly constant.
5.3.2 Amplification of Variation Maps
At the 22nm technology node, the process variation in terms of Vth is predicted to
be 27% (σVth/µVth) [8, 64]. However, the results of our measurement show that at
65nm as used by Cyclone III the variation as σ/µ for LUT delay is only 1.4% across
100 devices [14]. The timing model used in commercial FPGA tools is pessimistic
with the worst-case operating frequency (fworst) that designs can achieve. It is
understandable that the timing model has to be conservative enough to maintain
reliability for all possible environments and process variations across many FPGAs.
While this approach ensures reliability in commercial devices with delay variability,
much of the potential timing performance of on-chip resources is wasted. Therefore,
it is worthwhile to put effort into full chipwise retiming to obtain the potential upper-
bound improvement that our variation-aware method can achieve. In addition,
process variation is expected to increase dramatically in future devices. In our
experiment, similar to the method used in Sec.3.2.2, measured variation maps are
amplified to σdelay/µdelay = 30% to emulate the level of process variation in the
future technology where variation-aware methods will be of most interest [8].
5.4 Retiming Algorithm
5.4.1 FF constraints for Modern FPGAs
The research of retiming in this thesis focuses on the island style FPGA architec-
ture [71] and is consistent with the commercial Cyclone III FPGAs we measured.
In this type of architecture, a 4-input lookup table and associated Flip-Flop (FF)
104 Chapter 5. Variation-aware Retiming
is called a basic logic elements (BLE). One logic array block (LAB) consists of 16
BLEs. Current commercial FPGAs allow the combinational output of one LUT to
drive an empty FF in another BLE in the same LAB through local routing. This
feature makes post place and route (P&R) retiming possible on FPGAs with such
architecture. However, the results of P&R of the BLEs are kept unchanged during
the retiming process to avoid destroying what would have been a near-optimum
solution. In this variation-aware retiming algorithm, the FF can move backward
along the critical path inside the LAB, which is called the path-based FF constraint
retiming. In this way, our re-timing method can be seen as the additional step that
”squeezes” the last drop of performance for a given design by exploiting the knowl-
edge about the process variation for a specific device. It is worth noting that this
approach is not even a trade-off, i.e. the performance gain is not at the expense
of some other parameters - it is entirely free (except for the extra computation
required).
5.4.2 Formal Description
Algorithms for optimal retiming in digital circuits are already available in the lit-
erature. An iterative retiming method is proposed in [90], where the clock period
is reduced in successive iterations using a binary search strategy, until no further
reduction is possible. Another incremental retiming method to optimize clock pe-
riod by moving the FF backward is introduced in [91]. For our purposes, we chose
to use the algorithm based on [91], since it is reported to be faster compared to the
method in [90].
Our retiming algorithm operates on a directed graph G(V,E) which has the following
components:
• V : A set of vertices which model Primary Inputs (PIs) and Primary Out-
5.4. Retiming Algorithm 105
5
(1,1) (1,2)
L1
L2
LUT FF
1
3
4
2
L3
L4 F1
6
8
9
7
(a) The placement and routing configuration.
V1
1
r=0
2
r=0
3
r=0
5
r=0
4
r=0
6
r=0
V1
7
r=0
8
r=0
9
r=0
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=1
d=4.62e-10
Wr=0, fix=1
d=9.96e-11
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=0
d=1.04e-10
Wr=0, fix=0
d=1.04e-10
Wr=1, fix=0
d=1.6e-10
(b) The directed graph used by retiming based on the placement and routing configuration.
V1
1
r=0
2
r=0
3
r=0
5
r=0
4
r=0
6
r=0
V1
7
r=0
8
r=1
9
r=0
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=1
d=4.62e-10
Wr=0, fix=1
d=9.96e-11
Wr=0, fix=0
d=1.6e-10
Wr=0, fix=0
d=1.04e-10
Wr=1, fix=0
d=1.04e-10
Wr=0, fix=0
d=1.6e-10
(c) The directed graph after variation-aware retiming.
Figure 5.3: Example of post P&R retiming.
106 Chapter 5. Variation-aware Retiming
puts (POs) of LUTs/FFs, LABs and other logic blocks. Fig. 5.3(a) shows an
example with vertices labeled from 1 to 9.
• e(u, v) : An edge is defined by its start vertex u and end vertex v within set
V , where u and v are two consecutive vertices.
• E: A set of all e(u, v) in V.
• d(u, v): The delay of edge e(u, v).
• w(u, v): The number of registers on edge e(u, v).
• wr(u, v): The number of registers on edge e(u, v) after retiming.
• r(v): The number of registers shifted backward over vertex v during retiming,
where r(v) ≥ 0
• fix(u, v): A boolean value for each e(u, v) which denotes whether a register
can be inserted on that edge.
The various components of a directed graph generated from an example circuit can
be found in Fig. 5.3. Fig. 5.3(a) shows the result of variation-aware P&R for a
simple circuit including only one net going through 2 size-4 LABs. Assuming in
the left LAB (1,1), four highlighted FFs are occupied, no more FFs can be shifted
into this LAB. Beside that, it is impossible to move a FF onto the edge e(4, 5) and
e(5, 6) because of global routing. The directed graph parsed from the result of P&R
is illustrated in Fig. 5.3(b). For each edge, fix(u, v) is equal to 1 if FFs can not be
moved onto this edge. Fig. 5.3(c) shows the directed graph for optimal solution after
retiming. Based on the constraints of the FPGA architecture, it can be seen that the
optimal retiming solution for this example is to shift F1 backward one step before
L4 with the solution r(8) = 1, wr(8, 9) = 0 and wr(7, 8) = 1. Given such a graph we
state our retiming problem as: find a suitable rearrangement wr(u, v)∀e ∈ E such
that
5.4. Retiming Algorithm 107
• The clock period (critical path delay) is minimized.
• The functionality remains unchanged, i.e. the number of registers (latency)
on all paths from each PI to each PO remains unchanged.
Eq. 5.1 explains how to update the new weight after the retiming for any edges in
the directed graph and make sure the functionality is not affected.
wr(e) = w(e) + r(v)− r(u) ∀(u, v) ∈ E (5.1)
Let T denote the current clock period. The start time of vertex (v) is defined as
t(v) in eq 5.2
t(v) = max(0, max
∀e(u,v)∈E
(t(u) + d(u, v)− wr(u, v) · T )),∀u, v ∈ V (5.2)
Let C be the edge set within one LAB and K be the LAB size of the FPGA
architecture (Number of BLEs in one LAB). For post P&R variation-aware retiming
in this research, one addition constraint should be made depending on the size of the
LAB of the FPGA architecture as Eq.5.3e. With this constraint, the FFs used in
one LAB can not exceed the number of FFs one LAB has. The problem statement
is summarized as follows in Eq. 5.3.
Goal :minimize T (5.3a)
Constraint :r(u) = r(v), ∀e(u, v) s.t. fix(u, v) = 1 (5.3b)
wr(e) ≥ 0,∀e(u, v) s.t. fix(u, v) = 0 (5.3c)
T ≥ t(v), ∀v ∈ V (5.3d)∑
∀e(u,v)∈C
wr(u, v) ≤ K, (5.3e)
108 Chapter 5. Variation-aware Retiming
5.4.3 Variation-Aware Retiming Algorithm
The retiming algorithm proposed in this section is based on algorithm from [91] but
modified to take account of the process variation maps of an actual FPGA (Altera’s
Cyclone III). The main algorithm and three key functions such as Init, r ADJUST
and t ADJUST are introduced in this section.
Main Algorithm
Algorithm 4 Main algorithm of variation-aware retiming
Input: P,R,I,V ;
Output: T opt
1: (G, r, t, Ni, T, T opt) := Init(P,R, I, V )
2: while ¬(crit cycle flag) ∧ ¬(m cycle flag) do
3: Q← {v | ∀v ∈ V with t(v) = T}
4: if Q = ∅ then
5: (t, T ) := t ADJUST (G, t, T )
6: else
7: (r, t, T,Ni) := r ADJUST (G, r, t, T,Q,Ni)
8: if T opt > T then
9: T opt← T
10: end if
11: end if
12: (crit cycle flag) := UpdateCritF lag(G, r, t);
13: (m cycle flag) := UpdateMFlag(G, r);
14: end while
15: return T opt
In the main algorithm for retiming as shown in Alg. 4, there are four inputs the
result of placement (P ), the result of routing (R), architectural information of target
FPGA (I) and variation maps (V ). The Init function is designed to establish the
directed graph G(V,E) and to initialize other parameters, defined in Sec. 5.4.2, used
by the retiming process at the beginning based on these 4 inputs. Q is defined as a
queue to store the critical nodes (t(v) = T ) which are needed to be retimed. Next,
two functions t ADJUST and r ADJUST are used to minimise the clock period T .
t ADJUST searches for a potentially better clock period by decreasing T without
5.4. Retiming Algorithm 109
changing the location of FFs, while r ADJUST attempts to reach the target T set
in t ADJUST by moving the FF backward along the critical path as in [91]. In
this algorithm, the FF is only shifted in the reverse direction and T opt stores the
lowest value observed in T .
t ADJUST and r ADJUST are executed alternatively until the exit condition
is satisfied. Two flags called crit cycle flag and m cycle flag are checked in ev-
ery iteration from line 12 to 13 [91], to determine whether the result is optimal
(m cycle flag), and whether the critical path forms a cyclic feedback (crit cycle flag).
If the critical path forms a feedback, no further improvement can be made by re-
timing.
Initialization Function
Algorithm 5 Init function to establish G(V,E) and initialize variables
Input: P,R,I,V maps;
Output: G(V,E), r, t, Ni, T, T opt
1: G(V,E), Ni ← P,R,V maps; K ← I; T opt←∞
2: t(v)← 0, r(v)← 0 ∀v ∈ V ; T ← 0;Q← V ;
3: while Q 6= ∅ do
4: u← dequeue(Q);
5: for each e(u, v) ∈ E with w(u, v) = 0 do
6: if t(v) < t(u) + d(u, v) then
7: t(v)← t(u) + d(u, v);
8: if T < t(v) then
9: T ← t(v);
10: end if
11: Q← Q ∪ v if v /∈ Q;
12: end if
13: end for
14: end while
15: for each (u, v) ∈ E with w(u, v) > 0 do
16: if T < t(u)+d(u,v)−t(v)
w(u,v)
then
17: T ← t(u)+d(u,v)−t(v)
w(u,v)
;
18: end if
19: end for
20: return G(V,E), r, t(v), T, T opt;
110 Chapter 5. Variation-aware Retiming
The pseudo code of function Init is illustrated in Alg. 5. Differently from original
retiming algorithm for ASICs, the purpose of Init is not only to initialize all variables
but also to parse the P&R configuration P and R. The code of lines 1 and 2 is used
to initialize the variables used by retiming as in Sec. 5.4.2. In the timing graph
G(V,E), all flip-flops (FFs) on the the edge (u, v) ∈ E are lined up immediately
before v. For each edge {e(u, v)|e(u, v) ∈ E with w(u, v) = 0} the arrival time for
each node v is calculated from line 3 to 14. However, Eq. 5.2 may be violated when
w(u, v) is bigger than 1. To fix it, we propose to locally distribute FFs on edges
and increase T accordingly. If t(v) < t(u) + d(u, v) − w(u, v)T , we increase T to
(t(u) + d(u, v)− t(v))/w(u, v). With this method, T is increased without changing
t, therefore all constraints are satisfied. Code from line 15 to line 19 is used to
calculate T for {e(u, v)|e(u, v) ∈ E with w(u, v) > 0} [91].
The r ADJUST and t ADJUST Functions
As introduced in Sec. 5.4.3, there are two functions used in the main algorithm for
retiming. The purpose of t ADJUST is to set a target T by decreasing current
clock period which is fully explained in [90], while the purpose of r ADJUST is
to achieve the target clock period by moving the FFs backward. In addition, this
function is modified by adding more constraints imposed by the FPGA architecture
and considering process variation as in Alg. 6.
The function of r ADJUST is to shift the register backward along the critical path
to achieve the target T as well as to make sure the constraints in Eq. 5.3 are satis-
fied. Two conditions are used to terminate the r ADJUST process. m cycle flag
is the flag to determine whether the retiming process has achieved an optimal solu-
tion [91]. The nodes with arrival time equal to the current target T are stored in Q
which represents critical edges. If Q is empty, it means that the target T is bigger
than the arrival time of any node with the current retiming solution, thus, another
5.5. Experiments and Results 111
t ADJUST function is necessary to be called to set a smaller the target T . As in line
6 to 7, FFs are moved backward when they are on the critical edge. However, this
move may violate the constraints in 5.3b 5.3c 5.3e. One First-In-First-Out (FIFO)
queue rQ is used to store the node where r is changed by retiming. To restore
the constraints, the code between line 6 and 15 is used to move the FFs backward
further. Ni is the variable to record the current usage of FFs in LAB i. The value
of Ni is updated in line 12 as following.
Ni =
 Ni − 1, if u is the input of LAB iNi + 1, if u is the output of LAB i (5.4)
The process to restore the constraints in 5.3b 5.3c 5.3e will not stop until all nodes
in the timing graph meet these constraints. Another FIFO queue tQ is employed to
facilitate this process. The nodes affected by retiming are put into tQ and then the
arrival time for each node is updated from line 16 to line 38. For edges e(y, z) ∈ E
with wr(y, z) = 0, the arrival time of t(z) is max(t(z), t(y) + d(y, z)). For others
with wr(y, z) > 0, if t(y) < T , then t(z) is updated with max(t(z), t(y) + d(y, z) −
wr(y, z)T ). It makes sure that the FFs are locally evenly distributed with delay T in
between. The first FF on e(y, z) is positioned right after y and then t(z) is updated
with max(t(z), d(y, z)− (wr(y, z)− 1)T ).
5.5 Experiments and Results
5.5.1 Design of Experiments
The experiment flow of our variation-aware retiming is shown in Fig.5.4. To take
account of process variation, one possible method to update net delay with variation
maps is to apply variation-aware P&R with VPR [14]. P&R configurations and
112 Chapter 5. Variation-aware Retiming
Algorithm 6 r ADJUST Algorithm
Input:
Output:
1: while (Q 6= ∅) ∧ ¬(m− cycle)) do
2: v ← dequeue(Q);
3: if t(v) ≥ T then
4: r(v)← r(v) + 1;
5: rQ← {v};
6: while rQ 6= ∅ do
7: u← dequeue rQ;
8: for each e = (x, u) ∈ E or e = (u, x) ∈ E do
9: if (((fix(x, u) = 1) ∧ (r(x) 6= r(u)))||(wr(e) < 0))||Ni > K then
10: r(x)← r(x) + 1;
11: rQ← rQ ∪ {x} if x /∈ rQ;
12: Ni ← UpdateFFInLAB(u)
13: end if
14: end for
15: end while
16: t(v)← 0∀v ∈ V ; tQ← V ;
17: while tQ 6= ∅ do
18: y ← dequeue(tQ);
19: for each y.z ∈ E do
20: if wr(y, z) = 0 then
21: if t(z) < t(y) + d(y, z) then
22: t(z)← t(y) + d(y, z);
23: tQ← tQ ∪ {z} if z /∈ tQ;
24: end if
25: else if t(y) < T then
26: if t(z) < t(y) + d(y, z)− wr(y, z)T then
27: t(z)← t(y) + d(y, z)− wr(y, z)T ;
28: tQ← tQ ∪ {z} if z /∈ tQ;
29: end if
30: else if t(z) < d(y, z)− (wr(y, z)− 1)T then
31: t(z)← d(y, z)− (wr(y, z)− 1)T
32: tQ← tQ ∪ {z} if z /∈ tQ;
33: end if
34: end for
35: if t(z) ≥ T then
36: Q← Q ∪ {u} if z /∈ Q
37: end if
38: end while
39: end if
40: m cycle flag=UpdateMcycle(G, r, t)
41: end while
5.5. Experiments and Results 113
Normal P&R by VPR
P&R timing graph
100 variation 
maps
Variation-blind Full chipwise
P&R timing graph with 
process variation 
Sequential  
Circuits
Combinatorial 
Circuits
Sequential  
Circuits
Variation-blind 
retiming
Variation-aware 
retiming
Variation-aware 
timing analysis
Variation-aware 
timing analysis
1 retiming 
configuration 
Add N FFs at the 
end of each sink
100 retiming 
configurations 
Comparison in terms of 
delay of critical paths
Figure 5.4: The work flow of retiming.
variation maps introduced in Sec. 5.3 are loaded in the retiming process to form
G(V,E). According to the co-ordinates (x, y, z) LUT delay is updated during Init
function in retiming, where (x, y) represents the location of the LAB as shown in
Fig. 3.6(a), and z is the index of LUT inside the LAB. Therefore each LUT has its
unique delay value in G(V,E).
We tested 20 MCNC benchmarks in our experiment by applying post-P&R variation-
aware retiming. They are divided into two groups as combinatorial and sequential
logic circuits. For the sequential circuits, the retiming is applied directly without
changing the functionality of circuit. However, to explore the improvement we can
achieve for more benchmark circuits, the combinatorial circuits are modified and
tested by adding two FFs to the sink of each path. As introduced in Section. 5.2,
114 Chapter 5. Variation-aware Retiming
adding a proper number of FFs may affect the timing improvement provided by full
chipwise variation-aware retiming. In our experiment, two FFs are added to the sinks
of paths to test the potential improvement our proposed optimisation can achieve.
Future work will focus on how to find a proper number of FFs to achieve the greatest
improvement from full chipwise retiming. As a control group, variation-blind retim-
ing is defined as retiming applied without any knowledge of variation. However,
the information of variation is loaded for the results analysis at the end of retiming
to evaluate the delay of the critical path. The process variation (σdelay/µdelay) is
amplified to 30% to represent the process variation in the future. The algorithm is
written in un-optimized Matlab code to explore the potential improvement we can
achieve without considering execution time.
5.5.2 Results
As can be seen in Fig. 5.5(a), the average improvement achieved for 10 sequential
benchmarks in terms of clock period is about 7%. The improvement for some bench-
marks is much smaller than others such as bigkey. The reason is that the critical
path forms a cyclic feedback after several retiming iterations, thus less improvement
is possible on this type of circuit through retiming. Notice that the amount of
improvement depends on the number of choices that FFs can be shifted for retim-
ing. Therefore, as described in Section 5.2, benchmark circuits with more retiming
options are expected to have a greater improvement with our method.
For modified combinatorial circuits, we add two FFs to every sink of 10 combinato-
rial MCNC benchmarks. The results are shown in Fig. 5.5(b). The average timing
improvement achieved is about 18% for the 10 benchmarks. The improvement for
the modified combinatorial circuits is more than that of the tested sequential cir-
cuits because the FFs are located at the end of each path. By only moving FFs
backwards, there are more retiming choices than the sequential cases, and hence a
5.5. Experiments and Results 115
0
10
20
30
40
50
60
70
80
90
s2
98
tse
ng
dif
feq
big
key clm
a
ds
ip
elli
pti
c
fris
c
s3
84
17
38
58
4.1
10 sequential MCNC benchmarks
Cl
oc
k 
pe
rio
d 
(ps
)
 
 
Variation−blind retiming
Variation−aware retiming
(a) Results of sequential circuits.
0
10
20
30
40
50
60
ex
5p alu
4
ap
ex
4
de
s
ex
10
10 se
q
pd
c
m
ise
x3
sp
la
ap
ex
2
10 modified combinatorial MCNC benchmarks
Cl
oc
k 
pe
rio
d 
(ps
)
 
 
Variation−blind retiming
Variation−aware retiming
(b) Results of modified combinatorial circuits.
Figure 5.5: Variation-blind and aware retiming on (a) sequential and (b) modified
combinatorial MCNC circuits.
better improvement of timing performance is achieved.
Although the unoptimised algorithm in Matlab is incapable of reflecting real world
performance in terms of execution time, the overhead is expected to be mainly mem-
116 Chapter 5. Variation-aware Retiming
ory access of delay values from variation maps. Given appropriate optimisations,
the impact is expected to be insignificant with the given memory bandwidth and
computing power of current and future computer systems.
5.6 Conclusion
This chapter employed detailed delay variation measurements of 100 Cyclone III
FPGA chips to demonstrate the potential timing improvement of designs using
variation-aware retiming. With σdelay/µdelay = 30% variation, about 7% improve-
ment on average for 10 sequential MCNC benchmarks and 18% for the 10 modified
combinatorial benchmarks (add 2 FFs at the end) were observed. While variability
will inevitably impact design timing yield on FPGAs in the future, our method pro-
vides one promising solution that can be easily employed to combat the problem.
It is worth noting that our variation-aware retiming as proposed in this chapter is
not a trade-off with some other parameters. If applied after variation-aware P&R, it
provides a faster design with no negative consequences such as increase of resource
usage, higher power or change of functionality. Along with SSTA, which can also be
been used to improve timing, our method is in a sense “exploiting” the stochastic
variation on a particular device to squeeze the last drop of performance out of the
device. The only price to pay is the additional computational load to implement
the algorithm, but that is expected to be insignificant when compared with the load
required by P&R. The drawback of this work is to verify the retiming results. For
full chipwise retiming, a specific verification is required for the retimed configuration
which increases complexity and causes huge overhead. However, this drawback can
be overcome by applying the classification method on variation maps and then apply
the variation-aware retiming based on the median map of each class.
Chapter 6
Conclusion
6.1 Optimisation for Process Variation
As CMOS technology continues to scale towards the discrete atomic scales of materi-
als used in transistors, process variation has become an increasingly important issue
in VLSI device engineering because the uniformity of electrical characteristics on one
chip gets progressively more difficult to maintain as we approach the absolute phys-
ical scale limit. Speed-binning combined with guard-banding used by the industry
is so pessimistic that it degrades the timing performance of designs on FPGA chips.
The unique advantage of FPGAs is that their hardware can be programmed and re-
configured post-fabrication, making them suitable to variation-aware methodologies
that could alleviate the impact of process variation. In this thesis, three variation-
aware optimisation methodologies based on measured variation maps are proposed
to provide a complete framework to combat process variation.
Instead of using existing statistical models of process variation, variation maps mea-
sured from commercial FPGAs are used by the variation-aware opitmisation meth-
ods proposed in this thesis. By examining different optimisation methods in 3 areas:
117
118 Chapter 6. Conclusion
placement, routing and retiming, the measured variation maps are processed and
amplified to simulate expected variation patterns in future devices, such that the
potential of the methods are thoroughly explored. Full chipwise placement is intro-
duced in Chapter. 3. The results show that full chipwise optimisation can provide
promising improvement considering process variation. However, these “per-chip”
methods are not practical because of their huge overhead in execution time. There-
fore, a two-stage method that classifies variation maps into different classes and
performs variation-aware placement on the median maps of each class, is proposed
to save execution time. The timing improvement made by this two-stage method
with 16 classes over 129 FPGAs is about 7%, while the execution time is 8 times
quicker than the full chipwise approach. The advantage of variation-aware place-
ment is that it can deal with spatial correlated process variation effectively where
the goal is to place the entire design or the LABs used by the critical path on the
fast region of FPGA to achieve timing improvement. The only major limitation
with it is it has little effect on pure stochastic process variation.
Similar to placement, full chipwise routing is introduced in Chapter. 4 to optimise
towards the upper bound of improvement achievable by optimal variation-aware
routing. Since only critical paths and near-critical paths may affect the timing
performance, a partial re-routing method is proposed to re-route only a small pro-
portion of paths with high criticality to save execution time. To provide the routing
space for re-routing, some routing resources are reserved in the initial routing and
used in the re-routing process. In addition, a small number of non-critical paths
defined by the ratio, Non Crit T , are released to free necessary resources. How-
ever, both variation-aware routing and partial re-routing methods may fail to gain
any improvement because of the inherent noise in the router, even after a noise re-
duction method is used. The improvement made by partial re-routing is about 6%
with Crit T = 0.9 (threshold in criticality), and Non Crit T = 15% (threshold of
non-critical paths released for re-routing) and an 8 times speedup for 129 variation
6.1. Optimisation for Process Variation 119
maps is observed.
Compared with variation-aware placement, variation-aware routing is good at deal-
ing with stochastic process variation. However, the improvement achieved is rel-
atively small. There are three main causes to the observed small improvements.
Firstly, region-based variation maps are used where the routing resources share the
same variation characteristics at the same location. Therefore, the swapping of rout-
ing resources at the same location (individual switch in switch box or connection
box) would not improve the timing performance. Secondly, when non-critical paths
are released to provide extra space for variation-aware re-routing, these extra rout-
ing resource maybe inapplicable in some cases because they are too far from the
critical paths. Lastly, the noise of router may mask the improvement even when the
noise reduction method is applied.
Apart from placement and routing strategies, variation-aware retiming is also of
interest and is examined in Chapter. 5. It takes advantage of the largely unused
register resources on most modern FPGA architectures. However, in the case of com-
binatorial circuits, extra registers must be inserted to achieve retiming, resulting in
higher output latency. In addition, retiming options are restricted in circuits with
sequential feedback where additional latency could break its functionality. Nonethe-
less, for sequential MCNC benchmark circuits with variation-aware retiming, the
improvement achieved is 7% on average in terms of the delay of the critical path.
Also, with modified combinatorial MCNC benchmarks (increased latency with two
FFs inserted at the sink of each path), the improvement is about 18%.
All experiments in this thesis used region-based variation maps. It is clear that
more fine-grained variation maps may yield better results. However, due to the
extra overhead in execution time, more fine-grained variation maps are simply not
practical with these optimisation methods. For example, the measured region-based
variation maps include 7,232 (26 × 32) elements for Cyclone III FPGAs. If each
120 Chapter 6. Conclusion
0
2
4
6
8
10
12
alu
4
ap
ex
2
ap
ex
4
big
ke
y
clm
a
de
s
dif
feq ds
ip
ell
ipt
ic
ex
10
10
ex
5p fris
c
m
ise
x3 pd
c
s2
98
s3
84
17
s3
85
84
.1
se
q
sp
la
tse
ng
D
el
ay
 o
f C
rit
ica
l p
at
hs
 fo
r 9
5%
 ti
m
in
g 
yie
ld
 (n
s)
 
 
Variation−blind P&R and Retiming
Per−cluster placement, Partial rerouting, Per−device retiming  
Full chipwise  P&R, Per−device retiming 
Figure 6.1: The timing improvement made by combined optimisation methods.
switch and LUT has a unique delay value because of process variation, then by simply
assuming 10 switches in one switch box (25 × 31 switch boxes in one FPGA), the
number of elements in one variation map would be 1,340,862 (26×32 (LUTs)+25×
31× 10 switches) which is about 185 times that of region-based variation maps and
would be impractical to be used for the proposed optimisation methods.
The combined results using all 3 variation-aware optimisation methods on 129 FP-
GAs are illustrated in Fig. 6.1. Full chipwise variation-aware P&R and retiming can
achieve 19% improvement in critical path delay on average, while applying two-stage
placement (16 classes), partial re-routing (Crit T = 0.9) and variation-aware retim-
ing yielded 13% improvement. Overall, the combination of variation-aware methods
achieved about 20 times speedup in execution time compared with exhaustive full
chipwise optimization.
6.2. Future work and Opportunities 121
6.2 Future work and Opportunities
The work in this thesis has demonstrated that due to the features of FPGAs, post-
fabrication optimisation methods are capable of reducing the impact of process
variation using measured variation maps. The results of the optimisation methods
are so far based on VPR simulation only. The step forwards would be to demon-
strate the variation-aware optimisation methodologies on commercial FPGAs and
CAD tools. Integration of the technique here with industrial tools such as Altera’s
Quartus and Xilinx ISE would be highly useful, but would require extremely close
collaborations with them.
For the optimisation methods proposed in this thesis, the variation maps are mea-
sured and amplified to σ/µ = 30% to predict the upper bound of process variation
at future technology nodes. However, with development of fabrication techniques
including High-k [99] and 3D Finfets [100], our prediction may be postponed by a
few extra years. Nonetheless, it is worthwhile to build a solid foundation of tools
and methods to overcome and exploit delay variability, before its effects worsen in
the near future. Moreover, additional delay variability due to non-uniform aging
(degradation) of components may also benefit from the proposed methods, where
the life of FPGA could be prolonged in terms of operating speed.
The work in this thesis only concentrated on improvement to timing performance
of FPGA designs. However, power consumption is another important aspect that
is also closely related to process variation. As shown by Mehta et al. [64], power
consumption and delay are the two main aspects influenced by process variation.
Therefore, our variation-aware methods can also be extended to minimise power,
or selectively applied to different paths in a design to achieve optimisation in both
aspects.
As opposed to the region-based variation maps used in the thesis, DeHon proposed
122 Chapter 6. Conclusion
using fine-grained characterisation of basic FPGA components [21] to optimise tim-
ing. While such approach may yield higher timing gain at the expense of longer
measurement time and optimisation complexity, the major drawback is that the
approach is not scalable with the rapid growth in number and complexity of com-
ponents in FPGAs. Sedcole etal. has published the idea of deriving path delays
from measurements of longer paths with clearer statistical analysis [101]. Therefore,
an investigation into the most suitable level of measurement granularity in dealing
with process variation considering both timing performance and execution time will
form an important part of our future work.
Based on the reconfigurability of FPGAs, the variation-aware optimisation method
proposed in this thesis can be extended to address aging and other degradation ef-
fects. By applying online monitoring circuit, the degradation in timing performance
can be measured during the life of FPGAs. With the up-to-date variation maps, the
proposed methods in this thesis can be applied to improve the timing performance
of the chip.
More realistic benchmarks instead of MCNC benchmarks will be adopted to test our
proposed optimisation strategies. In addition, the future experiments will be based
on accurate advanced architectural models. So far, the process variation model we
used does not include heterogeneous blocks such as block memory or DSP with VTR
(Verilog-to-Routing). As future work, we will apply our variation-aware methods in
the latest version of Verilog-to-Routing (VTR), which will take hierarchical FPGA
architectures with heterogeneous blocks into account, and allow our methods to
be used in commercial FPGAs.
The variation-aware retiming we proposed is applied in a chip-by-chip basis, assum-
ing unique variation maps between FPGAs. However, it is impractical to treat each
FPGA individually in large quantities. In addition, this per-chip retiming approach,
when used in combination with variation-aware placement and/or partial rerouting,
6.2. Future work and Opportunities 123
is difficult to verify because every device has its unique configuration and hence
different retiming solution. Our future work on retiming is to enhance the method-
ology of retiming to reduce execution time and solve the verification problem, e.g.
by applying the two-stage classification method to the variation maps. Moreover,
the retiming algorithm written in Matlab can be migrated to more efficient lan-
guages, such as C, to significantly reduce runtime and hence enable integration of
the method into future design flow in FPGA CAD tools.
Ultimately, the tools developed in this thesis may redefine the ways in which FPGA
configurations are generated, optimised and used by future FPGA users. As varia-
tions in FPGAs increase to the point that they can no longer be ignored during the
design flow, just like the introduction of speed-grading to chips, manufacturers will
have to provide a similar measure that classifies their FPGA chips into classes of
variation patterns using the method described in Chapter 2 and 3. The main dif-
ference, however, compared to speed-grading is that each chip is assigned a class-ID
that points to a specific variation map stored in the “cloud” (remote servers) on the
vendor’s side.
The cloud could provide the variation maps on demand for optimising designs on
FPGAs with specific class-IDs. For large scale deployment of a finalised design
on multiple FPGAs, there is even the possibility of off-loading the second stage
of the proposed two-stage optimisation to the cloud, which it produces and stores
an optimised configuration for each class. On the client side, users simply need to
connect their FPGAs to the cloud, and they will transparently download the optimal
configurations according to their class-IDs.
All in all, it is about how tools can evolve by taking advantage of new concepts and
technologies to overcome the physical limit of this world, and continue the virtuous
circle of technology development.
Bibliography
[1] K. Takeuchi, A. Nishida, and T. Hiramoto, “Random fluctuations in scaled
MOS devices,” in Simulation of Semiconductor Processes and Devices, 2009.
SISPAD ’09. International Conference on, 2009, pp. 1–7.
[2] Y. Ye, S. Gummalla, C.-C. Wang, C. Chakrabarti, and Y. Cao, “Random
variability modeling and its impact on scaled CMOS circuits,” J. Comput.
Electron., vol. 9, no. 3-4, pp. 108–113, Dec. 2010.
[3] S. Borkar, “Designing reliable systems from unreliable components: the chal-
lenges of transistor variability and degradation,” Micro, IEEE, vol. 25, no. 6,
pp. 10–16, 2005.
[4] J. S. J. Wong, “Delay measurement and self characterisation on FPGAs,”
Ph.D. dissertation, Imperial College London, 2011.
[5] P. Sedcole and d Peter Y. K. Cheung, “Parametric yield modeling and simula-
tions of FPGA circuits considering within-die delay variations,” ACM Trans.
Reconfigurable Technol. Syst., vol. 1, no. 2, pp. 1936–7406, Jun. 2008.
[6] J. A. Nestor, “CADAPPLETS animations of VLSI CAD algorithms,” http:
//workbench.lafayette.edu/∼nestorj/cadapplets/, 18-Apr.-2013.
[7] K. J. Kuhn, M. D. Giles, B. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. T.
Ma, A. Maheshwari, and S. Mudanai, “Process technology variation,” Electron
Devices, IEEE Transactions on, vol. 58, no. 8, pp. 2197–2208, June 2011.
124
BIBLIOGRAPHY 125
[8] ITRS, “International technology roadmap for semiconductors,” http://www.
itrs.net/Links/2012ITRS/Home2012.htm, 19-Mar.-2013.
[9] K. J. Kuhn, “Moore’s law past 32nm: Future challenges in device scaling,” in
IWCE ’09. 13th International Workshop on, Feb. 2009, pp. 1–6.
[10] P. Sedcole and P. Y. K. Cheung, “Parametric yield in FPGAs due to within-
die delay variations: A quantitative analysis,” in Proc. 15th ACM/SIGDA Int.
Conf. Symposium on FPGAs, Feb. 2007, pp. 178–187.
[11] A. Kumar and M. Anis, “FPGA design for timing yield under process varia-
tions,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 18, no. 3, pp. 423–435, June 2010.
[12] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “A transition probability
based delay measurement method for arbitrary circuits on FPGAs,” in Proc.
Int. Conf. Field-Programmable Technology (FPT), Dec. 2008, pp. 105 –112.
[13] W. Kiri, C. Claire, R. Seth, and S. Stefan, “Constrained k-means clustering
with background knowledge,” in Proceedings of the Eighteenth International
Conference on Machine Learning., 2001, pp. 577–584.
[14] Z. Guan, J. Wong, S. Chaudhuri, G. Constantinides, and P. Cheung, “A two-
stage variation-aware placement method for fpgas exploiting variation maps
classification,” in Proc. 22nd Int. Conf. FPL, Aug. 2012, pp. 519 –522.
[15] E. Stott, Z. Guan, J. Levine, J. Wong, and P. Cheung, “Variation and relia-
bility in FPGAs,” Design Test, IEEE, vol. PP, no. 99, pp. 1–1, 2013.
[16] G. E. Moore, “Cramming more components onto integrated circuits,” Pro-
ceedings of the IEEE, vol. 86, no. 1, pp. 82–85, Mar. 1998.
[17] W. Shockley, “Problems related to p-n junctions in silicon,” Solid-State Elec-
tronics, vol. 2, no. 1, pp. 35–67, 1961.
126 BIBLIOGRAPHY
[18] K. Kuhn, C. Kenyon, A. Kornfeld, M. Liu, A. Maheshwari, W. kai Shih,
S. Sivakumar, G. Taylor, P. VanDerVoorn, and K. Zawadzki, “Managing
process variation in intels 45nm cmos technology,” Ietel Technology Journal,
vol. 12, no. 2, June 2008.
[19] T. Tuan, A. Lesea, C. Kingsley, and S. Trimberger, “Analysis of within-die
process variation in 65nm FPGAs,” in Quality Electronic Design (ISQED),
2011 12th International Symposium on, 2011, pp. 1–5.
[20] J. Jung and T. Kim, “Timing variation-aware high level synthesis: Current
results and research challenges,” in Circuits and Systems, 2008. APCCAS
2008. IEEE Asia Pacific Conference on, 2008, pp. 1004–1007.
[21] B. Gojman, S. Nalmela, N. Mehta, N. Howarth, and A. DeHon, “GROK-
LAB: generating real on-chip knowledge for intra-cluster delays using timing
extraction,” in Proc. ACM/SIGDA Int. Conf. Symposium on FPGAs, 2013,
pp. 81–90.
[22] S. R. Nassif, “Delay variability: sources, impacts and trends,” in Solid-State
Circuits Conference, 2000. Digest of Technical Papers. ISSCC. 2000 IEEE
International, 2000, pp. 368–369.
[23] P. Sedcole and P. Y. K. Cheung, “Within-die delay variability in 90nm FPGAs
and beyond,” in Proc. IEEE Int. Conf. FPGA, Dec. 2006, pp. 9–104.
[24] S. R. Nassif, “Design for variability in DSM technologies [deep submicron
technologies],” in Quality Electronic Design, 2000. ISQED 2000. Proceedings.
IEEE 2000 First International Symposium on, 2000, pp. 451–454.
[25] P. S. Zuchowski, P. A. Habitz, J. D. Hayes, and J. H. Oppold, “Process and
environmental variation impacts on ASIC timing,” in Computer Aided Design,
2004. ICCAD-2004. IEEE/ACM International Conference on, 2004, pp. 336–
342.
BIBLIOGRAPHY 127
[26] J. M. Levine, E. Stott, G. A. Constantinides, and P. Y. Cheung, “Online
measurement of timing in circuits: For health monitoring and dynamic voltage
& frequency scaling,” in Field-Programmable Custom Computing Machines
(FCCM), 2012 IEEE 20th Annual International Symposium on, Feb. 2012,
pp. 109–116.
[27] E. Stott, P. Sedcole, and P. Cheung, “Fault tolerance and reliability in field-
programmable gate arrays,” Computers Digital Techniques, IET, vol. 4, no. 3,
pp. 196–210, 2010.
[28] A. Srivastava, D. Sylvester, and D. Blaauw, Eds., Statistical Analysis and
Optimization for VLSI: Timing and Power. Springer, 2005.
[29] S. R. Nassif, “Modeling and analysis of manufacturing variations,” in Custom
Integrated Circuits, 2001, IEEE Conference on, 2001, pp. 223–228.
[30] S. G. Duval, “Statistical circuit modeling and optimization,” in Statistical
Metrology, 2000 5th International Workshop on, 2000, pp. 56–63.
[31] K. A. Bowman, A. R. Alameldeen, S. T. Srinivasan, and C. B. Wilkerson, “Im-
pact of die-to-die and within-die parameter variations on the clock frequency
and throughput of multi-core processors,” Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 17, no. 12, pp. 1679–1690, 2009.
[32] J. Xiong, V. Zolotov, and L. He, “Robust extraction of spatial correlation,”
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-
tions on, vol. 26, no. 4, pp. 619–631, 2007.
[33] A. Asenov, S. Kaya, and J. H. Davies, “Intrinsic threshold voltage fluctuations
in decanano MOSFETs due to local oxide thickness variations,” Electron De-
vices, IEEE Transactions on, vol. 49, no. 1, pp. 112–119, 2002.
128 BIBLIOGRAPHY
[34] A. Asenov, S. Kaya, and A. R. Brown, “Intrinsic parameter fluctuations in
decananometer MOSFETs introduced by gate line edge roughness,” Electron
Devices, IEEE Transactions on, vol. 50, no. 5, pp. 1254–1260, 2003.
[35] M. Pelgrom, A. C. J. Duinmaijer, and A. Welbers, “Matching properties of
MOS transistors,” Solid-State Circuits, IEEE Journal of, vol. 24, no. 5, pp.
1433–1439, 1989.
[36] GSS, “Simulation analysis of the Intel 22nm FinFET,” http:
//www.goldstandardsimulations.com/index.php/news/blog search/
simulation-analysis-of-the-intel-22nm-finfet/, 19-Mar.-2013.
[37] H. Iwai, “Roadmap for 22nm and beyond,” Microelectron. Eng., vol. 86, no.
7-9, pp. 1520–1528, July 2009.
[38] C. Visweswariah, K. Ravindran, K. Kalafala, S. G. Walker, S. Narayan, D. K.
Beece, J. Piaget, N. Venkateswaran, and J. G. Hemmet, “First-order incre-
mental block-based statistical timing analysis,” Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on, vol. 25, no. 10, pp.
2170–2180, 2006.
[39] F. J. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau, “Measur-
ing power and temperature from real processors,” in Parallel and Distributed
Processing, 2008. IPDPS 2008. IEEE International Symposium on, Feb. 2008,
pp. 1–5.
[40] V. Vorisek, T. Koch, and H. Fischer, “At-speed testing of SOC ICs,” in Design,
Automation and Test in Europe Conference and Exhibition, 2004. Proceedings,
vol. 3, 2004, pp. 120–125.
[41] C. Stroud, S. Konala, P. Chen, and M. Abramovic, “Built-in self-test of logic
blocks in FPGAs (finally, a free lunch: BIST without overhead!),” in VLSI
Test Symposium, 1996., Proceedings of 14th, 1996, pp. 387–392.
BIBLIOGRAPHY 129
[42] M. Omana, D. Giaffreda, C. Metra, T. Mak, S. Tam, and A. Rahman, “On-
die ring oscillator based measurement scheme for process parameter variations
and clock jitter,” in Defect and Fault Tolerance in VLSI Systems (DFT), 2010
IEEE 25th International Symposium on, 2010, pp. 265–272.
[43] E. A. Stott, J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Degradation in
FPGAs: measurement and modelling,” in Proceedings of the 18th Int. Conf.
Symposium on FPGAs, 2010, pp. 229–238.
[44] B. Gojman, N. Mehta, R. Rubin, and A. DeHon, “Component-specific map-
ping for low-power operation in the presence of variation and aging,” in Low-
Power Variation-Tolerant Design in Nanometer Silicon. Springer, 2011, pp.
381–432.
[45] Y. Lin, M. Hutton, and L. He, “Statistical placement for FPGAs considering,”
IET Computers & Digital Techniques, vol. 1, no. 4, pp. 267–275, 2007.
[46] A. Agarwal, D. blaauw, and V. zolotov, “Statistical timing analysis for intra-
die process variations with spatial correlations,” in Computer Aided Design,
2003. ICCAD-2003. International Conference on, Nov. 2003, pp. 900 – 907.
[47] “Statistical static timing analysis,” http://en.wikipedia.org/wiki/Statistical
static timing analysis, 07-Apr.-2013.
[48] M. Orshansky and K. Keutzer, “A general probabilistic framework for worst
case timing analysis,” in Design Automation Conference, 2002. Proceedings.
39th, 2002, pp. 556–561.
[49] H. Chang, V. Zolotov, S. Narayan, and C. Visweswariah, “Parameterized
block-based statistical timing analysis with non-gaussian parameters, nonlin-
ear delay functions,” in Design Automation Conference, 2005. Proceedings.
42nd, 2005, pp. 71–76.
130 BIBLIOGRAPHY
[50] G. Lucas, C. Dong, and D. Chen, “Variation-aware placement with multi-cycle
statistical timing analysis for FPGAs,” Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, vol. 29, no. 11, pp. 1818–1822,
2010.
[51] S. K. Mehr, A. R. A. Mehr, S. N. Mozaffari, and A. Afzali-Kusha, “A new
block-based SSTA method considering within-die variation,” in Quality Elec-
tronic Design (ASQED), 2010 2nd Asia Symposium on, Feb. 2010, pp. 260–
263.
[52] Y. Hu and L. He, “Retiming for high performance FPGAs considering flipflop
constraints and process variations.” UCLA Engr, Tech. Rep.
[53] Y. Lin, M. Hutton, and L. He, “Placement and timing for FPGAs considering
variations,” in Field Programmable Logic and Applications, 2006. FPL ’06.
International Conference on, Feb. 2006.
[54] L. Cheng, J. Xiong, L. He, and M. Hutton, “FPGA performance optimization
via chipwise placement considering process variations,” in Field Programmable
Logic and Applications, 2006. FPL ’06. International Conference on, 2006, pp.
1–6.
[55] Y. Lin, L. He, and M. Hutton, “Stochastic physical synthesis considering
prerouting interconnect uncertainty and process variation for FPGAs,” Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 2,
pp. 124–133, June 2008.
[56] S. Srinivasan, P. Mangalagiri, Y. Xie, and N. Vijaykrishnan, “FPGA rout-
ing architecture analysis under variations,” in Computer Design, 2007. ICCD
2007. 25th International Conference on, 2007, pp. 152–157.
[57] S. Sivaswamy and K. Bazargan, “Variation-aware routing for FPGAs,” in Proc.
15th ACM/SIGDA Int. Conf. Symposium on FPGAs, Feb. 2007, pp. 71–79.
BIBLIOGRAPHY 131
[58] Y. Matsumoto, M. Hioki, T. Kawanami, H. Koike, T. Tsutsumi, T. Nakagawa,
and T. Sekigawa, “Suppression of intrinsic delay variation in fpgas using mul-
tiple configurations,” ACM Trans. Reconfigurable Technol. Syst., vol. 1, no. 1,
pp. 3:1–3:31, Mar. 2008.
[59] D. Aghamirzaie, S. A. Razavi, M. S. Zamani, and M. Nabiyouni, “Reduction
of process variation effect on FPGAs using multiple configurations,” in VLSI
System on Chip Conference (VLSI-SoC), 2010 18th IEEE/IFIP, 2010, pp.
85–90.
[60] K. Katsuki, M. Kotani, K. Kobayashi, and H. Onodera, “A yield and speed
enhancement scheme under within-die variations on 90nm LUT array,” in
Custom Integrated Circuits Conference, 2005. Proceedings of the IEEE 2005,
2005, pp. 601–604.
[61] P. Sedcole, B. Blodget, T. Becker, J. Anderson, and P. Lysaght, “Modular dy-
namic reconfiguration in Virtex FPGAs,” Computers and Digital Techniques,
IEE Proceedings, vol. 153, no. 3, pp. 157–164, Mar. 2006.
[62] Y. Sugihara, Y. Kume, K. Kobayashi, and H. Onodera, “Performance opti-
mization by track swapping on critical paths utilizing random variations for
FPGAs,” in Field Programmable Logic and Applications, 2008. FPL 2008.
International Conference on, 2008, pp. 503–506.
[63] A. A. M. Bsoul, N. Manjikian, and L. Shang, “Reliability- and process
variation-aware placement for FPGAs,” in Design, Automation Test in Eu-
rope Conference Exhibition (DATE), 2010, 2010, pp. 1809 –1814.
[64] N. Mehta, R. Rubin, and A. DeHon, “Limit study of energy & delay; de-
lay benefits of component-specific routing,” in Proc. ACM/SIGDA Int. Conf.
Symposium on FPGAs, 2012, pp. 97–106.
132 BIBLIOGRAPHY
[65] U. Farooq, Z. Marrakchi, and H. Mehrez, Eds., Tree-based Heterogeneous
FPGA Architectures. Springer Science+Business Media LLC, 2012.
[66] V. G. Gudise and G. K. Venayagamoorthy, “FPGA placement and routing
using particle swarm optimization,” in VLSI, 2004. Proceedings. IEEE Com-
puter society Annual Symposium on, 2004, pp. 307–308.
[67] Xilinx, http://www.xilinx.com/, 18-Apr.-2013.
[68] Altera, http://www.Altera.com/, 18-Apr.-2013.
[69] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs: area-efficiency
vs. input sharing and size,” in Custom Integrated Circuits Conference, 1997.,
Proceedings of the IEEE 1997, 1997, pp. 551–554.
[70] G. G. Lemieux and D. M. Lewis, “Analytical framework for switch block
design,” in International Conference on Field-Programmable Logic and Appli-
cations, 2002, pp. 122–131.
[71] B. Vaughn, R. Jonathan, and M. Alexander, Eds., Architecture and CAD for
Deep-Submicron FPGAs. Norwell, MA, USA: Kluwer Academic Publishers,
1999.
[72] S.-J. Lee and D. K. Raahemifar, “FPGA placement optimization methodol-
ogy survey,” in Electrical and Computer Engineering, 2008. CCECE 2008.
Canadian Conference on, 2008, pp. 1981–1986.
[73] Y. Xu and M. A. S. Khalid, “QPF: efficient quadratic placement for FPGAs,”
in FPGAs, 2005. International Conference on, Sept. 2005, pp. 555–558.
[74] C. Kim and H. Shin, “Cramming more components onto integrated circuits,”
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-
tions on, vol. 15, no. 5, pp. 560–568, Mar. 1996.
BIBLIOGRAPHY 133
[75] J. Cong, M. Romesis, and M. Xie, “Optimality, scalability and stability study
of partitioning and placement algorithms,” in Proceedings of the 2003 inter-
national symposium on Physical design, 2003, pp. 88–94.
[76] M. Yang, A. Almaini, L. Wang, and P. Wang, “FPGA placement using ge-
netic algorithm with simulated annealing,” in ASIC, 2005. ASICON 2005. 6th
International Conference On, vol. 2, 2005, pp. 806–810.
[77] D. P. Singh and S. D. Brown, “Incremental placement for layout-driven op-
timizations on FPGAs,” in Computer Aided Design, 2002. ICCAD 2002.
IEEE/ACM International Conference on, 2002, pp. 752–759.
[78] A. Ludwin and V. Betz, “Efficient and deterministic parallel placement for
FPGAs,” ACM Trans. Des. Autom. Electron. Syst., vol. 16, no. 3, pp. 22:1–
22:23, June 2011.
[79] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated
annealing,” Science, vol. 220, no. 1, pp. 671–680, 1983.
[80] J. Lam and J.-M. Delosmett, “Performance of a new annealing schedule,” in
Design Automation Conference, 1988. Proceedings., 25th ACM/IEEE, 1988,
pp. 306–311.
[81] S. K. Nag and R. A. Rutenbar, “Performance-driven simultaneous place and
route for island-style FPGAs,” in Computer-Aided Design, 1995. ICCAD-95.
Digest of Technical Papers., 1995 IEEE/ACM International Conference on,
1995, pp. 332–338.
[82] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Nu-
merische Mathematik, vol. 1, no. 1, pp. 269–271, 1959.
[83] C. Y. LEE, “An algorithm for path connections and its applications,” Elec-
tronic Computers, IRE Transactions on, vol. EC-10, no. 3, pp. 346–365, 1961.
134 BIBLIOGRAPHY
[84] F. MO, A. Tabbara, and R. K. Brayton, “A force-directed maze router,”
in Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM International
Conference on, 2001, pp. 404–407.
[85] P. E. Hart, N. J. Nilssond, and B. Raphael, “A formal basis for the heuris-
tic determination of minimum cost paths,” Systems Science and Cybernetics,
IEEE Transactions on, vol. 4, no. 2, pp. 100–107, 1968.
[86] J. M. Emmert and D. Bhatia, “Incremental routing in FPGAs,” in ASIC
Conference 1998. Proceedings. Eleventh Annual IEEE International, 1998, pp.
217–221.
[87] J. M. Emmert and J. A. Cheatham, “On-line incremental routing for intercon-
nect fault tolerance in FPGAs minus the router,” in Defect and Fault Tolerance
in VLSI Systems, 2001. Proceedings. 2001 IEEE International Symposium on,
2001, pp. 149–157.
[88] C. Ebeling, L. McMurchie, S. A. Hauck, and S. Burns, “Placement and routing
tools for the triptych FPGA,” Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol. 3, no. 4, pp. 473–482, 1995.
[89] R. Rubin and A. DeHon, “Timing-driven pathfinder pathology and reme-
diation: quantifying and reducing delay noise in VPR-pathfinder,” in Proc.
ACM/SIGDA Int. Conf. Symposium on FPGAs, 2011, pp. 173–176.
[90] C. E. Leiserson, F. M. Rose, and J. B. Saxe, “Optimizing synchronous circuitry
by retiming,” in Proc. 3rd Caltech Conf.: Advanced Research VLSI, 1983.
[91] C. Lin and H. Zhou, “Optimal wire retiming without binary search,”
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-
tions on, vol. 25, no. 9, pp. 1577 –1588, Sept. 2006.
BIBLIOGRAPHY 135
[92] X.-Y. Zhu, T. Basten, M. Geilen, and S. Stuijk, “Efficient retiming of multirate
DSP algorithms,” Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, vol. 31, no. 6, pp. 831–844, Dec. 2012.
[93] “Fourier Filtering 2D,” http://fourier.eng.hmc.edu/e161/lectures/fourier/
node15.html, 18-Apr.-2013.
[94] C. Bishop, Ed., Pattern Recongnition and Machine Learning. Springer Sci-
ence+Business Media LLC, 2006.
[95] J. Shlens, “A tutorial on principal component analysis.” http://www.snl.salk.
edu/∼shlens/pca.pdf, 2009, 21-Mar.-2013.
[96] “Uni. of Utah, Spatial Filtering,” http://www.eng.utah.edu/∼cs4640/slides/
Lecture5.pdf/, 27-Mar.-2013.
[97] P. Jamieson, W. Luk, S. J. Wilton, and G. A. Constantinides, “An energy
and power consumption analysis of fpga routing architectures,” in Field-
Programmable Technology, 2009. FPT 2009. International Conference on,
2009, pp. 324–327.
[98] F. S. Pourhashemi and M. S. Zamani, “Evaluation of fpga routing architectures
under process variation,” in Proceedings of the 21st edition of the great lakes
symposium on Great lakes symposium on VLSI, 2011, pp. 351–354.
[99] Intel, “Intel 45nm high-k silicon technology page,” http://www.intel.
com/content/www/us/en/architecture-and-technology/microarchitecture/
microarchitecture-overview-general.html/, 18-Apr.-2013.
[100] X. Wu, C. P. C. H, S. Zhang, C. Feng, and C. M., “Stacked 3-D Fin-CMOS
technology,” Electron Device Letters, IEEE, vol. 26, no. 6, pp. 416–418, 2005.
136 BIBLIOGRAPHY
[101] P. Sedcole, J. Wong, and P. Y. K. Cheung, “Characterisation of fpga clock
variability,” in Symposium on VLSI, 2008. ISVLSI ’08. IEEE Computer So-
ciety Annual, Feb. 2008, pp. 322–328.
Glossary
BFS - Breadth-First Search
BIST - Built-In Self-Test
BMU - Basic Measurement Unit
CAD - Computer-Aided Design
CB - Connection Box
CMOS - Complementary Metal-Oxide-Semiconductor
DFS - Shorter Depth-First Search
FA - Full-Adder
FF - Flip-Flop
FIFO - First-In-First-Out
FPGA - Field-Programmable Gate Array
GASA - Genetic Algorithm and Simulated Annealing
GD - Gaussian Distribution
I/O - Input/Output
ITRS - International Technology Roadmap for Semiconductors
LAB - Logic Array Block
137
138 BIBLIOGRAPHY
LC - Logic Cell
LER - Line Edge Roughness
LR - Launch Register
LUT - Look Up Table
OTF - Oxide Thickness
PCA - Principal Component Analysis
PDF - Probability Density Function
PI - Primary Input
PO - Primary Output
P&R - Placement and Routing
RC - Random Variable
RDF - Random Dopant Fluctuations
RO - Ring Oscillator
RV - Resistance and Capacitance
SB - Switch Box
SR - Sample Register
SSTA - Statistical Static Timing Analysis
STA - Static Timing Analysis
TAC - Transition Activity Counter
TCG - Test Clock Generator
TP - Transition Probability
BIBLIOGRAPHY 139
TPA - Transition Probability Analyzer
TVG - Test Vector Generator
Ur - Utilisation Ratio
VLSI - Very Large-Scale Integration
VPR - Versatile packing, Placement and Routing
Appendix
The codes and resources used in this thesis is stored in the server.
svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup
Modified VPR5
VPR 5.0 is used and modified to execute variation-aware optimisation.
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/VPR
Some new commands are used for extra functions in modified VPR as follow
-var aware lookup: VPR will apply both variation-aware placement based on the
variation maps whose address is declare in architecture file.
routing algorithm -var timing driven: VPR will apply variation-aware rout-
ing.
routing algorithm -part var timing driven: VPR will apply variation-aware
partial-rerouting. This option should work together with -near crit threshold
x.
-near crit threshold x: VPR will apply partial rerouting for Crit T = x.
140
BIBLIOGRAPHY 141
-retiming echo: VPR will produce a timing graph which can be used later in the
variation-aware retiming.
Variation Maps
Two group of region-based variation maps with different granularity are measured
from commercial FPGAs in this thesis and stored under “MAPS”.
Coarse-grained variation maps: Variation maps that one LAB and routing re-
sources form a region.
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/MAPS/Coarse/
Fine-grained variation maps: Variation maps that 4 LUTs in side one LAB form
a region.
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/MAPS/Fine/
Experiment Scripts
All experiment introduced in this thesis is executed on the high performance com-
puter (HPC). The scripts used in this are written in python and perl.
2-stage Placement
The python code about two-stage placement is stored in
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/Python/TwoStagePlace
exp1.py: Execute normal placement by VPR and test the experiment running on
HPC.
142 BIBLIOGRAPHY
exp2.py: Execute normal placement by VPR and but load variation maps at the
end to evaluate the delay of critical path.
exp3.py: Add noise reduction method to VPR.
exp4.py: Test variation-aware placement and 2-stage placement by VPR.
exp6.py: Test variation-aware and 2-stage placement with different size of FPGA
and different seeds.
exp7.py: Test variation-aware and 2-stage placement with different process varia-
tion 10% 20% 30%.
exp8.py: Combined all previous experiment and collect results.
exp9.py: Combined all previous experiment and collect results with different seeds.
exp10.py: Combined all previous experiment and collect results with different
FPGA size.
exp11.py: Combined all previous experiment and collect results with different num-
ber of clusters.
exp12.py: Combined all previous experiment and collect results with different clas-
sification method.
Partial Rerouting
The python code about partial rerouting is stored in
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/Python/PartialReRouting
exp5.py: Apply partial rerouting experiment by VPR.
BIBLIOGRAPHY 143
Matlab code
Matlab is used in this thesis for classification for 2-stage placement and variation-
aware retiming.
Variation-aware Retiming
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/Matlab/Retiming
The algorithm in matlab is introduced as follow
Main.m: Main algorithm for retiming.
Initialization.m: Initialize the timing graph for retiming
r ADJUST.m: Move FFs to achieve better timing performance.
t ADJUST.m: Search for better possible clock period.
Add FFs.m: Add FFs to the sinks of circuits.
Calculate T.m: Calculate clock period for current timing graph.
Check A Circle Topological Sort.m: Sort the timing graph and determine whether
exists a cycle in timing graph.
Find Critical Path.m: Search for better possible clock period.
ReadGraphXML.m: Read the results of placement and routing in .xml format.
Variation Maps Classification
Link: svn+ssh://zg08@saba.ee.ic.ac.uk/ProcVarRepo/CodeBackup/Matlab/Classification
144 BIBLIOGRAPHY
Training identify map.m: Classify variation maps and produce mean maps for
each class (named as mean mapx.dat where x is the cluster number) and
Class-ID for each variation map (stored in file called IDX) .
K means.m: Apply k means method for classification.
Training map PCA.m: Apply PCA methods for maps classification.
Identify new map.m: Classify a map not in the data base into one of the class.
