Delay Measurements and Self Characterisation on FPGAs by Wong, Justin S. J. & Wong, Justin S. J.
Imperial College London of Science,
Technology and Medicine
Department of Electrical and Electronic Engineering
Delay Measurements and Self Characterisation
on FPGAs
Justin S. J. Wong
A thesis submitted for the degree of Doctor of Philosophy
of Imperial College London, October 2010
Revision: February 2011
1
2
Abstract
This thesis examines new timing measurement methods for self delay characterisation of Field-
Programmable Gate Arrays (FPGAs) components and delay measurement of complex circuits
on FPGAs. Two novel measurement techniques based on analysis of a circuit’s output failure
rate and transition probability is proposed for accurate, precise and efficient measurement of
propagation delays. The transition probability based method is especially attractive, since
it requires no modifications in the circuit-under-test and requires little hardware resources,
making it an ideal method for physical delay analysis of FPGA circuits.
The relentless advancements in process technology has led to smaller and denser transistors
in integrated circuits. While FPGA users benefit from this in terms of increased hardware
resources for more complex designs, the actual productivity with FPGA in terms of timing
performance (operating frequency, latency and throughput) has lagged behind the potential
improvements from the improved technology due to delay variability in FPGA components
and the inaccuracy of timing models used in FPGA timing analysis. The ability to measure
delay of any arbitrary circuit on FPGA offers many opportunities for on-chip characterisation
and physical timing analysis, allowing delay variability to be accurately tracked and variation-
aware optimisations to be developed, reducing the productivity gap observed in today’s FPGA
designs.
The measurement techniques are developed into complete self measurement and characterisa-
tion platforms in this thesis, demonstrating their practical uses in actual FPGA hardware for
cross-chip delay characterisation and accurate delay measurement of both complex combina-
torial and sequential circuits, further reinforcing their positions in solving the delay variability
problem in FPGAs.
3
4
Acknowledgements
The work in this thesis was carried out under the supervision of Prof. Peter Y. K. Cheung.
It has been a pleasure and joy to work with Peter, who is always passionate and enthusiastic
in the subject. I particularly thank him for his valuable advices and guidance through out my
study at Imperial College and his help through every critical moments in my PhD study. I
would also like to thank Dr. Pete Sedcole for has patience and support on technical problems,
as well as pointing me in the right direction through every steps of my research.
For the research hardware and equipments, I would like to thank Altera and Terasic for pro-
viding both software and hardware support on the FPGA platforms. Without their support,
most of the experiments would not have been possible.
I thank all my friends for their encouragement and support, especially those in my research
group that I have had the privilege to work with.
Lastly, I would like to thank my Mother for her unconditional love and support through out
my PhD study, despite I’m thousands of miles away from home.
5
6
Dedication
This thesis is dedicated to, and written in memory of my father, Thomas, whom developed my
passion and interest in science, engineering and electronics, and a creative mind that led to the
contributions in this work. It is also dedicated to my mother, Mamie, whom taught me the
true meaning and value of knowledge, and guided me in the right direction towards a successful
life.
7
‘If you know the enemy and know yourself you need not fear the results of a hundred battles.’
Sun Tzu
8
Contents
Abstract 3
Acknowledgements 5
1 Introduction 29
1.1 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3 Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Background and Related Work 36
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 The Role of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Performance and Functional Fault Classification . . . . . . . . . . . . . . 37
2.2.2 Device Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Field-Programmable Gate-Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Speed Grading of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9
10 CONTENTS
2.4 Delay Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Background on Process Variation . . . . . . . . . . . . . . . . . . . . . . 43
2.4.2 The Impact of Delay Variability . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.3 Counter Measures Against Delay Variability . . . . . . . . . . . . . . . . 45
2.5 Fundamental Test Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.1 Test Vector generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.2 Test Data Launch, Capture and Analysis . . . . . . . . . . . . . . . . . . 52
2.5.3 At-Speed Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5.4 Built-in Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.5 Design for Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Existing Test and Measurement Methods . . . . . . . . . . . . . . . . . . . . . . 58
2.6.1 Ring Oscillator Delay Measurement . . . . . . . . . . . . . . . . . . . . . 59
2.6.2 Signature Registers Based Delay Measurement . . . . . . . . . . . . . . . 61
2.6.3 Time-To-Voltage Delay Measurement . . . . . . . . . . . . . . . . . . . . 62
2.6.4 Time-To-Digital Delay Measurement . . . . . . . . . . . . . . . . . . . . 65
2.7 Timing Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.7.1 The RAZOR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.8 Timing Model Based Performance Analysis . . . . . . . . . . . . . . . . . . . . . 71
2.8.1 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.8.2 Timing Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.8.3 Discussion on Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . 74
2.9 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
CONTENTS 11
3 The Failure Rate Detection Measurement Method 79
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2 Principle of The Measurement Circuit . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.1 Timing Resolution and Accuracy . . . . . . . . . . . . . . . . . . . . . . 81
3.3 Path Delay Measurement Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.1 Timing Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Simulation and Modelling of Timing Failure . . . . . . . . . . . . . . . . . . . . 86
3.4.1 Asymmetric Path Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.2 Clock Jitter and Flip-flop Metastability . . . . . . . . . . . . . . . . . . . 89
3.4.3 Choosing the Timing Failure Reference Point . . . . . . . . . . . . . . . 92
3.5 Test Circuit Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.1 Path Specific FRD Test Module . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.2 Clock Frequency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.3 Clock Generator Implementations and Timing Resolution . . . . . . . . . 98
3.5.4 FPGA Delay Characterisation Circuit . . . . . . . . . . . . . . . . . . . . 103
3.6 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.6.1 Cyclone II Measurements and Characterisations . . . . . . . . . . . . . . 106
3.6.2 Cyclone III Measurements and Characterisations . . . . . . . . . . . . . 107
3.6.3 Adder Circuits Cross-Chip Characterisation . . . . . . . . . . . . . . . . 109
3.6.4 Result Precision, Accuracy and Reliability Evaluation . . . . . . . . . . . 110
3.7 Optimised Built-In Self-Test Designs . . . . . . . . . . . . . . . . . . . . . . . . 111
3.7.1 Parallel First-Fail Detector BIST . . . . . . . . . . . . . . . . . . . . . . 112
12 CONTENTS
3.7.2 Binary Search BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4 The Transition Probability Measurement Method 119
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2 Principle of the Measurement Method . . . . . . . . . . . . . . . . . . . . . . . . 120
4.2.1 Inspiration and Discovery of the New Technique . . . . . . . . . . . . . . 120
4.2.2 The Concept of Statistical Measurements . . . . . . . . . . . . . . . . . . 121
4.2.3 Transition Probability Profile Based Measurement . . . . . . . . . . . . . 123
4.3 Simulation and Modelling of Statistical Profiles . . . . . . . . . . . . . . . . . . 125
4.3.1 The Contributions of Timing Uncertainties . . . . . . . . . . . . . . . . . 125
4.3.2 Single Path TP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.4 Test Implementations and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.4.1 Adder Critical Path Testing . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.4.2 LFSR Sequential Circuit Testing . . . . . . . . . . . . . . . . . . . . . . 141
4.4.3 Test Time and Optimisations . . . . . . . . . . . . . . . . . . . . . . . . 144
4.5 Practical Usage of the TP Method . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.5.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5 Complex Circuit Testing 152
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.2 FRD Based Embedded Multiplier Testing . . . . . . . . . . . . . . . . . . . . . . 153
CONTENTS 13
5.2.1 Per-path Exhaustive Testing . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.2.2 Embedded Multipliers Delay Characterisation . . . . . . . . . . . . . . . 157
5.3 Measuring Complex Multi-Path Circuits with Transition Probability . . . . . . . 159
5.3.1 Initial TP Test on Embedded Multiplier . . . . . . . . . . . . . . . . . . 160
5.3.2 Detailed Study of Transition Probability’s Characteristics . . . . . . . . . 165
5.4 Self-Optimising Complex Circuit Test Platform . . . . . . . . . . . . . . . . . . 180
5.4.1 Adaptive Input Probability Weighting . . . . . . . . . . . . . . . . . . . 180
5.4.2 Bit-Specific Optimisation for Low Controllability
and Observability Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.4.3 Generating Weighted Probability Test Vectors . . . . . . . . . . . . . . . 184
5.5 Test Platform Implementation and Evaluation on FPGA . . . . . . . . . . . . . 187
5.5.1 Multiplier Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.5.2 Butterworth IIR Filter Test Case . . . . . . . . . . . . . . . . . . . . . . 191
5.5.3 Test Circuit Area Estimations . . . . . . . . . . . . . . . . . . . . . . . . 191
5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6 Conclusion 197
6.1 Measurement Resolution and Accuracy . . . . . . . . . . . . . . . . . . . . . . . 197
6.2 Limitations and Future Technology Scalability . . . . . . . . . . . . . . . . . . . 198
6.3 Advancements Over Existing Measurement
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.4 Closing the FPGA Design Productivity Gap . . . . . . . . . . . . . . . . . . . . 200
7 Future Work 202
7.1 Memory Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.2 Transition Probability Based On-line Test . . . . . . . . . . . . . . . . . . . . . 202
7.3 Physical Timing Model Tuning and Extraction . . . . . . . . . . . . . . . . . . . 203
7.4 Design Tool Integration and Measurement Based
Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A Glossary 206
B MATLAB Simulation 210
C 17-Level Weighted HP Logic Table 219
Bibliography 220
14
List of Tables
2.1 Comparison of random vector generators . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Comparison of delay measurement, error detection and timing analysis methods 78
5.1 Delay statistics and variation of multipliers on the Cyclone III . . . . . . . . . . 159
5.2 Example input to output activity map for bit-specific high probability (HP)
weights optimisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.3 Example of logic based weighted random sequence generation with different high
probability (HP) and a “Special” weight reserved for toggle signal generation. . 186
5.4 The general resource usage of the TP test platform in terms of the number of
Logic Wlement (LE) or Adaptive Logic Module (ALM) [25]. The estimations
are based on the number input bits n and output bits m, the TP counter bit
width k and the order of magnitude r of the number of HP weight levels. . . . . 194
B.1 Event based representation of clock signals with bounded edge-to-edge jitter
variable τ , where τmin ≤ τn−1 − τn ≤ τmax. . . . . . . . . . . . . . . . . . . . . . 210
B.2 Event based representation of input vectors and normal signals. . . . . . . . . . 210
C.1 Weighted random sequence generation with 17 high probability (HP) levels and
toggle signal. R3, R2, R1 and R0 are the independent uniform random inputs. . 219
15
16
List of Figures
2.1 Basic principle of speed-binning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 A typical Logic Element (LE) with register feedback. . . . . . . . . . . . . . . . 40
2.3 A typical interconnect and logic clusters structure of FPGAs. . . . . . . . . . . . 41
2.4 An example of a typical FPGA layout (based on the Cyclone III architecture [28]). 41
2.5 (a) Imperfection of photo-resist at microscopic scale. (b) Line edge roughness
of metal interconnects. (c) Random discrete dopants in CMOS transistor. (d)
Non-uniform gate thickness. (Images courtesy of University of Glasgow and
IBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 An example showing LUT delay in FPGAs under the influence of process vari-
ability and the ineffectiveness of current speed-binning methodology in tackling
a wide spread of within-die component delays. . . . . . . . . . . . . . . . . . . . 46
2.7 Visual comparison of 32-bit wide random sequences from [47] generated by (a)
Linear-Feedback Shift-Register (LFSR), (b) Linear Hybrid Cellular Automata
(LHCA) and (c) phase-shifted LFSR. Each column of pixels represents a 32-bit
binary value and are presented horizontally from left to right to illustrate the
entire series of random values generated from each method. . . . . . . . . . . . 50
2.8 An example of the Scan-Chain architecture. . . . . . . . . . . . . . . . . . . . . 52
2.9 An example of a 3-bit wide Built-in Logic Observer (BILO) with selectable LFSR
or CRC Signature generation in a Scan-Chain architecture. . . . . . . . . . . . . 53
17
18 LIST OF FIGURES
2.10 (a) A typical Ring Oscillator (RO) structure using inverters. (b) The general
definition of an RO circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.11 The waveform of an RO oscillation with period Tosc. tfall and trise represent the
propagation delay of falling and rising signal transition through the RO loop. . 60
2.12 An example showing how the analogue voltage in a sawtooth signal can be used
to measure the delay of a signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.13 The principle circuit diagram of a time-to-voltage based delay measurement
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.14 The principle circuit diagram of a Delay Lock-Loop (DLL) based delay measure-
ment method, containing a phase-detector (PD) and a voltage controlled Delay
Line (VCDL). Note that the charge pump and filter circuits for the control volt-
age generation is omitted in this illustration. . . . . . . . . . . . . . . . . . . . . 64
2.15 A ring oscillator based pulse-to-oscillation modulator. The modulated oscillation
frequency is determined by the delay element td. . . . . . . . . . . . . . . . . . . 65
2.16 An example of a relative delay measurement circuit using a reference path with
known delay. A count up or down signal indicates whether the relative delay is
positive or negative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.17 A test case with three PUTs without a reference path with known delay. The
circuitry for generating count up or down signal is omitted in the diagram. . . . 67
2.18 A typical Vernire Delay Line (VDL) containing delay elements with delay t1 and
t2. The time difference between the Start and Stop pulses is encoded by the
VDL as a set of Delay Bit Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.19 A principle test circuit using VDL. . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.20 Structure of the RAZOR Flip-Flop pipeline proposed in [18] to detect bit level
errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.21 (a) Flow diagram of a typical timing model driven placement and routing process
of FPGA designs. (b) Flow diagram of a physical delay characterisation and
timing measurement driven placement and routing process of FPGA designs. . . 75
LIST OF FIGURES 19
3.1 An example of a failure rate profile. . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2 Basic principle of the Failure Rate Detection (FRD) delay measurement method. 80
3.3 General implementation of the Failure Rate Detection (FRD) test circuit. . . . . 83
3.4 Timing diagram of the test circuit showing 3 cycles of operations. tcomb represents
the combinatorial path delay for each cycle. tclk S is the clock-to-output delay of
the launch register (LR). tg is the propagation delay of the XOR gate. tsetup SR
and thold SR are the setup and hold time of the sample register (SR). tsetup CR
and thold CR are the setup and hold time of the capture register (CR). . . . . . . 84
3.5 Simulated failure profile of a circuit path with ideal flip-flops and clock. (a)
shows the profile of an ideal circuit with perfect rising and falling transition delay
symmetry, and (b) shows the effect of asymmetrical rising and falling transition
where they propagate through the CUT with different delays. . . . . . . . . . . 87
3.6 A timing diagram showing the conditions resulting in 0%, 50% and 100% failure
rate in the profile. trise and tfall are the delay of rising and falling transitions
through the CUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.7 Simulated failure rate profiles of a circuit path showing the effects of clock jitter
and flip-flop metastability. (a) and (b) show the PDFs used in the simulation
to describe the edge-to-edge clock jitter and metastable window; (c) shows and
compare the effect of clock jitter and metastability on the failure profile. . . . . 90
3.8 The Failure Rate Detection (FRD) test module implementation for measuring
specific combinatorial path delay. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.9 (a) Test Clock Generator (TCG) implemented from Virtex-4 DCMs to test a
Cyclone II FPGA without run-time Reconfigurable PLLs; (b) a self contained
TCG implemented from run-time reconfigurable PLLs in a Cyclone III FPGA.
The Control Signals manage the reconfiguration process of the TCG and provide
the clock synthesis parameters. The locked signal indicates when the test clock
is ready to use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
20 LIST OF FIGURES
3.10 Dynamic timing resolution in terms of delta T/2 of test clock from 128 to 800MHz
generated from two cascaded DCMs on a Xilinx Virtex-4. . . . . . . . . . . . . 101
3.11 Distribution of step size (delta T/2) of test clock (128 to 800MHz) generated
from two cascaded DCMs on a Xilinx Virtex-4. . . . . . . . . . . . . . . . . . . 101
3.12 Comparison of dynamic timing resolution (delta T/2) of test clock from 128 to
800MHz generated from a single PLL (top) and a two-staged PLL configuration
with a post clock multiplier (bottom) on an Altera Cyclone III. . . . . . . . . . 102
3.13 Comparison of step size (delta T/2) distribution of test clock (128 to 800MHz)
generated from a single PLL (top) and a two-staged PLL configuration with a
post clock multiplier (bottom) on an Altera Cyclone III. . . . . . . . . . . . . . 103
3.14 An FPGA characterisation circuit using an array of FRD test modules. Each
FRD is activated and tested by specific address signals and a timed test enable
signal that last for a precise length of time. . . . . . . . . . . . . . . . . . . . . . 104
3.15 An FRD module configured to test LUTs as full adders in arithmetic mode and
their dedicated fast carry-chains on an Altera Cyclone III EP3C25. . . . . . . . 104
3.16 A timing diagram showing the test enable signal, test clock frequency stepping
and failure count of several test cycles of the FRD circuit up to N cycles, where
ttest defines the length of each test. tconfig is the reconfiguration time of the TCG
to the next frequency step and tlock is the time needed to lock onto the new clock
frequency. tsettle and tclear are the times reserved for the counter values to settle
and to clear the counter for the next test cycle. tread is the duration of time
where the failure count is valid for reading after each test. . . . . . . . . . . . . 106
3.17 (a) The failure rate profile obtained from a CUT containing 2 LUTs on a 90nm
Altera Cyclone II EP2C35; (b) Delay map generated from FRD array at different
locations in terms of LAB coordinates across the Cyclone II chip. . . . . . . . . 107
3.18 (a) The failure rate profile obtained from a CUT containing 4 LUTs on a 65nm
Altera Cyclone III EP3C25; (b) Delay map generated from FRD array at different
locations in terms of LAB coordinates across the Cyclone III chip. . . . . . . . . 108
LIST OF FIGURES 21
3.19 Delay map of (a) 4-bit and (b) 8-bit adders at different locations (LAB coordi-
nates) across the Cyclone III chip. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.20 (a) The scatter plot of half clock period (T/2) at 25 % failure rate with expo-
nential best fit. (b) Histogram of residuals of the delay measurements around
the exponential best fit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.21 Consistency test based on standard deviation of measurements computed across
90 sets of delay map results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.22 Modified first-fail detector (FFD) circuit. . . . . . . . . . . . . . . . . . . . . . 112
3.23 (a) FFD clusters based parallel BIST system schematic; (b) Array of FFD clus-
ters on the Cyclone II EP2C35, each containing 16 FFD blocks. . . . . . . . . . 113
3.24 (a-g) Progressive failure maps of FFDs. . . . . . . . . . . . . . . . . . . . . . . 115
3.25 A binary search based FRD characterisation BIST. . . . . . . . . . . . . . . . . 116
4.1 A typical synchronous circuit with input and output registers, and a combina-
torial stage in between. Registers are driven by a clock of period T . . . . . . . . 121
4.2 Example statistical profiles of an input to a circuit generated from a stationary
process (top) and the output of the circuit that failed after fmax (bottom) in the
clock frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.3 Principle circuit diagram of the transition probability (TP) measurement method.124
4.4 Comparison of simulated failure rate and statistical profiles of a single path CUT
in ideal conditions, with clock jitter and flip-flop metastability. The plots shows:
(a) the failure rate profiles, (b) the transition probability profiles, and (c) the
high probability profiles. Since the FRD method is based on half clock period,
the frequency scale of the failure rate plot is doubled to match the other two
plots for direct comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.5 Simulated TP and HP profiles of a circuit path with identical rising and falling
transition delays under random edge-to-edge clock jitter. The effect of clock
jitter and flip-flop metastability are depicted by their corresponding plots. . . . . 127
22 LIST OF FIGURES
4.6 Timing diagram illustrating the combinatorial output of a circuit path and the
corresponding clock edges with jitter. The edge jitter is described by a random
variable τ in terms of timing variation from the expected clock edge. The clock
jitter distribution PDF (τ) is centered around the expected clock edge at T and
is bounded by τmin and τmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.7 An example plot of the behaviour of low frequency multi-cycle random jitter.
The clock edges vary randomly over multiple cycles, but each edge is correlated
to their neighbouring edges due to the low frequency and gradual timing drift. . 130
4.8 Scatter plots showing the edge-to-edge correlation of clock signals with (a) edge-
to-edge random jitter, and (b) low frequency random jitter where the jitter be-
tween consecutive edges are highly correlated with a correlation coefficient of
0.977. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.9 The histograms show the edge-to-edge relative jitter distributions of (a) edge-to-
edge random jitter, and (b) low frequency random jitter within the same jitter
boundaries of ±15ps. The distribution of (b) does not differ a lot from (a)
because the clock edge timing variation in both cases are unpredictable between
multiple clock cycles and are uniformly spread within the same boundaries. The
major difference in (b) is the reduced steepness near the boundaries, which may
lead to a slightly smoother TP profile around the failure frequency of the CUT. 131
4.10 Simulated TP (a) and HP (b) profiles of a circuit path with identical rising and
falling edge delay under low frequency clock jitter. The effect of clock jitter and
flip-flop metastability are depicted by their corresponding plots, where the TP
plot with jitter has a reduced sensitivity to timing failure due to edge-to-edge
jitter correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.11 Comparison of low frequency multi-cycle random jitter against edge-to-edge ran-
dom jitter in terms of the TP profile of a circuit path with different rising and
falling transition delays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
LIST OF FIGURES 23
4.12 Actual TP profile from a CUT on a Cyclone III EP3C25 with similar rising and
falling transition delays. The delays are matched by configuring the LUTs as
inverters to even out the delay difference between rising and falling transitions.
The CUT is driven by simple toggle stimulus, which allowed the individual TP
profile of each transition type to be isolated by measuring at even or odd clock
cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.13 The CUT is a 4-bit Adder implemented on an Altera Cyclone III EP3C25 with
LUTs in arithmetic mode. A and B are the inputs of each Full Adder (FA). C
and S are the carry signals and sum outputs respectively. The carry signals are
transmitted using dedicated carry-chain interconnects. . . . . . . . . . . . . . . 138
4.14 Transition probability D(y) vs. frequency plot from 750 to 1250MHz of the criti-
cal path through the 4-bit Adder CUT on the Cyclone III. Region “a” represents
the initial TP level of normal circuit operation. Regions “b” to “e” are the dis-
tinctive regions caused by progressive failure of signal propagation in the CUT. . 138
4.15 The top plot is the TP profile of the 4-bit Adder vs. clock period steps in de-
scending order. The bottom plot shows the estimated probability density func-
tion (PDF (τ)) of clock jitter derived from the TP profile by taking its derivative.
The TP profile data points are first filtered by a median filter to reduce noise
level in the resulting derivative. τmedian0 and τmedian1 are the median values of
the two jitter distributions corresponding to the negative and positive transitions
of output Z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.16 The test circuit for testing a 63 bit maximal length linear-feedback shift-register
(LFSR) on the Cyclone III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.17 The output TP profile of the linear-feedback shift-register (LFSR) from 1.05 to
1.15 GHz. Upper and lower threshold are used to determine the point of fmax
where the deviation of D(y) occurs. . . . . . . . . . . . . . . . . . . . . . . . . . 142
24 LIST OF FIGURES
4.18 Floor plan of the complete test circuit on the Cyclone III EP3C25. Four equal
sized partitions are selected to accommodate the 4 types of logic activity stress
(300MHz, 1Hz, DC1 and DC0). Each LUT test block contains a CUT of 9 LUTs
placed in a logic array block (LAB). CUTs of row and column interconnects
(IC) are created with anchor LUTs and start/end LUTs to control the span and
orientation of the interconnect lines. The entire FPGA contains 18 × 32 LUT
test blocks, 19× 4 sets of column IC test and 32 sets of row IC test. . . . . . . . 146
4.19 Degradation maps of the Cyclone III in terms of the propagation delay of rising
(fast) and falling (slow) transitions through each LUT test block. Different
rate of degradation is seen between the four stress regions from top to bottom:
300MHz, 1Hz, DC1 and DC0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.20 Degradation measurement in delay of rising (fast) and falling (slow) transitions
in LUT chains under accelerated life stress, grouped by four types of input signal.149
4.21 Degradation measurement in delay of rising (fast) and falling (slow) transitions
in C4 interconnects under accelerated life stress, grouped by four types of input
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.22 Degradation measurement in delay of rising (fast) and falling (slow) transitions
in R24 interconnects under accelerated life stress, grouped by four types of input
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.1 Two different test circuit approaches for multiplier testing. Note that the TCG
is omitted in the diagrams and all clock ports are driven by the same test clock
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.2 (a) shows the circuit used to test the embedded multiplier exhaustively; (b)
shows the flow chart of the test procedure, where finit is the initial test clock
frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
LIST OF FIGURES 25
5.3 (a) A detailed plot of worst-case delays associated with each output bit of a
specific multiplier. (b) A floor-plan of the multiplier blocks under test on the
Cyclone III Chip, showing a total of 66 blocks divided into two columns (Left
and Right). Each block contains two 9x9 multipliers: M0 and M1, resulting in
a total number of 132 9x9 multipliers. (c) Plots of worst-case delay at the most
significant bit (MSB) output of all 132 9x9 embedded multipliers on the Cyclone
III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 The transition probability measurement circuitry for complex circuit with n in-
put bits and m output bits. A random vector generator is used to stimulate the
CUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.5 Circuit diagram of the CUT containing a 9x9 hardware multiplier block with
input A, B and output Z. The registered output y is processed by the TP
measurement circuitry to produce a TP profile. . . . . . . . . . . . . . . . . . . 161
5.6 Test results of the multiplier block are presented as colour map of D(y) for all
18 output bits from 400 to 850MHz. Each column represents the result of one
output bit and the colour represents the value of D(y) at a particular frequency
indicated by the vertical axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.7 The D(y) plot of the 18th output bit against test frequency steps. Upper and
lower thresholds are used to determine the point where the D(y) begins to deviate
from its steady state value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.8 This figure compares the fmax estimations obtained using TP profile and results
from the exhaustive ground-truth test. Note that the architecture of cyclone III
permits at most 16 bits to be connected directly via fast direct-link interconnects
to adjacent Logic array blocks (LABs) [28]. Therefore, to avoid inconsistency of
the ground-truth test results caused by usage of slower global interconnects, the
first 2 less significant output bits are excluded. . . . . . . . . . . . . . . . . . . . 164
5.9 An example of basic TP profile of a single logic path failing at tslow and tfast for
falling and rising transitions respectively. . . . . . . . . . . . . . . . . . . . . . . 166
26 LIST OF FIGURES
5.10 A simple 3-stage pipeline circuit consist of three simple combinatorial paths
(stage A, B and C) with different path delays. . . . . . . . . . . . . . . . . . . . 167
5.11 The TP profile simulation of a simple 3-stage pipeline. . . . . . . . . . . . . . . 167
5.12 A TP profile measurement of the 2nd LSB output of a 9x9 embedded multiplier
on the Cyclone III EP3C25. The unusual shape of the TP profile is the result of
individual paths failing at different times. The corresponding paths are isolated
and tested separately to obtain their basic TP profile components for reference. 169
5.13 Timing diagram showing the activity of output bit (Z) and registered output (Y)
in a multi inputs/paths to single output circuit. A certain glitch period occurs
after each clock edge due to variation in propagation delay between different
paths. The position of clock edge is governed by the jitter distribution PDF (τ),
and the probability of the register capturing the correct value in B′ depends on
both the glitch pattern and the overlapping jitter region. . . . . . . . . . . . . . 170
5.14 Plots evaluating the sensitivity of TP to timing failure in a circuit path. Max-
imum sensitivity is achieved when the input vector V has high probability
H(V ) = 0.67 when falling transitions fail first, or H(V ) = 0.33 when rising
transitions fail first. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.15 A simple two input arbitrary functional block for testing the sensitivity of tran-
sition probability to timing failure with multiple signal paths. . . . . . . . . . . 175
5.16 The output TP response mapping of all possible input HPs for (a) AND gate, (b)
OR gate and (c) XOR gate. The contour lines on the maps represent levels of the
same output TP, hence any change of input H(A) and H(B) along the contour
lines leads to an unchanging output D(S), possibly blocking failure responses
from the previous logic stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.17 The TP failure sensitivity mapping of all possible input HP for (a) AND gate,
(b) OR gate and (c) XOR gate. The contour lines represents the level at which
sensitivity is zero. Both positive and negative sensitivity values represent a
measurable change of TP, but in different directions. . . . . . . . . . . . . . . . 177
5.18 Simulated TP profile of an XOR gate showing (a) sensitivity loss to timing failure
in all paths when using uniformly distributed random inputs, and (b) sensitivity
restored using H(A) = H(B) = 0.87. . . . . . . . . . . . . . . . . . . . . . . . . 178
5.19 (a) shows the sensitivity loss to failure of the slowest type of transitions in the
XOR and regained sensitivity when both type of transitions have failed; (b)
shows the fully regained sensitivity using H(A) = H(B) = 0.87. . . . . . . . . . 179
5.20 Block diagram of the self-optimising complex circuit test platform. . . . . . . . . 180
5.21 A partial/dynamic reconfiguration approach for 17 levels reconfigurable HP weights
and toggle signals using a typical Logic Element (LE) with 4-input LUT and reg-
ister. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.22 Layout of the hardware test platform on a Cyclone III EP3C25 for complex
CUT. An alternative test circuit is included for accuracy evaluation of the TP
test platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.23 Accuracy evaluation of the TP method on a 4x4 LUT based multiplier with
optimised input HP against the exhaustive test method proposed in Section 5.2.1.190
5.24 Plot showing the error sensitivity improvement of the TP profile of a single
multiplier output bit using optimised input HP. . . . . . . . . . . . . . . . . . . 190
5.25 The 4th order Butterworth IIR filter design from Altera [86], where x(n) is the
input and y(n) is the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.26 The TP profiles of all 21 output bits of the Butterworth IIR filter. . . . . . . . . 193
5.27 The absolute failure rate of the Butterworth IIR Filter between 120 and 270MHz.193
6.1 An improved FPGA design flow that allows users to more effectively express the
true potential and timing performance of their designs in real FPGA hardware
while maintaining reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
27
28
Chapter 1
Introduction
1.1 Motivations and Objectives
Testing and timing measurement in Very Large Scale Integration (VLSI 1) circuits has long
been a reliable way of detecting delay faults, as well as classifying circuits into different timing
performance level according to physical measurements. This has allowed Application Specific
Integrated Circuits (ASIC) to reliably operate at their optimal speed to provide the highest
possible productivity.
The introduction of Field-Programmable Gate Array (FPGA) architecture, while provided a
flexible and configurable hardware platform for quick hardware implementation, has prevented
the traditional use of testing to provide specific circuit timing information, because the actual
use of an FPGA by potential end-users cannot be predicted at the manufacturing stage. Al-
though the underlying structure of FPGA is defined like an ASIC where the transistors and
interconnects are set at the chip fabrication stage, it is exceedingly difficult to accurately model
the specific timing of each component in all the possible mode of operations and different signal
loading conditions under practical use [1, 2]. For this reason, FPGA manufacturers would often
assign a conservative worst-case model to FPGAs to ensure all designs would operate correctly
at the stated speed.
1See Glossary in Appendix A for the detailed definitions of acronyms.
29
30 Chapter 1. Introduction
This strategy has, however, been proven to be less than optimal and often sacrifices a huge
margin of timing yield and timing performance. For example, [3] (Section 5.2.2) shows that the
physical worst-case delay of the 9x9 embedded multipliers in an Altera 65nm Cyclone III with
C8 (slowest) speed grade is just under 1.8ns. However, to our surprise, the timing model in the
Quartus II design tool reports a delay of over 3.7ns and the cyclone III’s data sheet [4] reports
an even greater delay of 3.8ns based on the stated 260MHz maximum operating frequency. This
implies that the multipliers are limited to operate at less than half of their physical maximum
speed (frequency), wasting a significant amount of potential throughput performance.
When comparing delay measurement and timing model estimation of look-up-table (LUT)
based circuits, the difference is equally large. It is shown in [3] (Section 3.6.3) that the worst-
case critical delay of the slowest LUT based 8-bit adder among an array of identical adder
circuits across the Cyclone III is 1.2ns, and the Quartus II design tool once again reports a
significantly higher delay of 1.95ns, over 60% greater than the actual delay. Although the error
is not as high as the multiplier case, the difference may still significantly impact the speed and
timing yield of any designs constrained to operating at their suggested clock frequency. The
smaller error over the multiplier case is probably due to more accurate modelling of components
in the much simpler adder circuit and hence less error is accumulated in the timing estimation.
In both case, the physical delay value is taken from the slowest multiplier and adder circuit
with the worst process variation impact. Therefore, it proves that the timing model is not
tracking process variation well, even when compared against the worst-case measurements on
the Cyclone III FPGA. It is understandable that the timing model has to be conservative
enough to accommodate all possible patterns of process variability over many FPGAs, not just
for one specific FPGA specimen. However, if the delays of FPGA circuits and components
can be physically measured on a chip-by-chip basis, far more realistic timing constrains can be
used and drastically increase the productivity of FPGA designs and applications in terms of
operating frequency and throughput.
In addition to the above reasons, the timing model also has to take into account of the possible
operating temperature range and the other sources of delay variability, such as variation in
1.1. Motivations and Objectives 31
voltage supply and possible timing degradation of components in FPGAs. This further explains
why the timing model has been given such a wide safety margin (guard-band) to ensure any
design would operate reliably across different environmental conditions. Therefore, a chip-wise
delay measurement method would be useful to provide optimal and realistic timing constrains
for FPGA designs while the environmental factors are stable, but also maintain reliability by
keeping track of delay changes when degradation and/or environmental variations occur.
The use of physical timing measurements not only would dramatically improved productivity
(operating frequency) of designs on existing FPGAs, there is also room for further improvement
by avoiding the slower components and exploiting the faster components resulted from process
variation. Consider, the previous cases with the 9x9 embedded multipliers and 8-bit adders.
Their respective best-case delays given by [3] are 1.64ns and 1.1ns, which provide an extra
speed gain of 9% and 8% by simply choosing the quicker components. Such possibility of
timing optimisation using variation-aware placement has already been considered by various
researchers [5, 6, 7, 8, 9, 10], all of which required a delay measurement method that could
precisely map out (characterise) the delay variation of FPGA components and measure delay
of specific circuits placed onto an FPGA.
The above challenges and opportunities related to delay variability clearly shows the need for
an efficient test platform that is capable of measuring fine grained FPGA components, as well
as user defined complex circuits on the FPGA. These requirements form the main subject of
this thesis and the key objectives are as follows.
1) Develop a new delay measurement method that provides accurate and precise results, while
maintaining efficient resource usage and test time.
2) Create a self-testing platform that allows quick and efficient delay characterisation of circuits
on FPGAs.
3) Develop a test platform for testing arbitrary complex circuits on FPGAs for efficient physical
timing analysis of user designs.
32 Chapter 1. Introduction
1.2 Overview
Chapter 2 describes the general role of testing, and gives an overview of basic test concepts.
A brief description of the FPGA architecture is given and the opportunities with delay mea-
surement methods in FPGAs are discussed. Existing test, delay measurement and timing
analysis techniques are reviewed in terms of their accuracy, precision and overheads, and their
applicability on FPGAs are discussed.
A measurement method based on failure rate detection is proposed in Chapter 3 for FPGA
delay characterisation and circuit path delay measurement. The achievable timing resolution
is examined in detail and the accuracy of measurements is determined through modelling the
output failure rate of a circuit path. The test method is optimised and developed into a built-in
self-test platform for efficient FPGA self delay characterisation.
In Chapter 4, a more efficient measurement approach based on a circuit’s output statistics is
proposed. The output statistics in terms of transition probability of a circuit path is examined
and modelled in detail to allow accurate estimation of circuit delay using pure output statis-
tical information. The transition probability based method is applied to both combinatorial
circuit path and simple sequential circuit on an Altera Cyclone III EP3C25 to demonstrate
its efficiency, precision and accuracy on real FPGA hardware. In addition, it is applied to the
Cyclone III to measure and monitor the timing degradation of circuits across the FPGA.
Chapter 5 extends the previous methods to perform complex multi-path circuit testing. The
failure rate based method from Chapter 3 is extended into an exhaustive test method to accu-
rately test embedded multipliers on the Cyclone III FPGA, as well as providing an accurate
reference to examine the accuracy of the transition probability based method on the same
embedded multiplier circuit. To further improve the accuracy with the transition probability
method, an in-depth study of transition probability behaviour in different multi-path cases is
carried out and a novel method to optimise accuracy by controlling the probability distribution
of input test vector is discovered. Lastly, the improved technique is applied and demonstrated
on both complex combinatorial and sequential circuits on the Cyclone III FPGA.
1.3. Statement of Originality 33
1.3 Statement of Originality
Three main areas of contribution are presented in this thesis; they are covered separately in
Chapter 3, 4 and 5. Further contributions are described in the introduction of each chapter,
but the key contributions are summarised as follows.
– The development of a failure rate based technique for high resolution timing measurement
of any combinatorial circuit paths. A detailed study and simulation of the failure process
of circuit paths is carried out to understand and model the relationship between circuit
delay, output failure rate, clock jitter and metastability, leading to a highly accurate mea-
surement method. The method is optimised and refined into built-in self-test platforms
to efficiently perform delay characterisation of fine grain components in FPGAs for delay
variability analysis. This work is described in Chapter 3 and formed the subject of three
papers [11, 12, 3].
– The discovery of a link between a circuit’s timing failure and its output statistics that led
to a highly flexible and efficient timing measurement method based on observing the
transition probability of the circuit’s output. The effects of circuit delay, clock jitter and
metastability is examined in detail by simulations to model the behaviour of transition
probability when timing failure occurs in a circuit path. A detailed study of the statis-
tical behaviour of clock jitter based on simulation and actual test results on FPGA is
carried out to further refine the transition probability behaviour model. The measure-
ment method is tested on actual FPGA hardware and the predicted transition probability
models are validated by the results. The method is also shown to produce accurate tim-
ing measurements on simple sequential circuits. The main concepts and results of this
work have been published in [13]. Practical usage of the transition probability method
is demonstrated on actual FPGA to measure timing degradation, where the main test
setup and concepts are presented in [14, 15]. The above contributions are described in
Chapter 4.
34 Chapter 1. Introduction
– The development of delay measurement methods for arbitrary complex circuits. The failure
rate based method is extended to exhaustively test complex multi-path circuits and is
used to validate the accuracy of the more efficient transition probability method with
multi-path circuits. The results and test implementations are published in [13, 3]. The
behaviour and timing failure sensitivity of transition probability in complex multi-path
circuits are studied in detail through the exhaustive test results and simulations to dis-
cover ways to improve timing failure sensitivity and hence measurement accuracy. An
optimisation technique based on controlling the probability distribution of input test vec-
tors is developed to enhance the transition probability method [16] to match the accuracy
of the proposed exhaustive test method. The optimised test method is developed into
a modularised test platform and is demonstrated on both complex combinatorial and
sequential circuits on FPGAs. The work described above is presented in Chapter 5.
1.4 List of Publications
2007 - J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-characterization of combi-
natorial circuit delays in FPGAs,” in Proc. IEEE International Conference on Field-
Programmable Technology, Dec. 2007, pp. 245 – 251.
2008 - J. S. J. Wong, P. Y. K. Cheung, and P. Sedcole, “Combating process variation on
FPGAs with a precise at-speed delay measurement method,” in Proc. International
Conference on Field Programmable Logic and Applications (FPL), Sep. 2008, pp. 703 –
704.
2008 - J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “A transition probability based delay
measurement method for arbitrary circuits on FPGAs,” in Proc. IEEE International
Conference on Field-Programmable Technology, Dec. 2008, pp. 105 – 112.
2008 - P. Sedcole, J. S. J. Wong, and P. Y. K. Cheung, “Characterisation of FPGA clock vari-
ability,” in Proc. IEEE Computer Society Annual. Symposium on VLSI, 2008. ISVLSI
’08., Apr. 2008, pp. 322 – 328.
1.4. List of Publications 35
2008 - P. Sedcole, J. S. J. Wong, and P. Y. K. Cheung, “Modelling and compensating for clock
skew variability in FPGAs,” in 2008 International Conference on Field-Programmable
Technology, Dec. 2008, pp. 217 – 224.
2009 - J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-measurement of combinatorial
circuit delays in FPGAs,” ACM Transactions on Reconfigurable Technology and Systems
(TRETS), vol. 2, no. 2, pp. 1 – 22, 2009.
2010 - E. A. Stott, J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Degradation in FPGAs:
Measurement and modelling,” in Proc. ACM/SIGDA International Symposium on Field
Programmable Gate Arrays - FPGA, Feb. 2010, pp. 229 – 238.
2010 - E. A. Stott, J. S. J. Wong, and P. Y. K. Cheung, “Degradation analysis and mitiga-
tion in FPGAs,” in Proc. International Conference on Field Programmable Logic and
Applications (FPL), Aug. 2010, pp. 428 – 433.
2011 - J. S. J. Wong and P. Y. K. Cheung, “Improved Delay Measurement Method in FPGA
based on Transition Probability,” in Proc. ACM/SIGDA International Symposium on
Field Programmable Gate Arrays - FPGA, Feb. 2011.
Chapter 2
Background and Related Work
2.1 Introduction
This chapter examines the role and importance of testing, the challenges we face today in testing
increasingly complex devices, and reviews existing test and delay measurement methods for
integrated circuits. The idea of testing existed ever since the first electronic circuit was invented
and the goal was clear – make sure the circuit functions correctly and runs at the required speed.
Nowadays, the basic goal of testing is still relatively unchanged but the complexity of circuits
have advanced by many order of magnitude. The relentless reduction of transistor size has
brought us very high logic density but also towards the physical limit of both the raw materials
and the fabrication processes. We have entered the nano metre realm where manufacturing
devices and components with uniform physical and electrical properties within a chip becomes
exceedingly difficult with each size reduction (see Section 2.4.1), and test methods must evolve
to measure and keep track of these unwanted variations accurately.
Although test methods in application specific integrated circuits (ASIC) have been a well-
established field, the introduction of field programmable gate array (FPGA) with configurable
fabric opens many new opportunities for extended use of testing and exploration for new test
concepts that could be used to combat the challenges imposed by delay variability. Such
36
2.2. The Role of Testing 37
advancements in testing would also greatly benefit FPGA itself which is especially vulnerable
to delay variability.
2.2 The Role of Testing
2.2.1 Performance and Functional Fault Classification
In VLSI technology, production testing, both functional and timing, are essential steps in the
manufacturing of integrated circuits. It allows faulty parts to be identified and functional
parts to be sorted into different product grades according to their timing performance (speed-
binning) for optimal productivity and yield. In addition, performance parameters, such as
supply voltage, operating temperature and power consumption are often considered during
the speed-binning process as they are closely related to the timing performance of a device.
Figure 2.1 depicts the basic concept of speed-binning using tests to sort products according to
their specific requirements [17].
Ideally, an effective and efficient test should provide a complete coverage of all the criti-
cal paths/components in the device but maintain a short test time to reduce the impact
on chip productivity. Existing test methodologies are generally divided into two categories:
(i) Detection of static and functional faults, such as open or stuck-at faults and large delay
faults caused by defective components.
(ii) Measurement of physical circuit delay, maximum operating clock frequency and small delay
faults caused by process variation and/or hardware degradation.
Current speed-binning processes relies on “Pass” or “Fail” results of the device under test
at particular operating requirements to reject faulty devices or determine their speed grades
(Figure 2.1). As a result, most of the existing test methods are oriented heavily towards fault
detection (i), but neglected (ii) where precise timing performance data is measured. For ASICs,
the ability to gather precise and accurate timing measurements allows devices to be tuned to
38 Chapter 2. Background and Related Work
Chip
Fabrication
Grade 1
Parts
Test
Fail
Pass
Grade 2
Parts
Test
Fail
Pass
Grade 1
targets
Grade
Parts
N
Test
Fail
Pass
..... Scrap
Test
margins
Test
margins
Grade 2
targets
Test
margins
Grade
targets
N
Delay, Voltage, Power and Temperature Requirements
Failed components
can be disabled
for lower hardware
specification ?
No
Yes
To further
binning of
lower spec.
parts
High End (fast) Low End (slow)
Spec.?
...
Figure 2.1: Basic principle of speed-binning.
run at optimal speed, but it plays an even greater part in FPGAs, where the actual hardware
configuration of the user application and design is not defined at manufacturing time, and hence
performance information can only be obtained post-design by timing analysis or physical delay
measurement at the users’ end. Also, with the growing concerns of process variability, the need
for measuring their effects is even greater. The potentials and implications of using precise
timing measurement for such purposes will be discussed in detail in Section 2.4.2 and 2.4.3.
2.2.2 Device Maintenance
During the life time of a device, faults and degradation of components may develop, causing
errors and/or reduction in timing performance. Therefore, it is important to perform regular
maintenance test to ensure correct operation.
A maintenance test can be designed to run off-line, where the circuit stops its normal operation
while a dedicated test runs, or designed to run on-line, where the test runs concurrently with
the normal circuit operation to detect errors and delay faults on-the-fly. Maintenance tests are
2.3. Field-Programmable Gate-Array 39
often implemented as Built-in Self-test (BIST) to eradicate the need of external test equipments
in self-contained environment (See Section 2.5.4). Such BISTs are ideal for regular user initiated
self-test, Power-on Self-test (POST) or continuous fault monitoring.
The ability to detect faults in real-time with on-line BIST, enables a device to automatically
take actions against the faults and possibly correct and maintain the functionality of the device.
This idea formed the foundation of hardware fault-tolerant and self-repair designs that led
to more reliable devices. In fault-tolerant designs, the fault induced errors can be corrected
using certain error correction schemes [18, 19, 20] (See Section 2.7), and in most cases, the
faulty components can be identified and bypassed using built-in hardware redundancy schemes
such as those proposed in [21]. For FPGAs, fault tolerance can be achieved by exploiting
reconfigurability to replace and bypass defective parts [22, 23, 14]. Analogous to wear levelling
schemes used to prolong flash memory’s life [24], a fault and degradation tolerant scheme may
exploit the replicating homogenous structure of FPGAs to minimise degradation in highly
stressed components. Nevertheless, there are still a number of technical challenges to be solved
before such idea could be realised in real FPGAs; that includes accurate measurement and
characterisation of degradation in terms of delay and how different type of stresses relate to
the degree of degradation.
2.3 Field-Programmable Gate-Array
FPGA was invented in 1985 to demonstrate that memory elements could be used to build
devices with fine-grained and fully programmable hardware logic functions and flexible inter-
connects. Despite the area overhead of using memory cells and the power and performance
disadvantage over its ASIC counterparts, it has since became a popular choice for rapid design
prototyping, signal processing applications and full scale system-on-chip platforms.
40 Chapter 2. Background and Related Work
Register
.
.
.
Clock
Inputs
Look-up-table
(LUT)
Outputs
Feedback
Figure 2.2: A typical Logic Element (LE) with register feedback.
2.3.1 The Architecture
Modern FPGA architectures are equipped with an array of Logic Elements (LEs) and spe-
cialised blocks such as memory and multipliers (Figure 2.4), all held together by networks of
programmable interconnects and clock trees. Each LE typically contains a look-up-table (LUT)
and a register (Figure 2.2). More advanced LE designs existed for high-end FPGA products,
where combination of LUTs, registers and built-in full adder circuits are used [25, 26].
The key component, LUT, is essentially a multiplexer used to address a set of memory bits
(LUT-Mask) that determines its function. Various LUT designs aimed at improving delay,
reliability and power consumption for commercial use are described in [27]. The majority of
FPGAs today use SRAM based LUT-Mask and configuration bits, which are quick to configure
but vulnerable to power loss. An alternative provided by Actel solved the problem by using
non-volatile flash memory, but it introduced other problems such as limited durability of flash
memory cells and slower configuration speed.
Advance Interconnect Structure
Figure 2.3 depicts a typical FPGA architecture consisting of three hierarchical levels of in-
terconnects – global (row and column), direct and local interconnects. To improve efficiency
of short distance data transfer between LEs with related functions, LEs are often clustered
together to form a larger block known as a Logic Array Block (LAB), where fast local inter-
2.3. Field-Programmable Gate-Array 41
.
.
.
.
.
.
.
.
.
.
.
.
Global
(Column)
Interconnects
Global
(Row)
Interconnects
Direct
Interconnects
Local
Interconnects
Switch
Blocks
Logic Array Blocks / Clusters
(LABs)
Logic Elements (LEs)
Figure 2.3: A typical interconnect and logic clusters structure of FPGAs.
Dynamically
Configurable PLLs
Embedded
Memory Blocks
Configurable
Logic Elements
Embedded
Multipliers / DSPs
I/Os
Figure 2.4: An example of a typical FPGA layout (based on the Cyclone III architecture [28]).
42 Chapter 2. Background and Related Work
connects are used. The different hierarchical levels of interconnects are intended to provide
optimal connectivity with minimum delay impact, where increase of connectivity and distance
are generally related to higher delays.
The routing of signals in FPGAs are achieved by configurable switch blocks located between
interconnect levels, line intersections and within each LAB. Other more sophisticated intercon-
nects schemes exist in commercial FPGAs depending on the vendor and architecture [28, 29],
but the principle of the interconnect hierarchical levels are generally obeyed.
Flexible Clock Generation and Distribution
Most FPGAs nowadays are equipped with multiple reconfigurable phase-lock-loop (PLL) or
delay-lock-loop (DLL) based clock generators [28, 30, 31] to satisfy the demand of multiple clock
domains, wide range of clock frequency requirements of different designs, and dynamic clock
throttling for power saving purposes. Figure 2.4 depicts an Altera Cyclone III EP3C25 with
4 runtime reconfigurable PLLs. The clock signals are distributed by highly flexible clock trees
that covers the whole FPGA. Unlike ASIC designs where a perfectly balanced H-tree structure
can be used, FPGA clock trees are designed to trade off symmetry for higher flexibility, causing
a certain clock skew variability across different locations on the FPGA [32]. Therefore, when
testing physical FPGA designs, it is important to always include clock driven components in
the test path, to ensure that the effect of clock skew variability is taken into account.
2.3.2 Speed Grading of FPGAs
The underlying hardware of FPGAs can essentially be identified as a large ASIC with many
replicated hardware blocks and programmable interconnects, each with their own specific func-
tions. Therefore, the general technique of speed-binning using conventional fault tests as in
Figure 2.1 still applies. FPGAs are usually graded based on the slowest hardware component
at a specific operating condition. For example, if the lowest speed grade requires that the
slowest LUT to operate with no more than 1ns delay at 70◦C and 1.2v, then any FPGAs with
2.4. Delay Variability 43
even a single LUT failing the requirements would be considered defective. Despite the already
conservative approach, a test margin or guard-band is often added to the timing requirement
to cover delays of untested logics, delay variability caused by process variation and potential
degradation, and the wide range of external operating conditions that the FPGA may experi-
ence in its life time. This implies that the speed grades assigned to an FPGA at production
time is often a very poor indication of its actual timing performance and the end-user has no
knowledge of the real speed potential of their application on the FPGA unless a physical delay
measurement is conducted.
2.4 Delay Variability
The term delay variability is generally used to describe how the propagation delay of a partic-
ular type of circuit component varies from its expected value due to imperfect process (process
variation), aging of the components (degradation) and external environment factors, such as
voltage supply and temperature. Among them, process variation has attracted the most atten-
tion, because it is predicted to worsen as transistor size continues to reduce in the near future
[33, 34].
2.4.1 Background on Process Variation
Process variation implies mainly the parametric variation of components caused by material and
fabrication process variations. This includes variation of the threshold voltage (VT), switching
delay, output impedance, sub-threshold leakage, gate oxide leakage of transistors [35], and
variation of metal line impedance. Variation as such will affect individual transistors and hence
the overall performance of the CMOS circuits. Some examples of the physical causes of process
variation are illustrated in Figure 2.5, where they are related to the physical limitation of
lithography processes and the discrete characteristic of material seen only at nano metre scales.
44 Chapter 2. Background and Related Work
(a) (b)
(c) (d)
Figure 2.5: (a) Imperfection of photo-resist at microscopic scale. (b) Line edge roughness of
metal interconnects. (c) Random discrete dopants in CMOS transistor. (d) Non-uniform gate
thickness. (Images courtesy of University of Glasgow and IBM)
Measuring Process Variation
The effect of process variation in real device are reflected by two major observable aspects of
a circuit – propagation delay and power dissipation in terms of power consumption and heat
dissipation.
Power measurement requires sensitive analogue probes to detect variation in voltage and cur-
rent, and it often has a poor spatial resolution because measurements can only be made at the
power supply rail/plane level. Temperature variation, on the other hand, can be measured by
on-chip sensors or thermal imaging [36]. However, the spatial resolution of temperature sensors
are limited by the number of sensors used and the results are only as good as the average
temperature of components within a certain region of each sensor. High resolution infrared
camera provides a better spatial resolution [36], but it can only be used on chips with infrared
transparent package, heat sinks, and test equipments are expensive and difficult to calibrate.
Also, the components in the chip must be exercised uniformly for temperature maps to reflect
process variation correctly.
2.4. Delay Variability 45
Unlike power and heat measurements, propagation delay can be measured accurately at the
signal path level, giving it a much better spatial resolution. Therefore, given the environmental
factors (temperature, voltage supply etc.) are steady, the effect of process variation can be
measured precisely using a high time resolution measurement method. There is a large variety
of candidate test methods that can be used, but most are incapable of the getting measurements
at the required time resolution to distinguish the effect of process variability. A selection of
test methods will be explained and compared in detail in Section 2.6.
2.4.2 The Impact of Delay Variability
Delay variability causes the propagation delay of nominally identical components to spread
over a range of values. The spread can be described by a probability density function (PDF).
Figure 2.6 shows an example of how FPGA LUT delay are affected by delay variability in both
the inter-die and within-die level. It also shows how a component associated with the worst-case
delay can impact the resulting timing yield in a conventional speed-binning process.
The problem becomes even more profound when a broad test margin is added to the target
timing requirement to account for delay variability. Xiong et al. from IBM [37] proposed a
statistical model based approach to compute minimum test margins for tests. Their model
took inter-die delay variability into account to improve timing yield but the issues of within-die
delay variability remained unresolved.
2.4.3 Counter Measures Against Delay Variability
In ASICs, counter measures against process variability can often be applied during the design
and fabrication process since the entire circuit design is known. For example, in [38, 39], sev-
eral lithography-aware design methodologies are used to reduce the effect of process variation
in ASIC designs. In comparison, FPGAs have a generic hardware structure for user speci-
fied designs, where design specific lithography-based process variability compensation is not
possible.
46 Chapter 2. Background and Related Work
Location (Y) Location (X)
D
e
la
y
Speed-bin
D
e
n
si
ty
Worst-case
LUT delay
Delay Target
(advertised delay specification)
Spread due to
inter-die
process variation
Test margin
(Guard-band)
Spread due to
within-die
process variation
LUT delay
Single
FPGA
sample
LUTs with worst-case delay are rare
I
n
t
e
r
-
d
i
e
W
i
t
h
i
n
-
d
i
e
L
U
T
d
e
l
a
y
m
e
a
s
u
r
e
m
e
n
t
s
Figure 2.6: An example showing LUT delay in FPGAs under the influence of process variability
and the ineffectiveness of current speed-binning methodology in tackling a wide spread of within-
die component delays.
2.5. Fundamental Test Concepts 47
Despite the limitations, the flexibility of FPGAs opens up another opportunity that ASICs
cannot match – circuit modules can be mapped to locations with fast components on the FPGA
fabric and avoid slow components at other locations. Such strategy allows the user circuit to
utilise resources at the faster end of the within-die delay distribution as shown in Figure 2.6
and prevents the slower components from being used, effectively increasing the timing yield
of the FPGA at the application circuit level in a chip-by-chip basis. In order to achieve such
fine grain delay-aware placement of hardware, a good set of delay measurement tools capable
of characterising the FPGA, as well as extracting the specific delay of the critical path in the
user design is required.
This concept of delay-aware placement for FPGA has been suggested by a number of researches
[5, 6, 34, 7, 8, 9] and possible implementations of the idea was proposed. In [5], the idea of
shifting functional blocks and critical paths around, based on delay variability measurements,
is proposed to improve yield. Similarly, [6] proposed a systematic approach to generate chip-
wise placement that takes within-die delay variability into account to further optimise yield.
[8] and [7] simultaneously proposed the idea that multiple configurations can be generated
and the best configuration for each chip with specific delay variability can be identified from
delay measurements. In addition, these adaptive placement strategies can also be used to
mitigate delay variability caused by non-uniform timing degradation in FPGAs, where the level
of degradation is measured and the circuit is re-mapped to avoid degraded parts [22, 23, 14].
2.5 Fundamental Test Concepts
There are typically four stages in a test procedure:
Input Generation – generation of input test vectors that stimulate the circuit-under-test
(CUT) in a specific way to exercise as many internal paths as possible.
Input Launch – launching of test vectors at specific time to allow accurate timing tests.
48 Chapter 2. Background and Related Work
Output Capture – capturing of the CUT’s output with precise timing and temporary stores
the results for the following output analysis stage.
Output Analysis – analysis of the results to obtain timing measurements, identify timing
faults and the locations of faulty hardware.
The following sub-sections highlight several of the commonly used techniques for the four major
stages of testing and compare their strengths and weaknesses. Other common test techniques,
such as at-speed testing, built-in self-test (BIST) and design for testability (DFT) will also be
discussed.
2.5.1 Test Vector generation
The purpose of a Test vector set is to stimulate the CUT at the input with a particular pattern
of bits and signal transitions such that the output transition patterns reflect the timing and/or
integrity of all the internal paths. The input signal transition pattern is described by not just
one binary vector but by pairs of consecutive vectors that differ by a specific number of bits.
A test vector set can be categorised into Single Input Change (SIC) or Multiple Input Change
(MIC) test vectors, depending on how many bits are changed between consecutive vectors pair.
There has been a debate over which type of test vectors are more effective. According to
Fukuoka et al. [40], the actual propagation delay of CMOS logic gates may be affected by several
input changing simultaneously, meaning SIC may reduce the accuracy of delay measurements.
Consider a 2-input CMOS NAND gate consist of two pull-up pMOS transistors in parallel and
two pull-down nMOS transistors in series. When both inputs fall simultaneously from logic
1 to 0, the two parallel pMOSs would turn on together, resulting in a stronger instantaneous
pull-up current and shorter propagation delay. On the other hand, when both inputs rise
simultaneously, the nMOSs in series behave in the opposite way with weaker initial pull-down
current and result in longer propagation delay than when one input was already at logic 1.
Notwithstanding, the inaccuracy of SIC approaches is relatively small and is easier to implement
in practice. According to [41] and [42], SIC provides sufficiently robust coverage of paths
2.5. Fundamental Test Concepts 49
in CUTs and exhaustive SIC schemes can be implemented efficiently in hardware [43] with
relatively low resource requirements. In terms of exhaustive test time, MIC quickly becomes
impractical when the required number of simultaneously changing bits increases. Consider an
SIC exhaustive test case with n input bits where each bit is associated with multiple signal
paths. The number of test vectors (NSIC) required to exhaustively test the CUT is given by
[43]:
NSIC = n · 2n , where n = 1, 2, ... (2.1)
For MIC approach with m changing bits, assuming the selected m bits traverse through all
possible transition patterns and cover all cases where the number of changing bits is less than
m. Then the number of exhaustive test vectors (NMIC) is given by:
NMIC =
n!
m!(n−m)!(2
m − 1) · 2n , where m,n = 1, 2, ... and m 6 n (2.2)
As can be seen, the MIC approach requires a larger number of test vectors and it grows at a
significantly greater rate than the SIC approach when m is greater than 1 (note that Eq. 2.2
reduces to Eq. 2.1 when m = 1). Nonetheless, the concept of MIC is rarely used for exhaustive
testing and more often used in random input testing, where the number of test vectors is
arbitrary.
Random Test Vector
Since true random number generation is not possible in deterministic digital systems, test
vectors are often generated using Pseudo Random Number Generator (PRNG). A good PRNG
produces binary sequences that are seemingly random, but is deterministic and shows repeating
pattern after a certain period of iterations, known as periodicity.
A commonly used PRNG hardware implementation for testing is the Linear-Feedback Shift-
Register (LFSR). It is used for its scalable structure, low resource overhead, high throughput
and optimal periodicity in the case of a Maximal-length LFSR. Each LFSR output bit produces
a uniform probability distribution between 0 and 1 where Pr(1) = Pr(0) = 0.5.
50 Chapter 2. Background and Related Work
(a) LFSR
(b) LHCA
(c) Phase-shifted LFSR
Figure 2.7: Visual comparison of 32-bit wide random sequences from [47] generated by (a)
Linear-Feedback Shift-Register (LFSR), (b) Linear Hybrid Cellular Automata (LHCA) and (c)
phase-shifted LFSR. Each column of pixels represents a 32-bit binary value and are presented
horizontally from left to right to illustrate the entire series of random values generated from
each method.
The behaviour of LFSRs are determined by characteristic polynomials that describe every
subsequence state change and the layout of feedback taps in the actual hardware. Maximal-
length LFSR is achieved when the characteristic polynomial is irreducible or primitive [44]
and the periodicity is given by 2n − 1, where n is the number of LFSR bits. There is a well
established documentation of pre-calculated primitive polynomials, such as the list presented
in [45] for maximal-length LFSR up to 168 bit wide.
Due to the structure of LFSR, its raw output bits are essentially identical bit sequences, dis-
tinguished by phase difference only. As a result, the output bits exhibit a high degree of phase-
induced bit value correlation, and may impact the path coverage of tests [46, 47]. Figure 2.7(a)
shows clearly how the output bits are correlated in a visual representation.
An alternative to LFSR is the Linear Hybrid Cellular Automata (LHCA) implementation.
LHCA operates by a linear chain of bits, where certain logical “rules” describe the next state
value of each bit according to its neighbouring bits [48]. It has many of the advantages of
LSFR, such as scalable structure and high throughput, but with far less phase correlation
(see Figure 2.7(b)). An LHCA can be designed with certain combinations of different rules
to achieve the same periodicity as a maximal-length LFSR of the same number of bits [48].
2.5. Fundamental Test Concepts 51
Table 2.1: Comparison of random vector generators
Characteristic LFSR LHCA Phase-shifted LFSR
Area Overhead Low ( 1 gate/bit) Higher (∼ 1 gate/bit) Moderate (< LHCA)
Maximal-length
implementation
Easy (well defined
primitive character-
istic polynomials)
Harder (combination
of rules not well de-
fined)
Same as LFSR
Throughput High High
High (depends on phase-
shifter implementation)
Randomness
Low (high phase
correlation)
High Highest
Delay fault sensi-
tivity
Low High High
Rajski et al. [47], however, showed that the LSFR remains a better choice of PRNG given that
the phase correlation problem is compensated using a simple phase shifter network consist of
XOR gates. The phase-shifted LFSR design produces outputs superior to LHCA in terms of
randomness and requires less hardware resources than LHCA. Figure 2.7 visually compares the
behaviour of LFSR, LHCA and phase-shifted LFSR, where phase-shifted LFSR gives the best
random behaviour, and Table 2.1 compares and summarises their strengths and weaknesses.
As opposed to ad-hoc test patterns, random test vectors result in less direct control over the
coverage of specific circuit paths. However, it is ideal for testing “black boxes”, where the
internal structure is not known and exhaustive test is not feasible. The coverage of internal
paths can often be indirectly controlled through changing the probability distribution of the
random test vector sequences – known as weighted random test patterns [48].
FPGA Test Vector generation
Thanks to the reconfigurability of FPGAs, there is little limitations on which test vector gener-
ation scheme could be used and how they should be implemented. The flexible FPGA hardware
can be exploited to build adaptive test vector generator with customisable test vector patterns
and generation schemes, where it can be adapted to test different components and paths in a
circuit. This suggests that a unified application-independent test vector generator design may
be possible in FPGA for testing a large variety of circuits.
52 Chapter 2. Background and Related Work
Reg
C
ir
c
u
it
U
n
d
e
r
T
e
st
(C
U
T
)Reg
Reg
......
Reg
Reg
Reg
......
Scan Select
Scan In
Scan Out
Normal
Inputs
Normal
Outputs
...
...
S
c
a
n
c
h
a
in
S
c
a
n
c
h
a
in
Figure 2.8: An example of the Scan-Chain architecture.
2.5.2 Test Data Launch, Capture and Analysis
Scan-chain Architecture
For a test to yield accurate results, the input test vectors must be launched at a specific time
and the responses captured at precise time after the launch. The most widely adopted approach
is the scan-chain architecture with scannable registers as shown in Figure 2.8.
The obvious advantage of scan-chain architecture from the hardware design and timing accuracy
standpoint is that it does not required major structural change to typical pipelined designs,
where combinatorial logics are naturally staged between registers. Moreover, the data shifting
structure of scan-chain means that it can take advantage of traditional JTAG boundary scan
structure for testing.
Since the same clock and registers for the normal operation are used for the scan operation,
launch and capture of test data, the CUT can be tested at-speed (see Section 2.5.3) and the
results reflect the realistic timing behaviour of the CUT in normal operation.
It is, however, less desirable when it comes to test time and efficiency aspects of the approach.
For every test vector, the inputs and responses require multiple clock cycles to sequentially
shift in and out of the scan-chain, taking up a significant portion of test time.
2.5. Fundamental Test Concepts 53
Reg
C1
Scan In
Normal
Outputs
Reg
Reg
01
C0
Mode
Control
Scan Out
Normal
Inputs
C1 Operation ModeC0
0 0
0 1
1 0
1 1
Scan
Signature Analysis / LFSR
Reset
Normal
Figure 2.9: An example of a 3-bit wide Built-in Logic Observer (BILO) with selectable LFSR
or CRC Signature generation in a Scan-Chain architecture.
The response data are normally stored separately for further analysis to identify timing errors.
The analysis process usually involves comparing a set of reference responses – pre generated
from the functional information and simulation of the CUT – against the actual test response.
If error is found, the CUT is considered faulty and fails the test.
Test Response Compaction
Complex designs with a large number of internal states, input and output bits often produce
a large amount of output response data, that are both storage and computationally expensive
to analyse. The problem is often tackled by compacting the raw test responses into a shorter,
more condensed bit stream, using various data hashing schemes [48, 49], thereby reducing the
storage and computational requirements.
Signature Analysis is one of the most widely used test response compaction scheme for se-
quential circuit testing. Signature Analysis is a broad term that covers a series of response
compaction schemes, including simple transition counting, ones counting and cyclic redundancy
check (CRC) based signature generation methods. [48]
54 Chapter 2. Background and Related Work
Among the different schemes, CRC based Signature Analysis scheme is considered a popular
choice because the signature generation is described by a primitive characteristic polynomial
which is equivalent to the description of an LFSR [48]. Hence the same scan-chain structure can
be used as both a built-in pseudo random vector generator or a signature generator, depending
on whether it is launching inputs or capturing outputs. This type of test design is known as a
Built-in Logic Observer (BILO) for its self-contained testing capability. Figure 2.9 depicts an
example of a 3 bit wide BILO which can be configured to act as an LFSR or generate CRC
signatures of the CUT’s response.
The CUT passes the test if the final signature pattern matches the expected reference signa-
ture pattern from correct operation. The only danger of using signature analysis with any
compaction schemes is the potential of missing errors, when the compaction scheme is insensi-
tive to certain patterns of errors in the raw circuit response. This is known as aliasing, and it
causes false negative results to occur at a certain probability that must be considered carefully
when choosing the type of compaction scheme for different CUTs.
2.5.3 At-Speed Testing
The term At-speed testing generally depicts a test where the CUT is exercised at the actual
operating clock frequency to discover real world timing behaviours. There are numerous types
of defects that are only visible under real-time at-speed testing. These include clock network
timing defects, heating effect of clock network and logic under high switching frequency, cross-
talk and interference between switching signals, and potential issues with power supply noise.
Most ASIC At-speed testing are carried out using scan-chain type test circuit for speed binning
purposes. Although accurate, it is quite demanding in terms of test equipment cost, test time
and area overhead for fast built-in test logic in the device to enable high frequency testing. For
FPGA, its reconfigurability may provide an alternative solution to the traditional scan-chain
method, allowing a non-permanent built-in hardware test platform that quickly and efficiently
test a design at full speed, but does not contribute to final resource and power overhead. See
Section 2.5.4 on Built-in Self-test for more detail.
2.5. Fundamental Test Concepts 55
2.5.4 Built-in Testing
For ASIC devices, extensive testing at high clock frequency or “at-speed” generally requires
highly expensive test equipments [50], and with the demand of ever higher operating speed of
devices, at-speed testing often becomes infeasible with external test equipments due to timing
and bandwidth limitations of the chip’s I/Os [51]. The solution to this is to embed extra
logic into the ASIC design to facilitate test inputs launching and test outputs capturing [52].
The Scan-Chain architecture explained in Section 2.5.2 is a widely adopted option. Although
external equipment is still required to control, generate test data, read and analyse output data
to discover timing faults, the speed of the equipment is no longer a limiting factor of the actual
test clock frequency. Therefore, slower and less costly external equipment can be used. This
strategy using partially Built-in test equipment is known as slow-fast-slow testing [52] and it is
a common technique to minimise external equipment cost for at-speed testing. The downside
of this approach is that the total test time is still limited the slow test vector write and result
read processes, thus the improvement with test time is insignificant.
Built-in Self-test (BIST)
BIST promotes the idea of test circuit integration further by building the entire test system
on-chip to fully overcome the issues with slow external equipment. In BIST, the entire test
vector generation, result analysis and test control circuitry are built into the chip, enabling
the device to independently initiate and manage tests all by itself. Thanks to the on-chip test
system, where speed is no longer limited by slow chip I/Os or external equipments, BIST has
a significantly shorter test time and virtually zero external test equipment cost.
BIST is widely used in the ASIC chip manufacturers to reduce cost and increase productivity
in performance and fault tests. Further more, due to the independent nature of BIST, it is
especially suited to system-on-chip (SoC) solutions to perform automatic self-test or Power-on
Self-test (POST) for critical functions and on-chip memories [53]. BIST is often used as On-line
56 Chapter 2. Background and Related Work
test to monitor faults or errors in critical devices during their operating life time. The concept
of on-line BIST is also a key to fault-tolerant designs as described earlier in Section 2.2.2.
Despite the benefit of BIST, the extra test circuitries incurs a long term area, cost, and even
power and timing yield penalties to the product in a per-chip basis for ASICs. It is often
a difficult task to maintain good balance between high test coverage and resource overhead
in BIST designs. Moreover, implementing application-specific BIST to obtain optimal test
efficiency and minimal resource overhead is often a costly and time consuming process.
FPGA on the other hand, does not suffer from these problems. The reconfigurability allows
FPGAs to have BIST hardware that does not contribute to extra hardware cost. As stated
by Stroud et al. [54], there are often a significant amount of unused resources on an FPGA to
implement BIST.
2.5.5 Design for Testability
The term testability in VLSI defines how easily a circuit can be controlled through its inputs
(controllability), such that errors caused by static or timing faults through all the possible input
to output signal paths and states of the circuit can be observed at the output (observability).
The higher the testability, the less input vectors are needed to exercise all paths and detect
faults in the circuit, enabling a quicker test. Intuitively, observability may seem to be directly
depended on controllability. However, this is only the case with combinatorial circuits and
sequential circuit without feedback paths. For sequential circuit with feedback that resembles
a Finite-State Machine (FSM) structure, the observability depends not only on the control of
its external inputs but also on the circuit’s internal state. An extreme case would be an LFSR,
where observability depends purely upon the circuit’s state, but errors can still be detected
given that the previous state and the state transition function is known.
A complex circuit with poor testability often requires full exhaustive test to cycle through
every input vector combinations to activate every signal paths, and observe its output for
errors. Such exhaustive test is not only time consuming, but also impacts productivity in the
2.5. Fundamental Test Concepts 57
case of commercial production testing. To overcome such problem, ASICs are often designed
based on certain testability requirement on top of its functional requirement, to ease the testing
process. This is known as Design For Testability (DFT). The idea of DFT is based on the fact
that a circuit design can be modified to increase its testability without altering its original
function [53]. Moriztz et al. [55] showed that the testability of static faults is affected by simply
changing the topology of CMOS transistor circuits or logic gate networks.
DFT Opportunities in FPGAs
The testability of FPGA in terms of its generic circuit structure is not particularly high, due
to the sheer number of possible signal paths and the high cost to control and observe faults
through every path. A test for general FPGA structure requires a large number of predefined
test configurations and long test time to exhaustively test through all the paths in different
components, such as LUTs, registers and the interconnect structure [54, 56, 57, 58, 59].
McCracken et al. proposed a scheme [57] that exploits the self-reconfigurability of FPGA to
improve test speed, efficiency and coverage of resources, including the complex interconnect
structure of FPGAs. Krasniewski on the other hand suggested in [58, 60] that manufacturing
tests targeting the general FPGA structure – application-independent tests – are often ineffec-
tive due to their unrealistic coverage of signal paths compared to actual signal paths in real
FPGA designs. He proposed an application-dependent strategy for FPGA in [58, 59] to im-
prove testability of FPGA designs. The idea is to transform LUT functions in the user design
such that more paths are exposed and exercised when using random input test vectors. The
approach requires no structural change to the user design, hence it can be easily incorporated
into FPGA design tools. FSM type sequential circuit can also be tested in the same way using
the method in [61], where the circuit’s feedback path is altered to test combinatorial (output
and next state) logic and the feedback path separately.
There is however, one drawback with Krasniewski’s approach. The internal structure of the
user circuit must be known for the LUT function transformation procedure to work, and hence
it is not appropriate for testing circuits containing Intellectual Property (IP) blocks, where they
58 Chapter 2. Background and Related Work
Oscillations
...
Odd number of Inverters
(a)
Any Circuit where
In Out
Out = In
Oscillations
(b)
Figure 2.10: (a) A typical Ring Oscillator (RO) structure using inverters. (b) The general
definition of an RO circuit.
are “black boxes” with inaccessible internal details. In fact the use of IP blocks in a design
prohibits the general idea of DFT, where internal circuit detail is required to make appropriate
modifications to improve testability.
2.6 Existing Test and Measurement Methods
This section presents an examination of existing test methods and concepts in detail. Both
FPGA specific and FPGA relevant general test methods are examined.
Since manufacturing tests used by vendors are considered vigorous and highly mature in de-
tecting static and functional faults, it is safe to assume that qualified devices are free from
such basic faults. Therefore, the test methods presented in this section are mainly oriented
towards post-manufacturing testing, where precise timing performance measurements, small
delay faults, and delay variability are of interests.
2.6. Existing Test and Measurement Methods 59
2.6.1 Ring Oscillator Delay Measurement
A Ring Oscillator (RO) generally depicts an odd number of inverters connected in a closed-
loop chain (Figure 2.10(a)). Though by definition, it can be formed with any circuit exhibiting
logical inversion and a direct output-to-input feedback loop (Figure 2.10(b)). Due to the signal
inversion between the input and output, the signals in an RO oscillate indefinitely and the
oscillation frequency is governed by the propagation delay of the entire signal path in the loop.
The path delay in the RO loop (tloop) is given by:
tloop =
Tosc
2
=
1
2 · fosc (2.3)
where Tosc is the oscillation period and fosc is the oscillation frequency.
Since the oscillation frequency can be measured easily and accurately by counting the number
signal transitions in a set period with a simple counter, it is one of the most popular delay
measurement method for both ASIC and FPGA designs. [62] demonstrated how RO can be
used to measure physical interconnect delay in FPGAs and it is also used to measure delay of
LUT and interconnect to discover the effect of process variation [34] and degradation [14] in
FPGAs.
The concept of RO is simple and easy to implement, yet, its general use is hugely limited by
the structural and functional requirement of RO. Firstly, it is exclusively limited to measuring
combinatorial circuits, where sequential circuits involving flip-flops are prohibited from this
method.
Secondly, an RO measures the delay of a loop, not the delay of an input-to-output path of a
typical CUT that is of interest. The measurements would include the unwanted delay in the
feedback path, reducing accuracy. In addition, most realistic circuits are not naturally compat-
ible with the RO structure (signal inversion and feedback) and forcing a combinatorial feedback
path in a normal circuit may affect its behaviour, yielding inaccurate delay measurements.
60 Chapter 2. Background and Related Work
t
fall
t
rise
T
OSC
Figure 2.11: The waveform of an RO oscillation with period Tosc. tfall and trise represent the
propagation delay of falling and rising signal transition through the RO loop.
Thirdly, the indefinite loop oscillations would cause unrealistic heat-up of the CUT from dy-
namic power dissipation, offsetting the measurements from its normal value. It is also unsuitable
for circuits with very low propagation delay, where the resulting oscillation frequency would be
too high to measured by any counter and causes extreme heating up during the measurement.
Lastly, the RO frequency measurements can not distinguish between the delays of raising tran-
sition (trise) and falling transition (tfall). Consider Figure 2.11, the resulting oscillation has a
period Tosc that is actually the sum of trise and tfall, and the measured oscillation frequency is
given by:
fosc =
1
Tosc
=
1
trise + tfall
(2.4)
Therefore, the actual tloop from RO frequency measurement is given by:
tloop =
1
2 · fosc =
trise + tfall
2
(2.5)
One can see that the delay derived from the RO frequency gives the average delay between rising
and falling transitions – not the worst-case delay. Eq. 2.3 assumes that trise approximately
equals to tfall, which is not always the case in reality. Although, the difference between the
rising and falling edge delays is reflected by the duty-cycle of the RO oscillation (Figure 2.11),
it is extremely difficult to measure such difference precisely, especially when fosc is high. The
inability to give the worst-case delay means RO measurements cannot be used to reliably
determine the maximum operating speed of a CUT when trise 6= tfall.
2.6. Existing Test and Measurement Methods 61
2.6.2 Signature Registers Based Delay Measurement
The Signature Analysis scan-chain architecture described earlier in Section 2.5.2 can be adapted
to measure actual delay values instead of just pass/fail results. Consider running an at-speed
Signature Analysis for a particular CUT through a range of clock frequencies, where the clock
period decreases in uniform steps. For example, if the clock period changes in the following
steps:
10ns, 8ns, 6ns, 4ns, 2ns,
and we obtain test results of:
pass, pass, pass, fail, fail.
Then, the delay of the CUT can be derived from the pass/fail pattern, and the CUT delay
in this case is estimated to be between 6ns and 4ns. The finer the frequency steps, the more
accurate and precise the timing measurements could be obtained. This approach was taken by
[63] to measure the delay of register-to-register paths in a specific ASIC design at resolutions
up to 2ns.
Since the essential requirement of this method is to run the at-speed scan/signature test for
multiple times at different clock frequencies to achieve more precise timing estimations, it is
suitable for all devices that are equipped with scan-chain registers and signature analysers.
However, for the same reason, it has a huge disadvantage on test time. For complex CUT, each
test requires a significant amount of time to run and analyse output responses for errors, and
the total test time for running multiple tests quickly becomes impractical even at a relatively
low timing resolution requirement. Thus, this method is not suitable for cases where the total
test time is critical.
Compared to the RO approach, this provides a more realistic measure of path delay of normal
register-to-register CUT structure, but timing accuracy and precision is achieved by trading off
total test time.
62 Chapter 2. Background and Related Work
2.6.3 Time-To-Voltage Delay Measurement
Accurate timing measurement can also be achieved by measuring delay in terms of analogue
voltage levels in VLSI circuits. Consider the timing diagram in Figure 2.12 where the CUT
output has a propagation delay tdelay relative to the positive clock edge. By sampling the
voltage level of an analogue sawtooth signal at the position of the output signal transition, an
accurate measurement of the circuit path delay in terms of voltage (Vdelay) can be obtained.
The general circuit layout of this method is depicted in Figure 2.13. The sawtooth signal is
generated from the system clock signal and the corresponding voltage level is sampled by the
voltage sampler circuit. The sampled voltage level representing the path-under-test’s (PUT)
delay is then converted to digital values by an analogue-to-digital converter (ADC).
Given an accurate voltage sampler circuit and a high resolution ADC, this method can yield
highly precise and accurate delay measurements. Test time is also very short because only a
single signal transition through the CUT is needed to obtain a measurement. [64] presents a
transistor level implementation of this method for measuring critical path delay in VLSI devices
for efficient speed-binning.
Since this method relies on analogue circuitries to generate, sample and interpret specific volt-
age levels, it is not applicable to existing FPGA products, and it is unlikely for future FPGA
architectures to include such analogue circuitries. For ASICs, the test circuit can be incorpo-
rated into a design easily in a per-path basis. However, every PUT has to have its own set
of sawtooth generator and voltage sampler in close proximity to minimise inaccuracy cause by
wire delay, meaning the test circuitries could induce a significant area overhead in complex
circuits with many spatially separated critical and near critical paths.
The analogue circuitries (sawtooth generator, voltage sampler and ADC) may require increased
transistor size to improve analogue characteristic, causing even more area overhead. Further
more, the behaviour of the analogue circuitries may be susceptible to temperature and voltage
supply fluctuations, process variation and degradation, affecting the measurement’s accuracy.
Lastly, each critical path must be pre-extracted from the circuit design and assigned the ap-
2.6. Existing Test and Measurement Methods 63
Clock
Output
tdelay
Sawtooth
Vdelay
Figure 2.12: An example showing how the analogue voltage in a sawtooth signal can be used
to measure the delay of a signal.
Sawtooth
Generator
Path-Under-Test
(PUT)
Voltage
Sampler
Analogue-to-Digital
Converter (ADC)
Clock
Input Output
trigger
Delay Value
Reg Reg
Figure 2.13: The principle circuit diagram of a time-to-voltage based delay measurement
method.
propriate test circuitries during the design process. Hence, this method is not applicable to
designs with IP blocks with inaccessible internal information.
DLL Based Time-To-Voltage Conversion
A similar but less hardware restricted approach is proposed by [65]. The measurement method
uses analogue phase-detector (PD) and voltage controlled Delay Line (VCDL) to achieve a
similar goal. PD and VCDL are the standard components found in Delay Lock-Loop (DLL)
for clock phase synchronisation and many existing FPGAs are equipped with DLL based clock
management circuits.
The measurement method works by passing a clock signal through a VCDL into a PUT, where
the delay of the PUT causes a phase shift in the clock signal at its output. The phase-
shifted clock is then compared to the original clock using a PD to generate a control voltage
proportional to the phase difference. The control voltage is in turn fed back to the VCDL
64 Chapter 2. Background and Related Work
Phase-Detector
(PD)
Path-Under-Test
(PUT)
ADC
Clock
Control Voltage
Phase-Shifted Clock
Delay Value
Voltage-Controlled
Delay Line (VCDL)
Figure 2.14: The principle circuit diagram of a Delay Lock-Loop (DLL) based delay measure-
ment method, containing a phase-detector (PD) and a voltage controlled Delay Line (VCDL).
Note that the charge pump and filter circuits for the control voltage generation is omitted in
this illustration.
to adjust its delay, forming a closed loop control system. See Figure 2.14. When the control
loop stabilises, the delay of the VCDL would be reduced to exactly compensate the phase-shift
caused by the PUT’s delay and the control voltage can be used to indicate the amount of
propagation delay induced by the PUT. Similar to the previous approach, an ADC circuit can
be used to translate the control voltage into numerical values and the actual PUT delay is given
by the difference between the uncompensated and the compensated values.
This approach shares most of the advantages and disadvantages of the initial time-to-voltage
method described previously. The test time is comparable to the initial method but slightly
longer due to the extra stabilisation time needed for the VCDL and PD to lock on to the correct
phase-shift in the loop. Also, unlike the initial method that allows testing of register-to-register
paths, this method is strictly limited to testing pure combinatorial circuits.
The use of DLL means it may be suitable for current and future FPGAs. However, it requires
access to the DLL’s internal control voltage signal, which is not available in most case. Simi-
larly, most FPGAs are not equipped with ADC to convert the voltage levels to delay values.
One alternative to the ADC is to feed the voltage reference directly into a Voltage-Controlled
Oscillator (VCO) – a component found in Phase Lock-Loops (PLLs) – to generate a clock with
frequency proportional to the delay value. The frequency can then be measured by a counter
to extract the delay value.
2.6. Existing Test and Measurement Methods 65
td
Input Pulse
(Enable)
Figure 2.15: A ring oscillator based pulse-to-oscillation modulator. The modulated oscillation
frequency is determined by the delay element td.
2.6.4 Time-To-Digital Delay Measurement
The previous methods based on time-to-voltage conversion could provide good measurement
accuracy and precision, though they all suffered from unrealistic analogue hardware require-
ments that are hard to realise even in future FPGA architectures. Measurement methods that
utilities standard hardware in existing FPGA architectures would be much more attractive and
more easily adopted for BIST and delay variability self-measurements. The following subsec-
tions describe several delay time conversion methods, where [66] relies on oscillator based time
measurement and [67, 68] rely on Delay Line based Time-to-digital converters.
Oscillator Based Path Comparison Approach
Abramovici et al. proposed the idea that a path with certain propagation delay can be measured
by comparing its output against a reference signal through a path with known delay [66]. The
time difference between the two signals indicates the PUT’s relative delay, which is used to
calculate the actual path delay.
An example of the test circuit is depicted by Figure 2.16. The measurement is done by com-
paring the output of the PUT and the reference path with an XOR gate to generate a pulse,
where the pulse width represents the amount of delay difference. To transform that pulse width
into a discretely measurable signal, it is passed into a pulse-to-oscillation modulator as shown
in Figure 2.15. The modulator has a ring oscillator like structure and an enable input which
controls the duration of the oscillation according to the width of the input pulse. The number
of oscillations in the modulated pulse is measured by a counter to obtain the relative delay and
hence the propagation delay of the PUT.
66 Chapter 2. Background and Related Work
Test
Signal
Counter
up/down
2's Comp
Reference Path
(with known delay)
Path Under Test
(PUT)
Delay
Value
Pulse
Modulator
Figure 2.16: An example of a relative delay measurement circuit using a reference path with
known delay. A count up or down signal indicates whether the relative delay is positive or
negative.
As can be seen, the timing resolution of this method depends on the oscillation frequency of
the modulated pulse, which is limited by the smallest possible delay (td) in the oscillator’s loop
(Figure 2.15), the frequency bandwidth of the underlying circuitry, and the highest measurable
oscillation frequency by the counter circuit. Also, the accuracy of the result relies heavily on
the accurate knowledge of the reference path’s delay. At first, this may seem to be a ridiculous
requirement – a delay measurement methods that requires accurate delay measurement in the
first place to work. However, consider the following example in Figure 2.17 with three paths:
PUT-A, PUT-B and PUT-C, and their delays to be measured: tA, tB and tC . It turns out that
no pre-measured reference path is needed when the paths are compared in such a way that they
act as a mutual reference paths to each other. In this case, three comparison tests are needed:
PUT-A against PUT-B : Relative delay: tA − tB = t1.
PUT-A against PUT-C : Relative delay: tA − tC = t2.
PUT-A against PUT-B and PUT-C : Relative delay: tA − (tB + tC) = t3.
The three equations representing the measured relative delays (t1, t2 and t3) can be solved to
obtain the actual path delay. They are given by tA = t1 + t2− t3, tB = t2− t3 and tC = t1− t3.
Although a known reference path is no longer required to obtain the PUTs delays, one must
be aware that the extra delay introduced by the wire linking PUT-B and PUT-C may not
be negligible. Also, the measured delay of PUT-C (tC) includes the multiplexer’s delay at
its input. To isolate the delay of PUT-C without the multiplexer, an extra test using either
PUT-A or PUT-B as reference may be necessary.
2.6. Existing Test and Measurement Methods 67
Test
Signal
PUT-A
To Counter
PUT-B
PUT-C
Pulse
Modulator
Test
Select
Figure 2.17: A test case with three PUTs without a reference path with known delay. The
circuitry for generating count up or down signal is omitted in the diagram.
Overall, this method provides a practical solution to measure combinatorial path delay at
acceptable accuracy. However, it relies on several assumptions that may not fit well into
FPGAs. The assumptions are:
(a) The interconnect delay between the source input signal and the PUTs are perfectly bal-
anced and cancels out.
(b) The interconnect delay between PUTs outputs and the XOR gate are perfectly balanced
and cancels out.
(c) The XOR gate has the same delay across the two used inputs – this is unlikely to be the
case for FPGAs considering the structure of common LUT designs [27].
(d) The PUTs must be pre-extracted from the CUT, assuming the internal circuit structure
is known and accessible.
This approach is unlikely to yield highly accurate results in FPGAs unless the PUTs and test
circuit are meticulously chosen and implemented to meet assumptions (a) to (c), and it is
certainly not suitable for CUT with IP blocks where circuit structure is unknown. Moreover,
since the measurement resolution depends on the physical characteristics of the oscillator and
counter circuit; the precision of results may vary across different VLSI process technology,
different FPGA architectures, different batch of the same FPGA, or even different locations on
the same FPGA due to process variation.
68 Chapter 2. Background and Related Work
...
Start
Delay Bit Pattern
t1
t2
d0
t1
t2
d1
...
... t1
t2
dn
Stop
Delay Elements
Reg Reg Reg
Figure 2.18: A typical Vernire Delay Line (VDL) containing delay elements with delay t1 and
t2. The time difference between the Start and Stop pulses is encoded by the VDL as a set of
Delay Bit Pattern.
Vernier Delay Line Based Approaches
A Vernier Delay Line (VDL) is a well established and widely used hardware structure capable
of measuring and encoding time difference between two signals into digital bits. VDL has a
regular structure containing chains of delay elements and registers as shown in Figure 2.18. The
Start and Stop signals propagate along the two delay lines (top and bottom) at different speeds
depending on the delay elements. As the Stop signal travels along the top delay line, each
register is triggered at certain time separated by t1 and stores the instantaneous input signal
from the bottom delay line. The stored bit pattern represents the time difference between the
Start and Stop signals, where the time resolution is given by t1 − t2 with the constrain t1 > t2
satisfied.
The structure of VDL means that it can be implemented directly into FPGAs, exploiting
FPGA’s regular structure with good scalability to measure any range of time. Also, the fact
that the VDL’s time resolution does not depend on the actual delay value of the delay elements,
but the relative difference between t1 and t2 implies that a very high time resolution could be
achieved with relatively slow delay elements in low cost and low power devices. In [67], Jansson
et al. proposed a specialised time-to-digital converter design using a VCDL as the delay line in
a VDL structure to gain very high timing resolution at the level of 10ps.
2.7. Timing Error Detection 69
Path-Under-Test
(PUT)
Vernier Delay Line
(VDL)
Input Output
Start
Delay Value
Reg Reg
Stop
Figure 2.19: A principle test circuit using VDL.
In [68], the VDL is adopted to measure propagation delays of combinatorial circuits. Figure 2.19
depicts the basic idea of the measurement method, where the input and output of the PUT act
respectively as the Start and Stop signals of the VDL. The key requirement and down-side of this
method is that the interconnects for the Start and Stop signals must be balanced in such a way
that their delays are exactly cancelled out when the signals reach the VDL. This is particularly
problematic to FPGA testing, since it is nearly impossible to control the placement and routing
of interconnect hardware for two different signals to use physically identical hardware. To make
things worse, two nominally identical paths may still differ in delay due to process variability.
In summary, VDL yields highly precise and accurate timing measurements between two signals.
However, when it comes to measuring delay of circuits, it is limited by the timing requirement of
the delivery of the Start and Stop signals, making it much less attractive, especially in FPGAs
where interconnect resources are predefined.
2.7 Timing Error Detection
Timing errors does not indicated timing information directly, but it is the direct result of timing
violation of a signal through a circuit path. Hence, it may be possible to obtain the a circuit’s
propagation delay if timing conditions of which the circuit transit from error-free to producing
error is found. Such relationship between error-rate and timing conditions may provide a highly
accurately way of measuring delays.
70 Chapter 2. Background and Related Work
Figure 2.20: Structure of the RAZOR Flip-Flop pipeline proposed in [18] to detect bit level
errors.
2.7.1 The RAZOR Architecture
Ernst et al. [18] proposed a self error detecting and correcting register design for resilient
low-voltage processors called RAZOR that is capable of producing error-rate information (Fig-
ure 2.20). Error is detected by comparing the main flip-flop output against a reference signal
from a shadow latch clocked by a slightly delayed clock signal. In this case, the error signal
controls a multiplexer that automatically corrects the value in the main flip-flop on the next
clock cycle to achieve self error recovery.
The basic idea of this RAZOR Flip-Flop pipeline is later adopted by Intel Corp. in [19, 20] to
create error, delay variability and degradation tolerant processor designs.
Although RAZOR does not perform explicit delay measurement, its error detection scheme
accurately detects timing violation in the combinatorial path between the RAZOR Flip-Flops.
This kind of timing precision and accuracy is difficult to match by any of the timing measure-
ment methods described previously. The challenge is how could the excellent timing accuracy
and precision of this simple error detection scheme be exploited to obtain actual timing mea-
surements.
2.8. Timing Model Based Performance Analysis 71
2.8 Timing Model Based Performance Analysis
Timing analysis can be seen as a way to measure a circuit’s performance without the actual
hardware and measurement circuits. Ideally, given a perfect timing analysis method and models,
the results should be identical to the actual physical measurements. However, timing analysis
in reality are far from ideal, and results are rarely close to the physical measurements due to
inaccurate timing models and other unpredictable factors such as process variations.
This raises the question of whether designers should continue to rely heavily on timing analysis
or seek physical test data whenever possible. For ASICs, it is not a practical option due to
high hardware fabrication cost and area overhead with on-chip delay measurement circuits.
However, for FPGAs, designs can be quickly mapped to hardware and verified by temporary
test circuits or BIST with no extra hardware cost, making physical testing a feasible option.
The following subsections explain the basic idea behind timing models and commonly used
timing analysis techniques. The potentials of using physical timing measurements on top of
existing timing models is discussed at the end.
2.8.1 Timing Models
Timing model generally depicts a set of mathematical approximations – derived from the low-
level analogue device models – that describe the timing behaviour and propagation delay of
signals in digital devices in terms of basic device parameters, such as supply voltage and tem-
perature.
The low-level analogue device model obtained from experimental observations of fundamental
circuit components such as interconnects, transistors and simple logic gates can be used directly
to accurately predict propagation delay of signals in a circuit [1]. However, these device models
often involve non-linear equations that must be solved numerically in an iterative time steps
manner, making them very time consuming and inefficient to use [1, 2].
72 Chapter 2. Background and Related Work
In practice, a digital circuit is assumed to operate in discrete voltage levels, with fixed circuit
conditions and parameters, meaning that the analogue device models can be approximated and
simplified to form a static timing model that describes the delay of components [2]. For FPGAs,
static timing models that describe the high level timing characteristics of components, such as
I/O buffers, interconnects, switch boxes, LUTs, registers and any other embedded components
are often used.
Modelling Delay Variability
Process variability cannot be statically modelled because its effect is usually non-deterministic.
The delay of components can be, at best, modelled as random variables with certain stochastic
distributions (see Statistical Static Timing Analysis in Section 2.8.2). Random fluctuations of
temperature and voltage could also affect the delay in a similar way, but are relatively easy to
minimise and control compared to process variation.
2.8.2 Timing Analysis Methods
Nowadays, most ASICs and FPGAs design tools are equipped with sophisticated timing anal-
ysers to generate quick feedback to the users on the timing of critical paths in their designs.
This is extremely useful in aiding the design and optimisation process under certain timing and
operating condition requirements. However, the results are only as good as what the critical
paths extraction algorithm and the timing model could provide, leading to four major require-
ments/assumptions that are necessary to give accurate timing estimations for modern VLSI
circuits:
(a) Accurate timing model that describes the timing behaviour of the underlying hardware.
(b) Accurate identification of critical paths in a design.
(c) Inclusion of the variations of environmental conditions in the timing model, such as oper-
ating temperature and supply voltage.
(d) Inclusion of the effects of process variation and degradation in the timing model.
2.8. Timing Model Based Performance Analysis 73
Static Timing Analysis
Static Timing Analysis (STA) works by tracing delay of components along different paths in
a circuit to compute delay of every path [2]. Unlike a physical test, STA does not depend on
input test vectors and hence has coverage of all paths.
STA is a popular choice in timing analysis tools because it is fast and generally yields accurate
timing estimations. The problem with STA is that the timing model assumes a deterministic
delay value for each type of hardware resources at certain environmental conditions, meaning
that unexpected delay variation described by (c) and (d) are not properly handled in most
cases. The usual work around is to consider the worst-case timing model under the worse-
case environmental conditions, process variations and degradation to cover the possible delay
variability. Although this strategy guarantees an error-free operations at the estimated speed,
it is highly conservative and the gap between the actual speed and the estimated speed is
expected to grow as process variation worsen in future devices.
Many of the up-to-date FPGA design tools still rely on this principle to ensure that their STA
based timing analysers get proper coverage of the increasing delay variability. As a result,
the reported timing performance of designs are hugely conservative and differs from the actual
timing performance by a huge margin.
Statistical Static Timing Analysis
To better account for delay variability, a variant of STA called the Statistical Static Timing
Analysis(SSTA) is used. Instead of treating the delay of logic as fixed values, SSTA treats
delays as random variables, each obeying a certain stochastic distribution described by the
characteristic of the expected delay variability. SSTA gives timing estimation in the form of
probability density functions (PDFs), showing the spread and probability of delay values of
circuit paths. A detailed explanation of the SSTA techniques concerning process variability
can be found in [69, 70].
74 Chapter 2. Background and Related Work
Although SSTA managed to take delay variability into account, its timing results in the form of
PDFs are usually not particularly meaningful to FPGA users. Thus, a point on the PDF with
high enough confidence – usually the 3-σ upper bound – is taken as the delay of the circuit.
This normally yields more realistic results than traditional STA. However, it is still likely to be
far from the physical timing of the design, especially when overly wide probability distributions
are used to cover the large spread of possible within-die and between dice delay variabilities.
Dynamic Timing Analysis
Dynamic timing analysis (DTA) is generally referred to as timing simulation, because a design is
simulated with actual input vectors to obtain timing information – much like testing a physical
circuit. The simulation is done with a complete timing model description of the design’s circuit
layout, such that realistic timing behaviour of the circuit can be observed. Like STA or SSTA,
the timing model needs to take delay variability into account by considering worst-case delays
or treat delays as PDFs.
The downside of DTA is that the simulation process is significantly more computationally
intensive and time consuming than STA or SSTA. On top of that, the accuracy of DTA is
highly dependent on the input stimulus used. Nonetheless, DTA has unique features, which
STA and SSTA failed to provide – such as the support of fully asynchronous designs, better
analysis of designs with multiple clock domains, and the ability to discover functional flaws in
a design. In most cases, DTA is use in conjunction with STA or SSTA to provide a complete
verification of a design.
2.8.3 Discussion on Timing Analysis
In a typical FPGA design flow, timing model is an essential component in timing driven place-
ment and routing of designs onto FPGA architectures (Figure 2.21(a)). However, the increase
of unpredictable delay variability means that the timing models are becoming less accurate,
and the resulting FPGA configuration may no longer be optimal.
2.8. Timing Model Based Performance Analysis 75
Logic
Synthesis FPGA
Configuration
Delay Model based
Timing Analysis
User v
requirements
oltage, power and
temperature
User timing
requirements
Placement
and Routing
No
Yes
Timing met?
Giveup?
Yes FPGA
Configuration
(Failed
requirements)
No
Use different
placement and routing
and/orSeed Strategy
FPGA
Delay Model
Estimated
f
max
(a) Timing model based
FPGA
Configuration
User v
requirements
oltage, power and
temperature
User timing
requirements
Placement
and Routing
No
Yes
Timing met?
Giveup?
Yes
FPGA
Configuration
(Failed
requirements)
No
Use different
placement and routing
and/orSeed Strategy
Configure
FPGA
Measure
Timing
FPGA Delay
Characterisation
(At operating
conditions)
Physical
f
max
Delay and slack information
Logic
Synthesis
(b) Measurement based
Figure 2.21: (a) Flow diagram of a typical timing model driven placement and routing process of
FPGA designs. (b) Flow diagram of a physical delay characterisation and timing measurement
driven placement and routing process of FPGA designs.
76 Chapter 2. Background and Related Work
As suggested earlier in Section 2.4.3, a physical delay characterisation and timing measurement
based placement and routing process could be used to help mitigate the negative effects of
variability. This could dramatically transform the way that placement and routing are done,
like the depiction in Figure 2.21(b). Since it does not rely on approximated delay models, it
could potentially yield more optimal placement and routing results, and designs with better
timing performance.
STA, SSTA or DTA timing analysers are likely to remain as important tools in the FPGA design
process for their efficient and quick feedback of timing information to the users. Though, the
timing models used should be constantly tuned by physical timing measurements to keep track
of delay variability. In addition, a design should always be tested by a physical measurement
method when possible to observe its true timing performance.
It is important to note that a place and route algorithm does not need to base all timing
information on physical measurements. It may very well be sufficient to have most of the
placement and routing done on a tuned timing model, leaving only a few critical paths to be
done using physical measurements, thereby resulting in a hybrid strategy that takes advantage
of both the efficiency of timing models and the accuracy of physical measurements.
2.9 Summary and Discussion
This chapter has explored the importance of testing in ASICs and FPGAs devices, and re-
viewed the common testing and delay measurement methodologies, as well as the challenges
and opportunities with testing posed by the increasing delay variability in VLSI circuits.
The flexible and reconfigurability structure of FPGA, makes it suitable for wide range of ap-
plications but at the same time placed it in a vulnerable position against delay variability.
Interestingly, this flexibility also allowed better delay measurement schemes to monitor delay
variability, which provided a wide rage of options to counter the problem. The solutions review
in Section 2.4.3 revolved around the ability to measure physical delays and perform adaptive
placement and routing of circuits to mitigate the effect of delay variability caused by process
2.9. Summary and Discussion 77
variation and degradation. This not only showed a promising direction in solving the problem
but it also showed that further circuit performance and reliability enhancement may be possible
through the same principles.
Several delay measurement methods based on ring oscillator, signature registers, time-to-voltage
and time-to-digital are reviewed in detail. In addition, an error detection scheme developed
for low-voltage processors (RAZOR) is examined for its potential in providing highly accurate
delay measurements. Timing models and timing analysis methods are also examined to show
how timing models are becoming less capable in tracking delay variability and emphasise the
importance of physical delay measurement in FPGAs.
The delay measurement methods are compared and summarised in Table 2.2. Although the
RAZOR error detection scheme is technically not a delay measurement method, it is included
to show how a measurement method may be like if the same principle is applied. Model based
timing analysis is also included in Table 2.2 to show the advantages of physical delay measure-
ments but it also shows the aspects of which measurement methods need further improvement
to match the versatility of timing analysis.
78 Chapter 2. Background and Related Work
Table 2.2: Comparison of delay measurement, error detection and timing analysis methods
Ring
Oscillators
Signature
Registers
Time-to-
Voltage
Time-to-
digital
RAZOR
Timing
Analysis
Accuracy Mediuma High Mediumb Lowc Highd Lowe
Resolution High Potentiallyhighf
Depends on
ADC Medium
g N/A Usuallyhigh
Variability
Aware Yes Yes Yes Yes
h Yes No
Area
Overhead Low Moderate High Moderate Moderate N/A
Real Design
Applicability Low High Moderate Moderate High Highest
FPGA
Compatibility High High Low Moderate High High
CUT/PUT
Type
Combinatorial
with feedback
Register-
to-register Combinatorial Combinatorial
Register-
to-register Any
IP Block
Support No Yes
i No No Yesi Yesj
Test time Short Long Short Short N/A Usuallyshortk
a Result offset by feedback delay and self-heating from high switching frequency.
b Depends on accuracy of voltage sampler.
c Offset by unbalanced interconnects delay.
d No actual delay measurement but accurately detects errors when timing is violated.
e Depends on accuracy of timing model.
f Given that fine clock frequency steps are used.
g The Vernier Delay Line based approach could yield high resolution results if delay elements are matched
carefully.
h Accuracy may be affected by delay variability in the test circuit.
i Given that registers can be modified within an IP block.
j Requires design tools support of IP blocks timing information.
k May take a long time if timing simulation is used.
Chapter 3
The Failure Rate Detection
Measurement Method
3.1 Introduction
In this chapter, a new delay measurement methodology, known as the Failure Rate Detection
(FRD) method, suitable for existing FPGAs and general use is proposed [11, 12, 3]. The
new method is aimed at addressing the issues with existing methods presented in the previous
chapter, as well as providing a standard self-measurement framework for characterising process
variability of FPGAs in terms of propagation delay.
When measuring the delay of a combinatorial circuit path directly, it is often impossible to
isolate the unwanted delay introduced by interconnects used to deliver and capture signals to
and from the circuit-under-test (CUT). All the aforementioned methods that relied on direct
measurement of time suffered from this problem. The methods that use registers to synchronise
the delivery and capture of test signals is immune to this problem, but since the signals are
re-timed to match a reference clock, it is difficult to directly measure the delay caused by the
CUT.
79
80 Chapter 3. The Failure Rate Detection Measurement Method
Clock Frequency
F
a
il
u
re
R
a
te
Normal
Operation
Failure
Characteristic
fmax
Figure 3.1: An example of a failure rate profile.
R
Combinatorial
circuit
R
Test Stimuli
Generator
(TSG)
Test Clock
Generator
(TCG)
Error Histogram
Accumulator
(EHA)
Circuit-under-Test (CUT)
Error
S D
Error Detection
Circuit
(EDC)
Figure 3.2: Basic principle of the Failure Rate Detection (FRD) delay measurement method.
The new idea introduced in this work is based on the realisation that timing can be measured
indirectly by observing the circuit’s error rate or timing failure rate behaviour. Inspired by
the accurate error detection ability of the RAZOR architecture [18], the timing failure rate
of a CUT can be measured against an increasing clock frequency (decreasing clock period) to
obtain a failure rate profile. An example of a failure rate profile is shown in Figure 3.1. Such
profile would indicate when the failure rate begins to deviate from zero, and hence indicate the
maximum operating frequency. A detailed observation of the relationship between the failure
rate and clock frequency may also reveal the failure mechanism and characteristic of the CUT.
3.2 Principle of The Measurement Circuit
The measurement circuit proposed in this work performs indirect delay measurement by step-
ping the system clock frequency while gathering timing failure statistic of the CUT to estimate
3.2. Principle of The Measurement Circuit 81
its maximum operating frequency. The basic principle of the test circuit based on Failure Rate
Detection (FRD) is depicted in Figure 3.2.
The circuit-under-test (CUT) consists of a combinatorial logic circuit and its associated local or
global interconnect, placed between two pipeline registers. The CUT is driven by a flexible test
clock generator (TCG) capable of fine frequency adjustment from a predefined lower bound to
an upper bound. While the TCG steps the clock frequency towards the upper bound, a Test
Stimuli Generator (TSG) exercises the CUT with stimuli S such that each signal transition
arrive at the output D after tdelay – the propagation delay to be measured.
As clock frequency steps up and the timing slack between the CUT’s registers drops below
tdelay, the wrong data from the previous cycle is sampled and the error is detected by the Error
Detection Circuit (EDC). By performing multiple test trials per frequency step and counting
the proportion of failed trials using an Error Histogram Accumulator (EHA), the failure rate of
the CUT at each frequency step can be deduced. The resulting failure rate profile as illustrated
by the example in Figure 3.1, allows an accurate estimation of the CUT’s maximum operating
frequency and propagation delay.
3.2.1 Timing Resolution and Accuracy
Since timing measurement is obtained indirectly from the failure rate profile in the clock fre-
quency domain, the timing resolution of the results depends on how finely the frequency of the
TCG can be stepped. Most modern FPGAs contain flexible and runtime reconfigurable on-chip
clock generation resources capable of fine clock frequency adjustment down to the KHz range.
This is especially true for FPGAs with PLL based clock generators which relies on analogue
control loop to precisely lock onto the desired frequency value defined by the pre-scale, feed-
back and post-scale counters in the PLL. The on-chip TCG allows any combinatorial circuit
and path delay to be measured precisely without the need of external circuitry.
On top of that, a moderate frequency step size is sufficient to yield high timing resolution. The
following case shows the general relationship between timing resolution and frequency step size.
82 Chapter 3. The Failure Rate Detection Measurement Method
Consider a CUT that works at frequency f , but fails at frequency f + ∆f . The delay of the
circuit is between t1 =
1
f
and t2 =
1
f+∆f
. Hence, the delay time resolution is [11]:
∆t = t1 − t2 = f−1 − f−1
(
1 +
∆f
f
)−1
≈ ∆f
f 2
(3.1)
For example, suppose f = 500MHz and ∆f = 0.25MHz, the resolution of the delay measure-
ment achieved is 1ps.
The accuracy of the the timing measurement is related to the edge-to-edge clock jitter from
the TCG. For PLL based TCG, clock jitter is mostly related to the Phase Frequency Detector
(PFD), the bandwidth of the low-pass filter in the control loop and any other noise source [71].
The proposed statistical approach derives the CUT’s maximum operating frequency from the
failure rate of multiple test trials instead of the spot frequency related to a single timing failure.
Therefore, given that the clock jitter has a predictable distribution and the failure rate profile is
interpreted correctly, the effect of clock jitter can be isolated to obtain highly accurate results.
This will be demonstrated in more detail later in Section 3.4 and 3.6.
3.3 Path Delay Measurement Circuit
A specific implementation of the FRD test circuit for measuring path delay is shown in Fig-
ure 3.3. The launch register (LR) and the sample register (SR) are clocked at opposite phase of
the test clock. This implies that the stimuli S must propagate across the CUT within half the
clock period (T/2) to ensure the correct value is sampled by the SR. The EDC compares the
delayed signal D and output of the sample register Q with an XOR gate and latches any error
E with the capture register CR on the rising edge of the test clock to produce a late signal L.
This in turn causes a toggle flip-flop to generate a transition, signaling an error to the EHA
circuit with the signal Error.
The output toggle flip-flop serves a number of useful purposes. Firstly, it acts as a synchronous
to asynchronous interface circuit allowing the EHA to be implemented as an asynchronous
counter to avoid synchronisation problem cause by clock skew between the CUT and the error
3.3. Path Delay Measurement Circuit 83
Combinatorial
circuit
S
R
Circuit-under-Test (CUT)
S D
Launch
Register
Sample
Register
L
R
C
R
T
flip-
flop
E L Error
Test Stimuli
Generator
(TSG)
Test Clock
Generator
(TCG)
Capture Register
Error Detection
Circuit (EDC)
Q
To
EHA
Figure 3.3: General implementation of the Failure Rate Detection (FRD) test circuit.
counter. Secondly, the asynchronous communication reduces global clock network overhead and
load, as well as enabling the EHA to be placed some distance away from the CUT, minimising
the effect of localised heating on the CUT’s propagation delay caused by the switching activities
in the EHA.
3.3.1 Timing Considerations
Figure 3.4 illustrates the operation of the FRD test circuit over three test clock cycles. The
clock period T with inherit jitter is defined by:
T = T0 + tj (3.2)
where T0 is the nominal period and tj describes the behaviour of the clock jitter. The effect of
clock jitter on delay measurements will be explained later in Section 3.4 and 3.6.
It is important to note here that the maximum operating frequency of the CUT is governed
by three factors: (i) the pure combinatorial delay of LUTs and interconnects in the CUT
(tcomb), (ii) the launch register’s clock-to-output delay (tclk S) and (iii) the clock skew between
the launch and sample registers (tskew) which will add or subtract slack from the signal path.
Combining these factors, we have the term:
tdelay = tcomb + tclk S − tskew (3.3)
which is the “apparent” delay of the CUT governing its maximum operating frequency (fmax).
84 Chapter 3. The Failure Rate Detection Measurement Method
Clock
Invalid
Transition
Valid
Regions:
T
T/2
tsetup_SR
thold_SR
Cycle 1 Cycle 2
tcomb
tcomb
S
D
tclk_S
Q
E (D xor Q)
tsetup_CR
thold_CR
L
Cycle 3
tcomb
tg
Figure 3.4: Timing diagram of the test circuit showing 3 cycles of operations. tcomb represents
the combinatorial path delay for each cycle. tclk S is the clock-to-output delay of the launch
register (LR). tg is the propagation delay of the XOR gate. tsetup SR and thold SR are the setup
and hold time of the sample register (SR). tsetup CR and thold CR are the setup and hold time of
the capture register (CR).
The measurement circuit we proposed is capable of extracting tdelay as a precise performance
indicator based on the following timing constrains. In cycle 1, the CUT operates without timing
error. This error-free condition (valid region) occurs when:
thold CR < tdelay <
T
2
− tsetup SR (3.4)
where tdelay is the apparent delay to be measured, T is the clock period, tsetup SR is the setup
time of the sample register SR, and thold CR is the hold time of the capture register CR.
Furthermore, in order for the late signal L to be interpreted correctly, the following must hold:
tdelay < T − (tsetup CR + tg) (3.5)
where tsetup CR is the setup time of the capture register CR and tg is the propagation delay of
the XOR gate.
3.3. Path Delay Measurement Circuit 85
Cycle 2 depicts the condition where a timing error occurs (invalid region). This happens when
the following condition is satisfied:
T
2
+ thold SR < tdelay < T − (tsetup CR + tg) (3.6)
Here the error signal E is sampled high at the beginning of cycle 3 and a late signal L lasting
for one clock period is produced one cycle after the timing error occurs. A counter in the EHA
then accumulates the total late signal count CLate over a certain period with a total expected
transition count of Ctrans. The failure rate (FR) is simply the ratio CLate/Ctrans. The goal is
to observe FR and identify when T/2 ≈ tdelay and use T/2 as the delay estimate.
The transition region (valid → invalid) represents setup or hold time violation of the sample
register by the signal D. Such a condition could in theory result in the register entering a
metastable state, although in practice metastable events are rare. Moreover, given that the
metastability cannot propagate beyond the next register and is equally likely to be resolved to
a 1 or 0, the rare metastable events are unlikely to affect the observed results. The effect of
metastability is explored in Section 3.4 through simulations.
As shown in Eq. 3.3, the measured apparent delay tdelay includes the skew between the launch
and capture registers. While this may seem to be a limitation at first, it is actually beneficial.
When estimating the speed at which a synchronous circuit can operate, it is necessary that
both the signal path delay and the clock skew are included in the estimation. If the CUT
represents the complete signal path of a user circuit, the clock skew in the measurements will
be exactly that of the user circuit. On the other hand, the delay of a user circuit path can be
estimated by summing measurements of individual parts of the the signal path. In this case,
clock skew in the estimation will match the actual clock skew only if the launch register is the
same as the capture register of the previous part of the path.
For pure clock skew measurements, the proposed FRD measurement method has been modified
by Sedcole et al. in [72, 32] to isolate the value of clock skew from the apparent delay of any
register-to-register path.
86 Chapter 3. The Failure Rate Detection Measurement Method
3.4 Simulation and Modelling of Timing Failure
In this section, the failure behaviour of a CUT containing a single path is explored through
MATLAB simulations to predict the effect of uncertainties such as clock jitter and metasta-
bility of flip-flops. The basic simulation code and data structures representing the signals and
clock are presented in Appendix B. The simulated failure rate profile is examined to deduce
how a non-ideal failure behaviour can be interpreted to obtain accurate timing and frequency
measurements.
The simulation emulates the CUT at the clock event and signal transition level to obtain
realistic results that resemble the actual digital circuit behaviour. The CUT is assumed to
be a single path containing a delay element with predefined delay values, placed between two
registers as in Figure 3.3. The input is stimulated by a test vector that toggles every clock
cycle. During a simulation, the clock period is decremented linearly at 1.0ps per step and the
subsequent failure rate over 2000 clock cycles (trials) for each period/frequency step is recorded
to construct the failure rate profile.
3.4.1 Asymmetric Path Delay
In Figure 3.5 (a), the path delay is set to 650ps and all signals and components are ideal. The
simulated failure rate profile shows an expected jump from zero failure rate to 100% failure rate
when timing is violated at around 770MHz, corresponding to a delay and half clock period (T/2)
of 650ps. The path delay can be easily measured from the position of the single step change of
failure rate. This type of single-step behaviour is, however, less likely to occur in reality due
to asymmetrical behaviour between rising and falling signal transitions in real circuits, causing
the two types of edges to have different delays.
3.4. Simulation and Modelling of Timing Failure 87
700 750 800 850
0
20
40
60
80
100
(a) Failure Rate Profile − Ideal Symmetric Delay
Fa
ilu
re
 R
at
e 
(%
)
700 750 800 850
0
20
40
60
80
100
Frequency (MHz)
(b) Failure Rate Profile − Ideal Asymmetric Delay
Fa
ilu
re
 R
at
e 
(%
)
Slow Edge Failing 
Fast Edge Failing 
Figure 3.5: Simulated failure profile of a circuit path with ideal flip-flops and clock. (a) shows
the profile of an ideal circuit with perfect rising and falling transition delay symmetry, and (b)
shows the effect of asymmetrical rising and falling transition where they propagate through the
CUT with different delays.
88 Chapter 3. The Failure Rate Detection Measurement Method
Clock
T
T/2
Cycle 1 Cycle 2
S
D
0% Failure
Both edges
in time
50% Failure
One edge late
100% Failure
Both edges late
t
fall
t
rise
D
D
t
rise
t
rise
t
rise
t
fall
t
fall
t
fall
Invalid
Transition
Valid
Regions:
Case 1
D
Case 2
Figure 3.6: A timing diagram showing the conditions resulting in 0%, 50% and 100% failure
rate in the profile. trise and tfall are the delay of rising and falling transitions through the CUT.
A more accurate simulation taking asymmetry into account is shown in Figure 3.5 (b), where
the path delay is set to 625ps for the rising transition and 675ps for the falling transition. The
failure process now involves two independent steps. The failure rate first goes from zero to
50% when the slower falling transition fails, and then from 50% to 100% when the faster rising
transition fails. The 50% plateau between the two transitions represents the intermediate region
where only the faster transition can propagate through within one clock cycle, resulting in half
of the transitions from the toggling input failing. The possible cases are illustrated by the
timing diagram in Figure 3.6. Clearly, a circuit must function with both types of transitions.
Therefore, the zero to 50% change should be used to indicate the maximum operating frequency
or the worst-case delay of a circuit.
Since there are no timing uncertainties in the simulated circuit, the timing failure occurs as a
well defined discrete step change in the failure rate profile, making it relatively easy to determine
the exact failure frequency or delay.
3.4. Simulation and Modelling of Timing Failure 89
3.4.2 Clock Jitter and Flip-flop Metastability
In real circuits, we cannot assumed an ideal clock source and flip-flops. Therefore, the timing
uncertainties caused by them must be taken into account to interpret the failure rate profile
correctly for timing measurements. The main source of timing uncertainties in the test method
are clock Jitter and flip-flop metastability, which affect the way that failure rate changes in the
the failure rate profile when the circuit fails. Since the CUT’s timing is constrained by the single
clock cycle edge-to-edge period, the clock jitter can be modelled by a probability distribution
that describe the relative edge-to-edge variation between consecutive clock edges. Similarly,
the effect of metastability can be modelled by assuming the output of a flip-flop always resolves
to a certain value according to a characteristic PDF – known as a metastable window. An
example using predefined PDFs of Clock jitter distribution and flip-flop metastability window
is simulated to demonstrate their effects.
In the simulation, the edge-to-edge clock jitter is set to have a uniform distribution between
±15ps (Figure 3.7 (a)) and a metastable window bounded by ±10ps relative to the actual clock
edge position (Figure 3.7 (b)), which defines the probability that a flip-flop fails to register an
input transition when it occurs close to a clock edge. As would be demonstrated later in Chap-
ter 4 Section 4.4.1, the bounded uniform jitter distribution is a good estimate of actual clock
jitter behaviour found in real circuits. The resulting failure rate profile with the uncertainties
are shown in Figure 3.7 (c) along with the ideal response for comparison. As can be seen from
the plot, clock jitter causes the failure rate to change gradually with a slope defined by the
jitter distribution rather than a discrete step change.
For any jitter distribution PDFJitter as a function of relative time (τ) to the nearest clock edge,
the failure rate (FR) from zero to 50% caused by the slower transition can be defined as a
function of clock period (T ) by the following expression:
FRJitter(T ) = 100%× 1
2
∫ t−T
2
−∞
PDFJitter(τ) dτ (3.7)
where t is the worst-case propagation delay of the slower signal transition.
90 Chapter 3. The Failure Rate Detection Measurement Method
−15 0 15
0
0.02
0.04
0.06
0.08
Time from Expected Clock Edge (ps)
(a) Clock Jitter Distrbution
Pr
ob
ab
ilit
y 
De
ns
ity
−10 0 10
0
0.02
0.04
0.06
0.08
Time from Actual Clock Edge (ps)
(b) Flip−flop Metastable Window
Pr
ob
ab
ilit
y 
De
ns
ity
700 750 800 850
0
25
50
75
100
Frequency (MHz)
(c) Failure Rate Profiles
Fa
ilu
re
 R
at
e 
(%
)
Ideal
With Clock Jitter
With Clock Jitter + Metastability
25% Failure Point 
Figure 3.7: Simulated failure rate profiles of a circuit path showing the effects of clock jitter
and flip-flop metastability. (a) and (b) show the PDFs used in the simulation to describe the
edge-to-edge clock jitter and metastable window; (c) shows and compare the effect of clock
jitter and metastability on the failure profile.
3.4. Simulation and Modelling of Timing Failure 91
For the example in Figure 3.7, the uniform rectangle function of the jitter PDF causes the
failure rate to increase linearly when the circuit fails, obeying the previous expression (Eq. 3.7).
Similarly, the effect of metastability can be expressed in terms of failure rate with the metastable
window PDFMetaStable(τ):
FRMetastable(T ) = 100%× 1
2
∫ t−T
2
−∞
PDFMetastable(τ) dτ (3.8)
The addition of metastability to the simulation has the effect of smoothing out the failure rate
change at the beginning and end of the failure process. This can be seen clearly in Figure 3.7
where the failure rate changes at a lower rate at the beginning, then crosses the 25% point at
approximately the same gradient as the case without metastability, and level off to 100% at a
lower rate again. Note that the metastable window is assigned a significant width relative to
the jitter distribution to clearly show its effect in the example. The actual width of metastable
window in real circuits should be much smaller relative to the jitter distribution, and causes a
smaller smoothing effect on the failure rate profile in real circuits.
Finally, the failure rate of a circuit with both metastability and clock jitter can be defined by
the integral of the convolution of their PDFs and it is given by:
FR(T ) = 100%× 1
2
∫ t−T
2
−∞
[
PDFJitter(τ) ∗ PDFMetaStable(τ)
]
dτ (3.9)
In reality, the metastable window of flip-flops in modern VLSI circuits and FPGAs, defined
by their physical setup and hold time, are often very narrow relative to the jitter distribution,
thus it is sufficient in most case to approximate the failure rate of a circuit with FRJitter alone
(Eq. 3.7). This assumption will be evaluated through actual measurements in Section 3.6 by
looking for any pronounced smoothing at the beginning and end of failure slope observations.
92 Chapter 3. The Failure Rate Detection Measurement Method
3.4.3 Choosing the Timing Failure Reference Point
In order to obtain consistent timing measurements, a clear reference point on the failure rate
profile is needed to define when the CUT fails. The previous simulation and failure rate models
show a well behaved failure rate characteristic when a realistic circuit path fails. According
to Figure 3.7 (c) and Eq. 3.9, the failure rate slope always crosses the 25% point when the
CUT’s delay (tdelay) violates the nominal half clock period (T/2), i.e., when tdelay = T/2. This
relationship holds true as long as the jitter PDF is approximately symmetrical, with its median
value centered about the expected clock edge position, where median ≈ mean. Moreover, the
effect of metastability is generally small in real circuits, and it only smooths out the failure rate
change without affecting the centre point around 25%. For these reasons, the 25% point is the
ideal timing failure reference point for obtaining consistent and accurate nominal CUT delay,
which is immune to both the effect of clock jitter and metastability. Using the 25% reference
point, the nominal CUT delay can be accurately estimated by the following expression:
tdelay =
T25%
2
=
1
2× f25% (3.10)
where T25% and f25% are the test clock period and frequency at 25% failure rate respectively.
The 25% point, however, should only be used to determine the nominal maximum operating
frequency or delay, not the absolute point where failure begins to occur. Since clock jitter
and metastability are non-deterministic and are modelled probabilistically, finding an absolute
point before failure occurs is often not possible. Nonetheless, a safety guard-band providing
sufficiently low probability of timing failure can be estimated from the distribution of clock
jitter and metastability superimposed in the failure rate profile around the 25% point.
By differentiating the failure rate profile of a CUT with respect to the clock period, the process
of Eq. 3.9 can be reversed to obtain the convoluted PDF of the clock jitter and metastability.
This could help calculate the appropriate guard band around the 25% point, as well as allowing
designers to better understand and model clock related timing uncertainties in their designs.
3.5. Test Circuit Implementation 93
TSG
Combinatorial Path
EDC
Clock
(From TCG)
SR
To EHA
Toggle
flip-flop
Sample
Register
Launch RegisterTest
Enable D QS
Path Under Test (PUT)
Figure 3.8: The Failure Rate Detection (FRD) test module implementation for measuring
specific combinatorial path delay.
3.5 Test Circuit Implementation
The measurement technique explained in the previous sections is implemented and tested on
real FPGAs to verify its accuracy and confirm the expected behaviour from the simulations.
The specific implementation of the main FRD test circuit and the major components including
the TVG and TCG are explained in the following subsections.
3.5.1 Path Specific FRD Test Module
The FRD test circuit is modularised and simplified to test a single circuit path, so that the
TVG and the launch flip-flop can be reduced to a single toggle flip-flop to supply a stimulus at
half the clock’s frequency. The circuit diagram of the FRD module is shown in Figure 3.8. The
simple FRD module allows an easy delay characterisation of FPGAs with a large number of
nominally identical circuit paths, which could be used to analyse the effect of process variation
and degradation. The initial implementations of the EDC and EHA explained in Figure 3.3
and Section 3.3 are used to detect and count errors in the path under test.
94 Chapter 3. The Failure Rate Detection Measurement Method
3.5.2 Clock Frequency Control
The main challenges we encountered during the implementation of the TCG were how to control
the clock generators on FPGAs to provide stable, accurate and fine frequency steps, as well as
quick switching between frequency steps to minimises the total test duration.
Most built-in clock synthesisers, whether they are PLL or DLL based, are known to define
output clocks using one or more frequency multiplier and divider parameters. For example, the
DLL based DCM modules in an Xilinx Virtex-4 FPGA [73] defines clock frequency fout using:
fout = fref
M
D
(3.11)
where fref is the reference clock, M is the multiplier and D is the divider. Both M and D are
5-bit integers with a valid range of 1 to 32.
For PLL based run-time reconfigurable clock generators such as those provided by Altera’s
Cyclone III or newer [74] and the Stratix FPGA families [75], the clock frequency fout is defined
in a similar way by:
fV CO = fref
M
N
(3.12)
fout =
fV CO
C
(3.13)
where M and N are the multiplier and divider defining the frequency of the voltage controlled
oscillator (VCO) and C is the post clock divider. The parameters are 9-bit integers representing
a range of 1 to 512. The frequency of the intermediate clock signal fV CO must lie within the
VCO hardware specifications (fV COmin and fV COmax) to ensure reliable operation.
3.5. Test Circuit Implementation 95
Parameter Generation for Optimal Frequency Resolution
To obtain the best possible frequency resolution from a clock generator, the clock multiplication
and division parameters must be varied in such a way that they produce the smallest possible
increment in frequency each time. Yet, the process of finding these combinations of parameters
is non trivial because they have limited ranges and there are no consistent relationship that
links the current combination to the next combination of the smallest frequency increment. On
top of that, there are many different combinations that could result in the same frequencies
and they should be avoided.
To attack this problem, a constrained search algorithm is used to discover all possible combi-
nations of parameters for any frequency synthesiser, including the DCM or PLL based clock
generators described earlier. Taking the previous PLL example (Eq. 3.12 and 3.13) with param-
eters M , N and C, the following pseudo code (List 3.1) would generate a full list of parameters
(Par list) that provides the highest possible frequency resolution.
Listing 3.1: Clock generator parameters generation pseudo code.
1 M min := Ce i l ( F lower bound / F r e f ) ; //Get minimum c lock mu l t i p l i e r
2 for M := M min to 512 do //Loop c l o c k mu l t i p l i e r
3 begin
4 for N := 1 to M do //Loop c l o c k d i v ide r , assuming N <= M
5 begin
6 F VCO := F r e f ∗(M/N) ;
7 //Check F VCO aga ins t VCO range and user de f ined boundaries
8 i f ( (F VCO < F VCO min) or (F VCO < F lower bound ) ) then break
9 else i f (F VCO <= F VCO max) then
10 for C := 1 to 512 do //Loop pos t c l o c k d i v i d e r
11 begin
12 F out := (F VCO/C) ;
13 i f ( F out < F lower bound ) then break
14 else i f ( F out <= F upper bound ) then
15 P a r l i s t . Add( F out ,M,N,C) ; //Add new parameters to l i s t
16 end ;
17 end ;
18 end ;
19 Sort Frequency ( P a r l i s t ) ; // Sort parameter l i s t in ascending F out
20 Remove Duplicates ( P a r l i s t ) ; //Remove en t r i e s with dup l i c a t ed F out
96 Chapter 3. The Failure Rate Detection Measurement Method
The parameter generation process is constrained by the reference frequency (F ref), the user
defined frequency boundaries (F lower bound and F upper bound), and the VCO’s hardware
frequency limits (F VCO min and F VCO max). The algorithm works by searching through
the parameters’ space permitted by the constrains and then sort the results in ascending order
according to their output frequency (F out). Any duplicated frequency entries are removed to
give a list of unique frequencies and parameter combinations.
The proposed constrained search algorithm can be used to obtain a complete set of param-
eters for any clock generator to produce optimal frequency resolution. Although the process
is partially exhaustive and the execution time may increase exponentially with the range of
parameters, the generation process is only required once for each clock generator architecture,
where the generated list is stored for unlimited reuse and does not contribute to the actual
circuit testing time.
For reference, the above code took less than 0.1 second to execute on an 1.86GHz Intel Core2
Duo E6300 processor to generate an optimal parameter list for frequencies between 50 and
1000MHz, where F ref is set to 50MHz, F VCO min and F VCO max are set to 350 and
1300MHz respectively.
Frequency Switching Efficiency
Most DLL or PLL based clock generators provide a lock signal that indicates when a clock
signal has stabilised and ready to use after each frequency switching. The time it takes an
output clock to lock (lock time) is usually used to indicate a clock generator’s performance.
When a clock frequency is synthesised, the lock time is usually related to the value of parameters
used. Since one frequency value can often be produce by more than one parameter combinations,
it may be possible to choose the combination that results in the quickest lock time.
3.5. Test Circuit Implementation 97
For the PLLs in the Cyclone III FPGA architecture, we found that the best lock time for each
frequency is achieved when M and N are relative prime, where the minimum possible VCO
frequency is achieved. Thus, the “Remove Duplicates” procedure in the generation code is
implemented to discard all frequency entries with non-relative prime M and N values.
Resolution Enhancement by Clock Generator Cascading
In the case where very fine frequency resolution is needed, it is possible to cascade two or more
clock generators together to effectively increase the number of parameters in the frequency
synthesis process. For example, when two Virtex-4 DCMs are chained together, the output
frequency fout is given by:
fout = fref
MA
DA
MB
DB
(3.14)
Where MA, DA are the parameters of the first DCM and MB, DB are the parameters of
the second DCM. This provides a significantly higher frequency resolution than one DCM,
considering the fact that each parameter only has a small range of 1 to 32.
There is, however, one major drawback with cascading. Clock jitter from each clock generation
stage would accumulate along the chain, causing the final output to exhibits a higher level of
jitter than a single stage. For Xilinx’s DLL based DCMs, the accumulated jitter (Jitterout) of
two cascaded DCMs is given by [73]:
Jitterout =
√
Jitter2A + Jitter
2
B (3.15)
Where JitterA and JitterB represents the individual jitter of the two DCMs.
PLL based clock generators, on the other hand, are more robust against jitter accumulation,
thanks to the low-pass filter within the PLL loop. It is able to reject noise (jitter) frequencies
beyond the bandwidth of the filter and hence reducing the amount of jitter accumulated through
PLL cascading. Generally, the lower the filter bandwidth, the better jitter rejection it has [75].
However, a certain bandwidth must be retained for the PLL to lock properly onto its target
frequency and operate with reasonable lock time.
98 Chapter 3. The Failure Rate Detection Measurement Method
Although cascading may increased the jitter spread, causing a CUT to appear to start failing at
a lower frequency in its failure rate profile; the 25% reference point – according to Section 3.4.3 –
should remain stationary as long as the jitter distribution is symmetrical, and the proposed FRD
measurement method should benefit from the better timing resolution without compromising
accuracy.
3.5.3 Clock Generator Implementations and Timing Resolution
Two different TCG implementations using external and internal clock generator were proposed
for the FRD test method. The first implementation targets FPGAs without run-time recon-
figurable clock generators, such as the Altera Cyclone II EP2C35 we used as our initial FPGA
test candidate. It relies on an external run-time reconfigurable clock source to supply the vari-
able test clock. The second implementation targets any FPGAs with more advanced run-time
reconfigurable clock generators, which enables a self-contained TCG circuitry and higher test
clock quality. All the Xilinx FPGAs equipped with DCMs and Altera FPGAs with enhanced
PLLs [76] are applicable to this implementation. For this, we chose the Altera Cyclone III
EP3C25 as our second test candidate.
The required test clock frequency range (fbegin and fend) are dependent on the delay of the
CUT and they are given by the following inequalities:
fbegin <
1
2(tdelay max + tsetup S)
(3.16)
fend >
1
2(tdelay min − thold S) (3.17)
where tdelay max and tdelay min are the apparent delays of the slowest and fastest CUT respec-
tively. According to the timing models of the Cyclone II and Cyclone III reported by Altera’s
design tools, the required test clock frequency range is estimated to be 128 to 800MHz.
3.5. Test Circuit Implementation 99
Cyclone II
PLL
DCM_A DCM_B
Virtex-4
Control
Signals
DCM Control
100MHz
Ref.
Clock
4
Test Clock
Test Array
C
o
n
t
r
o
l
l
o
g
i
c
locked
Ext. clock
16
Run-time
reconfigurable DCMs
(a) DCM based
Cyclone III
PLL 2
50MHz
Ref.
Clock
Test Clock
Test Array
C
o
n
t
r
o
l
l
o
g
i
c
locked
PLL 1
17
c
lo
c
k
1
Control SignalsRun-time
reconfigurable PLL
(b) PLL based
Figure 3.9: (a) Test Clock Generator (TCG) implemented from Virtex-4 DCMs to test a Cy-
clone II FPGA without run-time Reconfigurable PLLs; (b) a self contained TCG implemented
from run-time reconfigurable PLLs in a Cyclone III FPGA. The Control Signals manage the
reconfiguration process of the TCG and provide the clock synthesis parameters. The locked
signal indicates when the test clock is ready to use.
100 Chapter 3. The Failure Rate Detection Measurement Method
External DCM Based TCG
In order to test an Altera Cyclone II (EP2C35) without run-time reconfigurable clock gen-
erators, we used the DCMs in a Xilinx Virtex-4 to supply the run-time reconfigurable clock
through an external clock pin to the Cyclone II. The setup is depicted by Figure 3.9(a). Two
DCMs is cascaded in the Virtex-4 to improve timing (clock period) resolution and the output
frequency is divided by 4 for more reliable transmission through the external clock connection.
The PLL on the cyclone II serves three main purpose: (a) restores the divided-by-4 clock
frequency; (b) multiplies the limited range of output frequency from the virtex-4 DCMs up to
the target range of 128 to 800MHz, and (c) helps to reject the noise introduced by the external
clock connection and the jitter accumulated in the cascaded DCMs of the Virtex-4.
Due to the constraints imposed by the DCM parameters on the Virtex-4, the possible frequencies
produced by the TCG are discrete and are not equally spaced. The timing resolution of the
generated test clock is shown in Figure 3.10 in terms of the step size of half clock period
(delta T/2) between consecutive frequency steps, which is equivalent to the timing resolution of
measurements obtained from an FRD test using this clock generator. The worst-case resolution
exceeded 4ps at the lower frequency end but remained well within 1.5ps beyond 267MHz. The
dynamic resolution is concentrated mostly between 0 and 1ps with only a few outliers at lower
resolutions. This can be seen clearly in Figure 3.11, where the step size of half clock period is
most densely distributed below 0.5ps.
3.5. Test Circuit Implementation 101
Figure 3.10: Dynamic timing resolution in terms of delta T/2 of test clock from 128 to 800MHz
generated from two cascaded DCMs on a Xilinx Virtex-4.
0 0.5 1 1.5 2
0
0.01
0.02
0.03
0.04
0.05
0.06
Distribution of Step Size −− Cascaded DCMs
Step Size: Delta T/2 (ps)
D
en
si
ty
Figure 3.11: Distribution of step size (delta T/2) of test clock (128 to 800MHz) generated
from two cascaded DCMs on a Xilinx Virtex-4.
102 Chapter 3. The Failure Rate Detection Measurement Method
Figure 3.12: Comparison of dynamic timing resolution (delta T/2) of test clock from 128 to
800MHz generated from a single PLL (top) and a two-staged PLL configuration with a post
clock multiplier (bottom) on an Altera Cyclone III.
Built-in PLL Based TCG
A more advanced 65nm Altera Cyclone III FPGA (EP3C25) was later introduced to replace
the less desirable Cyclone II + Virtex-4 setup for the FRD testing. Its flexible run-time re-
configurable PLLs allows the whole TCG circuitry to be implemented on-chip using only high
quality on-chip clock network. Figure 3.9(b) illustrates the improved on-chip two-staged PLL
setup, where PLL 1 performs the initial frequency synthesis with variable parameters from the
control logic and PLL 2 multiplys the frequency by a fixed value to reach a higher frequency
range.
This two-staged configuration provides a significantly better timing resolution than a single PLL
by exploiting the much higher frequency resolution at the lower frequency range of a single PLL
with the given parameter constrains in Eq. 3.12 and 3.13. In this case, a low frequency clock
ranging from 7.23 to 47.06MHz is generated by the first PLL and then multiplied by a factor of
3.5. Test Circuit Implementation 103
0 0.5 1 1.5 2
0
0.05
0.1
0.15
Distribution of Step Size −− Single PLL
D
en
si
ty
0 0.5 1 1.5 2
0
0.05
0.1
0.15
Distribution of Step Size −− With Post Clock Multipler (PLL 2)
Step Size: Delta T/2 (ps)
D
en
si
ty
Figure 3.13: Comparison of step size (delta T/2) distribution of test clock (128 to 800MHz)
generated from a single PLL (top) and a two-staged PLL configuration with a post clock
multiplier (bottom) on an Altera Cyclone III.
17 in PLL 2 to reach the target range of 128 to 800MHz test clock frequency. The advantage of
this approach is illustrated clearly by Figure 3.12, where the two-staged approach is compared
against a single PLL. The worst-case resolution is improved from over 6.5ps to under 1.2ps and
the overall half clock period step sizes are significantly smaller, with the best-case resolution
reaching 0.002ps. It can also be seen in Figure 3.13 that the distribution of half clock period
step size of the two-staged approach is sharply concentrated below an astounding 0.1ps and
peaked at approximately 0.006ps, hugely outperforming the single PLL solution.
3.5.4 FPGA Delay Characterisation Circuit
Delay characterisation was carried out on both the Cyclone II EP2C35 and the Cyclone III
EP3C25. They share a similar architecture but are fabricated on different process dimensions,
where the Cyclone II is 90nm and the Cyclone III is 65nm.
104 Chapter 3. The Failure Rate Detection Measurement Method
R
o
w
E
n
a
b
l
e
D
e
c
o
d
e
r
Error Histogram Accumlator (EAH)
Error
Row Address
Column Address
FRD
EN
Timed Test Enable
Test Clock
Generator
(TCG)
To Global
Clock Network
Clear
FRD
FRD
FRD
FRD
FRD
FRD
FRD
FRD
R
o
w
M
U
X
..
..
.
..
..
.
..
..
.
.....
.....
.....
.....
R
o
w
M
U
X
R
o
w
M
U
X
Column Select MUX
Failure
Rate
Profile
Column Enable Decoder
Figure 3.14: An FPGA characterisation circuit using an array of FRD test modules. Each FRD
is activated and tested by specific address signals and a timed test enable signal that last for a
precise length of time.
TSG CUT (4-bit Adder)
EDCTCG
SR
To EHA
Toggle
flip-flop FA
A B
S
FA
A B
S
FA
A B
S
C C
1 10 00
FA
A B
S
C
10
Sample
RegisterLaunch Register
Full Adder (FA) - LUT (Arithmetic mode)
Test
Enable
D Q
Carry-chain
Figure 3.15: An FRD module configured to test LUTs as full adders in arithmetic mode and
their dedicated fast carry-chains on an Altera Cyclone III EP3C25.
3.6. Test Results 105
The FRD circuit characterises combinatorial paths consist of local interconnect and chain of
LUTs. Every FRD circuits are placed on different locations on the FPGA in an array con-
figuration as shown in Figure 3.14. The LUTs in the combinatorial path-under-test (PUT)
are configured as buffers, inverters and full-adders to investigate the effect of different LUT
configuration on their delay and the benefit of dedicated carry-chain interconnects when LUTs
are in arithmetic mode (full-adder). The full-adder test path is shown in Figure 3.15. In addi-
tion to the individual path delay, an overall spatial delay variability can be observed from the
characterisation results, in the form of a delay map.
Lastly, the characterisation is proceeded by selecting one FRD location at a time with appropri-
ate row and column address signals and the test is carried out according to the timing diagram
in Figure 3.16. Each FRD location is tested with a selected range of frequency steps and the
corresponding failure count is recorded to build a failure rate profile. The test duration (ttest)
is defined by a fixed number of clock cycles on a fixed frequency reference clock – in this case
a 50MHz clock. Therefore, the higher the test clock frequency, the higher number of samples
(test trials) are counted by the failure counter in the EHA. The failure rate (FR) is given by:
FR =
Failurecount
ftest
(3.18)
where ftest is the test clock frequency. This approach has an advantage on measurement ac-
curacy because it allows the failure rate measurement to better reject noise as the test clock
frequency increases, while test time remains constant. Moreover, it does not require a counter
synchronised to the variable test clock to maintain the same number of count samples at dif-
ferent frequencies, making the test circuit more robust at high test clock frequencies.
3.6 Test Results
The delay measurement and characterisation results of both the 90nm Cyclone II EP2C35 and
65nm Cyclone III EP3C25 using the previously explained FRD test circuits are presented in
this section.
106 Chapter 3. The Failure Rate Detection Measurement Method
Test Cycle 1 Test Cycle 2
t
test
Test Cycle N
Failure
Count
Test Enable
Test Clock
t
lock
t
settle
freq 1
t
clear
.....
t
read
Clock frequency
updating and settling
freq 2 freq N
t
config
Counting
Count NCount 1 Count 2
Figure 3.16: A timing diagram showing the test enable signal, test clock frequency stepping
and failure count of several test cycles of the FRD circuit up to N cycles, where ttest defines the
length of each test. tconfig is the reconfiguration time of the TCG to the next frequency step
and tlock is the time needed to lock onto the new clock frequency. tsettle and tclear are the times
reserved for the counter values to settle and to clear the counter for the next test cycle. tread is
the duration of time where the failure count is valid for reading after each test.
3.6.1 Cyclone II Measurements and Characterisations
Figure 3.17(a) shows the failure rate profile plot of a typical CUT containing 2 cascaded LUTs
configured as inverters on the Cyclone II FPGA. In this case, the signal path includes the
two LUTs in series, the relevant path through the launch and sample registers, as well as the
local interconnect link between them. The plot shows that the CUT fails in two distinct steps
caused by asymmetric delays of rising and falling transitions as predicted in Section 3.4.1. The
intermediate step at 50% (region B) represents the range of frequencies where one type of
transitions has failed but the faster transitions remain propagating through the CUT correctly.
The failure rate transitions between the three regions (region A, B and C) are not discrete steps
as would be expected if failure always occur at a spot frequency. The gradual slopes of the
transitions matched the effects of clock jitter and metastability predicted earlier in Section 3.4.2.
As expected, the slope is mostly linear, where metastability is hardly showing any significant
effect. Using the 25% failure frequency (f25%) at approximately 535MHz as a reference, the
nominal worst-case delay calculated from Eq. 3.10 is approximately 935ps for this particular
CUT.
3.6. Test Results 107
Region A Region B Region C
f
25%
Cyclone II - Failure Rate Profile (2 LUTs: Inverters)
(a)
20
40
60
10
20
30
880
900
920
940
960
LAB Column (X)
Cyclone II Delay Map (2 LUTs: Inverters)
LAB Row (Y)
Es
tim
at
ed
 D
el
ay
: T
/2
 (p
s)
Es
tim
at
ed
 D
el
ay
: T
/2
 (p
s)
890
900
910
920
930
940
(b)
Figure 3.17: (a) The failure rate profile obtained from a CUT containing 2 LUTs on a 90nm
Altera Cyclone II EP2C35; (b) Delay map generated from FRD array at different locations in
terms of LAB coordinates across the Cyclone II chip.
Figure 3.17(b) shows the the delay map surface plot of CUTs on the Cyclone II using the
FRD characterisation circuit. Similar to the finding in [34], there is an observable correlated
systematic delay variation on the plot in the form of an overall surface curvature. On top of
the systematic variation, a clear stochastic delay variation can be seen superimposed on the
surface resulting in the apparent “roughness”. Correlated systematic and random stochastic
variation can be separated from the delay map using methods proposed in [34, 77, 78].
Unlike the ring oscillator method in [34], LUTs are characterised in pairs instead of chains of
5 LUTs. This allows the random stochastic variations to be observed more clearly, which are
normally obscured by the the averaging effect of summation of delays in a long chain of LUTs.
3.6.2 Cyclone III Measurements and Characterisations
A single CUT on the Cyclone III consists of 4 LUTs configured as Buffers is first tested to
analyse the characteristic of its failure rate profile (Figure 3.18(a)). The FRD test circuit is
mostly identical to the one used on the Cyclone II, except that the TCG is now placed on-chip.
Also, to investigate the type of signal transitions (rising or falling) that is responsible for the
worst-case delay in this particular test case, a small modification is made to the error detection
108 Chapter 3. The Failure Rate Detection Measurement Method
0
25
50
75
100
340 360 380 400 420
Cyclone III - Failure Rate Profile (4 LUTs: Buffers)
Frequency (Mhz)
F
a
il
u
re
R
a
te
(%
)
Falling
transitions
(ii)
Rising
transitions
(iii)
Initial
profile
(i)
(a)
10
20
30
40
10
20
30
1350
1400
1450
LAB Column (X)
Cyclone III Delay Map (4 LUTs: Buffers)
LAB Row (Y)
Es
tim
at
ed
 D
el
ay
: T
/2
 (p
s)
Es
tim
at
ed
 D
el
ay
: T
/2
 (p
s)
1340
1360
1380
1400
1420
1440
(b)
Figure 3.18: (a) The failure rate profile obtained from a CUT containing 4 LUTs on a 65nm
Altera Cyclone III EP3C25; (b) Delay map generated from FRD array at different locations in
terms of LAB coordinates across the Cyclone III chip.
circuit to allow separate failure rate measurement of the two transition types. The modification
relies on a toggle signal that enables the error capture register at even or odd cycles to isolate
the errors of one type of transitions at a time.
Figure 3.18(a) contains three plots, which represent the normal (initial) failure rate profile
(i), the isolated failure rate profiles of the slower falling transitions (ii) and the faster rising
transitions (iii). The quicker propagation speed of rising transitions is possibly caused by the
asymmetric charge/discharge time of transistors in the CUT. Thanks to the higher quality
clock signal from the local TCG, the failure transition slopes are less noisy and better defined
compared to the Cyclone II’s results. The slight curvature of the failure transition slopes are
mostly due to the non-linear relationship between clock frequency and clock period, causing
the linear relationship between the failure rate change and clock period to appear distorted in
the frequency domain. Similar to the Cyclone II results, the failure rate profiles are hardly
affected by metastability, which could smooth out the beginning and end of the failure slopes.
The cross-chip delay characterisation of the cyclone III is shown in Figure 3.18(b), where a
delay map similar to the Cyclone II’s result is obtained. Again, the existence of correlated
systematic variation and stochastic variation can be seen clearly in the delay map as an overall
slope/cuvature and the apparent roughness of the surface plot.
3.6. Test Results 109
10
20
30
40
10
20
30
950
1000
1050
LAB Column (X)
Cyclone III Delay Map (4 LUTs : Adder/carry chain)
LAB Row (Y)
Es
tim
at
ed
 D
el
ay
 (p
s)
Es
tim
at
ed
 D
el
ay
 (p
s)
940
960
980
1000
1020
1040
(a)
10
20
30
40
10
20
30
1100
1150
1200
LAB Column (X)
Cyclone III Delay Map (8 LUTs : Adder/carry chain)
LAB Row (Y)
Es
tim
at
ed
 D
el
ay
 (p
s)
Es
tim
at
ed
 D
el
ay
 (p
s)
1110
1120
1130
1140
1150
1160
1170
1180
1190
1200
(b)
Figure 3.19: Delay map of (a) 4-bit and (b) 8-bit adders at different locations (LAB coordinates)
across the Cyclone III chip.
3.6.3 Adder Circuits Cross-Chip Characterisation
The Cyclone III’s full adder implementation using LUTs in arithmetic mode and fast carry-
chain interconnects is also tested by the FRD characterisation circuit. The layout of individual
test circuit with the adder chain is depicted earlier in Figure 3.15. This test circuit demonstrates
that the measurement method can be applied to logic and routing that are not part of generic
FPGA architectures. It yields a delay map that represents the speed of adders at locations
across the Cyclone III chip with certain delay variability.
It can be seen in Figure 3.19(a) that the overall delay of the CUTs using carry-chains are
approximately 400ps lower than the previous results in Figure 3.18(b) where slower local in-
terconnects are used. While all the CUTs are configured to use the same placement within
each LAB, the surface plot showed an increase in delay variability. This suggests that although
carry-chains may reduced the average delay, the use of carry-chains with LUTs in arithmetic
mode can potentially increase the worst-case delay variation of the CUTs. Figure 3.19(b) shows
a similar delay map of CUTs with longer 8-bit adders. The 8-bit adder did not result in twice
the delay because the bulk of the delay came from the first and last stage of the adder, where
slower local interconnect is used at the input of the first stage, and both stages involve a longer
path through the LUT compared to the carry-in to carry-out paths of the intermediate stages.
This clearly shows the advantage of using LUTs in arithmetic mode with carry-chain inter-
110 Chapter 3. The Failure Rate Detection Measurement Method
0 100 200 300 400 500
934
935
936
937
938
939
940
941
942
Test number
H
al
f C
lo
ck
 P
er
io
d 
(T
/2)
 (p
s)
Scatter plot of T/2 at 25% failure rate
(a)
−3 −2 −1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
Timing Measurement residuals (ps)
Pr
ob
ab
ilit
y 
De
ns
ity
Probability distribution of measurements at 25% failure rate
(b)
Figure 3.20: (a) The scatter plot of half clock period (T/2) at 25 % failure rate with exponential
best fit. (b) Histogram of residuals of the delay measurements around the exponential best fit.
connects in this FPGA architecture. For the same reasons, the delay variability within the
first adder stage, which is the same between the 4-bit and 8-bit cases, would have a significant
influence in both Figure 3.19(a) and (b), resulting in similar delay variability pattern where it
is most apparent at locations with particularly high delay (peaks) and low delay (troughs).
3.6.4 Result Precision, Accuracy and Reliability Evaluation
The precision of measurements in terms of timing resolution is related to the test clock frequency
step size generated by the TCG, which is shown in Section 3.5.3 to be mostly within 1ps for
both the Cyclone II and Cyclone III TCG implementations.
The reliability and consistency of measurements are sensitive to uncertainties such as voltage
supply fluctuation and temperature variation mostly due to self-heating of the CUT during
tests. Their effects are evaluated through 500 delay measurements of the same circuit over
time and the results are shown in Figure 3.20(a). The best-fit curve shows the effect of self-
heating which settles after 120 measurements. Each test lasts for approximately 720ms therefore
the local temperature reaches equilibrium at 720ms ×120 = 86.4s. The random scattering
around the best-fit curve is depicted in Figure 3.20(b) as a histogram. This random error is
approximately Gaussian with a small standard deviation of 0.61ps.
3.7. Optimised Built-In Self-Test Designs 111
Figure 3.21: Consistency test based on standard deviation of measurements computed across
90 sets of delay map results.
Also, to verify that the delay map measurements from the delay characterisations are not
reflecting random noise but true stochastic delay variation of the CUTs, a consistency map
(Figure 3.21) is created by computing the standard deviation for each specific chip location
across 90 delay maps from 90 repeated tests. The surface plot is relatively flat with an average
standard deviation value of 0.7ps across all locations, hence inconsistency cause by random
noise is within ±2.1ps (3-sigma).
3.7 Optimised Built-In Self-Test Designs
This section proposes two BIST designs with different optimisation techniques to achieve quick
power-on self delay measurement and characterisation on FPGAs. Although the Cyclone II
without a fully on-chip TCG were used to test the BIST designs, the results were intended to
show the speed and efficiency advantages of the BIST in general, where the BISTs are applicable
to any FPGAs with or without on-chip TCG.
112 Chapter 3. The Failure Rate Detection Measurement Method
TSG Circuit Under Test (CUT)
Reset
TCG
SR
Status
Toggle
flip-flop
Sample
Register
Launch Register
D Q.....S
C
R
Capture Register Status Register
First-Fail EDC
Test
Enable
R
Figure 3.22: Modified first-fail detector (FFD) circuit.
3.7.1 Parallel First-Fail Detector BIST
The method described in the last section allows delay across any combinatorial circuit to be
measured with high accuracy. However, the use of failure rate profile although accurate, is
relatively slow because each FRD is enabled in turn and multiple measurements are made at
each frequency in order to build the failure counts into a complete profile plot. For the purpose
of providing FPGA delay map for placement optimisations in a chip-by-chip basis via quick
power-up testing, an efficient parallel self-characterisation method is needed.
Detailed Circuit
The FRD circuit described in Section 3.5.1 is modified to the first-fail detection (FFD) circuit
as shown in Figure 3.22. The EDC now provides a sticky status output which goes high when
it first encounters a timing error and remains high until it is reset. Since the FFD measurement
records the first failure point, the estimated delay of the CUT will be more conservative than
that obtained from the complete failure rate profile. If necessary, this can be compensated for
by initially calibrating the first point of failure with the 25% failure point on a small number
of CUTs, and then allowing for the appropriate margin in the FFD measurements.
3.7. Optimised Built-In Self-Test Designs 113
Column Enable Decoder
R
o
w
E
n
a
b
l
e
D
e
c
o
d
e
r
Column Select MUX
EN
Test Enable
Rst
TCG
Ref.
Clock
Test
Clock
TCG Parameters
Look-up-table
Global
Reset
Frequency
Table Index
R
e
s
u
l
t
D
a
t
a
S
t
o
r
a
g
e
M
u
l
t
i
-
b
i
t
P
a
r
a
l
l
e
l
R
e
s
u
l
t
s
Write Enable
Write Address
Data
W_EN
Addr
FFD clusters
Row
MUX
Row
MUX
Row
MUX
TCG
Reconfigure
StartTest
Self-Test
Control
FSM
Storage Control
and
Address Decoder
Test Address
Row
MUX
(a)
LAB X
L
A
B
Y
1 64
1
3
5
Clusters containing
16 FFDs
Control, decoder
and mux circuitry
(b)
Figure 3.23: (a) FFD clusters based parallel BIST system schematic; (b) Array of FFD clusters
on the Cyclone II EP2C35, each containing 16 FFD blocks.
114 Chapter 3. The Failure Rate Detection Measurement Method
Multiple FFDs (16 in this case) are grouped into clusters, which are then arranged into an
array as shown in Figure 3.23(a). The entire self-characterisation sequence is controlled by the
on-chip test control FSM. The status of all the FFDs from each cluster at each test frequency
step forms a status word (16 bit) which is written to the on-chip memory (block RAM).
For the Cyclone II EP2C35 FPGA on the DE2 Board, the device is organised into 52 sec-
tors, each containing 16 FFDs. The floor plan of the BIST circuit on the device is shown in
Figure 3.23(b). Part of the device is devoted to the implementation of the FSM and decoder
circuits. Characterising this part of the device is easily achieved by relocating the controlling
circuits elsewhere on the device.
With this parallel FFD circuits, self-characterisation is achieved in approximately 3 seconds,
and the entire characterisation data, i.e., the frequency map of which each FFD begins to fail
are stored in on-chip block RAM. The upper bound of the test frequency range is adaptive for
each sector; therefore, test time may be shorter depending on the actual device under test.
For optimal storage requirement, the characterisation results are stored as an integer frequency
index instead of the actual frequency value. Each frequency index points to an output frequency
from the TCG that caused the FFD at the particular location to give a failure status. The
BIST requires 13kbit of storage space based on a frequency index size of 16 bit for each FFD
location with 52 sectors and 16 FFDs per sector — 52× 16× 16bit = 13kbit.
Results
Figure 3.24 shows progressively how failure occurs as the test frequency is gradually increased.
With the hierarchical organisation of the FFDs into clusters, it is possible to detect worst-
case delay of fine-grain FFDs (involving two LUTs) as demonstrated in Figure 3.24 (a)-(d).
In addition, a more coarse-grain cluster based characterisation can be obtained (Figure 3.24
(e)-(g)).
As can be seen in the failure maps, at 515MHz no timing failure is detected anywhere on the
device. Beyond that frequency, failure starts to occur from the right side of the device and
3.7. Optimised Built-In Self-Test Designs 115
Figure 3.24: (a-g) Progressive failure maps of FFDs.
gradually moves left. At 560MHz, all FFD clusters on the device have failed. It can be seen
clearly that the failure pattern follows an overall trend caused by the correlated systematic
variation but at the same time contains some randomness caused by the stochastic variations.
3.7.2 Binary Search BIST
An alternative to the parallel FFD test method would be to optimise the process of searching for
the 25% point in failure rate profile along the clock frequency domain. This has the advantage of
providing more accurate delay measurement than the FFD BIST method described previously,
since the 25% point defines a consistent nominal circuit delay based on the center of clock jitter
and is largely unaffected by flip-flop metastability (see Section 3.4.3).
To optimise the frequency search process, we adopted the binary search algorithm, as it requires
relatively little resources to implement on hardware for BIST and it is able to yield the frequency
at the point of 25% failure rate with only log2N tests where N is the total number of frequency
steps within the test range. The total test time (Ttotal) is given by:
Ttotal = l × Ttest × log2N (3.19)
where l is the number of FPGA test locations, and Ttest is the test time per frequency step.
116 Chapter 3. The Failure Rate Detection Measurement Method
Rst
FRD
elements
Column Enable Decoder
R
o
w
E
n
a
b
l
e
D
e
c
o
d
e
r
Column Select MUX
EN
Test Enable
TCG
Ref.
Clock
Test
Clock
TCG Parameters
Look-up-table
Global Reset
R
e
s
u
l
t
D
a
t
a
S
t
o
r
a
g
e
Compare Failure Count
Write Enable
Frequency Table Index
Write Address
Data
W_EN
Addr
Row
MUX
Row
MUX
Row
MUX
TCG
Reconfigure
Test Address
Counter
Counter
Count/4
A
B
(A > B)
(A < B)
Self-Test
Control
FSM
StartTest
Binary
search
circuit
Found
Higher
Lower
Index
Get 25% Count Reference
Storage Address Decoder
Row
MUX
Figure 3.25: A binary search based FRD characterisation BIST.
3.8. Summary and Discussion 117
Test Circuit Implementation and Evaluation
The block diagram of the BIST circuit is illustrated in Figure 3.25. It is implemented on
the Cyclone II in a similar way to the previous parallel FFD BIST circuit, except that each
location is tested independently and the self-test control FSM is modified to include the binary
search algorithm. The BIST has a slightly higher control logic overhead than the parallel FFD
approach, but the storage requirement of results per test location is exactly the same as before.
The test time of the BIST on the Cyclone II with a 26× 35 test array is 2.5 seconds. The test
results obtained is essentially identical to the initial delay map result obtained in Figure 3.17(b).
Despite the fact that locations are tested sequentially, the total test time is shorter than the
FFD BIST design due to the much smaller frequency search space (log2N) per location. The
BIST is repeated 90 times to check its reliability and consistency, and the results show an
exceptionally high reliability of a standard deviation of ±0.70ps over all tests.
3.8 Summary and Discussion
This chapter presented a technique to measure actual propagation delays of combinatorial
circuit paths on FPGAs. The delay measurement method is applicable to circuit elements as
well as complete signal paths, and provides a delay time resolution that increases with frequency.
This implies that the method will track the advancement of technology — as operating speed
increases, so will the timing resolution. With the proposed clock generation methods, a very
high timing resolution of approximately 1ps or better is achieved.
The measurement method is demonstrated on the Cyclone II EP2C35 and Cyclone III EP3C25
FPGAs, which their cross-chip circuit delays were successfully characterised. Although a Virtex-
4 FPGA was used to provide the necessary clock generation on the Cyclone II, it was no
longer necessary for the Cyclone III because of its on-chip run-time re-configurable PLLs. Most
existing FPGAs – including the Stratix and Virtex families – are capable of the required on-chip
test clock generation, hence they could natively support the FRD method.
118 Chapter 3. The Failure Rate Detection Measurement Method
Two efficient BIST solutions based on parallel first-fail detection (FFD) and binary search
algorithm were proposed for quick power-on self measurement and delay characterisation on
FPGAs. Both BISTs were able to yield a cross-chip delay measurement on the Cyclone II
EP2C35 within 3 seconds, but the binary search approach allows a higher measurement accuracy
at the expense of slightly higher control logic overhead.
While we have demonstrated that the FRD method can measure delays in computational paths
such as a series of LUTs with local interconnects and Full adders with carry-chains, it is not
without limitations. The main weakness of the current FRD implementation is that it is
restricted to measure delays in known combinatorial paths in circuits. To test arbitrary block
with unknown critical paths, an exhaustive test approach may be needed to apply the FRD
method. Nevertheless, in cases where it is desirable to accurately measure the delay of a given
combinatorial circuit path, this method offers both precision and simplicity, particularly in
reconfigurable architectures where the test circuit can be inserted around the path-under-test
with ease.
Another important question on FPGA delay characterisation is what circuit paths out of the
complex FPGA architecture should be selected for testing to best represent the timing perfor-
mance of real user circuits on an FPGA. Although the answer is beyond the scope of this work
on delay measurement methods, critical paths in synchronous circuit designs are typically de-
fined by the combinatorial paths between register stages, which are well suited to the proposed
FRD method. A representable set of FPGA delay characterisation data would not only help
to model process variability, but also help to develop various design, placement and routing
techniques in the FPGA design flow to mitigate its effect or even exploit variability to gain
performance.
Chapter 4
The Transition Probability
Measurement Method
4.1 Introduction
The failure rate profile method described in the previous chapter enables a precise and accurate
measures of path delay in circuits, which opened the door to a number of new and interest-
ing opportunities on delay characterisation and strategies against delay variability in FPGAs.
However, it has several undesirable requirements that limit its accuracy and its use on complex
circuits.
Firstly, it requires a set of dedicated error detection circuitry per output bit of a circuit to
generate individual failure rate profiles, incurring a substantial hardware overhead on circuits
with large number of outputs. Secondly, the error detection circuitry requires access to both the
raw combinatorial output and the registered output signals simultaneously from the CUT, which
is not always possible in circuits with output access restriction. Furthermore, the access to the
combinatorial output may change the signal load on the CUT, affecting its delay behaviour and
offsetting the timing measurements. Lastly, the proposed FRD implementation uses a negative
edge triggered output/sample register to allow a simplified error detection circuitry. Yet, this
might affect the measurement accuracy if the test clock does not exhibit an exact 50% duty
119
120 Chapter 4. The Transition Probability Measurement Method
cycle. Moreover, the half clock period timing constrain imposed by this design means that the
CUT do not fail at-speed, but only at half the maximum clock speed, which prevented the
measurements from reflecting the effects of the test clock at full operating frequency, such as
switching heat and noise.
To overcome these shortcomings of the FRD approach, an improved measurement technique
based on statistical measurement of signals is proposed in this chapter. The method relies on
measuring a type of signal statistics, called transition probability [13], to detect timing failures
rather than using actual hardware error detection circuitries. As a result, measurements are
achieved non-invasively through the circuit’s normal I/Os and causes virtually no hardware
overhead at the CUT’s output. This implies that the combinatorial outputs of the CUT are
not stressed by any external load during a test, ensuring more accurate and realistic results.
With the improved flexibility, the new method is not only capable of measuring combinatorial
paths – like the FRD circuit – at high accuracy and efficiency, but it can also measure delays of
sequential circuits and complex combinatorial circuits of arbitrary granularity, or even circuit
modules with unknown internal structure.
The main concepts and implementation of the transition probability (TP) measurement method
for circuit path is explained in this chapter and its use on complex and sequential circuits will
be evaluated and demonstrated in detail in the next chapter.
4.2 Principle of the Measurement Method
4.2.1 Inspiration and Discovery of the New Technique
The first inspiration of the new measurement technique came when I was trying to implement
the previously described FRD method on a complex multi-path circuit. As expected, I had little
success with it due to the limitation of the FRD method. However, in one of the implementation,
I mistakenly bypassed the output comparison circuit for direct error detection and ended up
counting the number of signal transitions rather than errors. To our surprise, the transition
4.2. Principle of the Measurement Method 121
Combinatorial
CircuitR R
V z S
T
Clock
Figure 4.1: A typical synchronous circuit with input and output registers, and a combinatorial
stage in between. Registers are driven by a clock of period T .
count plot of the circuit behaved in almost the same way as the expected failure rate profile,
showing very clear response when timing failure occurs. This important discovery initiated our
research on a more flexible and efficient measurement technique based on pure output statistical
measurements and led to the development of the transition probability measurement method.
4.2.2 The Concept of Statistical Measurements
Consider a typical synchronous circuit with a combinatorial stage and output register (Fig-
ure 4.1). The output signal from the register can be seen as a series of discrete time samples
S(k) of the preceding combinatorial output, where k = 1, . . . . Since the output sample rate
obeys the clock frequency driving the register, two types of relative statistical measurement
over a finite number of clock cycles N can be observed:
(I) The High Probability (HP) or H(S), where H(S) = P{S(k) = 1}. It represents the ratio of
the number of samples of which S is high over N clock cycles. It is a first order statistical
measurement of S and its value lies within the range of 0 to 1.
(II) The Transition Probability (TP) or D(S), which is defined as the probability that S
changes state between consecutive samples, i.e., the average number of signal transition
in S per cycle over N clock cycles. It is given by:
D(S) = P{S(k + 1) = S(k)}
= P{S(k) = 0}P{S(k + 1) = 1}+ P{S(k) = 1}P{S(k + 1) = 0} (4.1)
D(S) is a second order statistical measurement of S and it also has a range of 0 to 1.
122 Chapter 4. The Transition Probability Measurement Method
Now consider the combinatorial circuit depicted in Figure 4.1 with one output z. A sequence of
input vectors V (k), k = 1, . . ., is applied to the inputs of the circuit, and results in a sequence
of output samples S(k) with certain high probability H(S) and transition probability D(S).
It was shown in [79] that the probability of an output evaluating to 1 is equal to the sum of
the probabilities of each of the disjoint cubes in the cover evaluating to 1. If the value of input
vectors V are chosen randomly or follows a fixed pattern — i.e., the vectors form a stationary
process — then the probability of the output z evaluating to 1 will be a stationary process as
well. Therefore, the high probability H(S) and the transition probability D(S) of the output
samples will be stationary (unchanging).
Anything disrupting the stationary process within the circuit, such as timing violation, would
cause the probabilities to change and hence indicate a timing failure. The idea is illustrated by
the example plots in Figure 4.2. Such disruption due to timing violation can be explained by
the following case:
In Figure 4.1, the register captures a sample S(k) of the output z after time T , one clock
cycle after applying the input V (k). If the clock frequency is low enough, then the circuit
operates without faults: S(k) = z(k), and the probabilities H(S) and D(S) remains stationary.
However, because of propagation delays in the circuit, the output z will only change some time
after the inputs are applied. If the test clock frequency is increased, at some point the timing
constrain imposed by the clock will be breached, and the register will begin to sample the value
of z from the previous cycle, such that S(k) = z(k−1), for some of the time. This causes H(S)
and D(S) to deviate from their normal stationary values.
The Transition Probability (TP) is chosen to be used in majority of the tests in this work
because it is more sensitive to timing error than High Probability (HP) in specific cases due to
its second order nature (See Section 4.3). It is also easier to detect, allowing a more robust
detection process and requires less hardware than HP (See Section 4.4).
4.2. Principle of the Measurement Method 123
Clock Freq.
O
u
tp
u
t
S
ta
ti
st
ic
s
fmax
Stationary CUT
Failed
Clock Freq.
In
p
u
t
S
ta
ti
st
ic
s
Stationary
Figure 4.2: Example statistical profiles of an input to a circuit generated from a stationary
process (top) and the output of the circuit that failed after fmax (bottom) in the clock frequency
domain.
4.2.3 Transition Probability Profile Based Measurement
The delay is measured in a similar way to the FRD method but instead of building a failure
rate profile from direct error detection in the CUT, we observed the change of output transition
probabilities due to timing errors while ramping up the clock frequency. A transition probability
profile that shows the failure behaviour of the CUT is built from the measurements, and the
maximum operating frequency or circuit delay can be estimated from it. An example transition
probability profile is illustrated by the bottom plot in Figure 4.2.
This test technique relies on two features: (a) the ability to gradually and finely sweep the test
clock frequency fclk; and (b) the transition probability changes when the CUT begins to fail.
The clock generation process (a) has been thoroughly implemented during the development of
the FRD test method in Section 3.5.2 and 3.5.3, where an under 1ps resolution was achieved.
For (b), the idea will be evaluated and simulated in Section 4.3, to explore how transition
probability D(y) responds to timing failure in a CUT.
The top level implementation of the transition probability (TP) measurement circuit is depicted
in Figure 4.3. The Circuit-Under-Test (CUT) represents any circuit with input V and output y.
124 Chapter 4. The Transition Probability Measurement Method
Test Vector
Generator
(TVG)
Circuit Under Test (CUT)
Transition
Probability
Analyser
(TPA)
Test Clock
Generator
(TCG)
Transition
Activity Counter
(TAC)
LR SR
Circuit Timing Measurements
.....
V
Output
Statistics
Z
Measurement Circuitries
Launch
Register
Sample
Register
y
Figure 4.3: Principle circuit diagram of the transition probability (TP) measurement method.
The Launch Register (LR) and the Sample Register (SR) at the beginning and end of the CUT
are clocked by a Test Clock Generator (TCG) which steps through a range of test frequencies.
The achievable timing resolution in terms of clock period using the TCG is the same as the
FRD case and it can be estimated by Eq. 3.1 derived earlier in Section 3.2.1.
A Test Vector Generator (TVG) provides test vectors V to the CUT such that during normal
operation, each output bit of the CUT exhibits a non-zero but steady transition probability
D(yi), where i is the bit index.
The Transition Activity Counter (TAC) processes samples (y) from the sample register on every
clock cycle for a length of K test clock periods (cycles) and accumulates the number of signal
transitions. The obtained statistical information (transition count) is then processed further
by the Transition Probability Analyser (TPA) to calculated a normalised value in terms of
average signal transition per test clock cycle. This value is essentially the probability of signal
transition per cycle (D(y)) which lies within the range of 0 to 1. For a given length of sampling
time over K test clock cycles, D(y) is given by:
D(y) =
signal transition count
K
(4.2)
4.3. Simulation and Modelling of Statistical Profiles 125
The D(y) for each frequency step is collected to build a TP profile which is analysed by the
TPA to deduce the maximum operating frequency and propagation delay of the CUT.
4.3 Simulation and Modelling of Statistical Profiles
To explore the behaviour of the statistical profiles (TP and HP), a number of logic event
based simulations described earlier in Section 3.4 are conducted on a single path CUT with
predefined propagation delays and uniform clock jitter distribution. The CUT is stimulated
by toggle input and is driven by a test clock that sweep through a range of frequencies in
1.0MHz discrete steps. The output statistics are recorded over 2000 clock cycles (trials) for
each frequency step to construct their TP and HP profiles against frequency. The corresponding
MATLAB code are included in Appendix B.
The simulation results of a CUT with asymmetrical (different) rising and falling transition
delays are shown in Figure 4.4, where a set of failure rate profile based on the FRD method
proposed in the previous chapter is included to compare against the simulated TP and HP
profiles. Both TP and HP plots showed a nearly identical response to the failure rate profile of
the CUT, where the 25% failure rate reference point is represented by D(y) = 0.5 in the TP
case and H(y) = 0.75 in the HP case. Among the 3 plots, the TP profile showed the highest
change of 0.5 when failure occurred, which distinguished it from the rest in terms of timing
failure sensitivity.
4.3.1 The Contributions of Timing Uncertainties
The main concern of using TP or HP, compared to failure rate, is when the delays of the
rising and falling transitions are approximately symmetrical (equal), causing the region with
D(y) = 0 between the two failure slopes to diminish. However, the existence of the main
timing uncertainties – clock jitter and metastability – interestingly prevent the TP profile from
completely loosing sensitivity to timing failure in such cases. Their effects are illustrated in
126 Chapter 4. The Transition Probability Measurement Method
700 750 800 850
0
25
50
75
100
(a) Failure Rate Profiles
Fa
ilu
re
 R
at
e 
(%
)
700 750 800 850
0
0.25
0.5
0.75
1
(b) Transition Probability Profiles
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
700 750 800 850
0
0.25
0.5
0.75
1
Frequency (MHz)
(c) High Probability Profiles
H
ig
h 
Pr
ob
ab
ilit
y
Ideal
With Clock Jitter
With Clock Jitter + Metastability
D(y) = 0.5
H(y) = 0.75 
25% Failure Point 
Figure 4.4: Comparison of simulated failure rate and statistical profiles of a single path CUT
in ideal conditions, with clock jitter and flip-flop metastability. The plots shows: (a) the failure
rate profiles, (b) the transition probability profiles, and (c) the high probability profiles. Since
the FRD method is based on half clock period, the frequency scale of the failure rate plot is
doubled to match the other two plots for direct comparison.
4.3. Simulation and Modelling of Statistical Profiles 127
730 740 750 760 770 780 790 800 810
0
0.5
1
(a) TP Profiles − Symmetric Delay, Random Edge−to−Edge Jitter
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
730 740 750 760 770 780 790 800 810
0
0.5
1
Frequency (MHz)
(b) HP Profiles − Symmetric Delay, Random Edge−to−Edge Jitter
H
ig
h 
Pr
ob
ab
ilit
y
Ideal
With Clock Jitter
With Clock Jitter + Metastability
Figure 4.5: Simulated TP and HP profiles of a circuit path with identical rising and falling
transition delays under random edge-to-edge clock jitter. The effect of clock jitter and flip-flop
metastability are depicted by their corresponding plots.
Figure 4.5, which clearly shows the positive effect of random clock jitter. When jitter is absent,
the TP profile has absolutely no sensitivity to timing failure, whereas the cases with jitter and
metastability produced easily distinguishable TP responses. Note that metastability in real
circuits should have a much smaller effect on the TP profile. This was shown to be the case
in the previous chapter in Section 3.6. However, to more clearly demonstrate its effect, the
simulations was carried out with an exaggerated metastable window described perviously in
Section 3.4.2. Also see Section 3.4.2 for the definition of metastable window.
Another interesting observation from Figure 4.5 is that the HP profile is hardly sensitive to
timing failure, showing only a tiny noise like fluctuation in the profile during the expected
period of timing failure. This clearly shows the advantage of using the second order TP as the
indicator of timing failure, where timing errors cause a decrease in signal transitions but no
128 Chapter 4. The Transition Probability Measurement Method
Test Clock
Clock
Jitter
Distribution
Combinatorial
Output
t (fall)slow
T T+ô
T+ minô T+ maxô
PDF( )ô
t (rise)fast
Figure 4.6: Timing diagram illustrating the combinatorial output of a circuit path and the
corresponding clock edges with jitter. The edge jitter is described by a random variable τ in
terms of timing variation from the expected clock edge. The clock jitter distribution PDF (τ)
is centered around the expected clock edge at T and is bounded by τmin and τmax.
overall change in the average number of logic high cycles (HP).
Simulating Realistic Jitter
Clearly, clock jitter is the key to the success of the TP method. Yet, the behaviour of jitter
could vary between different clock source, and jitter could be induced by different processes
[71]. Therefore, it is important for us to thoroughly understand how these differences could
affect the resultant TP profile.
The main concept of jitter is illustrated in Figure 4.6, where jitter is described by a random
variable τ relative to the expected clock edge at time T . Since we are only interested in the
timing constrain of the CUT’s combinatorial output imposed by a single clock period and jitter
between two consecutive clock edges (T+τ), we can model the overall jitter as pure edge-to-edge
jitter.
4.3. Simulation and Modelling of Statistical Profiles 129
According to [71], there are mainly two types of jitter that could affect the apparent edge-to-
edge jitter experienced by the CUT:
I. Edge-to-edge random jitter — it causes independent random phase variation of each clock
edge.
II. Low frequency multiple-cycle random jitter — it causes a random but gradual phase drift
over multiple clock cycles. As a result, the jitter between consecutive edges appears to
be highly correlated.
Figure 4.7 shows an example of low frequency multi-cycle random jitter. This kind of jitter
behaviour can be simulated by generating a set of low frequency random values for every
N clock cycles, then compute the jitter values of the intermediate cycles using any smooth
interpolation methods, such as cubic interpolation. The correlation of the simulated jitter
between consecutive edges can be seen clearly in Figure 4.8 (b) compared against an independent
edge-to-edge random jitter (Figure 4.8 (a)). As can be seen in Figure 4.9, the low frequency
random jitter has a similar but slightly rounded distribution compared to the distribution of
the independent random jitter.
The major difference between the two types of jitter can be observed when they are applied
to the previous scenario with symmetrical rising and falling transition delays. It can be seen
in Figure 4.10 that the edge-to-edge correlation of the low frequency multi-cycle jitter causes a
significantly smaller TP response during the period of timing failure. The reduced TP sensitivity
can be explained by the following example. Consider the combinatorial output in Figure 4.6
with tslow = tfast = T . The general expression of TP under such condition is given by:
TP = P (Fallfail ∩Risefail) + P (Fallfail ∩Risefail) (4.3)
130 Chapter 4. The Transition Probability Measurement Method
0 10 20 30 40
−40
−30
−20
−10
0
10
20
30
40
Low Frequency Jitter with Edge−to−Edge Correlation
Jit
te
r (
ps
)
Cycles
Individual Edge Jitter
Low Frequency Multi−Cycle Random Jitter
Figure 4.7: An example plot of the behaviour of low frequency multi-cycle random jitter. The
clock edges vary randomly over multiple cycles, but each edge is correlated to their neighbouring
edges due to the low frequency and gradual timing drift.
Figure 4.8: Scatter plots showing the edge-to-edge correlation of clock signals with (a) edge-to-
edge random jitter, and (b) low frequency random jitter where the jitter between consecutive
edges are highly correlated with a correlation coefficient of 0.977.
4.3. Simulation and Modelling of Statistical Profiles 131
(a) Edge−to−Edge Random Jitter
D
en
si
ty
−30 −20 −10 0 10 20 30
0
1
2
3
4 x 10
−3
(b) Low Frequency Multi−Cycle Random Jitter
Jitter (ps)
D
en
si
ty
−30 −20 −10 0 10 20 30
0
1
2
3
4 x 10
−3
Figure 4.9: The histograms show the edge-to-edge relative jitter distributions of (a) edge-to-edge
random jitter, and (b) low frequency random jitter within the same jitter boundaries of ±15ps.
The distribution of (b) does not differ a lot from (a) because the clock edge timing variation
in both cases are unpredictable between multiple clock cycles and are uniformly spread within
the same boundaries. The major difference in (b) is the reduced steepness near the boundaries,
which may lead to a slightly smoother TP profile around the failure frequency of the CUT.
132 Chapter 4. The Transition Probability Measurement Method
730 740 750 760 770 780 790 800 810
0
0.5
1
(a) TP Profiles − Symmetric Delay, Low Frequency Jitter
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
730 740 750 760 770 780 790 800 810
0
0.5
1
Frequency (MHz)
(b) HP Profiles − Symmetric Delay, Low Frequency Jitter
H
ig
h 
Pr
ob
ab
ilit
y
Ideal
With Clock Jitter
With Clock Jitter + Metastability
Figure 4.10: Simulated TP (a) and HP (b) profiles of a circuit path with identical rising and
falling edge delay under low frequency clock jitter. The effect of clock jitter and flip-flop
metastability are depicted by their corresponding plots, where the TP plot with jitter has a
reduced sensitivity to timing failure due to edge-to-edge jitter correlation.
4.3. Simulation and Modelling of Statistical Profiles 133
700 750 800 850
0
0.25
0.5
0.75
1
Frequency (MHz)
Edge−to−Edge vs. Low Frequency Random Jitter
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Edge−to−Edge Random Jitter
Low Frequency Random Jitter
D(y) = 0.5 
Figure 4.11: Comparison of low frequency multi-cycle random jitter against edge-to-edge ran-
dom jitter in terms of the TP profile of a circuit path with different rising and falling transition
delays.
When there is no jitter correlation, the probability that one type of transition fail is 0.5 regard-
less of the outcome of the previous transition, hence the TP is given by:
TP = P (Fallfail)P (Risefail) + P (Fallfail)P (Risefail) (4.4)
= 0.52 + (1− 0.5)2 = 0.5
which agrees with the simulation results in Figure 4.5. However, when jitter correlation ex-
ists, the probability that a transition fails becomes dependent on the outcome of the previous
transition. Thus increasing the probability that both transitions failing at the same time
(P (Fallfail ∩Risefail)) and both not failing at the same time (P (Fallfail ∩Risefail)), causing
an overall increase of TP towards its initial value of 1.0 and reduces its sensitivity against timing
errors (Figure 4.10). In this case, metastability has the positive effect of improving sensitivity,
since it introduces uncertainty to the output that is always independent between clock edges
regardless of jitter correlation.
134 Chapter 4. The Transition Probability Measurement Method
Note that when the rising and falling transition delays are asymmetrical with tslow  tfast, the
probability that a fast transition fails given a failed slow transition is always zero. This breaks
any probability correlation between the transitions pair and the TP profile is unaffected by any
degree of jitter correlation. This effect is observed clearly in Figure 4.11 where the TP profiles
from both independent and correlated jitter are approximately identical.
4.3.2 Single Path TP Model
Based on the simulation results, the following models describing both the independent and
correlated jitter are derived. For a single path stimulated by a toggle signal and driven by
clock signal with independent edge-to-edge random jitter, the behaviour of the TP profile as a
function of clock period (T ) is given by the following approximation:
TPindep(T ) ≈ 1
2
+ 2
(
1
2
−
∫ tfast−T
−∞
PDFJitter(τ) dτ
)
×
(
1
2
−
∫ tslow−T
−∞
PDFJitter(τ) dτ
)
(4.5)
where tfast and tslow represent the propagation delays of the rising and falling transitions.
On the other hand, when the jitter between consecutive clock edges are perfectly correlated,
the TP profile is given by:
TPcorr(T ) ≈ 1−
∫ tslow−T
−∞
PDFJitter(τ) dτ
+
∫ tfast−T
−∞
PDFJitter(τ) dτ (4.6)
where tfast ≤ tslow.
Note that both TPindep and TPcorr gives identical results in normal cases when the failure
caused by tfast and tslow do not overlap (see Figure 4.11). Yet, when the two types of delay
do overlap, the jitter correlation causes their respective change of TP to cancel each other out,
reducing the magnitude of change in the TP profile.
4.3. Simulation and Modelling of Statistical Profiles 135
Clock signals in real systems are likely to exhibits both independent and correlated jitter.
Therefore, a combination of TPindep and TPcorr can be used:
TP (T ) = (1− k)× TPindep + k × TPcorr (4.7)
where k defines the correlation factor ranging from 0 to 1. In reality, it is highly unlikely to
have perfect edge-to-edge jitter correlation, i.e., k = 1. Therefore, the TP profile should always
show a measurable amount of change, even if tfast and tslow are exactly identical. The effect of
metastability is not include in the approximations because it is expected to have a negligible
effect on the TP profile in most realistic cases.
Edge-to-edge Jitter Correlation in the Cyclone III FPGA
A preliminary test was carried out on an Altera Cyclone III to identify possible edge-to-edge
jitter correlation in its clock signal. The CUT was implemented with a series of 9 inverters such
that the path has approximately symmetrical rising and falling transition delays. Figure 4.12
depicts the TP profiles, where individual TP profile of each transition type is isolated for
comparison. The overall TP profile shows a deviation of less than 0.2 from initial value at 1
during the failure of both transition types, which is a clear evidence of correlated multi-cycle
jitter in the clock signal.
136 Chapter 4. The Transition Probability Measurement Method
TP Profiles of Similar Rising and Falling Transition Delays - 9 LUTs: Inverter
0
0.2
0.4
0.6
0.8
1
240 245 250 255 260 265
Frequency (MHz)
T
r a
n s
i t i
o n
 p
r o
b a
b i
l i t
y
Normal TP Profile
TP - Falling Transitions
TP - Rising Transitions
Figure 4.12: Actual TP profile from a CUT on a Cyclone III EP3C25 with similar rising and
falling transition delays. The delays are matched by configuring the LUTs as inverters to even
out the delay difference between rising and falling transitions. The CUT is driven by simple
toggle stimulus, which allowed the individual TP profile of each transition type to be isolated
by measuring at even or odd clock cycles.
4.4 Test Implementations and Results
Two basic tests based on 1) an isolated critical path in an adder, and 2) a linear feedback
shift-register (LFSR) of pure sequential circuit structure were carried out to evaluate the TP
test method in a 65nm Altera Cyclone III EP3C25 FPGA. The first test were intended to char-
acterise the TP profile of critical path in a practical combinatorial circuit as well as obtaining a
general picture of the physical jitter distribution in FPGAs driven by PLL based clock sources.
The second test explores the test method’s potential on measuring sequential circuits where
controllability of the circuit’s internal state is limited.
4.4. Test Implementations and Results 137
4.4.1 Adder Critical Path Testing
The Test circuit is depicted in Figure 4.13, where the CUT is implemented as a simple 4-bit
adder using 4 LUTs operating in arithmetic mode [28] and dedicated carry-chain interconnects.
Since the test circuit only needs to exercise one input bit to exercise the critical path of the
adder, the TVG and Launch register is simplified to a single toggle flip-flop to generate the
required test vector. The generated signal toggles every clock cycle and has a stationary TP of
1.0.
The TCG is implemented based on the 2-staged on-chip PLL clock generation approach pre-
sented in Section 3.5.3 of the previous chapter. The generated test clock frequency (ftest) is
given by:
ftest = fref
M
N × C × k (4.8)
where M , N and C are the loop-counter, pre-counter and post-counter values respectively of
the first PLL stage, and k is the constant frequency multiplication factor applied by the second
PLL stage to enhance both frequency resolution and range (see Chapter 3, Section 3.5.3 for
details). The TCG in this case is able to support clock frequencies reliably up to the PLL’s
maximum VCO frequency of 1300MHz specified in [4]. Since full clock period is used as the
timing constrain as opposed to the half clock period in the previous FRD method, the overall
timing resolution is halved. However, this allows CUTs to be tested truly at-speed and the
timing resolution is still sufficiently high, remaining largely within 1ps.
The Transition Activity Counter (TAC) is implemented as an asynchronous counter. During a
test, the TAC counts the number of transitions for a given test period for each clock frequency
step and a TP profile plot is produced and analysed by the Transition Probability Analyser
(TPA). The asynchronous nature of the TAC ensures reliable transition counting regardless of
the interconnects delay in between and allows it to be placed some distance away from the
CUT, eliminating its influence on the CUT’s delay due to localised heating.
138 Chapter 4. The Transition Probability Measurement Method
TVG Circuit Under Test (CUT)
TP Profile Builder
Test Clock
Generator
(TCG)
SR
Frequency/Delay Measurements
Toggle
flip-flop
TAC
Asynchronous
CounterFA
A B
S
FA
A B
S
FA
A B
S
C C
1 10 00
FA
A B
S
C
10
Sample
RegisterLaunch Register
4-bit Adder
(4 LUTs)
Test
Enable
TCG Parameters
Z
Transition
Count
Transition
Probability
Analyser
(TPA)
TP Profile Analyser
TP Profile
y
Figure 4.13: The CUT is a 4-bit Adder implemented on an Altera Cyclone III EP3C25 with
LUTs in arithmetic mode. A and B are the inputs of each Full Adder (FA). C and S are the
carry signals and sum outputs respectively. The carry signals are transmitted using dedicated
carry-chain interconnects.
800 900 1000 1100 1200
0
0.2
0.4
0.6
0.8
1.0
Frequency (MHz)
T
ra
n
s
it
io
n
P
ro
b
a
b
il
it
y
D
(
)
y
Stationary Initial state (a)
-ve
trans.
Failed
(c)
-ve & +ve
trans.
Failed
(e)
-ve trans.
Timing
Violation
(b)
+ve trans.
Timing
Violation
(d)
Center of
Jitter
Distribution
972.3 MHz
Clock Jitter
Region
Clock Jitter
Region
D(y) = 0.5
Figure 4.14: Transition probability D(y) vs. frequency plot from 750 to 1250MHz of the critical
path through the 4-bit Adder CUT on the Cyclone III. Region “a” represents the initial TP
level of normal circuit operation. Regions “b” to “e” are the distinctive regions caused by
progressive failure of signal propagation in the CUT.
4.4. Test Implementations and Results 139
Results and Clock Jitter Analysis
Figure 4.14 shows the relationship between the transition probability D(y) and the test clock
frequency of the circuit described in Figure 4.13. The results shows five distinctive regions
as the CUT fails progressively in stages. The measured TP profile matches the simulated TP
behaviour of a single path obtained in the previous section, where the slopes in the TP profile
in Figure 4.14 region (b) and (d) are the results of clock jitter. The jitter distribution affecting
both the falling and rising transitions (region (b) and (d)) can be extracted from the TP profile
by plotting its gradient against clock period. The results are presented in Figure 4.15, where a
plot of the TP profile against clock period is also provided for reference. The extracted jitter
distributions corresponding to the two transition types are defined by random variable τ and
forms the probability density function PDF (τ). The distributions have well defined boundaries
marked by τmin and τmax relative to the expected clock edge positions defined by τmedian0 and
τmedian1. In this case, the width of both jitter distributions are approximately 100ps, within the
specified range of jitter in [4]. By using the median value of the jitter distribution as reference,
which is also defined by the intersection of TP profile at D(y) = 0.5, the nominal worst-case
delay tdelay and the maximum operating frequency fmax can be accurately determined. For this
particular case: tdelay = 1029ps and fmax = 972.3MHz.
The two jitter distributions centered at approximately 972MHz and 1125MHz are almost identi-
cal in range and shape, meaning that jitter behaviour is largely independent on clock frequency
in the Cyclone III. While the shape of the distributions are close to but not exactly uniformly
distributed, the slight inclination of the distributions have very little effect on the TP profiles,
causing the two failure slopes to be approximately linear.
Measurement Accuracy
According to the models in Section 4.3, the TP response corresponding to timing failure matches
exactly with the absolute failure rate response of a single path. Therefore, the TP method is
expected to have the same level of measurement accuracy as the FRD method proposed in
140 Chapter 4. The Transition Probability Measurement Method
7007508008509009501000105011001150
0
0.005
0.01
0.015
Period (ps)
0
0.2
0.4
0.6
0.8
1.0
D(y) = 0.5
J
it
te
r
P
ro
b
a
b
il
it
y
D
e
n
s
it
y
ô ô
median0 min
+
ô ô
median0 max
+
Center of
Jitter
Distribution
ô ô
median1 min
+
ô ô
median1 max
+
PDF( )ô
T
ra
n
s
it
io
n
P
ro
b
a
b
il
it
y
1029 ps
ô
median0
ô
median1
Figure 4.15: The top plot is the TP profile of the 4-bit Adder vs. clock period steps in descending
order. The bottom plot shows the estimated probability density function (PDF (τ)) of clock
jitter derived from the TP profile by taking its derivative. The TP profile data points are first
filtered by a median filter to reduce noise level in the resulting derivative. τmedian0 and τmedian1
are the median values of the two jitter distributions corresponding to the negative and positive
transitions of output Z.
4.4. Test Implementations and Results 141
TCG
To TAC
(Async.
Counter)
63 bit Linear-Feedback
Shift-Register (LFSR)
...
62 61 60 01
R R R R R
CUT
Registers
y
Figure 4.16: The test circuit for testing a 63 bit maximal length linear-feedback shift-register
(LFSR) on the Cyclone III.
the previous chapter. Furthermore, unlike the FRD method, the TP method does not require
direct access to the CUT’s combinatorial output which preserves the exact fan-out and load
of the CUT, allowing an even more accurate measure of the actual circuit delay. Finally, the
ability to test the CUT at full clock rate (at-speed), as opposed to the FRD method (half clock
rate) ensures more realistic test environment and timing measurements.
4.4.2 LFSR Sequential Circuit Testing
For sequential circuits, the test method should work equally well as long as the circuit exhibits
an observable stationary output transition probability in normal operation. In this section we
will demonstrate this with an example sequential circuit on the cyclone III.
An LFSR was chosen as the test candidate because it not only satisfies the basic requirement of
the TP test method – stationary output TP, it also encapsulates the typical sequential circuit
structure and characteristic such as feedback loops and finite state machine behaviour. The
actual test circuit is depicted in Figure 4.16 where the LFSR is implemented as a 63 bit maximal
length LFSR. The test circuit does not require any TVG, launch register and sample register
because their functions are already covered by the LFSR itself. The LFSR is driven by the
test clock from the TCG and one output bit is selected for processing by the TAC and TPA to
produce its TP profile and fmax measurements.
142 Chapter 4. The Transition Probability Measurement Method
0.4
0.42
0.44
0.46
0.48
0.5
0.52
1050 1070 1090 1110 1130 1150
T
ra
n
s
it
io
n
P
ro
b
a
b
il
it
y
(
)
D
(
)
y
Frequency (MHz)
TP Profile of LFSR Output
Lower Threshold
Upper Threshold
fmax estimation
1124.4 MHz
Figure 4.17: The output TP profile of the linear-feedback shift-register (LFSR) from 1.05 to 1.15
GHz. Upper and lower threshold are used to determine the point of fmax where the deviation
of D(y) occurs.
Results Analysis
The plot in Figure 4.17 shows a very distinctive change in the TP profile as the LFSR fails. The
sharp change is caused by the positive feedback of internal signal errors where a small change
in transition probability leads to a bigger change on the next feedback cycle, causing the entire
LFSR to fail catastrophically. The sharp change allows the fmax to be defined with ease. Since
the TP response is no longer related to the failure of specific signal transitions in a path but
the collective error accumulated in the circuit, the general reference point at D(y) = 0.5 cannot
be used. Instead, the point of fmax can be defined using a “thresholding” method with a set of
lower and upper thresholds.
4.4. Test Implementations and Results 143
The thresholds can be determined using the following method:
(i) The TP profile is low-pass filtered to reduce any fluctuations due to the imperfect statistical
measurements from a finite number of TP count samples of the output. In this particular
case, we have chosen a Wiener filter [80] to remove the noise components where the
noise power is estimated from the TP profile during the initial error-free stage (1050 to
1100MHz). However, any other form of low-pass filter could have been used to achieve
similar results.
(ii) A steady state mean value (µ) and standard deviation value (σ) are estimated over the
low frequency region (1050 to 1100MHz) where the CUT is expected to work correctly.
Using µ and σ, the upper and lower bound thresholds are determined according to:
Thresholdupper = µ+ gσ (4.9)
Thresholdlower = µ− gσ (4.10)
where g is a constant chosen to set the threshold bounds such that the algorithm is
guarded from detecting random fluctuations in the TP profile before the actual failure
frequency of the CUT. In this case, g = 3 is used, which gives the thresholds a 3-sigma
margin.
(iii) The frequency at which the TP profile first falls outside of the threshold bounds as shown
in Figure 4.17 is taken as a good estimate of fmax. The LFSR case has an estimated fmax
of 1124.4MHz and the delay of the most critical stage is given by: tdelay =
1
fmax
= 889ps.
Any sequential circuits with similar structure and behaviour should be applicable to the TP
test method. These include circuits from simple counters to complex Infinite Impulse response
(IIR) filters that are commonly used in digital signal processing.
144 Chapter 4. The Transition Probability Measurement Method
4.4.3 Test Time and Optimisations
General Test Time
The time required to complete a test of any type of CUT in general depends on three factors:
the transition counting period (Tcount) of the TAC, the predefined test clock frequency range
from the start frequency (fbegin) to the end frequency (fend), and the frequency resolution (∆f).
The total test time (Ttest) is generalised to:
Ttest = Tcount × fend − fbegin
∆f
(4.11)
For example in the LFSR test case (Figure 4.17), where Tcount = 1 ms, fbegin = 1050MHz,
fend = 1150MHz and ∆f = 0.5MHz. The total test time is 0.2 second.
Since Tcount is defined as a fixed time period, the number of samples (K) taken at each frequency
step increases with test clock frequency (ftest) where:
K = Tcount × ftest (4.12)
Thus, the increasing K has the positive effect of increasing the statistical accuracy of the D(y)
measurements as ftest increases while maintaining a static test time.
Test Time Optimisation
Since we are only interested in the worst case fmax of a circuit, it is not necessary to step
through the entire range of predefined test frequencies beyond the point of fmax. The test
can terminate once the value of fmax is found. Therefore, the total test time can be defined
adaptively by:
Ttest = Tsamp × fmax − fbegin
∆f
(4.13)
Assuming the same fbegin is used, the test time of the LFSR case can be reduced by 25% to
0.15 second with adaptive frequency range.
4.5. Practical Usage of the TP Method 145
4.5 Practical Usage of the TP Method
Our first application with the TP test method was to measure the aging or degradation of
FPGAs circuits in terms of propagation delay change over their operating life time [14, 15]. We
found that the TP method is highly suitable for this purpose because: 1) the test circuitries
themselves are not timing critical in the test and can tolerate a very high degree of delay
degradation before the measurements are affected; 2) unlike conventional ring oscillator (RO)
based approaches, it gives a clear indication of the worst-case delay and the delay of the
individual signal transition type (rise or fall), as opposed to the average delay between the
best-case and worst-case transition delay with ROs (see Section 2.6.1).
The test was intended to explore the physical effect of temperature, core supply voltage and
logic activity stresses on the rate of degradation in FPGAs and identify possible connections
between known degradation mechanisms/models in semi-conductor devices. The degradation
modelling details are included in [14].
4.5.1 Test Setup
The test is carried out on an Altera Cyclone III EP3C25. The chip is divided into four regions,
where four different types of logic activity stresses were tested. They are (a) high frequency
switching at 300MHz, (b) low frequency switching at 1Hz, (c) static high (DC1), and (d) static
low (DC0). See [14, 15] for more detail on the logic activity stresses selection. The floor plan
of the FPGA is depicted in Figure 4.18 with the four regions and their stresses labelled.
Three types of resources were tested on the FPGA: LUT test blocks containing a series of 9
LUTs configured as buffers, column interconnects spanning 4 LABS (C4), and row interconnects
spanning 24 LABs (R24). The placement and location of the three types of test blocks (CUTs)
are depicted in Figure 4.18.
The environmental stresses (temperature, core voltage) are applied to the FPGA using an
external heater plate and a customised core voltage controller circuit. The stress temperature
146 Chapter 4. The Transition Probability Measurement Method
Row IC Start/End Row IC Anchors
Column IC
Start/End
Column IC
Anchor
High Frequency Stress
(300 MHz)
Low Frequency Stress
(1 Hz)
Static High Stress
(DC 1)
Static Low Stress
(DC 0)
Row Interconnects
C
o
n
tr
o
l
C
ir
c
u
it
a
n
d
T
A
C
s
C
o
lu
m
n
In
te
rc
o
n
n
e
c
ts
LUT test
Blocks
Figure 4.18: Floor plan of the complete test circuit on the Cyclone III EP3C25. Four equal
sized partitions are selected to accommodate the 4 types of logic activity stress (300MHz, 1Hz,
DC1 and DC0). Each LUT test block contains a CUT of 9 LUTs placed in a logic array
block (LAB). CUTs of row and column interconnects (IC) are created with anchor LUTs and
start/end LUTs to control the span and orientation of the interconnect lines. The entire FPGA
contains 18× 32 LUT test blocks, 19× 4 sets of column IC test and 32 sets of row IC test.
4.5. Practical Usage of the TP Method 147
and core voltage are set to 125◦C and 1.8v respectively. The degree of degradation acceleration
caused by these environmental stresses are discussed in [15]
The TP test procedure is initiated automatically by a PC connected to the FPGA and is
repeated at a regular time interval to build a series of delay maps of the CUTs over a pe-
riod of 4 weeks. The environmental stresses are removed before each test to ensure realistic
measurements of the FPGA at normal operating temperatures and core voltage — 35◦C and
1.2v.
4.5.2 Results
In Figure 4.19, a noticeable amount of initial delay variability can be observed in the initial
delay map across the FPGA. Also, an interesting trend of change in delay measurements over
the 4 weeks of stress period (693 hours) can be seen in the delay maps. A clear degradation
induced delay variability between the four regions with different logic activity stresses emerged
after the first few days, where DC0 has the highest impact on the falling transition delays and
DC1 has the highest impact on the rising transition delays. As can be seen in Figure 4.20 and
the delay maps, the rate of degradation is highest at the beginning and gradually reduces over
time.
The delay plots in Figure 4.20, 4.21 and 4.22 shows the average delay of the CUTs with LUTs,
C4 interconnects and R24 interconnects within each of the four logic activity stress regions
(300MHz, 1Hz, DC1 and DC0). Thanks to the ability to measure the delay of falling and
rising transitions separately with the TP method, the asymmetrical impact of degradation on
the CUTs’ delay is observed and allowed us to identify the specific degradation mechanism in
action [14, 15] that caused the respective delay changes. The analysis in [14] revealed that
the worst-case degradation with DC0 is caused mostly by negative bias temperature instability
(NBTI) in the underlying transistors.
148 Chapter 4. The Transition Probability Measurement Method
Figure 4.19: Degradation maps of the Cyclone III in terms of the propagation delay of rising
(fast) and falling (slow) transitions through each LUT test block. Different rate of degradation
is seen between the four stress regions from top to bottom: 300MHz, 1Hz, DC1 and DC0.
4.5. Practical Usage of the TP Method 149
0 200 400 600
3.4
3.5
3.6
3.7
3.8
3.9
4
D
el
ay
 (n
s)
Time elapsed (Hour)
LUT Degradation (Slow Transitions)
DC 0
DC 1
1Hz
300MHz
0 200 400 600
3.4
3.5
3.6
3.7
3.8
3.9
4
D
el
ay
 (n
s)
Time elapsed (Hour)
LUT Degradation (Fast Transitions)
DC 0
DC 1
1Hz
300MHz
Figure 4.20: Degradation measurement in delay of rising (fast) and falling (slow) transitions in
LUT chains under accelerated life stress, grouped by four types of input signal.
0 200 400 600
1.9
1.95
2
2.05
2.1
2.15
D
el
ay
 (n
s)
Time elapsed (Hour)
C4 Interconnect Degradation (Slow Transitions)
DC 0
DC 1
1Hz
300MHz
0 200 400 600
1.9
1.95
2
2.05
2.1
2.15
D
el
ay
 (n
s)
Time elapsed (Hour)
C4 Interconnect Degradation (Fast Transitions)
DC 0
DC 1
1Hz
300MHz
Figure 4.21: Degradation measurement in delay of rising (fast) and falling (slow) transitions in
C4 interconnects under accelerated life stress, grouped by four types of input signal.
150 Chapter 4. The Transition Probability Measurement Method
0 200 400 600
3
3.1
3.2
3.3
3.4
D
el
ay
 (n
s)
Time elapsed (Hour)
R24 Interconnect Degradation (Slow Transitions)
DC 0
DC 1
1Hz
300MHz
0 200 400 600
3
3.1
3.2
3.3
3.4
D
el
ay
 (n
s)
Time elapsed (Hour)
R24 Interconnect Degradation (Fast Transitions)
DC 0
DC 1
1Hz
300MHz
Figure 4.22: Degradation measurement in delay of rising (fast) and falling (slow) transitions in
R24 interconnects under accelerated life stress, grouped by four types of input signal.
4.6 Summary and Discussion
In this chapter, an improved delay measurement method based on transition probability (TP)
is proposed. This statistical based measurement technique unifies the testing of combinatorial
paths and sequential circuits with a single at-speed testing methodology. The test method
requires no modifications to the CUT and results in little hardware overhead. Since the CUT
output load and fan-out are completely preserved during a test, very accurate and realistic
timing measurements can be yielded.
Through simulations, the behaviour of TP through a circuit path is modelled with clock jitter
taken into account. The statistical behaviour of jitter in terms of edge-to-edge correlation is also
simulated and modelled to give an accurate prediction of TP response in real circuits. Using
a simple TP test circuit, we were able to confirm the predicted TP response with edge-to-edge
jitter correlation in a Cyclone III EP3C25 FPGA.
4.6. Summary and Discussion 151
The TP test method has been demonstrated on both combinatorial paths and sequential circuits
on the Cyclone III and successfully measured the worst-case critical path delay in a 4-bit
adder and a 63-bit LFSR. Using the TP profile obtained from the adder test, the actual jitter
distribution in the clock signal of the Cyclone III is extracted and is shown to have a relatively
uniform shape. The effectiveness of the TP method is also demonstrated in the LFSR test case
where the worst-case path delay can be measured through detecting deviation in the TP profile
with simple threshold bounds.
Moreover, the TP method provides a practical, efficient and reliable way of measuring circuit’s
timing degradation in FPGAs over time, that is not matched by most of the other existing
measurement methods such as ring oscillator. The particular test case has proven the general
advantage of the TP method in practical use for precise timing measurement, as well as allowed
us to identify the physical degradation mechanism responsible for the observed delay change
under the specific environmental and logic activity stresses [14].
While we have shown the effectiveness of the TP method on measuring delay of circuit paths and
LFSR circuit, its potential should not be limited to known critical paths and simple sequential
circuits. According to the general characteristic of TP stated in Section 4.2.2 and [79], the same
principle should be applicable to more complex circuits containing multiple paths. Though, a
better understanding of TP and a more extensive study of its behaviour in multi-path cases
are necessary to ensure the reliability and accuracy of the measurement method in such cases.
Chapter 5
Complex Circuit Testing
5.1 Introduction
So far, the basic concepts and uses of both the failure rate detector (FRD) and transition
probability (TP) based methods have been demonstrated on FPGAs in the previous chapters.
The greatest remaining challenge is how they can be improved for practical used in general
circuits where their physical implementations and structures are not necessarily known by the
users. The first criteria in achieving this goal is to ensure that the measurement accuracy of
the methods are adequate for practical use, and discover ways to improve accuracy in circuits
with low controllability and/or observability that are difficult to test.
The FRD method has shown good measurement accuracy but it suffers from the limitation that
only isolated paths can be tested. Despite the limitation, it can be used in a per-path exhaustive
testing manner to provide accurate timing measurement of complex multi-path circuit [3]. This
will be demonstrated in the early sections of this chapter.
In contrast to the FRD method, the TP method provides promising potentials on measuring
multi-path combinatorial and sequential circuits with much higher efficiency, but its accuracy
in such cases are not known. To overcome this problem, the FRD per-path exhaustive test
152
5.2. FRD Based Embedded Multiplier Testing 153
procedure described above is used to evaluate the general accuracy of the TP method on a
complex 9x9 multiplier circuit [13].
For the rest of this chapter, an extensive study and modelling of TP behaviour is carried out
to isolate the main factors that affect the measurement accuracy in the initial test, and most
importantly, a technique that can be used to optimise general measurement accuracy of the TP
method is developed through the improved understanding of TP.
Lastly, the optimised TP method in the form of a modularised test platform is tested on a
number of practical complex combinatorial and sequential circuits on FPGA to evaluate its
accuracy and efficiency.
5.2 FRD Based Embedded Multiplier Testing
The FRD test method described in Chapter 3 is suitable for testing any combinatorial cir-
cuit paths as long as the un-registered combinatorial output and the registered output can
be accessed for comparison to detect timing failure. However, in the case of the embedded
multipliers in the Cyclone III and probably most of the other FPGA architectures, the inter-
nal output registers of the embedded multiplier and the direct combinatorial output cannot
be accessed simultaneously [28]. Although it is possible to bypass the internal output regis-
ters and use regular flip-flops to register the output, the measured delay would then include
the external interconnect delay, and fail to show the true performance of the multiplier when
used conventionally with its internal registers. To overcome this limitation, two approaches are
proposed:
Parallel Reference Signal (PRS) Approach - Two extra multipliers are used in parallel
to the CUT to generate the test reference signal, see Figure 5.1(a). The multipliers
are activated alternately and each operates at half of the test clock frequency. This
guarantees correct reference signal generation within the propagation delay of the CUT
for every clock cycle, enabling the CUT to be tested correctly at full clock rate.
154 Chapter 5. Complex Circuit Testing
Sequential Reference Signal (SRS) Approach - The same multiplier in the CUT is used
to generate the reference signal sequentially to check itself for error, see Figure 5.1(b).
The multiplier is driven by a test stimulus S that changes every 2 test clock cycles. Since
any timing error will be registered at the output (D) on the first cycle but recover to
the correct value on the second cycle, D can be used as a reference signal to detect error
shifted into the second register (ESR) at cycle 2.
Both approaches have their advantages and limitations. The PRS-Approach performs output
comparison every clock cycle as opposed to every two cycles in the SRS-Approach, therefore it
is twice as efficient in terms of output samples comparison. For a given number of test samples,
PRS-Approach requires only half the test time at the cost of higher complexity and resource
usage.
Although both approaches are valid and have their strengths, we decided to adopt the SRS-
Approach for its low resource usage over test efficiency. It has other advantages, namely less
complex circuitry and less self-heating during test, which also contribute to better reliability
and measurement accuracy.
5.2.1 Per-path Exhaustive Testing
The goal of exhaustive testing is to obtain delay information of a circuit with a known function
but unknown internal structure. The embedded multiplier is chosen for this example because
unlike circuits with known and obvious critical path, the details of critical paths within the
multiplier are not disclosed by Altera [28]. Therefore it is necessary to exercise every path from
the inputs to the outputs to uncover the critical path and hence obtain the worst-case delay of
each multiplier output. We achieved this by using the test circuit in Figure 5.2(a).
The multiplier is tested exhaustively by exercising one signaling path at a time with a toggling
signal and searching for the maximum operational frequency fmax of that particular path. The
process is repeated for all possible paths from every input bits to every output bit to obtain an
absolute worst-case fmax associated with each output bit. For a circuit with N input bits and
5.2. FRD Based Embedded Multiplier Testing 155
TSG
Circuit Under Test (CUT)
R
Reference output signal
T flip-flop
EDC
To EHA
R
R R
EN
EN
R R R R
Z
A
B
0
R
R
1
(a) Parallel Reference Signal (PRS) Approach
Circuit Under Test (CUT)
L
R
S
R
S Z
Internal
Launch
Register
Internal
Sample
Register
Q
18
9
9
A
B
18
EDC
To EHA
E
S
R
External
Sample
Register
D
EN
T
flip-
flop
Test
Enable
Embedded
Multiplier
TSG
EN
Enable Signal
(b) Sequential Reference Signal (SRS) Approach
Figure 5.1: Two different test circuit approaches for multiplier testing. Note that the TCG is
omitted in the diagrams and all clock ports are driven by the same test clock signal.
156 Chapter 5. Complex Circuit Testing
TSG
Circuit Under Test (CUT)
L
R
S
R
17
Z
Internal
Launch
Register
Internal
Sample
Register
Q
T flip-flop
17 bit
Test
Counter
16
0
1
2
..
..
.
n
..
..
. Circular
Shift by n
Toggle bit select (n)
18
9
9
A
B
Output bit select (m)
18
EDC
To EHA
E
S
R
External
Sample
Register
D
EN
EN
T
flip-
flopTest
Enable
Enable signal
(a)
Yes
No
Yes No
Yes
No
Yes
No
START
END
m = 0
n = 0
Reset Counter
Test (n, m) @ f
max
Found Error
Count 2 1< -
N-1Increment
Counter
f
max
= -f f
max
n N 1< - m M-1<n n 1= +
m m 1= +
Stores
of
the
output bit
f
max
m th
f
max
= f
init
N: number of input bits
M: number of output bits
n: index of input bit under test
m: index of output bit under test
Count: Test Counter output
Search for new f
max
(b)
Figure 5.2: (a) shows the circuit used to test the embedded multiplier exhaustively; (b) shows
the flow chart of the test procedure, where finit is the initial test clock frequency.
5.2. FRD Based Embedded Multiplier Testing 157
M output bits, the total number of test vector combinations is given by N ×M × 2N−1. The
detailed test procedure is presented in Figure 5.2(b) as a flow chart. In this case, the actual
test procedure is optimised to reduce test time by searching only for the new fmax if a failure
is detected. Once the fmax of each output bit is acquired, the worst-case delay (tdelay) can be
computed by 1/fmax.
The proposed per-path exhaustive test does have a known limitation: it does not test the
situation where two or more inputs to the CUT are changing simultaneously. Such cases can
increase the switching times in CMOS gates and therefore increase the propagation delay [40].
We assume this effect to be relatively small. See Section 2.5.1 on Multiple Input Change (MIC).
5.2.2 Embedded Multipliers Delay Characterisation
The delay results are presented in Figure 5.3 showing the delays within a single multiplier and
every multiplier across the Cyclone III FPGA as depicted in Figure 5.3(b). It can be seen in
Figure 5.3(a) that the worst-case delay path lies between the input and the most significant
bit (MSB) output of the multiplier. Hence the MSB output of every 9x9 multiplier across the
chip is tested to obtain the worst-case delay plots in Figure 5.3(c). The statistics of the results
are summarised by Table 5.1 in four categories according to their locations. The worst-case
delay of the multipliers ranged from 1.642ns to 1.791ns and the 3-σ variation between all the
multipliers on the Cyclone III EP3C25 specimen is ±5.96%. There is a distinctive systematic
delay variation between M0 and M1 multipliers where M0 has a lower delay in general. This
suggest that M0 and M1 hardware are constructed and/or routed differently in each block.
A slight systematic delay variation can also be observed between multiplier columns where
multipliers in the left column have lower delays on average. An interesting trend is observed
from Table 5.1 where multiplier categories with lower average delay actually results in higher
degree of variability. For each category of multiplier (M0, M1) in each column (left, right), we
can observed its delay variation individually from the plots in Figure 5.3(c). Although a slight
trend can be observed among each plot, suggesting that there is a certain amount of systematic
variation, the delay is mostly dominated by random stochastic variation.
158 Chapter 5. Complex Circuit Testing
(a)
Left Col
Y
l
o
c
a
t
i
o
n
s
0
3
2
Right Col
Multiplier Columns
Multiplier
Block
M0 M1
(b)
(c)
Figure 5.3: (a) A detailed plot of worst-case delays associated with each output bit of a specific
multiplier. (b) A floor-plan of the multiplier blocks under test on the Cyclone III Chip, showing
a total of 66 blocks divided into two columns (Left and Right). Each block contains two 9x9
multipliers: M0 and M1, resulting in a total number of 132 9x9 multipliers. (c) Plots of worst-
case delay at the most significant bit (MSB) output of all 132 9x9 embedded multipliers on the
Cyclone III.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 159
Table 5.1: Delay statistics and variation of multipliers on the Cyclone III
Resources Max (ns) Min (ns) Mean (ns) 3-σ variation (%)
Left Col
M0 1.733 1.657 1.693 ±3.234
M1 1.791 1.721 1.752 ±2.736 ±5.959
Right Col
M0 1.728 1.642 1.681 ±3.677 (cross-chip)
M1 1.773 1.704 1.735 ±2.813
Uniform
Random
Vector
Generator
Circuit Under Test (CUT)
TCG
LR SR
TP profile
V TP
Measurement
Circuitry
y
n m
Arbitrary
Complex Circuit
Figure 5.4: The transition probability measurement circuitry for complex circuit with n input
bits and m output bits. A random vector generator is used to stimulate the CUT.
Exhaustive testing in general may require a long test time due to the large theoretical search
space of critical paths. However, many of the input combinations actually cause no state change
at the output bit under test, allowing the test to be optimised. This applies to most existing
complex combinatorial circuits, including multipliers. Therefore, in the case where the function
of the circuit is known and hence we can predict if a particular input stimulus will cause a
change at the output, the test time can be greatly reduced by excluding all the unnecessary
input combinations.
5.3 Measuring Complex Multi-Path Circuits with Tran-
sition Probability
The general TP measurement circuitry described in the previous chapter can be adapted to
test complex multi-path circuits by using a pseudo random test vector generator to stimulate
the CUT (Figure 5.4). Since the vector generation process is stationary, the statistics of the
resultant random test vectors are also stationary. Apart from transition probability D(y), the
160 Chapter 5. Complex Circuit Testing
random vectors’ statistics can be quantified by the High Probability (HP) or H(y) of signal y
introduced in the last chapter, where 0 and 1 implies a stuck-at low or high respectively. In
the case of independent identically distributed random bit sequences, the values of TP and HP
are linked by a simple quadratic relationship:
D(y) ≈ 2×H(y)× (1−H(y)) (5.1)
When defining or quantifying the statistics of random input sequences, the use of HP is pre-
ferred, because it represents a unique random bit pattern. TP values, on the other hand, could
result in two different HP solutions from the quadratic Eq. 5.1 with two opposite bit patterns,
causing unnecessary confusions. The only exception where TP points to a unique random bit
pattern is when it is at its maximum — D(y) = 0.5.
5.3.1 Initial TP Test on Embedded Multiplier
The example presented in the previous chapter shows that the TP test method can be used
to test known critical paths of a combinatorial circuit. However it can also be used to test
arbitrary circuit blocks (the so called ”black box testing”) where the internal structure and
critical paths are not known. To illustrate this, we chose the embedded multiplier blocks on the
Cyclone III FPGA as an example. Embedded multiplier block is a good candidate to verify the
test method because its internal structure is not published and no information on the criticality
of paths from different input bits to output bits is available, i.e., it is difficult for conventional
methods to produce accurate results without any means of exhaustive testing.
Test Setup and Procedure
The CUT containing a 9x9 embedded multiplier block on the Cyclone III is depicted in Fig-
ure 5.5, where the input bits are stimulated by a sequence of uniformly distributed random
input vectors V to exercise the paths within the multiplier.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 161
Circuit Under Test (CUT)
L
R
S
R
V Z
Launch
Register
Sample
Register
18
9
9
A
B
18
y
To
TP Measurement
Circuitry
Embedded Multiplier
Test
Clock
Figure 5.5: Circuit diagram of the CUT containing a 9x9 hardware multiplier block with input
A, B and output Z. The registered output y is processed by the TP measurement circuitry to
produce a TP profile.
For each frequency step, all input bits of the CUT are exercised by the random inputs for
K clock cycles while the TP measurement circuitry gathers output statistics of each output
bit individually. The sampling period in terms of the number of test clock cycles (K) can be
varied to adjust the coverage and test time of test when necessary. The TP profiles in the
test clock frequency or clock period domain is obtained for each individual output bit so as to
determine the maximum operating frequencies (from input to output) for all the output bits.
The maximum operating frequency fmax (hence the worst case delay) for the entire multiplier
is found by taking the lowest value from the results.
Results Analysis
Figure 5.6 shows the overall results for the 9x9 multiplier block, which has 18 output bits.
Each column represents the TP profile for each output bit over the frequency range of 400 to
850MHz. The actual TP values are represented by different colours. The TP profile for each
output bit shows a different fmax at which deviation from the stationary value starts, with the
most significant output (bit 18) fails at the lowest frequency as expected.
Unlike the single test path case in the previous chapter, it is no longer appropriate to use the
mid-point (D(y) = 0.5) of the TP profile as the threshold to estimate the nominal value of fmax
under the influence of clock jitter. This is because the interaction of the internal signals are
complex and the effects of clock jitter on the overall D(y) profile is no longer predictable. This
issue will be investigate further in Section 5.3.2 later. Like the LFSR test case presented earlier
162 Chapter 5. Complex Circuit Testing
1
F
re
q
u
e
n
c
y
(M
H
z
)
0.46
0.26
0.38
0.28
0.36
0.30
0.40
0.32
0.34
0.42
0.44
400
450
500
550
600
650
700
750
800
850
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Output Bit Number
T
ra
n
s
it
io
n
P
ro
b
a
b
il
it
y
(D
(
))
y
Figure 5.6: Test results of the multiplier block are presented as colour map of D(y) for all 18
output bits from 400 to 850MHz. Each column represents the result of one output bit and the
colour represents the value of D(y) at a particular frequency indicated by the vertical axis.
450 500 550 600 650 700
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
TP Profile of MSB Output
Lower Threshold
Upper Threshold
Frequency (MHz)
fmax estimation
494.5 MHz
T
ra
n
s
it
io
n
(
)
P
ro
b
a
b
il
it
y
D
(
)
y
Figure 5.7: The D(y) plot of the 18th output bit against test frequency steps. Upper and lower
thresholds are used to determine the point where the D(y) begins to deviate from its steady
state value.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 163
in the previous chapter, the fmax or worst-case delay of the multiplier has to be determined by
a set of upper and lower thresholds (see Section 4.4.2 on the proposed thresholding method).
The TP profile and thresholds for the 18th output bit is shown in Figure 5.7.
The frequency at which the TP profile first falls outside of the threshold bounds is taken as a
good estimate of fmax. In this particular case, the fmax for the 18
th output bit is 494.5MHz
and the worst-case delay is given by 1
fmax
= 2.02ns. The estimated fmax value for all output
bits of the multiplier is presented in Figure 5.8.
With the TP test time optimisation proposed in Section 4.4.3 in the previous chapter, the total
time required for the entire test procedure is approximately 0.2 second.
Results Verification
The results in the previous section enabled us to get a close estimate of the fmax of embedded
multipliers based on sets of threshold bounds. However, it is not clear how accurate and
reliable the estimations are and a definite “ground-truth” test is needed to verify its accuracy.
By using the accurate FRD per-path exhaustive test method proposed earlier in Section 5.2.1
as a ground-truth test, a set of accurate timing measurements can be obtained to verify the
TP measurement method’s accuracy.
Figure 5.8 shows the fmax values obtained by both the TP method and the previous FRD per-
path exhaustive test method for the multiplier outputs. It can be seen that the two methods
track each other remarkably well over all the output bits and differ from each other by less
than 12% in the estimated fmax values. The results shown here is very significant. These two
methods are totally different in approach. One provides exhaustive test results on all possible
paths, while the other relies purely on TP statistics. The fact that the TP method produced
similar results but with a significantly shorter test time than the exhaustive test, shows that
it could lead to a highly efficient test method for general complex circuits with reasonable
accuracy.
164 Chapter 5. Complex Circuit Testing
Fr
eq
ue
nc
y 
(M
Hz
)
Output Bit Number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
400
450
500
550
600
650
700
750
800
Exhaustive Test Results
f
max
 Estimation using TP profile
Figure 5.8: This figure compares the fmax estimations obtained using TP profile and results
from the exhaustive ground-truth test. Note that the architecture of cyclone III permits at
most 16 bits to be connected directly via fast direct-link interconnects to adjacent Logic array
blocks (LABs) [28]. Therefore, to avoid inconsistency of the ground-truth test results caused
by usage of slower global interconnects, the first 2 less significant output bits are excluded.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 165
The remaining questions are what caused the 12% gap in the measurements and how could
it be closed to provide even better accuracy that matches the much more time consuming
exhaustive test method. To answer them, a much deeper understanding of the TP behaviour
and characteristic in complex multi-path circuits is needed.
5.3.2 Detailed Study of Transition Probability’s Characteristics
Basic TP Model Under Uniform Random Input
Figure 5.9 depicts a simulated TP profile of a single logic path with uniformly distributed
random input sequence. See Appendix B for the corresponding MATLAB code. The falling
and rising transitions are assigned different propagation delay values tslow and tfast. The gradual
failure slopes are caused mainly by the stochastic behaviour of clock jitter [13], where it can
be describe by a random variable τ in terms of the relative time from the expected clock edge,
with a specific probability density function PDFJitter(τ). By assuming each clock edge has
independent random jitter, the behaviour of the TP profile as a function of clock period (T ) is
given by the following approximation:
TPindep(T ) ≈ 1
2
[
3
4
+
(
1
2
−
∫ tfast−T
−∞
PDFJitter(τ) dτ
)
×
(
1
2
−
∫ tslow−T
−∞
PDFJitter(τ) dτ
)]
(5.2)
where tfast ≤ tslow.
Similar to the toggle signal case described in Section 4.3.2 in the previous chapter, the behaviour
of the resultant TP profile also depends on the degree of jitter correlation between consecutive
clock edges, which affects the timing failure interaction between the rising and falling transitions
through the CUT.
166 Chapter 5. Complex Circuit Testing
750800850900950
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Clock Period (ps)
TP Profile Simulation of Single Logic Path
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
t
slow tfast
Figure 5.9: An example of basic TP profile of a single logic path failing at tslow and tfast for
falling and rising transitions respectively.
In the case of complete edge-to-edge jitter correlation, the TP behaviour with uniform random
input is given by:
TPcorr(T ) ≈ 1
2
[
1− 1
2
(∫ tslow−T
−∞
PDFJitter(τ) dτ
−
∫ tfast−T
−∞
PDFJitter(τ) dτ
)]
(5.3)
where tfast ≤ tslow.
Note that both TPindep and TPcorr give identical results in normal cases when the failure caused
by tfast and tslow do not overlap (Figure 5.9). Yet, when the two types of delay do overlap, the
jitter correlation causes their respective change of TP to cancel each other out, reducing the
magnitude of change in the TP profile.
Eq. 4.7 from Section 4.3.2 can be used to combine Eq. 5.2 and 5.3 from the two extreme cases
to form a more realistic TP estimation, where real clock signals are likely to exhibit both
correlated and un-correlated edge-to-edge jitter.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 167
LR SR
V S
3-Stage Pipeline
Delay: t , tA-fall A-rise
R R
t , tB-fall B-rise t , tC-fall C-rise
Stage A Stage B Stage C
Figure 5.10: A simple 3-stage pipeline circuit consist of three simple combinatorial paths (stage
A, B and C) with different path delays.
70080090010001100120013001400
0
0.1
0.2
0.3
0.4
0.5
Period (ps)
TP Profile Simulation of 3−Stage Pipeline
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
tB−fall tA−risetA−fall
tB−rise
tC−fall tC−rise
Figure 5.11: The TP profile simulation of a simple 3-stage pipeline.
TP Model of Simple Sequential Paths
When a series of combinatorial paths are connected in series by registers in a pipeline ar-
rangement. The TP profile of the entire circuit can be expressed from the TP profiles of each
individual path. An example of such pipeline circuit containing three stages is presented in
Figure 5.10, where the three stages (A, B and C) are associated with their respective propa-
gation delays: tA-fall, tA-rise, tB-fall, tB-rise and tC-fall, tC-rise. Figure 5.11 depicts the TP profile of
the circuit. It can be seen that the overall TP profile has a simple multiplicative relationship
with the individual TP profiles of the three stages. When the range between the rise and fall
delays of a path does not overlap with the delays of other paths – i.e., stage C in this case – the
change in TP response due to timing failure are independent and simply propagates directly to
the output, maintaining the usual form of a single-path TP profile (Figure 5.9).
168 Chapter 5. Complex Circuit Testing
For the simple 3-stage pipeline circuit in Figure 5.10, the TP profile TPseq can be modelled by
the following multiplicative expression:
TPseq ≈ 1
2
(2TPA × 2TPB × 2TPC) (5.4)
where TPA, TPB and TPC are the individual TP profile of the paths in stage A, B and C
respectively obtained by Eq. 5.2, 5.3 and 4.7.
Similarly, for a simple pipeline containing N stages, the general TP profile can be expressed by:
TPseq-N ≈ 2N−1
N∏
i=1
TPi (5.5)
Where TPi represents the TP profile of the i-th stage combinatorial path.
It is clear that in the case of simple sequential paths, the failure of the worst-case path would
always yield an easily distinguishable TP response no matter how the failure of the other paths
are affecting the overall TP profile. This further explains why the LFSR test case in the previous
chapter yielded such a sharp change in the TP profile despite high number of paths in the long
chain of 63 registers in the LFSR.
Analysis of Complex Multi-Path TP Profile
The previously described models are useful for predicting the TP profile of a failing path or
simple sequential paths. Yet, the problem with them is that they are not scalable to more
complex circuits containing multiple interacting combinatorial paths. Figure 5.12 depicts the
TP profile of the second LSB output of a 9x9 embedded multiplier on the Cyclone III EP3C25.
As can be seen, the observed output TP profile is related to all the basic TP profiles of each
individual path. While the TP profile may appear to be a direct combination of the basic TP
profile components of the paths, it is actually not possible to recreate the exact overall TP
profile using the basic single path TP profiles alone. The main reason for this is that the failure
process of the paths are interrelated with each other in a difficult to predict manner.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 169
750 800 850
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Frequency (MHz)
 
 Output TP Profile
TP profiles of individual paths
Figure 5.12: A TP profile measurement of the 2nd LSB output of a 9x9 embedded multiplier
on the Cyclone III EP3C25. The unusual shape of the TP profile is the result of individual
paths failing at different times. The corresponding paths are isolated and tested separately to
obtain their basic TP profile components for reference.
170 Chapter 5. Complex Circuit Testing
Test Clock
Clock
Jitter
Distribution
Fastest
Path Delay
T
T+ô
T+ minô T+ maxô
PDF( )ô
A B
...
Slowest Path Delay
Glitch
Period
Combinatorial
Output (Z)
A' B'
Register
Output (Y)
Figure 5.13: Timing diagram showing the activity of output bit (Z) and registered output (Y)
in a multi inputs/paths to single output circuit. A certain glitch period occurs after each clock
edge due to variation in propagation delay between different paths. The position of clock edge
is governed by the jitter distribution PDF (τ), and the probability of the register capturing the
correct value in B′ depends on both the glitch pattern and the overlapping jitter region.
Consider the timing illustration in Figure 5.13, where a circuit with multiple internal paths is
stimulated by random vectors. The probability that an input transition through a particular
path is observable at the output depends on the input pattern and the state of the other paths,
which means each path could contribute differently to the observed TP profile. Such behaviour
is only predictable if the exact circuit implementation, structure and layout are known.
Although each active path may produces a signal transition some time after the clock edge,
their different arrival time result in a “glitch period” containing a series of unwanted transition
activities. These glitch activities are unpredictable especially when random input vectors are
used. When the glitch period coincides with the next clock edge, where the clock edge position
itself is unpredictable due to clock jitter, the actual value captured by the register (B′) is
not deterministic, and hence the resultant transition probability cannot be determined with
certainty. Also, the rapid transitions in the glitch period could cause undesirable metastability
problem in the output register [81], further increasing the unpredictability of the output value.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 171
For these reasons, the direct approach of modelling the TP profile based on specific path quickly
become impractical with complexity. For FPGA designs, a mere change of placement and
routing could produce a layout with completely different TP profile. The only way that a precise
model of the TP profile can be obtained is if a perfect physical model of the circuit is available
with precise information on signals propagation, interaction, and clock jitter behaviour, so that
the exact glitch pattern is known and the registered output value is predictable. Though, if
such perfect physical model exists, a delay measurement method would not be necessary in the
first place. A better strategy would be to consider the timing error sensitivity of TP rather
than its exact profile, and deduce an effective way to control its sensitivity to timing errors in
complex circuits, so that good measurement accuracy is achieved.
Controlling Sensitivity of TP to Timing failure
Timing error Sensitivity of TP for a circuit is defined as the difference between the normal
operating level of output TP and the level of TP after the slowest type of the transition (rise
or fall) through the worst-case path has failed. The higher the difference, the more likely errors
are detected and hence provide better sensitivity to timing failure. The ability to control the
sensitivity of TP against timing failure allows the test method to produce more reliable results,
avoiding inaccuracy caused by sensitivity loss.
There are three typical cases where sensitivity could be affected:
(i) Sensitivity dilution – a logic block with large number of inputs converging to one output
suffers from reduced observable TP failure response. This problem can be easily observed
in an N -input AND gate where errors can only propagate through when all inputs are
high and the TP sensitivity decreases as N increases.
(ii) Sensitivity blocking – in a circuit with multiple combinatorial stages separated by pipeline
registers, the change in TP profiles due to timing failure of one stage could be blocked by
its following stage(s) under certain conditions, causing it to be invisible at the output.
172 Chapter 5. Complex Circuit Testing
(iii) Failure blind spot – when a logic block with N inputs is supplied with inputs SN with
certain H(SN), the failure of specific internal paths may not cause any observable change
at the output TP profile.
The problem of diluted sensitivity (i) is unavoidable in most cases, especially with random test
vectors. Yet, the sensitivity is only reduced and never completely lost, meaning that it can be
overcome by taking a higher number of transition count samples to form a TP profile with less
residue noise from the random inputs (see Eq. 4.2 in Chapter 4 Section 4.2.3). This approach,
however, increases the total test time and it does not solve the problems in cases (ii) and (iii)
where complete loss of sensitivity is possible.
To provide a general solution for the three cases while maintaining short test time, we propose
a method that can improve TP sensitivity by controlling the statistics of the random input
vector in terms of high probability (HP).
In Figure 5.14, the sensitivity of rising or falling transition failure in a single path can be
improved by adjusting the HP of input vector V . The usual choice of uniformly distributed
random test vectors, where H(V ) = 0.5, do not actually provide the best sensitivity to errors.
Instead, a maximum sensitivity can be achieved when H(V ) is 0.33 or 0.67 depending on
whether the rising or falling transitions fail first.
This unusual asymmetrical phenomenon can be explained and modelled probabilistically through
the following cases.
Consider 3 cycles of input vector sequences V (k), k = 1, 2, 3. If the falling transitions fail to
propagate within 1 cycle, a transition is only detected at the output register on the 4th cycle
when V has a sequence of 0 → 0 → 1 or 1 → 0 → 0. Therefore, the output TP of the failed
path in terms of V is given by the probability of the two sequences occurring:
TPfall failed = 2× P (V = 1)× P (V = 0)× P (V = 0)
= 2× P (V = 1)× (1− P (V = 1))2
= 2H(V )(1−H(V ))2 (5.6)
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 173
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
Input H(V)
(b) TP Sensitivity Plot
TP
 F
ai
lu
re
 S
en
sit
ivi
ty
 
 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
Input H(V)
(a) Single Path TP Response with Input H(V) from 0.0 to 1.0
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
 
 
Zero Sensitivity Line
Max. Sensitivity
(rising edge)
Input H(V) = 0.33 
Max. Sensitivity
(falling edge)
Input H(V) = 0.67 
Normal TP response
Failed TP response
(falling edge fail first)
Failed TP response
(rising edge fail first)
Figure 5.14: Plots evaluating the sensitivity of TP to timing failure in a circuit path. Maximum
sensitivity is achieved when the input vector V has high probability H(V ) = 0.67 when falling
transitions fail first, or H(V ) = 0.33 when rising transitions fail first.
174 Chapter 5. Complex Circuit Testing
In the same way, when the rising transitions fail, a transition is only detected when V is
0→ 1→ 1 or 1→ 1→ 0. This produce a similar probability expression:
TPrise failed = 2× P (V = 1)× P (V = 1)× P (V = 0)
= 2× P (V = 1)2 × (1− P (V = 1))
= 2H(V )2(1−H(V )) (5.7)
By subtracting these failed TP responses from the normal TP response (TPnormal) which is
given earlier by Eq. 5.1, the TP sensitivity of both falling and rising transitions can be derived:
Sensitivityfall = TPnormal − TPfall failed
= 2H(V )2(1−H(V )) (5.8)
Sensitivityrise = 2H(V )(1−H(V ))2 (5.9)
These expressions describe exactly the sensitivity behaviour observed in Figure 5.14 and the
corresponding HPs of their maximums (peaks), computed through solving their derivatives,
give exactly the observed optimal HP values: 0.33 (1/3) and 0.67 (2/3) for rising and falling
transitions respectively.
This asymmetrical sensitivity to different transition types means that uniformly distributed
random vectors are not necessary the optimal choice, given the CUT is known to have one
type of transitions failing at a significantly lower clock frequency than the other. This usually
happens in CMOS circuits where the pull-up and pull-down transistors are sized differently
or when extra pull-up or down transistors are added to improve signal strength. The only
advantage of uniformly distributed random vectors is when the CUT has exactly matched
rising and falling transition delay or their failure order is not known in advance.
TP Response and Sensitivity Mapping of Logic Circuits
To further understand how varying the input HP can improve the cases with potential sensitivity
loss – sensitivity blocking and failure blind spot – we carried out a series of sensitivity simulation
on a 2-input logic block. The layout of the block is depicted in Figure 5.15, where it has two
internal paths, each with its corresponding rising and falling transition delays (tA-fall, tA-rise and
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 175
LR
SR
A
S
Arbitrary 2-Input Function
LR
B
Delay = t , tA-fall A-rise
Delay = t , tB-fall B-rise
Figure 5.15: A simple two input arbitrary functional block for testing the sensitivity of transition
probability to timing failure with multiple signal paths.
tB-fall, tB-rise). The idea is to stimulate both inputs of the circuit with random vectors A and
B of varying H(A) and H(B) to create extensive two-dimensional mappings of TP response
and sensitivity, and identify possible sensitivity issues. The MATLAB code for simulating the
2-input logic functions are included in Appendix B.
The first issue we encountered is sensitivity blocking, which occurs in circuits with multiple
pipeline stages of functional blocks. Figure 5.16 demonstrates how certain failure response
from the preceding logic stage could be blocked by simple logic functions. For a circuit with
multiple pipeline stages, it is important to have the variation of TP caused by the failure of
early stages to propagate all the way through to the external output, so that it can be detected.
This process can, however, be blocked by logic stages, if the input statistics H(A) and H(B)
change in a specific way that follow the contour line in the TP response maps. Each of the
lines represents a constant level of output TP. Thus, H(A) and H(B) changing along these
lines would yield no output TP change, effectively blocking any timing failure information from
reaching the output. In this case an XOR gate posses the most problem, because there is a
wide and flat region about the centre where variation of H(A) and H(B) would not produce
any change at the output. The obvious solution against this problem is to adjust the input
HPs such that the observed TP response blocking does not happen.
Another serious issue with TP sensitivity is when a timing failure in a circuit leads to no change
of TP with specific input HPs — the failure blind spot. Such cases could be demonstrated in
176 Chapter 5. Complex Circuit Testing
(a) (b)
(c)
Figure 5.16: The output TP response mapping of all possible input HPs for (a) AND gate, (b)
OR gate and (c) XOR gate. The contour lines on the maps represent levels of the same output
TP, hence any change of input H(A) and H(B) along the contour lines leads to an unchanging
output D(S), possibly blocking failure responses from the previous logic stages.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 177
(a) (b)
(c)
Figure 5.17: The TP failure sensitivity mapping of all possible input HP for (a) AND gate,
(b) OR gate and (c) XOR gate. The contour lines represents the level at which sensitivity is
zero. Both positive and negative sensitivity values represent a measurable change of TP, but
in different directions.
2-input functions and they are depicted in Figure 5.17. In the three cases of 2-input functions:
AND, OR and XOR, the falling transition delay from input A is set to have the worst-case
delay and hence it fails first in the simulation. The deviation of TP caused by path A failing
is recorded for all possible input HPs at A and B to form a sensitivity map for each case. The
level where sensitivity is zero is marked by the contour lines. Therefore, any H(A) and H(B)
values that fall on or near these lines will result in undetectable TP response. Clearly, for AND
and OR function, the blind spots with zero sensitivity are rare and can be avoided relatively
easily. On the other hand, XOR has a wide spread region across the middle where H(A) = 0.5.
Such region should be avoided by adjusting the values for H(A) and H(B). For linked input
HP values where H(A) = H(B), H(A) = H(B) = 0.87 gives approximately the best sensitivity.
178 Chapter 5. Complex Circuit Testing
70080090010001100
0.2
0.3
0.4
0.5
(a) TP profile of XOR, H(A) = H(B) = 0.5
D
(S
)
Clock Period (ps)
70080090010001100
0.2
0.3
0.4
Clock Period (ps)
(b) TP profile of XOR, H(A) = H(B) = 0.87
D
(S
)
 
 
tB−falltA−rise tB−rise
tA−rise tB−fall tB−risetA−fall
tA−fall
Figure 5.18: Simulated TP profile of an XOR gate showing (a) sensitivity loss to timing failure
in all paths when using uniformly distributed random inputs, and (b) sensitivity restored using
H(A) = H(B) = 0.87.
The effect of the XOR’s blind spot is demonstrated in Figure 5.18 and Figure 5.19 in terms
TP profile, where cases with different path delay order is shown. In Figure 5.18, the error
sensitivity is completely lost, due to the uniform input HPs. Whereas in Figure 5.19, the TP
profile showed a certain change when tB-fall is violated, but still missed the failure of the worst-
case path (tA-fall). In both case, the sensitivity is restored and improved dramatically when
H(A) = H(B) = 0.87 is used.
5.3. Measuring Complex Multi-Path Circuits with Transition Probability 179
70080090010001100
0.2
0.3
0.4
0.5
Clock Period (ps)
(a) TP profile of XOR, H(A) = H(B) = 0.5
D
(S
)
70080090010001100
0.1
0.2
0.3
0.4
Clock Period (ps)
(b) TP profile of XOR, H(A) = H(B) = 0.87
D
(S
)
tB−fall tA−rise tB−risetA−fall
tA−fall tB−fall tA−rise tB−rise
Figure 5.19: (a) shows the sensitivity loss to failure of the slowest type of transitions in the
XOR and regained sensitivity when both type of transitions have failed; (b) shows the fully
regained sensitivity using H(A) = H(B) = 0.87.
180 Chapter 5. Complex Circuit Testing
Uniform
Random
Vector
Generator
Circuit Under Test (CUT)
LR SR
TP profile
V
TP
Measurement
Circuitry
y
n m
Arbitrary
Complex Circuit
Probability
Weighting
Circuit
Circuit
Response
Tester
(CRT)
TCG
High Probability (HP) Weights
Circuit
Response
Analyser
(CRA)
Test Select
TP profile
Analyser
Fmax
Glitch Detection Path
(Non-permanant)
Normal
Transition
Detection Path
Figure 5.20: Block diagram of the self-optimising complex circuit test platform.
5.4 Self-Optimising Complex Circuit Test Platform
The deeper understanding of how TP behaves in complex circuits led us to create a more
accurate self-optimising test platform that allows any FPGA users to easily measure their
design’s timing. The block diagram of the complete complex circuit test platform is depicted
in Figure 5.20. The test circuit automatically optimises its random input vectors with specific
probability weights to improve the TP’s sensitivity against timing errors in the CUT.
5.4.1 Adaptive Input Probability Weighting
The circuit response tester (CRT) stimulates the CUT by toggling one input bit at a time while
cycling the remaining bits with a counter every two clock cycles. Each count would form a pair
of input patterns differ only by the single toggle bit. This forms a set of exhaustive single input
change (SIC) test vectors. This approach effectively exercise every path in the combinatorial
logic blocks in the CUT with full input access. The Output pattern is analysed by the circuit
response analyser (CRA). Input pattern pairs from the CRT that lead to actual activities at
specific output bit are recorded and marked as “effective”. Since a significant number of input
patterns are likely to produce no output transitions, the refined “effective” input patterns would
5.4. Self-Optimising Complex Circuit Test Platform 181
form a series with distinctive average HP values for each input bit when applied in sequence.
Such HP values are then applied to the probability weighting circuit as HP weights to generate
weighted random sequences with specific HP. The HP optimised random vectors are likely to
exercise the internal paths of the CUT more thoroughly than uniformly distributed random
vectors, because it is probabilistically similar to the “effective” input patterns that exercised
every paths in the exposed combinatorial parts of the CUT.
While this approach may appear to neglect sequential feedbacks in circuits, where combinatorial
blocks with feedback inputs may not be directly controllable from the proposed input sequences;
it is the vary nature of feedback in sequential circuits that allows the TP test method to maintain
high timing error sensitivity, where errors are accumulated through the feedback paths and
causes a significant change in the output TP response.
Through monitoring the raw output of complex combinatorial circuits on FPGAs, we also
found that many of the critical and near critical paths are actually un-observable under normal
circumstances, since they may produce momentary glitches that are not usually observable
from the registered output, except when the clock edge falls exactly within the glitch period.
The fact that the output register may filter out potential activities of critical paths means
that it is sometimes necessary to observe the CUT’s activity before the output register. Such
path for detecting glitches is illustrated in Figure 5.20 as “Glitch Detection Path” where the
CRA can detect glitch activity to create more effective HP weights for further TP sensitivity
improvement. The glitch detection path is only needed once at the beginning to generate the
HP weights; therefore it can be removed afterwards to avoid its effect on the CUT’s fan-out,
load and propagation delay. Though, in most case, the usual registered outputs are adequate
to generate a set of good HP weights for accurate timing measurements.
182 Chapter 5. Complex Circuit Testing
HP Weight Estimation for the 2-input XOR Case
For the earlier 2-input XOR example, the effective input vectors are: ∗0, 0∗, ∗1 and 1∗, where
∗ represents the input bit being toggled by the CRT. Assuming a toggling bit has an HP of
0.5, one could argue that the overall average HP of the four vector pairs equates to 0.5 for both
input bits, which was shown to give zero sensitivity in Figure 5.18 and 5.19. However, due to
the asymmetrical sensitivity of delay paths described earlier by Eq. 5.8 and Eq. 5.9, the overall
average of the effective vectors may not form appropriate HP weights. Instead, the vectors
should be divided into groups according to the pattern of the non-toggling bits and compute
their average HP weights separately. In this case, they form two groups: {∗0, 0∗} and {∗1, 1∗}.
Assuming falling transitions are slower, optimal sensitivity is achieved when the ∗ bits are
assigned an HP of 0.66, and the resultant HP weight of both input bits for the 1st and 2nd
group are:
1st Group: HP V 0 = HP V 1 = (0.66 + 0)/2 = 0.33
2nd Group: HP V 0 = HP V 1 = (0.66 + 1)/2 = 0.83
Although either of the HP weights (0.33 or 0.83) would yield measurable TP responses when
applied to both inputs of the XOR, the HP weight from the 2nd group (0.83) would give better
sensitive, since HP weights greater than 0.5 favour higher failure sensitivity for the slower falling
transitions in this particular case. The estimated HP weight agrees with the best HP weight
(0.87) observed earlier in Figure 5.17(c), assuming the same HP weight is used for both inputs.
5.4.2 Bit-Specific Optimisation for Low Controllability
and Observability Circuits
Occasionally, a complex design may contain a critical path that is associated purely to one
specific input bit and/or output bit, making the general HP weight optimisation approach that
targets all input and output bits at once less effective. To improve TP sensitivity for such cases,
5.4. Self-Optimising Complex Circuit Test Platform 183
input and/or output bit-specific HP weight patterns can be used. The HP weight generation
procedure is similar to the general case, but instead of marking the effective input pattern for
all input and output bits, the actual indices of the toggling input bit and the active output bits
are recorded for every effective input pattern detected. Thus, a table of input vectors associated
with each input bit to output bit activity can be built to obtain HP weights targeting specific
input bit and/or output bit. Table 5.2 shows an example of an input to output activity map
of a circuit with 3 input bits and 2 output bits.
Table 5.2: Example input to output activity map for bit-specific high probability (HP) weights
optimisation.
Vector Input Vectors: V 2, V 1, V 0 Output Active
pairs (∗ = toggle bit) Bit1 Bit0
0 0 0 ∗ √
1 0 1 ∗
2 1 0 ∗ √ √
3 1 1 ∗ √
4 0 ∗ 0 √ √
5 0 ∗ 1 √
6 1 ∗ 0 √ √
7 1 ∗ 1 √
8 ∗ 0 0 √ √
9 ∗ 0 1 √
10 ∗ 1 0 √ √
11 ∗ 1 1 √
For example, the input vectors with the 1st input bit toggling (vector pairs 0-3) can be used to
generate a specific set of HP weights optimised for the 1st input bit (V 0). Since the toggling bit
has one transition every clock cycle (D(y) = 1.0) and its activity cannot be emulated by any
means of HP weights, a special case on the probability weighting circuit is necessary to directly
produce a toggle signal. By including direct toggle signal generation, the TP profile can be
made much more sensitive than using random vectors alone, because the corresponding paths
will be exercised much more frequently. In this case, the HP values optimised for V 0 would be
given by the 3 effective vector pairs (00∗, 10∗, 11∗) as follows:
HP V 0 = Toggle
HP V 1 = (0 + 0 + 1)/3 = 0.33
HP V 2 = (0 + 1 + 1)/3 = 0.67
184 Chapter 5. Complex Circuit Testing
It is also possible to further narrow down the HP weights for the specific input to output bit
combinations. For example, the optimal HP values for V 0 to output Bit0 is given by:
HP V 0 Bit0 = Toggle
HP V 1 Bit0 = (0 + 0)/2 = 0.0
HP V 2 Bit0 = (0 + 1)/2 = 0.5
and the optimal HP values for V 0 to output Bit1 is:
HP V 0 Bit1 = Toggle
HP V 1 Bit1 = (0 + 1)/2 = 0.5
HP V 2 Bit1 = (1 + 1)/2 = 1.0
As can be seen, the more specific the HP weights are used, the higher the potential TP sensitivity
against failure, but at the expense of higher number of separate tests and hence longer test
time. The test time increase is related to the number of input bits N and output bits M , and
the worst-case test time when every input to output combinations are tested separately is given
by:
Total Test Time = Single Test Time×M ×N (5.10)
Although the worst-case test time may be increased by a factor of the product of the number
of input bits and output bits, the total test time required is still significantly less than that of
an exhaustive test. Also, it is highly unlikely that the full combination of HP weights have to
be tested, given that the high level function of the CUT is known.
5.4.3 Generating Weighted Probability Test Vectors
Random sequences with weighted HP can be generated from a combination of independent
uniformly distributed random bit streams through simple boolean logic [48]. Consider the
following basic relationships (rules) linking the boolean combination of random bit streams
and the probability distribution of the resultant output:
5.4. Self-Optimising Complex Circuit Test Platform 185
(I) The AND/product rule:
For two independent random bit streams A and B:
H(A ·B) = H(A)×H(B) (5.11)
Where A ·B depicts the boolean AND operation of A and B. For example, when H(A) =
H(B) = 0.5, H(A ·B) = 0.5× 0.5 = 0.25. This rule allows bit streams with HP less than
0.5 to be generated.
(II) The NOT/inversion rule:
For any random bit stream C:
H(C) = 1−H(C) (5.12)
This rule can be used to produce HP greater than 0.5. For example, let H(A) = H(B) =
0.5, H(A ·B) = 1 −H(A)H(B) = 1 − 0.5 × 0.5 = 0.75. Note that when C is uniformly
distributed (H(C) = 0.5), boolean inversion would return a bit stream with exactly the
same HP, where
H(C) = H(C) (5.13)
(III) The logic interchange rule:
By De Morgans theorem, the HP of two uniformly distributed random bit streams A and
B satisfy the following identity
H(A ·B) ≡ H(A+B) (5.14)
where the boolean AND operation is interchanged with an OR operation.
However, since H(A) = H(B) = 0.5 also satisfy Eq. 5.13, the inversion of the individual
uniform random bit streams can be omitted. Therefore,
H(A ·B) ≡ H(A+B) (5.15)
This implies that the previous example for generating HP = 0.75 can be simplified to a
single boolean OR operation, where H(A+B) = 1− 0.5× 0.5 = 0.75.
Using these basic rules, a wide range of HP values between 0 and 1 can be generated using only
AND and OR logic. Table 5.3 shows an example of 9 levels HP weighting using 3 independent
uniform random bit streams: R0, R1 and R2. A “Special” weight is also included in the table
186 Chapter 5. Complex Circuit Testing
for toggle signal generation if bit-specific sensitivity optimisation is needed. With r independent
uniform random inputs, the achievable number of HP levels is given by 2r + 1. An example
boolean logic implementation table for a higher resolution 17 levels HP weighting circuit using
4 independent random bit streams is included in Appendix C, Table C.1, for reference.
Table 5.3: Example of logic based weighted random sequence generation with different high
probability (HP) and a “Special” weight reserved for toggle signal generation.
Weights HP Boolean Logic / Hardware
1 0 GND
2 0.125 R2 ·R1 ·R0
3 0.25 R2 ·R1
4 0.375 R2 · (R1 +R0)
5 0.5 R2
6 0.625 R2 +R1 ·R0
7 0.75 R2 +R1
8 0.875 R2 +R1 +R0
9 1.0 VCC
Special 0.5 (Toggle) Toggle Flip-flop
Exploiting Partial and Dynamic Reconfigurability
For FPGAs with partial/dynamic reconfigurability, the HP weight can be modified easily
through changing the configuration bits and the LUT’s function on the fly [82, 83]. Figure 5.21
shows the basic concept of a partial/dynamic reconfiguration based HP weighting circuit on
a typical FPGA logic element with 4-input LUT and register. Otherwise, if partial/dynamic
reconfiguration is not available, the same controllable HP can be implemented directly with
several LUTs, at the expense of slightly higher area. For the Cyclone III EP3C25 without such
capability, a weighted random bit stream with 17 HP levels based on Table C.1 in Appendix C
can be implemented using three 4-input LUTs.
5.5. Test Platform Implementation and Evaluation on FPGA 187
Register
R3
Look-up-table
(LUT)
Output with
Weighted HP/
Toggle SignalR2
R1
R0
Uniform
Random
Inputs
... LUT Mask for Weighted HP Logic
Configuration Memory Bits
Select Toggle
Signal generation
Partial/
Reconfiguration
Bit Stream
(HP Weights)
Dynamic
...
Figure 5.21: A partial/dynamic reconfiguration approach for 17 levels reconfigurable HP weights
and toggle signals using a typical Logic Element (LE) with 4-input LUT and register.
5.5 Test Platform Implementation and Evaluation on
FPGA
The proposed complex circuit test platform is implemented on a Cyclone III EP3C25 FPGA
to evaluate its accuracy and efficiency. Figure 5.22 depicts the hardware layout of the test
circuit on the FPGA. In this particular case, the TP measurement circuitries and CRA are
placed next to the CUT for a more compact representation. However, there are no limitation
on where these circuitries should be placed, because they are completely asynchronous from the
CUT and do not suffer from timing issues if placed at a remote location. The random vector
generator is implemented as an LFSR and is followed by a 17 levels HP weighting circuit.
The test procedure contains two phases. First, the circuit’s response is analysed by the CRT
and CRA to generate the optimised HP weights, then they are used to conduct the TP test
to obtain its maximum operating frequency or worst-case delay measurements. The response
analysis phase is only required once for each design. In some cases, it can be skipped completely
if the optimised HP weights can be obtained through analysis or simulation of the CUT during
the design process.
188 Chapter 5. Complex Circuit Testing
TP c ,
Circuit Response
Checker
ircuitries
CUT
Random
Vector
Generator
Probability
Weighting
Circuit
Circuit
Response
Tester
Alternative Test Circuit
(For accuracy evaluation)
CUT output storage
for direct error checking
Figure 5.22: Layout of the hardware test platform on a Cyclone III EP3C25 for complex CUT.
An alternative test circuit is included for accuracy evaluation of the TP test platform.
The test platform is evaluated by two types of CUTs: a 4x4 LUT based multiplier and a But-
terworth IIR Filter. The layout in Figure 5.22 is taken from the Butterworth Filter case. The
test candidates were chosen such that both combinatorial and sequential circuits are evaluated.
Since practical FPGA applications in general contain mostly LUT based functions, the LUT
based multiplier test would give us a clear guideline on how well the test method performs in
general.
5.5.1 Multiplier Test Case
The LUT based multiplier is tested with both the proposed TP method and the full exhaustive
test method proposed in Section 5.2.1 to give an absolute measurement reference for accuracy
comparison. For the TP test, the random inputs with and without optimised HP weighting are
tested to identify their effectiveness. The placement and routing of the CUT are kept exactly
identical between both tests.
5.5. Test Platform Implementation and Evaluation on FPGA 189
Results
The maximum operating frequencies (fmax) of the multiplier measured from the TP profiles
using threshold bounds are shown in Figure 5.23. The results with optimised input HP tracks
the exhaustive test results very closely and is accurate within 1% of the results. The apparent
accuracy difference between the normal and optimised HP results are not very high in this
case because the test using uniformly distributed random inputs is already very close to the
exhaustive test references. Nonetheless, a clear improvement can still be seen between the two.
We suspect that the good accuracy from the uniform random inputs is due to a high degree of
glitch acitivity at the combinatorial output of the multiplier. Referring back to Section 5.3.2
and Figure 5.13, it can be seen that when the glitch period is violated by the clock edge and
jitter region, the registered output becomes highly unpredictable. Although it is not possible
to predict the exact TP response, the increased uncertainty may have caused a more distinctive
TP deviation from its normal value and thus increased the TP sensitivity. [84] and [85] have
shown that complex combinatorial circuits in FPGAs generally produce a high level of glitch
at their outputs which could in fact favour the TP measurement method.
In Figure 5.24, the TP profiles taken from the same output bit using uniform HP and optimised
HP are compared. The TP profile with optimised HP shows a significantly larger TP response,
resulting in higher timing error sensitivity and better measurement accuracy using thresholds.
The optimised TP method may produce slightly more conservative results than the exhaustive
test because it is not possible to take the effect of clock jitter into account with the complex TP
behaviour produced by multiple paths failing. Whereas the exhaustive test method examines
each path individually and is able to produce a nominal fmax according to the expected clock
edge position at the centre of the jitter distribution. See Section 3.4.3 and 5.2.1 for more
information.
For this particular multiplier test case, the efficiency of the test platform in terms of total
test time is under 3 seconds, assuming the optimal input HP weights are pre-extracted or
pre-computed.
190 Chapter 5. Complex Circuit Testing
Accuracy Evaluation of the Optimised TP Method
175
225
275
325
375
425
0 1 2 3 4 5 6 7
Output Bit #
F
r e
q u
e n
c y
 (
M
H
z )
Exhaustive Test
Optimised Input HP 
Input HP = 0.5
Figure 5.23: Accuracy evaluation of the TP method on a 4x4 LUT based multiplier with
optimised input HP against the exhaustive test method proposed in Section 5.2.1.
Sensitivity Improvement with Optimised Input HP
0
0.2
0.4
0.6
0.8
1
1.2
125 225 325 425 525 625
Freq (MHz)
Tr
an
st
io
n 
P
ro
ba
bi
lit
y
TP Profile (HP = 0.5)
TP Profile (Optimised HP)
431 MHz
441 MHz
Optimised Input HP
Input HP = 0.5
Figure 5.24: Plot showing the error sensitivity improvement of the TP profile of a single mul-
tiplier output bit using optimised input HP.
5.5. Test Platform Implementation and Evaluation on FPGA 191
5.5.2 Butterworth IIR Filter Test Case
The Butterworth IIR Filter (Figure 5.25) is implemented with multiple 18x18 embedded multi-
pliers, adders, feedback paths and register stages on the Cyclone III EP3C25. Such complexity
of the CUT resembles most practical designs in FPGAs and it would give a good representation
of the TP test platform’s measurement accuracy in realistic situation.
To evaluate the actual measurement accuracy, a test method based on absolute output com-
parison is used to gather the filter’s output at a series of finely spaced increments of clock
frequencies and compare them against a set of pre-calculated reference results to identify any
timing errors. The layout of the extra test circuit is depicted in Figure 5.22 as “Alternative
Test Circuit”. Note that this alternative test circuit is built purely for the purpose of accuracy
evaluation, its area overhead and test time are far too high for practical use.
Results
The results from the optimised TP test platform in the form of TP profiles (Figure 5.26) gave
a maximum operating frequency (fmax) measurement of 159.44MHz, which is within 1% of the
absolute fmax obtained from the comparison based method (Figure 5.27). This reference fmax
is derived from the point where error starts to occur in the failure rate plot.
The total test time in this case is similar to the previous multiplier test case, where a test
takes approximately 3 seconds to complete. This is mainly because the test time is linked to
the range of frequency sweep and a relatively short frequency range was needed to obtain the
results in both cases.
5.5.3 Test Circuit Area Estimations
The resource usage of the test platform depends on four key factors: the number of input
bits (n), the number of output bits (m), the TP counter bit width (k) and the order of HP
weight levels (r), where 2r + 1 gives the actual number of HP levels. The detailed resources
192 Chapter 5. Complex Circuit Testing
D Q
D Q
D Q
D Q
w1(n-1)
b11
w1(n-2)
b12
D Q
D Q
w2(n-1)
b21
w2(n-2)
b22
w1(n)
w1(n-2)
DSP
block
w2(n)
D Q
D Q
w2(n-2)
w2(n-1)
y(n)
x(n)
w1(n-1)
Second
biquad
First
biquad
D Q
D Q
D Q
D Q
Figure 5.25: The 4th order Butterworth IIR filter design from Altera [86], where x(n) is the
input and y(n) is the output.
5.5. Test Platform Implementation and Evaluation on FPGA 193
TP Profiles of the Butterworth IIR Filter
0.25
0.3
0.35
0.4
0.45
0.5
0.55
120 140 160 180 200 220 240 260
Frequency (MHz)
T
ra
ns
iti
on
 P
ro
ba
bi
lit
y
 fmax = 159.44 MHz
Figure 5.26: The TP profiles of all 21 output bits of the Butterworth IIR filter.
Absolute Failure Rate of the Butterworth IIR Filter
0
20
40
60
80
100
120 140 160 180 200 220 240 260
Frequency (MHz)
F
ai
lu
re
 R
at
e 
(%
)
 fmax = 158.14 MHz
Figure 5.27: The absolute failure rate of the Butterworth IIR Filter between 120 and 270MHz.
194 Chapter 5. Complex Circuit Testing
estimation for each test circuit component is presented in Table 5.4. The general resource
usage is quantified in terms of the number of Logic Element (LE) containing a 4-input LUT
and register, or the number of Adaptive Logic Module (ALM) with fracturable 8-input Adaptive
LUT (ALUT) and two registers found in more advanced FPGA architectures [25].
The test circuit can be optimised for low area overhead by sharing the same TP counter among
different output bits using an m-to-1 multiplexer, or it can be optimised for short test time
using multiple TP counters in parallel. Table 5.4 shows the two extreme cases with full TP
counter sharing or fully parallelised TP counters. The resource usage of the multiplexer in
the sharing case is estimated based on the actual optimised synthesis results from Altera’s
Quartus II design tool, where the resource usage in terms of LE and ALM are approximated
by ceil
(
3m
4
+ 1
)
and ceil
(
m
3
)
respectively.
Table 5.4: The general resource usage of the TP test platform in terms of the number of Logic
Wlement (LE) or Adaptive Logic Module (ALM) [25]. The estimations are based on the number
input bits n and output bits m, the TP counter bit width k and the order of magnitude r of
the number of HP weight levels.
4-LUT, Reg ALUT, 2 Reg
(LE) (ALM)
Random Vector Generator ∼ r × n ∼ r × n/2
Circuit Response Tester ∼ n ∼ n/2
HP Weighting Partial/Dynamic Reconfigurable n× ceil( r−1
3
)
n× ceil( r−1
5
)
Circuit Direct Logic Implementation n× ceil(2r
3
)
n× ceil(2r
5
)
k-bit Shared (low area) by MUX ∼ ceil(3m
4
+ 1
)
+ k ∼ ceil(m
3
)
+ k/2
TP Counter Parallel (short test time) m× k m× k/2
It can also be seen that partial/dynamic reconfigurability significantly reduced the resource
usage in the HP weighting circuit by over a factor of 2. The only drawback with it is the
potential penalty on HP weights switching speed, which is not critical to the actual testing
process in most cases.
The resource usage of the circuit response analyser (CRA) is not included in the table because
its function can be covered by the same counter hardware that performs the TP counting at
the output of the CUT.
5.6. Summary and Discussion 195
Using the previous 8 inputs and 8 outputs 4x4 multiplier test case as an example, the Cyclone
III EP3C25 would require the following number of LEs in the shared and parallel cases with
n = m = 8, k = 20 and r = 4 (17 HP levels):
Resources-shared ≈ r × n+ n+ n× ceil
(
2r
3
)
+ ceil
(
3m
4
+ 1
)
+ k (5.16)
≈ 91 LEs
Resources-parallel ≈ r × n+ n+ n× ceil
(
2r
3
)
+m× k (5.17)
≈ 224 LEs
For the Cyclone III EP3C25 with 24624 LEs, the resource usage is equivalent to 0.36% and
0.9% of the entire FPGA for the shared and parallel cases respectively.
If the same CUT is tested on a more advanced architecture with ALMs and partial/dynamic
reconfigurability, such as the 28nm Stratix V from Altera [83], the resource usage efficiency
would increase even further to:
Resources-shared ≈ r × n
2
+
n
2
+ n× ceil
(
r − 1
5
)
+ ceil
(
m
3
)
+
k
2
(5.18)
≈ 41 ALMs
Resources-parallel ≈ r × n
2
+
n
2
+ n× ceil
(
r − 1
5
)
+
m× k
2
(5.19)
≈ 108 ALMs
5.6 Summary and Discussion
In this chapter, we have proposed an FRD based per-path exhaustive test method capable of
producing reliable and accurate results for complex multi-path combinatorial circuits. It was
used to evaluate the accuracy of the more efficient TP method and showed an accuracy of
12% for the embedded 9x9 multiplier on the Cyclone III EP3C25. To overcome this margin
of inaccuracy, an in-depth study of the TP behaviour in complex circuits was carried out and
a number of strategies based on test vector probability weighting were proposed to optimise
sensitivity of the TP test method against timing failure.
196 Chapter 5. Complex Circuit Testing
Using a 4x4 LUT based multiplier and a 4th order Butterworth IIR filter as CUT, we have
demonstrated that the optimising TP test method could provide a highly accurate delay and
frequency measurements in both complex combinatorial and sequential circuits. The actual
accuracy was shown to be within 1% of the absolute timing measurements obtained from the
much more time consuming and area expensive FRD exhaustive test and the direct comparison
reference method.
The proposed technique to automatically optimise random input test vectors in terms of high
probability (HP) weights enables a large variety of complex circuits to benefit from the elegant
TP test method with highly precise and accurate timing results. Moreover, the test circuit is
highly area efficient, where overhead is not directly proportional to the CUT’s complexity but
the number of input and output bits, and it is contributed mainly to input vectors generation.
The TP circuitries can also be shared among different outputs or circuits to achieve further
area reduction at the cost of longer test time. Otherwise, multiple TP counters could be used
in parallel for very short test time.
The main limitation of the test method is that the actual response in a TP profile cannot be
reliably predicted for complex circuits due to glitches and clock jitter uncertainties. That means
there could be a certain degree of unpredictability in the measurement’s accuracy. However,
given the good accuracy achieved in the test cases, such unpredictability could be easily guarded
using a relatively small guard band and have minimal impact on the results optimality.
Chapter 6
Conclusion
6.1 Measurement Resolution and Accuracy
The timing resolution of measurements from both the failure rate detector (FRD) and transition
probability (TP) methods are dependent on the clock generator used. For PLL based clock
generators such as the two-stage PLL implementation (Section 3.5.3) used in the Altera Cyclone
III, the achievable timing resolution in terms of half clock period between 128 and 800MHz is
shown to range from 1.16ps to as high as 0.002ps, but it is mostly concentrated within 0.1ps
and peaked at approximately 0.006ps (see Figure 3.13 in Section 3.5.3). For the TP method
that relies on the full clock period, the actual timing resolution is reduce by a factor of two but
it remains sufficiently precise and is largely within the level of 0.2ps.
Both the FRD and TP method can produce highly accurate results for path specific measure-
ments, since it can take advantage of the reference point provided by the centre (∼median) of
the clock jitter distribution, giving an accurate nominal delay based on the expected clock edge
position. The same applies to the per-path FRD exhaustive test method, where specific paths
are isolated and tested individually. However, when testing complex multi-path circuits with
the TP test platform, the behaviour of jitter cannot be isolated from the complex TP responses,
and hence a set of threshold bounds are used to determine the circuit delay. This method is
197
198 Chapter 6. Conclusion
less accurate, but with the help of optimised test vector probability weights, the accuracy is
shown to be within 1% for both complex combinatorial and sequential circuits.
6.2 Limitations and Future Technology Scalability
There are no significant limitations with the optimised TP test platform, except that the
achievable accuracy may vary between different circuits due to the unpredictable effects of
output glitches and clock jitter on the TP failure sensitivity (Section 5.3.2). As explained
earlier in Section 5.6, the accuracy variation is expected to be relatively small and can be
overcome using a small guard-band.
Although the FRD and TP methods require runtime reconfigurable PLLs or DCMs based clock
generator to work, it is no longer a limitation because most existing Altera and Xilinx FPGAs
are already equipped with such hardware, and it is expected to be a basic standard for future
FPGA architectures.
Both methods have no specific technology dependency that could impact their operation and/or
efficiency in future process technology and FPGA architectures. In fact, the measurement
method would automatically scale with any advancements in clock generators, clock networks
and operating frequency of circuits in future FPGAs, due to the fact that all measurements are
obtained relative to the system clock. Any improvement that leads to more precise control of
clock frequency and increase of clock signal quality in terms of jitter distribution would lead to
better measurement resolution and accuracy in future FPGAs.
6.3 Advancements Over Existing Measurement
Methods
As stand alone timing measurement tools, the measurement techniques and test platform pro-
posed in this thesis can be used in a wide range of applications and digital VLSI hardware, not
6.3. Advancements Over Existing Measurement Methods 199
limited to FPGA platforms. However, FPGA will be the first to benefit from it due to its inherit
flexibility, where any user could easily include the modularised test platform to efficiently and
accurately test their designs without incurring any permanent hardware overhead. The tem-
porary resources used by the test platform is shown to scale only with the circuit-under-test’s
inputs and outputs number, rather than the actual internal complexity. Therefore, in most
modern commercial FPGAs, including low cost products, the resource usage would contribute
to at most a few percent of the entire chip, allow complex designs that use a large portion of
FPGA resources to also benefit from the test platform.
Amongst the two measurement techniques, the TP timing measurement method is suitable
for FPGA as well as ASIC, since it is non-invasive and does not require any internal circuit
modifications in the circuit-under-test. Furthermore, the beauty of the TP method is that
it gives more than just the worst-case delay of a circuit — it is able to measure the delay
of rising and falling transitions separately. Such information is useful for design level timing
optimisations, where signals can be deliberately inverted between combinatorial nodes to even
out and reduce the impact of slow transitions on the overall worst-case propagation delay. As
for both ASICs and FPGAs, it can be used for quick power-on self-test (POST) where specific
timing conditions (health) of the hardware can be accurately monitored to inform users of
impending hardware degradation or failure.
In terms of the test hardware flexibility, measurement accuracy, precision and consistency, both
the FRD and TP methods surpass existing measurement methods, such as ring oscillator and
at-speed scan tests that are commonly used in both FPGAs and ASICs. The excellent spatial
and timing resolution of the test methods demonstrated on FPGAs in this thesis , proved that it
can be used to accurately map out the delays of individual components, clearly distinguishing
any delay pattern across the chip caused by process variation. This puts the test methods
in an ideal position for hardware security schemes known as Physical Unclonable Function
(PUF) [87] that take advantage of the intrinsic delay variation in FPGAs as a unique hardware
identification. The precise and consistent measurements from the FRD and TP methods can
be exploited to implement more reliable and effective PUF for FPGAs.
200 Chapter 6. Conclusion
6.4 Closing the FPGA Design Productivity Gap
At the FPGA manufacturer and timing model standpoint, it is becoming harder and harder
to reliably predict the optimal timing performance of user designs due to process variation
and other environmental variations. The continuous increase of chip size, logic density and
transistor scaling have only made the situation worse by forcing more and more conservative
timing models in an attempt to retain reliability under the effect of timing variability. While
this strategy succeeded in maintaining reliability so far, the gap between the physical timing
performance of our technology and the timing model is continuously growing, causing our
productivity with FPGAs to lag behind the actual advancements in hardware technology. This
thesis presented a highly flexible and practical timing measurement method that can be used to
bridge the performance gap between theoretical timing models and physical timing performance
of FPGA designs.
The main implication of an efficient timing measurement technique in the FPGA design process
is that it can assist or even replace the less optimal timing model based analysis process in the
traditional FPGA design flow. The improved design flow based on physical timing measurement
is illustrated in Figure 6.1. Given the highly efficient measurement method proposed in this
thesis, where the test time of a practical complex circuit is in the order of a few seconds, the
overall increase in an FPGA design compilation time should not be affected significantly. The
main requirement is that the actual FPGA hardware has to be available during the compilation
process, but in exchange, the user gets to have a design that operates at its optimum speed
at a specific environment (temperature and voltage supply) and does not need to worry about
potential reliability issues caused by process variation.
As demonstrated earlier, the delays of components across an FPGA, including complex em-
bedded blocks, can be accurately characterised through efficient built-in self-tests, creating a
chip-wise delay map of the FPGA. Such delay map can be used to assist the placement and
routing process by providing a physical timing model or used to tune existing timing models to
provide more accurate estimation of circuit delays. As shown in Figure 6.1, the FPGA delay
characterisation step is only required once at the beginning for each specific FPGA hardware.
6.4. Closing the FPGA Design Productivity Gap 201
User Functional, Operational
and Environmental
Requirements
Physical Delay-Aware
Place and Route
FPGA Delay
Characterisation
Timing Model
Tuning
+
Physical
f
max
Design Entry
(HDL)
Design
Synthesis
Physical
Timing Analysis
FPGA Device
Programming
FPGA
Configuration
Timing
Failed
Design Level
Timing
Optimisation
Place and
Route
Adjustments
Figure 6.1: An improved FPGA design flow that allows users to more effectively express the true
potential and timing performance of their designs in real FPGA hardware while maintaining
reliability.
Chapter 7
Future Work
7.1 Memory Testing
One aspect that the proposed timing measurement methods did not explicitly cover is the
testing of circuits consist of mostly memory elements, such as the embedded SRAM based
Block RAM in FPGAs. Although one can argue that the TP test method should be able to
handle FPGA designs using memory blocks as long as the memory content is randomised to have
a stationary statistics, we cannot guarantee that the TP method alone could effectively detect
timing error associated with all memory operations (address, read and write) in all possible
cases. Therefore, as future work, an extensive testing of the TP method on different types of
memory oriented designs is necessary to validate the above assumption. Nonetheless, memory
testing has been one of the most mature area in the VLSI testing community. Therefore, it
is possible that existing memory test techniques could be used in conjunction with the TP
measurement method to provide a solid test platform for memory oriented designs.
7.2 Transition Probability Based On-line Test
The TP measurement method can also be used to perform on-line tests in specific cases, as
long as the circuit-under-test produces outputs with stationary statistics. Any random number
202
7.3. Physical Timing Model Tuning and Extraction 203
generator circuits, such as LFSR, are ideal examples where the output TP is always stationary
under normal operation. Another area where the idea may be applicable is in the class of
circuits for data encryption. Since one of the key goal of data encryption is to make sure the
output appears as random as possible to prevent data sniffing through statistical analysis; the
output TP over a certain period of time should, therefore, be relatively stationary in normal
operation, making it possible to detect timing failure through observing TP responses.
To validate the above hypothesis on data encryption circuitries and possibly other circuits
with similar behaviour, further testing will be required to observe and analyse actual failure
behaviour in such cases. This would give us an idea on the achievable measurement accuracy
and discover situations when TP is influence by normal circuit operations, thus allowing us to
track the changes in TP intelligently to avoid false positive results.
7.3 Physical Timing Model Tuning and Extraction
As previously discussed in the thesis, timing models should be constantly tuned to match the
physical condition of an FPGA as well as tracking any delay variability due to process variation.
Since it would be nearly impossible and infeasible to independently measure every single piece of
hardware components and interconnects at the finest granularity level in an FPGA. A detailed
study of how delays of components within close proximity are correlated is necessary to predict
the delay of parts that cannot be measured independently, assuming process variation is not
perfectly random at small scale.
Given that the spatial correlation of delay can be modelled accurately, the FPGA characteri-
sation process can be optimised to test the minimum number of components at the maximum
acceptable spatial granularity with minimal test time, while maintaining sufficient information
to construct a close approximation of the physical timing model. With the spatial correlation
model, the delay characterisation test can also be used to tune existing timing models with only
a small number of physical measurements, given that the delay variability has a high spatial
correlation between FPGA components.
204 Chapter 7. Future Work
7.4 Design Tool Integration and Measurement Based
Optimisations
The generalised test platform and the flexibility on placement location of the TP measurement
circuitries allow FPGA users to easily apply the test platform to their circuit designs for ac-
curate and efficient physical delay measurements. Such test platform could be integrated into
conventional FPGA design tools, either within the design flow or as an optional tool, to give
users an immediate knowledge of their circuit’s timing performance under the actual FPGA
hardware and physical conditions. This allows users to efficiently design FPGA circuits that
are optimised and tailored to their specific FPGA timing and conditions to operate reliably at
the optimal speed.
While this approach is ideal for personal use, where the target FPGA is accessible by the
user, it posses a great challenge in batch design methodology that relies on creating a single
FPGA configuration to run on multiple FPGAs which are not physically accessible by the
designer. This implies the configuration optimised by the design tool based on physical delay
measurements of one specific FPGA specimen used by the designer is meaningless, since the
FPGAs in the end products are likely to have different delay variability pattern.
There are two main options around this problem [5, 6, 7, 8, 9, 10]:
Late place and route optimisation (late binding) - this relies on the equipment that per-
forms the configuration of individual FPGAs to optimise the placement and routing of
the design based on physical delay measurements of the specific FPGA. Although this
process is generally costly in both time and equipment, the end results is expected to be
highly optimal.
Multiple configurations - the design tool predicts the common delay variability pattern of
the target FPGA and create multiple configurations optimised for each predicted case.
Since the file size of a single FPGA configuration is generally very small when compressed
– normally within a megabyte for moderately sized FPGAs and designs, a package con-
7.4. Design Tool Integration and Measurement Based Optimisations 205
taining several configurations would remain feasible in terms of storage requirement. The
configuration equipment would still need to run a short delay characterisation test to
identify the best configuration to use, but the overall time can be significantly reduced
by not having to perform a specific place and route optimisation. The end results are
expected to be less optimal than late binding, since it is not possible to take any of the
stochastic random delay variability into account.
The two options above can be used in combination to reduce time and equipment cost, while
maintaining optimality close to the first option. For example, the process can be divided into
two stages, where the design tool provides an initial set of configurations optimised for several
typical systematic delay variability patterns. Then, before the actual configuration process, a
specific small scale place and route optimisation is performed on the best configuration to take
random delay variability into account. The optimisation in the 2nd stage would require much
less computational time because it only considers minor placement and routing adjustments
from the original configuration. Although this ideas is built upon the two basic principles –
multiple configurations and late binding, more work is needed in both the delay measurement/
characterisation aspect and the place and route optimisation algorithms to create a practical
strategy for general use.
Appendix A
Glossary
ALM - Adaptive Logic Module. The basic logic modules in the Altera Stratix FPGA family from
Stratix II onwards. Each contains an ALUT and its corresponding output registers (see ALUT).
ALUT - Adaptive LUT (see LUT), a LUT architecture that can be split into several smaller inde-
pendent LUTs or act as one combined LUT with larger number of inputs. In Altera’s term, an
ALUT also includes the output registers that follow it.
ASIC - Application Specific Integrated Circuits. Integrated circuits that are designed to have spe-
cialised functions for specific applications.
At-Speed Testing - Tests that are performed under the circuits actual operating speed (clock fre-
quency).
BIST - Built-In Self-Test, an on-chip test platform that performs self-testing without the need of
external equipments.
C4 Interconnect - FPGA column interconnects spanning 4 LABs (see LAB).
Clock jitter - Timing/phase variation of the edges of a clock signal from their expected positions.
CMOS - Complementary Metal-Oxide-Semiconductor is a technology and design principle for con-
structing integrated circuits.
CRA - Circuit Response Analyser, a circuit used to analyse the output response of a circuit-under-
test.
206
207
CRC - Cyclic Redundancy Check, a hash function designed to detect errors in binary data.
CRT - Circuit Response Tester, a circuit used to stimulate a circuit-under-test to analyse its be-
haviour.
CUT - The Circuit-Under-Test in a test procedure.
DCM - Digital Clock Manager is a DLL based module included in most Xilinx FPGAs for Clock
phase and frequency control.
DLL - Delay-locked loop, a first order close loop control circuit that can control and maintain a
certain phase (delay) relationship between two clock signals. It is frequently used in clock
management circuits.
EDC - Error Detection Circuit, a circuit for detecting timing errors at the output of a circuit-under-
test.
EHA - Error Histogram Accumulator. A circuit that accumulates the number of timing failures from
a circuit-under-test and produce an error count histogram profile of the failure process.
FPGA - Field-Programmable Gate Array. An integrated circuit with flexible and programmable
hardware, designed to be configured by end-users at the field for specific applications.
FRD - Failure Rate Detector. The general term for a delay measurement circuit based on timing
failure detection.
Guard-band - A safety margin given to timing models to accommodate possible delay variability.
HDL - General term for Hardware Description Language. The VHDL and Verilog standards are
commonly used for FPGA designs.
HP - High Probability. The probability that a logical high occurs in a synchronous signal on the
next clock cycle. It is also denoted by H(S) to represent the high probability of signal S.
IP Block - Intellectual Property Block, a term commonly used to describe a circuit with protected
and encrypted structural and implementation information.
LAB - Logic Array Block, a general term for a cluster of LEs or ALMs in Altera’s FPGA architectures
(See LE and ALM).
208 Appendix A. Glossary
LE - Logic Element is the general term used to describe a LUT and register pair in most of the
Altera’s FPGA architectures.
LUT - Look-Up-Table. A LUT with n inputs and one output can implement any n variables boolean
function from a stored truth table.
Metastable (flip-flop) - A rare and momentary state of a flip-flop where its output is neither logical
0 or 1, caused mostly by violation of its setup and hold time requirements. The output normally
settles to a valid state within a short period of time.
MIC test vectors - Multiple Input Change test vectors are input stimulus that differ by multiple
bits between test patterns.
On-line testing - Test that runs at the background without interrupting a circuit’s functions and
operations.
PLL - Phase-locked loop, a second order close loop control circuit that can control and maintain a
certain frequency and phase relationship between two clock signals. It is frequently used in clock
generation and frequency synthesis circuits
POST - Power-On Self-Test, a test procedure carried out during hardware startup to analyse hard-
ware integrity and/or performance.
PRS-Approach - Parallel Reference Signal Approach. Test approach using extra copies of the
circuit-under-test in parallel to create reference signal for error detection.
PUT - The Path-Under-Test of a circuit in a test procedure.
R24 Interconnect - FPGA row interconnects spanning 24 LABs (see LAB).
RAZOR - The term used to describe the pipeline level error detection and correction scheme devel-
oped by Intel for microprocessors.
RO - Ring Oscillator. A circuit loop that oscillates at frequency proportional to its internal path/loop
delay, usually formed by an odd number of inverter stages with a direct output to input feedback.
SIC test vectors - Single Input Change test vectors are input stimulus that differ by only one bit
between test patterns.
209
SRS-Approach - Sequential Reference Signal Approach. Test approach using an extra output reg-
ister to create reference signal sequentially every 2 clock cycles for error detection.
TAC - Transition Activity Counter is the circuit used to obtain the output statistical response of a
circuit in terms of transition probability (see TP).
TCG - Test Clock Generator is the circuit used to general a variable test clock.
Timing-yield - The yield or proportion of ASIC or FPGA circuits that satisfy a target timing
constrain over a specific number of hardware samples produced or configured.
TP - Transition Probability. The probability that a signal transition occurs in a synchronous signal
on the next clock cycle. It is also denoted by D(S) to represent the transition probability of
signal S, which came from the term “Derivative”, implying the rate of signal transitions per
clock cycle.
TPA - Transition Probability Analyser. The circuit block responsible for analysing TP response from
a circuit and produce the corresponding timing estimations (see TP).
TSG - Test Stimuli Generator that produce a series of bit streams to exercise a circuit-under-test.
TVG - Test Vector Generator. Same as TSG.
VCDL - Voltage Controlled Delay Line. A circuit path with variable propagation delay controlled
by a reference voltage.
VCO - Voltage Controlled Oscillator. A circuit with variable oscillation frequency controlled by a
reference voltage.
VDL - Vernier Delay Line is a well established and widely used hardware structure capable of mea-
suring and encoding time difference between two signals into digital bits.
VLSI - Very Large Scale Integration, a term generally used to indicate a silicon device with large
number of integrated logic gates.
Appendix B
MATLAB Simulation
Table B.1: Event based representation of clock signals with bounded edge-to-edge jitter variable
τ , where τmin ≤ τn−1 − τn ≤ τmax.
Clock Data Structure Clock Period Edge-0 Time Edge-1 Time Edge-2 Time · · · Edge-n Time
(+ve Edge Triggered) T 0 + τ0 T + τ1 2T + τ2 · · · nT + τn
Table B.2: Event based representation of input vectors and normal signals.
Signal Event Time t0 t1 t2 · · · tn
Logic Value/Transition Type 0 (Start logic value) +1 (rise) −1 (fall) · · · End logic value
Listing B.1: Top level MATLAB code for clock event and signal transition based simulations.
1 %% Circu i t parameters
2 CyclesN= 2000 % Number o f t e s t c y c l e s ( t r i a l s )
3 JitterWin= 30 ; % J i t t e r window width (−15 to +15 ps )
4 MetaWin = 20 ; % Metas tab le window width (−10 to +10 ps )
5 ClkSkew = 0 ; % Clock skew , assume to be 0 in t h i s s imu la t ion
6 tCQ = 0 ; % Clock to Q ( r e g i s t e r out ) delay , assume to be 0 .
7
8 PathDelays = [ 625 % Rising t r a n s i t i o n de lay ( ps )
9 675 ] ; % Fa l l i n g t r an s i t i o n de lay ( ps )
10
11 % PathDelaysA = [ 700 % Rising t r a n s i t i o n de lay ( ps )
12 % 900 ] ; % Fa l l i n g t r an s i t i o n de lay ( ps )
13 % PathDelaysB = [ 800 % Rising t r an s i t i o n de lay ( ps )
14 % 1000 ] ; % Fa l l i n g t r a n s i t i o n de lay ( ps )
15
16 %% I n i t i a l i s e Frequency Sweep data
17 MaxDelay = max( PathDelays ( : ) ) ;
18 MinDelay = min( PathDelays ( : ) ) ;
210
211
19 Tstart= MaxDelay+150; % Star t c l o c k per iod ( descending order )
20 Tend= MinDelay−150; % End c l o ck per iod ( descending order )
21 dT = 1 . 0 ; % Clock per iod s t ep s i z e ( ps )
22 T = Tstart :−dT: Tend ; % Clock per iod l i s t ( ps )
23 FreqLi s t = 1000000./T; % Clock frequency l i s t (MHz)
24 StepsN= length (T) ; % Number o f per iod / frequency s t e p s
25
26 %% I n i t i a l i s e s t a t i s t i c a l p r o f i l e s
27 FR = zeros (1 , StepsN ) ; % Fai lure ra t e p r o f i l e (For Chapter 3)
28 TP = zeros (1 , StepsN ) ; % Transi t ion p r o b a b i l i t y p r o f i l e (For Chapter 4 , 5)
29 HP = zeros (1 , StepsN ) ; % High p r o b a b i l i t y p r o f i l e (For Chapter 4 , 5)
30
31 %% Clock genera t ion parameters
32 InterpNum = 0 ; % J i t t e r i n t e r p o l a t i o n / co r r e l a t i o n contro l , 0 = no co r r e l a t i on
33 InterpType = ’ cubic ’ ; % multi−cy c l e j i t t e r i n t e r p o l a t i o n type
34
35 %% Test Loop
36 pa r f o r k = 1 : StepsN % Mult i threaded p a r a l l e l for−l oop
37 %Clock genera t ion with edge−to−edge j i t t e r
38 jClk = ClockGen ( CyclesN , T( k ) , JitterWin , InterpNum , InterpType ) ;
39 V = ToggleGen ( jClk , tCQ ) ; % Synchronous t o g g l e s i g n a l genera t ion
40 % V = RandVecGen( jClk , HP weight , tCQ) %(For Chapter 4 , 5)
41 % Synchronous random vec tor genera t ion with s p e c i f i c h igh p r o b a b i l i t y (HP) we igh t s
42
43 z = Func BUF(V, PathDelays ) ; %Combinatoria l path ( b u f f e r ) with output z
44 % z = Func AND(VA,VB, PathDelaysA , PathDelaysB ) ;
45 % Combinatoria l f unc t i on (AND) with output z (For Chapter 4 , 5)
46 % z = Func OR(VA,VB, PathDelaysA , PathDelaysB ) ;
47 % Combinatoria l f unc t i on (OR) with output z (For Chapter 4 , 5)
48 % z = Func XOR(VA,VB, PathDelaysA , PathDelaysB ) ;
49 % Combinatoria l f unc t i on (XOR) with output z (For Chapter 4 , 5)
50
51 [Q FR( k ) TP( k ) HP( k ) ] = Reg i s t e r Ana ly s e r ( jClk , 1 , MetaWin , ClkSkew , z , tCQ, V) ;
52 % Reg i s t e r with output Q and
53 % bu i l t−in ana lyser f o r f a i l u r e ra t e (FR) and output s t a t i s t i c s (TP, HP)
54 end ;
Listing B.2: MATLAB code for the clock generation function “ClockGen”.
1 function [ c l k ] = ClockGen ( cyc l e s , per iod , JitterRelMax , InterpNum , InterpType )
2 % InterpNum de f i n e s the number o f i n t e r p o l a t e d samples
3 % 0 = no i n t e r p o l a t i o n and uncorre l a t ed random edge−to−edge j i t t e r
4 InterpNum = InterpNum + 1 ;
5 StepNum = ce i l ( c y c l e s /InterpNum ) ;
6 sNum = 1 + InterpNum∗StepNum ;
212 Appendix B. MATLAB Simulation
7 %Simulate c o r r e l a t e d mult i−cy c l e j i t t e r by i n t e r p o l a t i o n
8 KeyNum = StepNum + 1 ; %add the ex t ra key po in t at the end
9 KeyRand = JitterRelMax ∗(rand (1 ,KeyNum)− 0 . 5 ) ; % generate random j i t t e r
10 x = 1 : InterpNum : sNum;
11 x i = 1 :sNum;
12 iRand = interp1 (x , KeyRand , xi , InterpType ) ; % In t e r po l a t e d j i t t e r va lue s
13
14 % Create bounded edge−to−edge j i t t e r
15 j = cumsum( iRand ( 1 : c y c l e s +1)) ;
16 c l k = [ per iod ( 0 : per iod : ( c y c l e s ∗ per iod ) ) + j ] ;
17 % The 1 s t element i s the c l o c k period , r e s t are the c l o c k +ve edge t iming po s i t i o n s
Listing B.3: MATLAB code for the toggle signal generation function “ToggleGen”.
1 function [ vec ] = ToggleGen ( c lk , tCQ)
2 vec = zeros (2 , length ( c l k )−1);
3 vec ( 1 , 1 :end) = c lk ( 2 :end)+ tCQ ;
4 vec ( 2 , 2 : 2 : end−1)= 1 ;
5 vec ( 2 , 3 : 2 : end−1)= −1;
6 vec (2 ,end) = ( vec (2 ,end−1)==1);
Listing B.4: MATLAB code for the random vectors generation function “RandVecGen”.
1 function [ vec ] = RandVecGen( clk , PHigh , tCQ) % PHigh : output h igh p r o b a b i l i t y
2 Period = c lk ( 1 ) ;
3 c y c l e s = length ( c l k )−1;
4 R = (rand (1 , c y c l e s )< PHigh ) ;
5 RR = xor (R( 1 :end−1) ,R( 2 :end ) ) ;
6 RCount = sum(RR) + 2 + R( 1 ) ;
7
8 vec = ones (2 , RCount ) ;
9 TimeList = c lk ( 2 :end)+ tCQ ;
10 TimeList2 = TimeList (RR) ;
11 vec (1 :2 ,1)= 0 ;
12 i f R(1)
13 vec (1 ,2)= tCQ;
14 vec ( 1 , 3 :end−1) = TimeList2 ;
15 vec ( 2 , 3 : 2 : end−1)= −1;
16 else
17 vec ( 1 , 2 :end−1) = TimeList2 ;
18 vec ( 2 , 3 : 2 : end−1)= −1;
19 end
20 vec (1 ,end) = TimeList (end)+per iod ;
21 vec (2 ,end) = ( vec (2 ,end−1)==1);
213
Listing B.5: MATLAB code for the buffer circuit “Func BUF”.
1 function [ vec ] = Func BUF(A, de lay )
2 % r i s e de lay : de lay (1) , f a l l de lay : de lay (2)
3 vec = A;
4 i f length (A)>2
5 i f A(2 ,2)==1
6 vec ( 1 , 2 : 2 : end−1)= vec ( 1 , 2 : 2 : end−1) + delay ( 1 ) ;
7 vec ( 1 , 3 : 2 : end−1)= vec ( 1 , 3 : 2 : end−1) + delay ( 2 ) ;
8 else
9 vec ( 1 , 2 : 2 : end−1)= vec ( 1 , 2 : 2 : end−1) + delay ( 2 ) ;
10 vec ( 1 , 3 : 2 : end−1)= vec ( 1 , 3 : 2 : end−1) + delay ( 1 ) ;
11 end
12 end
13
14 i f A(2 ,end)
15 vec (1 ,end)= vec (1 ,end) + delay ( 1 ) ;
16 else
17 vec (1 ,end)= vec (1 ,end) + delay ( 2 ) ;
18 end
Listing B.6: MATLAB code for the AND boolean function “Func AND”.
1 function [ vec ] = Func AND(A, B, Adelays , Bdelays )
2 i f (nargin >= 4)
3 A=Func BUFS(A, Adelays ) ; % path A de lay s
4 B=Func BUFS(B, Bdelays ) ; % path B de lay s
5 end
6 M =sort rows ( [ B( : , 1 : end−1) A( : , 1 : end−1 ) ] ’ , 1 ) ’ ;
7 s I n i t = (A(2 , 1 ) & B( 2 , 1 ) ) ;
8 Se l = (cumsum(M(2 , : ) )==2) ;
9 R = M(1 , Se l ) ;
10 F = M( 1 , [ f a l s e Se l ( 1 :end−1 ) ] ) ;
11
12 % Remove zero width g l i t c h e s
13 i f Se l (end)
14 dup = (R( 1 :end−1)˜=F ) ;
15 R = R( [ dup true ] ) ;
16 F = F( dup ) ;
17 else
18 dup = (R˜=F ) ;
19 R = R( dup ) ;
20 F = F( dup ) ;
21 end
22
23 vec = ones ( 2 , ( length (F)+length (R)+2− s I n i t ) ) ;
214 Appendix B. MATLAB Simulation
24 vec (1 , 1 ) = min(A( 1 , 1 ) ,B( 1 , 1 ) ) ;
25 vec (2 , 1 ) = s I n i t ;
26 i f s I n i t
27 vec ( 1 , 2 : 2 : end−1) = F;
28 vec ( 1 , 3 : 2 : end−1) = R( 2 :end ) ;
29 vec ( 2 , 2 : 2 : end−1) = −1;
30 else
31 vec ( 1 , 2 : 2 : end−1) = R;
32 vec ( 1 , 3 : 2 : end−1) = F;
33 vec ( 2 , 3 : 2 : end−1) = −1;
34 end
35 vec (1 ,end) = max(A(1 ,end ) ,B(1 ,end ) ) ;
36 vec (2 ,end) = ( vec (2 ,end−1)==1);
Listing B.7: MATLAB code for the OR boolean function “Func OR”.
1 function [ vec ] = Func OR(A, B, Adelays , Bdelays )
2 i f (nargin >= 4)
3 A=Func BUFS(A, Adelays ) ; % path A de lay s
4 B=Func BUFS(B, Bdelays ) ; % path B de lay s
5 end
6 M =sort rows ( [ B( : , 1 : end−1) A( : , 1 : end−1 ) ] ’ , 1 ) ’ ;
7 s I n i t = (A(2 , 1 ) | B( 2 , 1 ) ) ;
8 Se l = (cumsum(M(2 , : ) )==0) ;
9 F = M(1 , Se l ) ;
10 R = M( 1 , [ f a l s e Se l ( 1 :end−1 ) ] ) ;
11
12 % Remove zero width g l i t c h e s
13 i f Se l (end)
14 dup = (F( 1 :end−1)˜=R) ;
15 F = F ( [ dup true ] ) ;
16 R = R( dup ) ;
17 else
18 dup = (R˜=F ) ;
19 R = R( dup ) ;
20 F = F( dup ) ;
21 end
22
23 vec = ones ( 2 , ( length (F)+length (R)+1+ s I n i t ) ) ;
24 vec (1 , 1 ) = min(A( 1 , 1 ) ,B( 1 , 1 ) ) ;
25 vec (2 , 1 ) = s I n i t ;
26 i f s I n i t
27 vec ( 1 , 2 : 2 : end−1) = F;
28 vec ( 1 , 3 : 2 : end−1) = R;
29 vec ( 2 , 2 : 2 : end−1) = −1;
215
30 else
31 vec ( 1 , 2 : 2 : end−1) = R;
32 vec ( 1 , 3 : 2 : end−1) = F( 2 :end ) ;
33 vec ( 2 , 3 : 2 : end−1) = −1;
34 end
35 vec (1 ,end) = max(A(1 ,end ) ,B(1 ,end ) ) ;
36 vec (2 ,end) = ( vec (2 ,end−1)==1);
Listing B.8: MATLAB code for the XOR boolean function “Func XOR”.
1 function [ vec ] = Func XOR(A, B, Adelays , Bdelays )
2 i f (nargin >= 4)
3 A=Func BUF(A, Adelays ) ; % path A de lay s
4 B=Func BUF(B, Bdelays ) ; % path B de lay s
5 end
6 M =sort rows ( [ B( : , 1 : end−1) A( : , 1 : end−1 ) ] ’ , 1 ) ’ ;
7 s I n i t = xor (A( 2 , 1 ) , B( 2 , 1 ) ) ;
8 Se l = (cumsum(M(2 , : ) )==1) ;
9 R = M(1 , Se l ) ;
10 F = M( 1 , [ f a l s e Se l ( 1 :end−1 ) ] ) ;
11
12 % Remove zero width g l i t c h e s
13 i f Se l (end)
14 % Low−High−Low g l i t c h e s
15 dup = (R( 1 :end−1)˜=F ) ;
16 R = R( [ dup true ] ) ;
17 F = F( dup ) ;
18 % High−Low−High g l i t c h e s
19 i f ˜isempty (F)
20 dup = (R( 2 :end)˜=F ) ;
21 R = R( [ t rue dup ] ) ;
22 F = F( dup ) ;
23 end
24 else
25 % Low−High−Low g l i t c h e s
26 dup = (R˜=F ) ;
27 R = R( dup ) ;
28 F = F( dup ) ;
29 % High−Low−High g l i t c h e s
30 i f ˜isempty (R)
31 dup = (R( 2 :end)˜=F( 1 :end−1)) ;
32 R = R( [ t rue dup ] ) ;
33 F = F ( [ dup true ] ) ;
34 end
35 end
216 Appendix B. MATLAB Simulation
36 vec = ones ( 2 , ( length (F)+length (R)+2− s I n i t ) ) ;
37 vec (1 , 1 ) = min(A( 1 , 1 ) ,B( 1 , 1 ) ) ;
38 vec (2 , 1 ) = s I n i t ;
39 i f s I n i t
40 vec ( 1 , 2 : 2 : end−1) = F;
41 vec ( 1 , 3 : 2 : end−1) = R( 2 :end ) ;
42 vec ( 2 , 2 : 2 : end−1) = −1;
43 else
44 vec ( 1 , 2 : 2 : end−1) = R;
45 vec ( 1 , 3 : 2 : end−1) = F;
46 vec ( 2 , 3 : 2 : end−1) = −1;
47 end
48 vec (1 ,end) = max(A(1 ,end ) ,B(1 ,end ) ) ;
49 vec (2 ,end) = ( vec (2 ,end−1)==1);
Listing B.9: MATLAB code for the register function with built-in timing failure and output
statistics analyser “Register Analyser”.
1 function [QV PTrans PHigh FRate ] = Reg i s t e r Ana ly s e r ( pclk , Type , metaW, skew , SV, tCQ, RefV)
2 % QV: r e g i s t e r output , PTrans : t r a n s i t i o n p r o b a b i l i t y , PHigh : h igh p r o b a b i l i t y , FRate : f a i l u r e ra t e
3 lenClk = length ( pc lk ) ;
4 c y c l e s = ( lenClk −3);
5 per iod = pclk ( 1 ) ;
6 i f (Type<0) % nega t i v e edge t r i g g e r e d
7 s h i f t = per iod /2 ; % assuming 50% duty−cy c l e
8 else % po s i t i v e edge t r i g g e r e d
9 s h i f t = 0 ;
10 end
11 i f l enS <= 2 % Spec ia l cases f o r empty , DC s i g n a l or s i g n a l with a s i n g l e t r a n s i t i o n
12 i f SV(2 , 1 )
13 QV= [ [ 0 ; 0 ] [ s h i f t + skew ; 1 ] [ pc lk (end) + s h i f t + skew ; 1 ] ] ;
14 PTrans = 1 / c y c l e s ; % not e x a c t l y accurate but does not matter
15 PHigh = 1 ;
16 else
17 QV= [ [ 0 ; 0 ] [ pc lk (end) + s h i f t + skew ; 0 ] ] ;
18 PTrans = 0 ;
19 PHigh = 0 ;
20 end
21 FRate = 1 . 0 ;
22 return ;
23 end
24 pc lk ( 2 :end) = pc lk ( 2 :end)+ ( s h i f t + skew + metaW/ 2 ) ;
25 SS = [ SV ( : , 2 : end−1); zeros (1 , lenS −2) ] ; % Add ex t ra row of 0 s to i d e n t i f y i t as S i gna l edges
26 B = [ pc lk ( 2 :end−1); ones (2 , lenClk −2) ] ; % Add ex t ra row of 1 s to i d e n t i f y i t as Clock edges
217
27 % Sort s i g n a l and c l o c k edges in ascending t iming order
28 CombB =sort rows ( [ SS ( : , 1 : end) B( : , 1 : end ) ] ’ , 1 ) ’ ;
29
30 % Remove l e f t s i g n a l edges
31 CombBB = CombB ( : , [ ( ( CombB( 3 , 2 :end)==1) | (CombB( 3 , 1 :end−1)==1)) t rue ] ) ;
32
33 % Se l e c t the 2nd c l o c k edge f o l l ow i n g a s i g n a l edge
34 Se l = [ f a l s e f a l s e CombBB( 3 , 1 :end−2)==0] & (CombBB( 3 , 1 :end)˜=0);
35 Follow= CombBB( : , Se l ) ;
36 Follow ( 2 : 3 , : ) = CombBB( 1 : 2 , [ Se l ( 3 :end) f a l s e f a l s e ] ) ;
37
38 % s e l e c t and add s i g n a l edges data to t h e i r f o l l ow i n g c l o c k edge
39 s S e l e c t = (CombBB( 3 , 1 :end−1)==0);
40 CombBS = [CombBB( 1 , [ f a l s e s S e l e c t ] ) ; CombBB( 1 : 2 , [ s S e l e c t f a l s e ] ) ] ;
41
42 % Add the 2nd c l o c k edge back to the array in cor r ec t t iming order
43 CombBS = sort rows ( [CombBS Follow ] ’ , 1 ) ’ ;
44 i f (metaW)
45 % Se l e c t s i g n a l edges t ha t entered the metas tab l e window (metaW)
46 FailMask = ( (CombBS(1 , : )−CombBS(2 , : ) ) < metaW ) ;
47 Temp = CombBS( 1 : 3 , FailMask ) ;
48
49 % se t output t r a n s i t i o n s based on p r o b a b i l i t y P
50 % P = di s tance from s i g na l edge to Max c l o c k edge d i v i d e by the width o f metas tab l e window
51 % 1: No Fai lure , 0 : Complete Fa i lure
52 P = (Temp(1 , : )−Temp ( 2 , : ) ) . / metaW;
53 % Actual f a i l u r e o f c l o c k edge determined by ”rand” compared aga ins t P.
54 % c lock edge <= P: Pass
55 % c lock edge > P: Fa i l
56
57 FSel = (rand (1 , length (P))>P) ; %i d e n t i f y f a i l e d s i g n a l edges
58 Temp(3 , FSel)=−Temp(3 , FSel ) ;
59 CombBS(3 , FailMask ) = Temp ( 3 , : ) ;
60 end
61
62 % Keep f i r s t entry i f s i g n a l i s i n i t i a l l y High or f i r s t s i g n a l i s a r i s i n g edge
63 % ( Reg i s t e r i s assumed to always be i n i t i a l i s e d to LOW be fo re time 0)
64 % Remove dup l i c a t ed edges with same edge type ( f a l l i n g , r i s i n g )
65 DupMask = [ ( ( CombBS(3 ,1)==1) | SV( 2 , 1 ) ) CombBS( 3 , 2 :end)˜=CombBS( 3 , 1 :end−1) ] ;
66 CombBD = CombBS( [ 1 3 ] , DupMask ) ;
67 C = 1+round ( (CombBD(1 , : )− ( skew + metaW/2))/ per iod ) ;
68 CombBD( 1 , : ) = CombBD(1 , : )+tCQ−metaW/2 ;
69 CombC = [ (C−1).∗ per iod ; CombBD( 2 , : ) ] ;
70
218 Appendix B. MATLAB Simulation
71 i f ˜isempty (CombBD( 1 , : ) )
72 i f CombBD(2 ,1)==1
73 QV= [ [ 0 ; 0 ] CombBD [ pc lk (end ) ; (CombBD(2 ,end )==1) ] ] ;
74 CV= [ [ 0 ; 0 ] CombC [ pc lk (end ) ; (CombC(2 ,end )==1) ] ] ;
75 i f CombBD(2 ,end)==1
76 PHigh = (sum( [C( 2 : 2 : end) c y c l e s +1])−sum(C( 1 : 2 : end ) ) ) / ( cyc l e s−C(1)+1) ;
77 else
78 PHigh = (sum(C( 2 : 2 : end))−sum(C( 1 : 2 : end ) ) ) / ( cyc l e s−C(1)+2) ;
79 end
80 else
81 QV= [ [ 0 ; 0 ] [ s h i f t ; 1 ] CombBD [ pc lk (end ) ; (CombBD(2 ,end )==1) ] ] ;
82 CV= [ [ 0 ; 0 ] [ s h i f t ; 1 ] CombC [ pc lk (end ) ; (CombC(2 ,end )==1) ] ] ;
83 i f CombBD(2 ,end)==1
84 PHigh = (sum( [C( 3 : 2 : end) c y c l e s +1])−sum(C( 2 : 2 : end ) ) ) / ( cyc l e s−C( 1 ) ) ;
85 else
86 PHigh = (sum(C( 3 : 2 : end))−sum(C( 2 : 2 : end ) ) ) / ( cyc l e s−C(1)+1) ;
87 end
88 end
89 PTrans = length (CombBD( 1 , : ) ) / ( cyc l e s−C(1)+2) ;
90 else
91 QV = [ [ 0 ; 0 ] [ pc lk (end ) ; 0 ] ] ;
92 CV = QV;
93 PTrans = 0 ;
94 PHigh = 0 ;
95 end
96
97 % Obtain f a i l u r e ra t e
98 % I f the r e g i s t e r i s s e t to t r i g g e r on p o s i t i v e edge (Type = 1) ,
99 % the l i s t o f c l o c k per iods f o r p l o t t i n g FRate shou ld be s ca l ed by 2x to s imu la te
100 % the FRD t e s t c i r c u i t us ing inve r t ed c l o c k and ha l f−c l o c k per iod as t iming cons t ra in
101 RefV=Func BUF(RefV , [ per iod per iod ] + skew − s h i f t ) ;
102 Compare = Func XOR(RefV ,CV) ;
103 i f sum(Compare(2 ,:)==1)==0
104 FRate = 0 ;
105 e l s e i f length (Compare (1 , : ) ) >2
106 FRate= sum(Compare ( 1 , ( 3 ) : 2 : end)−Compare ( 1 , ( 2 ) : 2 : end−1)) ;
107 i f FRate< 1e−4
108 FRate = 0 ;
109 else
110 FRate = FRate /(RefV (1 ,end)−RefV(1 ,1)− per iod ) ;
111 end
112 else
113 FRate = 0 ;
114 end
Appendix C
17-Level Weighted HP Logic Table
Table C.1: Weighted random sequence generation with 17 high probability (HP) levels and
toggle signal. R3, R2, R1 and R0 are the independent uniform random inputs.
Weights HP Boolean Logic / Hardware
1 0 GND
2 0.0625 R3 ·R2 ·R1 ·R0
3 0.125 R3 ·R2 ·R1
4 0.1875 R3 ·R2 · (R1 +R0)
5 0.25 R3 ·R2
6 0.3125 R3 · (R2 +R1 ·R0)
7 0.375 R3 · (R2 +R1)
8 0.4375 R3 · (R2 +R1 +R0)
9 0.5 R3
10 0.5625 R3 +R2 ·R1 ·R0
11 0.625 R3 +R2 ·R1
12 0.6875 R3 +R2 · (R1 +R0)
13 0.75 R3 +R2
14 0.8125 R3 +R2 +R1 ·R0
15 0.875 R3 +R2 +R1
16 0.9375 R3 +R2 +R1 +R0
17 1.0 VCC
Special 0.5 (Toggle) Toggle Flip-flop
219
Bibliography
[1] S. S. Sapatnekar, Timing. Springer, 2004.
[2] R. C. J. Bhasker, Static Timing Analysis for Nanometer Designs: A Practical Approach.
Springer, 2009.
[3] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-measurement of combinatorial
circuit delays in FPGAs,” ACM Transactions on Reconfigurable Technology and Systems
(TRETS), vol. 2, no. 2, pp. 1 – 22, 2009.
[4] Cyclone III Device Handbook, Volumn 2, Altera Corp., 2008.
[5] K. Katsuki, M. Kotani, K. Kobayashi, and H. Onodera, “A yield and speed enhancement
scheme under within-die variations on 90nm LUT array,” in Proc. IEEE Custom Integrated
Circuits Conference, Sept. 2005, pp. 601 – 604.
[6] L. Cheng, J. Xiong, L. He, and M. Hutton, “FPGA performance optimization via chip-
wise placement considering process variations,” in Proc. International Conference on Field
Programmable Logic and Applications (FPL), Aug. 2006, pp. 44 – 49.
[7] P. Sedcole and P. Y. K. Cheung, “Parametric yield in FPGAs due to within-die delay
variations: A quantitative analysis,” in Proc. 15th ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays,, Feb. 2007, pp. 178–187.
[8] Y. Matsumoto, M. Hioki, T. Kawanami, T. Tsutsumi, T. Nakagawa, T. Sekigawa, and
H. Koike, “Performance and yield enhancement of FPGAs with within-die variation us-
220
BIBLIOGRAPHY 221
ing multiple configurations,” in Proc. ACM/SIGDA International Symposium on Field
Programmable Gate Arrays - FPGA, Feb. 2007, pp. 169 – 177.
[9] P. Sedcole and P. Y. K. Cheung, “Parametric yield modelling and simulations of FPGA cir-
cuits considering within-die delay variations,” ACM Transactions on Reconfigurable Tech-
nology and Systems, vol. 1, no. 2, 2008.
[10] P. Sedcole, E. Stott, and P. Cheung, “Compensating for variability in FPGAs by re-
mapping and re-placement,” in Proc. International Conference on Field Programmable
Logic and Applications (FPL), Aug. 2009, pp. 613 – 616.
[11] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-characterization of combinatorial
circuit delays in FPGAs,” in Proc. IEEE International Conference on Field-Programmable
Technology, Dec. 2007, pp. 245 – 251.
[12] J. S. J. Wong, P. Y. Cheung, and P. Sedcole, “Combating process variation on FPGAs
with a precise at-speed delay measurement method,” in Proc. International Conference on
Field Programmable Logic and Applications (FPL), Sep. 2008, pp. 703 – 704.
[13] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “A Transition Probability based de-
lay measurement method for arbitrary circuits on FPGAs,” in Proc. IEEE International
Conference on Field-Programmable Technology, Dec. 2008, pp. 105 – 112.
[14] E. A. Stott, J. S. J. Wong, P. Sedcole, and P. Y. Cheung, “Degradation in FPGAs: Mea-
surement and modelling,” in Proc. ACM/SIGDA International Symposium on Field Pro-
grammable Gate Arrays - FPGA, Feb. 2010, pp. 229 – 238.
[15] E. A. Stott, J. S. J. Wong, and P. Y. K. Cheung, “Degradation analysis and mitigation in
FPGAs,” in Proc. International Conference on Field Programmable Logic and Applications
(FPL), Aug. 2010, pp. 428 – 433.
[16] J. S. J. Wong and P. Y. K. Cheung, “Improved Delay Measurement Method in FPGA
based on Transition Probability,” in Proc. ACM/SIGDA International Symposium on Field
Programmable Gate Arrays - FPGA, Feb. 2011.
222 BIBLIOGRAPHY
[17] R. Rajsuman, System-On-A-Chip: Design and Test. Artech House Publishers, 2000.
[18] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, C. Z. Toan Pham, D. Blaauw, T. ustin,
K. Flautner1, and T. Mudge, “Razor: A low-power pipeline based on circuit-level tim-
ing speculation,” in Proceedings. 36th Annual IEEE/ACM International Symposium, Dec.
2003, pp. 7 – 18.
[19] M. Zhang, T. Mak, J. Tschanz, K. S. Kim, N. Seifert, and D. Lu, “Design for resilience to
soft errors and variations,” in Proc. IOLTS 2007 13th IEEE International On-Line Testing
Symposium, Jul. 2007, pp. 23 – 28.
[20] J. Tschanz, K. Bowman, C. Wilkerson, S.-L. Lu, and T. Karnik, “Resilient circuits: en-
abling energy-efficient performance and reliability,” in Proc. of the 2009 IEEE/ACM In-
ternational Conference on Computer-Aided Design (ICCAD 2009), Nov. 2009, pp. 71 –
73.
[21] V. P. Nelson, “Fault-tolerant computing: Fundamental concepts,” Computer, vol. 23, pp.
19–25, Jul. 1990.
[22] J. Emmert, “Partial reconfiguration of FPGA mapped designs with applications for fault
tolerance and yield enhancement,” in Int. Workshop on Field Programmable Logic and
Applications, Sept. 1997, pp. 141–150.
[23] E. A. Stott, P. Sedcole, and P. Y. Cheung, “Fault tolerant methods for reliability in
FPGAs,” in Proc. International Conference on Field Programmable Logic and Applications
(FPL), Sept. 2008, pp. 415–420.
[24] S.-W. Han, “Flash memory wear leveling system and method,” U.S. Patent 6 016 275, Jan.
18, 2000.
[25] FPGA Architecture White Paper, Altera Corp., 2006.
[26] Advantages of the Virtex-5 FPGA 6-Input LUT Architecture, Xilinx Inc., 2007.
[27] T. Pi and P. J. Crotty, “FPGA lookup table with transmission gate structure for reliable
low-voltage operation,” U.S. Patent 6 667 635 B1, Dec. 23, 2003.
BIBLIOGRAPHY 223
[28] Cyclone III Device Handbook, Volumn 1, Altera Corp., 2008.
[29] MultiTrack Interconnect in Stratix III Devices, Altera Corp., 2009.
[30] Stratix III Device Handbook, Altera Corp., 2009.
[31] Virtex-5 User Guide v3.0, Xilinx Inc., 2007.
[32] P. Sedcole, J. S. J. Wong, and P. Y. K. Cheung, “Modelling and compensating for clock
skew variability in FPGAs,” in 2008 International Conference on Field-Programmable
Technology, Dec. 2008, pp. 217 – 224.
[33] S. Nassif, “Delay variability: sources, impacts and trends,” in Proc. IEEE International
Solid-State Circuits Conference, 2000.
[34] P. Sedcole and P. Y. K. Cheung, “Within-die delay variability in 90nm FPGAs and be-
yond,” in Proc. IEEE International Conference on Field-Programmable Technology, Dec.
2006, pp. 97–104.
[35] B. P. Wong, Nano-CMOS circuit and physical design. John Wiley and Son, 2005.
[36] F. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau, “Measuring power and
temperature from real processors,” in Proc. IEEE International Parallel and Distributed
Processing Symposium., Apr. 2008, pp. 1–5.
[37] J. Xiong, V. Zolotov, C. Visweswariah, and P. A. Habitz, “Optimal margin computation
for at-speed test,” in Proc. Design, Automation and Test in Europe, DATE, Mar. 2008,
pp. 622 – 627.
[38] K. Cao and J. Hu, “ASIC design flow considering lithography-induced effects,” IET Cir-
cuits, Devices and Systems, vol. 2, no. 1, pp. 23 – e29, 2008.
[39] K. Kuhn, C. Kenyon, A. Kornfeld, M. Liu, A. Maheshwari, W. kai Shih, S. Sivakumar,
G. Taylor, P. VanDerVoorn, and K. Zawadzki, “Managing process variation in Intel’s 45nm
cmos technology,” Intel Technology Journal, vol. 12, no. 2, pp. 93 – 109, 2008/06/17.
224 BIBLIOGRAPHY
[40] T. Fukuoka, A. Tsuchiya, and H. Onodera, “Statistical gate delay model for multiple
input switching,” IEICE Transactions on Fundamentals of Electronics, Communications
and Computer Sciences, vol. E92-A, no. 12, pp. 3070 – 3078, Dec. 2009.
[41] A. Virazel, R. David, P. Girard, C. Landrault, and S. Pravossoudovitch, “Delay fault
testing: Choosing between random SIC and random MIC test sequences,” Journal of
Electronic Testing: Theory and Applications (JETTA), vol. 17, no. 3-4, pp. 233 – 241,
Jun. 2001.
[42] R. David, P. Girard, C. Landrault, S. Pravossoudovitch, and A. Virazel, “Hardware gener-
ation of random single input change test sequences,” Journal of Electronic Testing: Theory
and Applications, vol. 18, no. 2, pp. 145 – 157, Apr. 2002.
[43] J. Voyiatzis, A. Paschalis, D. Nikolos, and C. Halatsis, “An efficient built-in self test
method for robust path delay fault testing,” Journal of Electronic Testing: Theory and
Applications, vol. 8, no. 2, pp. 219 – 222, Apr. 1996.
[44] E. W. Weisstein, “Primitive polynomial.” [Online]. Available: http://mathworld.wolfram.
com/PrimitivePolynomial.html
[45] Linear Feedback Shift Registers in Virtex Devices, v1.3, Xilinx Inc., 2007.
[46] S. Zhang, R. Byrne, J. Muzio, and D. Miller, “Why cellular automata are better than
LFSRs as built-in self-test generators for sequential-type faults,” in IEEE International
Symposium on Circuits and Systems, vol. vol.1, May 1994, pp. 69 – 72.
[47] J. Rajski, G. Mrugalski, and J. Tyszer, “Comparative study of CA-based PRPGs and
LFSRs with phase shifters,” in Proceedings 17th IEEE VLSI Test Symposium, Apr. 1999,
pp. 236 – 45.
[48] L.-T. Wang, C.-W. Wu, C.-W. Wu, and X. Wen, VLSI test principles and architectures:
design for testability, ser. The Morgan Kaufmann series in systems on silicon. Academic
Press, 2006.
BIBLIOGRAPHY 225
[49] L. Li and K. Chakrabarty, “Test data compression using dictionaries with fixed-length
indices [SOC testing],” in Proceedings 21st IEEE VLSI Test Symposium, Apr. 2003, pp.
219 – 224.
[50] W. Wang and S. Gupta, “Weighted random robust path delay testing of synthesized mul-
tilevel circuits,” in Proc. 12th IEEE VLSI Test Symposium, Apr. 1994, pp. 291 – 297.
[51] I. Polian and B. Becker, “Scalable delay fault BIST for use with low-cost ate,” Journal of
Electronic Testing: Theory and Applications, vol. 20, no. 2, pp. 181 – 197, Apr. 2004.
[52] K. Angela and C. Kwang Ting, Delay fault testing for VLSI circuits, ser. Frontiers in
electronic testing. Springer, 1998, vol. 14.
[53] L. T. Wang, C. E. Stroud, and N. A. Touba, System-on-chip test architectures: nanometer
design for testability, ser. Systems on Silicon Series. Morgan Kaufmann, 2008.
[54] C. Stroud, S. Konala, P. Chen, and M. Abramovici, “Built-in self-test of logic blocks in
FPGAs (finally, a free lunch: BIST without overhead!),” in Proceedings. 14th IEEE VLSI
Test Symposium, Apr. 1996, pp. 387 – 392.
[55] P. Moritz and L. Thorsen, “Cmos circuit testability,” Solid-State Circuits, IEEE Journal
of, vol. 21, no. 2, pp. 306 – 309, Apr. 1986.
[56] S. McCracken and Z. Zilic, “FPGA test time reduction through a novel interconnect test-
ing scheme,” in 10th ACM International Symposium on Field-Programmable Gate Arrays
(FPGA), Feb. 2002, pp. 136 – 44.
[57] ——, “Design for testability of FPGA blocks,” in Proc. 5th International Symposium on
Quality Electronic Design, Mar. 2004, pp. 86 – 91.
[58] A. Krasniewski, “Evaluation of delay fault testability of LUT functions for improved effi-
ciency of FPGA testing,” in Proceedings Euromicro Symposium on Digital Systems Design,
Sept. 2001, pp. 310 – 317.
226 BIBLIOGRAPHY
[59] ——, “Evaluation of delay fault testability of LUTs for the enhancement of application-
dependent testing of FPGAs,” Journal of Systems Architecture, vol. 49, no. 4-6, pp. 283 –
296, Sept. 2003.
[60] ——, “On the set of target path delay faults in sequential subcircuits of LUT-based FP-
GAs,” in Proc. International Conference on Field Programmable Logic and Applications
(FPL), Sept. 2002, pp. 596 – 606.
[61] ——, “Exploiting reconfigurability for effective testing of delay faults in sequential sub-
circuits of LUT-based FPGAs,” in Proc. International Conference on Field Programmable
Logic and Applications (FPL), Sept. 2002, pp. 616 – 626.
[62] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on commercial FPGA chips,”
in Proceedings - 6th IEEE Workshop on Signal Propagation on Interconnects, SPI, May
2002, pp. 157 – 159.
[63] K. Katoh, T. Tanabe, H. Zahidul, K. Namba, and H. Ito, “A delay measurement technique
using signature registers,” in Proc. 18th Asian Test Symposium (ATS 2009), Nov. 2009,
pp. 157 – 162.
[64] A. Raychowdhury, S. Ghosh, and K. Roy, “A novel on-chip delay measurement hardware
for efficient speed-binning,” in Proceedings - 11th IEEE International On-Line Testing
Symposium, IOLTS 2005, Jul. 2005, pp. 287 – 292.
[65] T. Matsumoto, “High-resolution on-chip propagation delay detector for measuring within-
chip variation,” in International Conference on Integrated Circuit Design and Technology,
May 2005, pp. 217 – 220.
[66] M. Abramovici and C. Stroud, “BIST-based delay-fault testing in FPGAs,” Journal of
Electronic Testing: Theory and Applications, vol. 19, no. 5, pp. 549 – 558, Oct. 2003.
[67] J.-P. Jansson, A. Mantyniemi, and J. Kostamovaara, “A CMOS time-to-digital converter
with better than 10 ps single-shot precision,” IEEE Journal of Solid-State Circuits, vol. 41,
no. 6, pp. 1286 – 1296, Jun. 2006.
BIBLIOGRAPHY 227
[68] S. Pei, H. Li, and X. Li, “A low overhead on-chip path delay measurement circuit,” in
Proc. 18th Asian Test Symposium (ATS 2009), Nov. 2009, pp. 145 – 150.
[69] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, “Statistical timing analysis: from
basic principles to state of the art,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits, vol. 27, no. 4, pp. 589 – 607, Apr. 2008.
[70] I. Nitta, T. Shibuya, and K. Homma, “Statistical static timing analysis technology,” Fujitsu
Scientific and Technical Journal, vol. 43, no. 4, pp. 516 – 23, Oct. 2007.
[71] K. Kundert, Predicting the Phase Noise and Jitter of PLL-Based Frequency Synthesizers,
4th ed., Designer’s Guide Consulting Inc., Aug. 2006.
[72] P. Sedcole, J. S. J. Wong, and P. Y. K. Cheung, “Characterisation of FPGA clock variabil-
ity,” in Proc. IEEE Computer Society Annual. Symposium on VLSI, 2008. ISVLSI ’08.,
Apr. 2008, pp. 322 – 328.
[73] Virtex-4 User Guide v2.2, Xilinx Inc., 2007.
[74] Clock Networks and PLLs in the Cyclone III Device Family, Altera Corp., 2009.
[75] General-Purpose PLLs in Stratix & Stratix GX Devices, Altera Corp., 2005.
[76] “PLL clock management features in Altera FPGAs.” [Online]. Available: ftp:
//ftp.altera.com/outgoing/download/bsdl/PLLs%20 Features in Altera FPGAs.xls
[77] J. Xiong, V. Zolotov, and L. He, “Robust extraction of spatial correlation,” in ISPD ’06:
Proceedings of the 2006 international symposium on Physical design, 2006, pp. 2–9.
[78] B. Hargreaves, H. Hult, and S. Reda, “Within-die process variations: How accurately
can they be statistically modeled?” in Proceedings of the Asia and South Pacific Design
Automation Conference, ASP-DAC, 2008, pp. 524 – 530.
[79] A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation of average switching activity
in combinational and sequential circuits,” in Proc. 29th ACM/IEEE Design Automation
Conference, Jun. 1992, pp. 253–259.
228 BIBLIOGRAPHY
[80] N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT
Press, 1949.
[81] Understanding Metastability in FPGAs, Altera Corp., 2009.
[82] Difference-Based Partial Reconfiguration, Xilinx Inc., 2007.
[83] Increasing Design Functionality with Partial and Dynamic Reconfiguration in 28-nm FP-
GAs, Altera Corp., 2010.
[84] J. Wilton, S.-S. Ang, and W. Luk, “The impact of pipelining on energy per operation in
field programmable gate arrays,” in Proceedings of 14th International Conference, FPL
2004), Aug. 2004, pp. 719 – 728.
[85] J. Lamoureux, G. G. Lemieux, and S. J. E. Wilton, “Glitchless: An active glitch mini-
mization technique for FPGAs,” in ACM/SIGDA International Symposium on Field Pro-
grammable Gate Arrays - FPGA, Feb 2007, pp. 156 – 165.
[86] Implementing High Performance DSP Functions in Stratix & Stratix GX Devices, Altera
Corp., 2004.
[87] A. Maiti and P. Schaumont, “Improving the quality of a physical unclonable function
using configurable ring oscillators,” in Proc. 19th International Conference on Field Pro-
grammable Logic and Applications (FPL), Aug. 2009, pp. 703 – 707.
