Timing Measurement Platform for Arbitrary Black-Box Circuits Based on Transition Probability by Wong, JSJ & Cheung, PYK
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
Timing Measurement Platform for Arbitrary
Black-Box Circuits Based on Transition Probability
Justin S. J. Wong and Peter Y. K. Cheung
Abstract— The key aspects of a good on-chip timing mea-
surement platform are high measurement resolution, accuracy,
and low area overhead. A measurement method based on
transition probability (TP) has shown promising characteristics
in all these areas. In this paper, the TP measurement method
is examined through simulation to understand its apparent
effectiveness and accuracy in measuring complex circuits. Timing
uncertainties and logic glitch activities are considered in detail,
and the effect of varying input vectors’ probability distributions
is analyzed to enable further accuracy improvements. Using a
field-programmable gate array, the method is implemented and
demonstrated as a modular on-chip test platform for testing
complex arbitrary circuits. Practical circuits found in typical
modular designs, including fixed/floating-point arithmetic and
filter circuits, are chosen to evaluate the test platform. The
resolution of the timing measurements ranges from 0.3 to 8.0 ps,
and the measurement errors against reference measurements are
found to be within 3.6%. The test platform can be applied to
VLSI designs with minor area overhead, and provides designers
with precise and accurate physical timing information of circuits.
Index Terms— Reliability, testing, timing measurement.
I. INTRODUCTION
DESIGN with VLSI circuits as both application-specificintegrated circuits (ASICs) and programmable archi-
tectures, e.g., field-programmable gate arrays (FPGAs), is
facing a dilemma. While the need for greater operating speed
encourages designers to push ever closer toward the absolute
physical timing limit of hardware circuits, the increase in
circuit complexity and density decrease timing predictability
due to process variations [1], [2]. This forces the use of wider
timing margins to ensure reliability, but this also sacrifices
potential timing performance. The most obvious solution is to
measure the actual circuit timing chip by chip, down to specific
parts or components, such that deterministic timing informa-
tion of critical circuit paths/components is available to allow
much narrower timing margins while maintaining reliability.
The use of intellectual property (IP) and embedded blocks is,
however, problematic, since it is not possible to isolate/extract
their actual critical paths due to encrypted/undisclosed circuit
information. There are many existing timing measurement
methods, but most of them are neither practical nor applicable
to “black-box” circuits. For ring-oscillator-based methods [3],
Manuscript received June 25, 2012; revised October 5, 2012; accepted
November 12, 2012. This work was supported in part by the U.K. Engi-
neering and Physical Sciences Research Council (EPSRC) under Grant
1052EP/H013784/1 and Grant EP/C549481/1.
The authors are with Imperial College London, London SW7 2AZ, U.K.
(e-mail: justin.s.wong@imperial.ac.uk; p.cheung@imperial.ac.uk).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2012.2230280
extra interconnects for completing the feedback path impact
measurement accuracy and prohibit registers in the circuit.
Similarly, methods relying on time-to-digital conversion using
Vernire delay line [4], analog-based delay converter [5], and
delay-lock loop [6] also fail to support registered paths, and
the extra interconnects needed to route the inputs and outputs
to and from the test circuit cause inaccuracy. Scan-chain-based
methods for pass/fail analysis at specific clock frequency can
be repeated over multiple frequency steps to measure delay [7].
However, the test time is impractical for high timing resolution
measurements, since capturing and analyzing output samples
for many frequency steps through the scan chain is time con-
suming. We introduced a method that allows high-resolution
delay measurement of register-to-register path through failure
rate detection (FRD) [8]. However, it is limited to testing one
path at a time and cannot directly support multipath circuits
such as state machines and pipelined circuits.
We proposed a measurement method in [9] that was shown
to give relatively accurate measurements for multipath com-
plex circuits [10], providing an attractive solution to the prob-
lem. The method is based on measuring the output transition
probability (TP) of a circuit under test (CUT). It is able to
achieve the same measurement accuracy and resolution as
FRD for isolated path but requires fewer hardware resources
[9], [10]. Test time is also significantly shorter than scan-chain-
based methods for complex circuits at high timing resolution.
The remaining question is what mechanisms are behind the TP
method that govern the apparent high measurement accuracy,
and how it can be improved further, as well as how well it
performs over a wider range of practical circuits as a universal
test platform.
This paper proceeds as follows. Section II examines TP and
related measurement methods in detail. Then, in Sections III
and IV, TP is simulated and modeled to further understand
its characteristics. In Section V, the mechanisms that allow
TP measurement to infer delay of a circuit are examined and
the possible causes of measurement inaccuracy are identi-
fied. Actual measurements of corner case logic functions on
FPGA are analyzed in Section VI along with investigation of
measurement accuracy. Next, the detailed implementation and
test procedure of the complete TP test platform are described
in Section VII. The test platform is demonstrated on FPGA
with practical designs (arithmetic, filter, and state machine) in
Section VIII along with accuracy and consistency evaluations.
Finally, the test time and resource usage of the test platform
is estimated in Section IX. Section X concludes this paper.
II. FAILURE RATE, TRANSITION PROBABILITY, AND
HIGH-PROBABILITY METHODS
The FRD method introduced earlier was inspired by a
register-level timing error detection/correction mechanism
1063–8210/$31.00 © 2013 IEEE
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Combinatorial
CircuitR R
V z S
T
Clock
At-speed Reference
Generator
Failure Rate
ProfileComparator
Failure Rate Detector (FRD)
freq.
eruliaF
e taR
0
fmax
Error
Counter
(b)(a)
Fig. 1. (a) Typical combinatorial circuit between input and output registers,
followed by (b) a failure rate detector (FRD) circuit.
(RAZOR) proposed by [11], which can be adapted to infer
path delay [12]. The FRD method measures the timing failure
rate of a circuit path while stepping up the system clock
frequency, and infers the path delay from the point at which
timing failures begin to occur. Failure rate is measured by
comparing the registered output of a path under test (PUT)
against a correct reference signal using a hardware comparator
[Fig. 1(b)]. The biggest drawback of FRD is the need of
an at-speed reference generator circuit which must operate
correctly beyond the maximum speed of the PUT, and often
utilizes an extensive amount of hardware resources [13]. We
were able to efficiently generate a reference signal using the
preregistered output of the PUT [8]. However, the technique
is applicable to testing only one isolated combinatorial path
at a time.
Similar to the FRD method, the TP method [9], [10] mea-
sures circuit delay by detecting timing failures. However, the
failures are inferred indirectly from statistical observation of
the CUTs’ outputs instead of using a hardware comparator and
a reference generator. This significantly reduces the hardware
resource usage, and it also enables the method to be used on
multipath multistage pipelines and sequential circuits. In con-
junction with the TP method, we also introduce a high prob-
ability (HP) based method which has similar behavior as TP
but enables cross-comparison and analysis of the TP method.
A. Definitions of TP/HP and Measurement Concepts
Consider a typical synchronous circuit with a combinatorial
stage and output register [Fig. 1(a)]. The output signal from the
register can be seen as a series of discrete time samples S(k) of
the preceding combinatorial output, where k = 1, 2, . . . Since
the output sample rate obeys the clock frequency driving the
register, two types of relative statistical measurement over N
clock cycles can be observed.
1) The HP or H (S), where H (S) = P{S(k) = 1}. It
represents the ratio of the number of samples whose S
is high over N clock cycles. It is a first-order statistical
measurement of S and its value lies within the range 0–1.
2) The TP or D(S), which is the probability that S
changes state between consecutive samples, i.e., the
average number of signal transitions in S per cycle over
N clock cycles. It is given by
D(S) = P{S(k + 1) = S(k)}
= P{S(k) = 0}P{S(k + 1) = 1}
+P{S(k) = 1}P{S(k + 1) = 0}. (1)
D(S) is a second-order statistical measurement of S.
When S contains random binary samples, D(S) obeys the
following quadratic relationship with a maximum of 0.5:
D(S) ≈ 2 × H (S) × (1 − H (S)). (2)
It was shown in [14] that the probability of an output
of a Boolean function evaluating to 1 is equal to the sum
of the probabilities of each of the disjoint cubes in the
Clock Freq.
scitsitatStuptu
O
fmax
Stationary CUT
Failed
Clock Freq.
scitsitat Stu pnI
Stationary
(a) (b)
Fig. 2. Example statistical profiles of (a) an input to a circuit generated from
a stationary process and (b) the output of the circuit that failed after fmax in
the clock frequency domain.
Test Vector
Generator
(TVG)
Circuit Under Test (CUT)
Probability Profile
Analyser (PPA)
Test Clock
Generator
(TCG)
LR SR
Circuit Timing Measurements
.....V
TP or HP Profile
Z TP or HPMeasurement
Circuitry
Launch Register Sample Register
y
Fig. 3. Circuit diagram showing the principle of measurement of the TP
and HP.
cover evaluating to 1. If the input vectors of a circuit are
chosen randomly or follow a fixed sequential pattern, i.e., the
vectors form a stationary process, then the probability of its
output(s) evaluating to 1 will be stationary as well. Therefore,
H (S) and D(S) of the output samples will be stationary
(unchanging). Any timing violations in the circuit disrupting
the stationary process would cause the probabilities to change
and hence indicate a timing failure. The idea is illustrated by
the example plots in Fig. 2. Such disruption due to timing
violation can be explained through the following example.
In Fig. 1(a), the output register captures a sample S(k) of the
output z after time T , one clock cycle after applying the input
V (k). If the clock frequency is low enough, then the circuit
operates without faults: S(k) = z(k), and the probabilities
H (S) and D(S) remains stationary. However, if the test
clock frequency is increased step by step, at some point the
clock period will breach the timing constraint imposed by the
propagation delay of z, and the register will begin to sample
the z value from the previous cycle, such that S(k) = z(k −1).
This disrupts the stationary process and causes H (S) and D(S)
(HP and TP) to deviate from their normal stationary values.
The HP or TP value for each frequency step is collected to
plot a profile that shows the failure behavior of the CUT over
a range of test frequencies (Fig. 2), and is used to estimate
the maximum operating frequency ( fmax) or circuit delay.
This test method relies on two features: 1) the ability to
sweep the test clock frequency fclk in fine steps and 2) the
ability to infer circuit delay from TP and/or HP measurements
assuming they reflect timing failures in the circuit as frequency
is swept from low to high. The clock generation and sweeping
process for 1 has been thoroughly implemented in [8], [13],
and [15] using phase-locked loops (PLLs) and/or digital clock
managers (DCMs) [16]. For 2, the idea will be evaluated and
simulated in the following sections to understand how TP and
HP respond to timing failure in real circuits.
B. Measurement Circuit
The top-level implementation of the measurement circuit
is depicted in Fig. 3. The CUT represents combinatorial or
sequential circuits with input V and output y. The launch
register (LR) and the sample register (SR) at the beginning
and end of the CUT are clocked by a test clock generator
(TCG) which steps through a range of test frequencies. The
minimum achievable timing resolution (t) in terms of the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 3
Test Clock
Transition
Count
Transition
Probability
Alternative HP measurement path
TP (or HP)
ProfileAsync.
Transition
Counter
Probability
Profile
Collector
(PPC)
N
N Samples
from CUT
freq.
TFF
En
Converts High Cycles to transitions
Plot probabilities for
each clock frequency
Fig. 4. Measurement circuit for the TP profile (or HP profile, using TFF path).
start clock frequency ( f ) and frequency step size ( f ) is
expressed in [8] as
t ≈  ff 2 . (3)
If f begins at 200 MHz and increments with steps of 0.1 MHz
( f ), the minimum timing resolution achieved is 2.5 ps.
A test vector generator (TVG) provides test vectors V to
the CUT such that during normal operation each output bit of
the CUT exhibits a nonzero and steady output TP/HP. The
measurement circuitry at the output of the CUT generates
either TP or HP profile and it is analyzed by the proba-
bility profile analyzer (PPA) to produce delay measurement
of the CUT. The measurement circuits for generating TP
or HP profile are depicted in Fig. 4. The asynchronous
transition counter is implemented as a ripple counter which
takes input as a clock and processes N samples from the
CUT’s output register over N test clock cycles for each
frequency step. The count is stored by the probability profile
collector (PPC) for all test frequency steps to create a profile.
In practice, as long as N is kept constant across all frequency
steps, circuit delay can be inferred directly by the PPA through
detecting the change in transition count, without the need for
dividing by N to obtain absolute TP values. Also, it is only
necessary to store the most recent count in the PPC for change
detection. Full TP/HP profiles are collected only for illustration
purpose and to gain insight in circuit failure process.
For HP, the CUT’s output is first fed through a toggle
flip-flop (TFF) to translate each sample at logical high into
a signal transition for the asynchronous transition counter to
count the number of high cycles and allow the PPC to produce
an HP profile. Notice that the TFF must be synchronized to
the test clock. Therefore, to prevent clock skew related errors,
it is important to place it close to the output registers of the
CUT where the relative clock skew is small. For this reason,
the purely asynchronous TP measurement circuit is preferred
due to the greater robustness, freedom of placement location,
and lower resource usage.
III. CHARACTERISTICS OF TP AND HP PROFILES
To further understand the characteristics of the statistical
profiles of TP and HP, the circuit in Fig. 3 is simulated with a
single register-to-register path. Signal transitions are simulated
as timing events, taking into account their interaction and
propagation along the path. The propagation delays of the
path for rising and falling transitions are set to two distinctive
values, and the registers are driven by a clock with 30-ps-
wide uniform clock jitter distribution around each expected
clock edge (see Fig. 7). The path is stimulated by an input that
toggles every clock cycle, and the clock frequency increases
in 1.0-MHz steps. TP and HP are recorded over 2000 clock
cycles (trials) for each frequency step to construct their profiles
against frequency. 2000 cycles are sufficient because only one
path is being tested. For actual circuits with many interacting
(a)
(c) (d)
700 750 800 850
0
0.25
0.5
0.75
1
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
700 750 800 850
0
0.25
0.5
0.75
1
Frequency (MHz)
H
ig
h 
Pr
ob
ab
ilit
y
700 750 800 850
1
0.75
0.5
0.25
0
Frequency (MHz)
H
ig
h 
Pr
ob
ab
ilit
y
700 750 800 850
0
25
50
75
100
Fa
ilu
re
 R
at
e 
(%
)
Ideal With Clock Jitter With Clock Jitter + Metastability
TP = 0.5
Slow (Falling)
transitions fail
Fast (Rising)
transitions fail
25%
Failure
Falling
transitions fail
Rising
transitions fail
Falling
transitions fail
Rising
transitions failHP = 0.25 
HP = 0.75 
(b)
Fig. 5. Comparison of simulated failure rate and statistical profiles of a
single path in ideal conditions, with clock jitter and flip-flop metastability.
(a) Failure rate profiles. (b) TP profiles. (c) HP profiles. (d) HP profiles with
swapped transition types for the same slow and fast delay values.
paths, more test cycles are necessary (see Section VI). Flip-
flop metastability is accounted by a metastable window (20-ps
wide) defined by symmetrical setup and hold times. The
probability of resolving to the previous cycle’s output is set to
be linearly proportional to where the violation occurred, i.e.,
the probability varies linearly from 0 to 1 depending on where
the violation occurs between the start and end of the window.
The simulated TP and HP profiles of a path with falling
and rising transition delays at 1350 and 1250 ns (∼740 and
800 MHz) are shown in Fig. 5, where a reference failure rate
profile from the FRD method is included. In Fig. 5(a), the first
change in failure rate around 25% reflects the failure of the
slower signal transitions (falling transitions in this case). The
second change around 75% shows the failure of the quicker
rising transitions, leading to 100% failure toward the end. The
nominal maximum operating frequency ( fmax) of the path can
be deduced from the mid-point of the first failure slope at
25%, where TP and HP plots [Fig. 5(b) and (c)] infer timing
failures in a similar way as the failure rate profile. Both TP
and HP respond to the failure of the falling/rising transitions
around TP = 0.5 and H P = 0.75, respectively, with similar
slopes. By taking their mid-point, the same nominal fmax can
be obtained accurately.
One interesting aspect about HP is that it allows us to
identify the transition type (rise or fall) corresponding to the
path delay. If the transition types are swapped for the same
slow and fast delay values, both the failure rate and TP profiles
will remain exactly the same, but with HP [Fig. 5(d)], the
profile is flipped upside down around H P = 0.5.
A. Contributions of Timing Uncertainties
The main concern with TP or HP is when the delays of the
rising and falling transitions are symmetrical (equal), poten-
tially causing the two failure slopes to overlap and cancel each
other out. However, the existence of timing uncertainties—
clock jitter and metastability—would prevent the TP pro-
file from completely loosing sensitivity to timing failure.
Their effects are illustrated in Fig. 6 with both delays at
1300 ns (∼770 MHz), which clearly shows the positive effect
of random clock jitter. When jitter is absent, the TP profile has
absolutely no sensitivity to timing failure, whereas the cases
with jitter and metastability produce easily distinguishable TP
responses.
Fig. 6 shows that the HP profile is insensitive to timing
failure, showing only noise-like fluctuation during the expected
period of timing failure. This clearly shows the advantage of
using the second-order TP as the indicator of timing failure
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
(a) (b)
740 760 780 800
0
0.5
1
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Frequency (MHz)
740 760 780 800
0
0.5
1
Frequency (MHz)
H
ig
h 
Pr
ob
ab
ilit
y
Ideal
With Clock Jitter
With Clock Jitter + Meta−stability
Ideal
With Clock Jitter
With Clock Jitter + Meta−stability
Fig. 6. Simulated (a) TP and (b) HP profiles of a circuit path with identical
rising and falling transition delays under random edge-to-edge clock jitter.
The effect of clock jitter and flip-flop meta stability is depicted by their
corresponding plots.
Test Clock
Clock
Jitter
Distribution
Combinatorial
Output
t (fall)slow
T T+τ
T+ minτ T+ maxτ
PDF( )τ
t (rise)fast
Fig. 7. Timing diagram illustrating the combinatorial output of a circuit path
and the corresponding clock edges with jitter. The edge jitter is described by a
random variable τ in terms of timing variation from the expected clock edge.
The clock jitter distribution P DF(τ ) is centered around the expected clock
edge at T and is bounded by τmin and τmax.
in this case. The reason for the sensitivity loss with HP will
be discussed further in Section V-A.
1) Effect of Random and Correlated Jitter: Clearly, clock
jitter leads to the success of the TP method. Yet, the behavior
of jitter could vary between different clock sources, and jitter
could be induced by different processes [17]. Therefore, it
is important to thoroughly understand how these differences
could affect the resultant TP profile.
The main concept of jitter is illustrated in Fig. 7, where jitter
is described by a random variable τ relative to the expected
clock edge at time T . Since the CUT’s combinatorial output
must settle within one clock cycle (assuming no multicycle
transfer), we can model the clock period variation between two
consecutive clock edges by T + τ as pure edge-to-edge jitter,
assuming a jitter-free initial clock edge. According to [17],
there are two main types of jitter that could affect the apparent
edge-to-edge jitter experienced by a CUT.
1) Edge-to-edge random jitter—independent random phase
variation between each clock edge.
2) Low-frequency multiple-cycle random jitter—random
but gradual phase drift over multiple clock cycles, which
causes high degree of edge-to-edge correlation.
Type 1 jitter can be simulated by interpolating a low
frequency random drift to compute intermediate jitter values
at higher frequency for each clock cycle with edge-to-edge
correlation. In comparison to uncorrelated jitter, Fig. 8 shows
that the edge-to-edge correlation of the low-frequency multi-
cycle jitter causes a significantly smaller TP response during
the period of timing failure. The reduced TP sensitivity is
explained as follows. Consider the combinatorial output in
Fig. 7, with the delays coinciding exactly with the clock edges
(tslow = tfast = T ). The general expression of TP under such
condition is given by
T P = P(Fallfail ∩ Risefail) + P(Fallfail ∩ Risefail). (4)
Without jitter correlation, the probability that one type of
transitions fails is 0.5, independent of the outcome of the
(a) (b)
740 760 780 800
0
0.5
1
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Frequency (MHz)
Ideal
With Clock Jitter
With Clock Jitter + Meta−stability
740 760 780 800
0
0.5
1
Frequency (MHz)
H
ig
h 
Pr
ob
ab
ilit
y
Ideal
With Clock Jitter
With Clock Jitter + Meta−stability
Fig. 8. Simulated (a) TP and (b) HP profiles of a circuit path with identical
rising and falling edge delay under low frequency clock jitter. The effects
of clock jitter and flip-flop metastability are depicted by their corresponding
plots, where the TP plot with jitter has a reduced sensitivity to timing failure
due to edge-to-edge jitter correlation.
700 750 800 850
0
0.25
0.5
0.75
1
Frequency (MHz)
Edge−to−Edge vs. Low Frequency Random Jitter
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Edge−to−Edge
Random Jitter
Low Frequency
Random Jitter
D(y) = 0.5 
Fig. 9. Comparison between TP profiles with low-frequency multicycle
random jitter and edge-to-edge random jitter in a circuit path with different
rising and falling transition delays.
previous transition, hence T P = 0.52 + (1 − 0.5)2 = 0.5,
which agrees with the TP value in Fig. 6 at ∼770 MHz.
However, when jitter correlation exists, the probability that
a transition fails becomes dependent on the outcome of the
previous transition. Thus increasing the probability that both
transitions failing at the same time (P(Fallfail ∩ Risefail))
and both not failing at the same time (P(Fallfail ∩ Risefail))
causes a smaller TP change from its initial value of 1.0
and lower sensitivity against timing errors (Fig. 8). The TP
deviation caused by metastability becomes more apparent in
this case and helps maintain timing failure sensitivity, since
it introduces uncertainty that is always independent of clock
edges regardless of jitter correlation.
Note that, when the rising and falling transition delays are
asymmetrical with tslow  tfast , the probability that they both
fail at the same frequency is zero. Hence, the TP profile is
unaffected by jitter correlation (Fig. 9).
A simple test is carried out on an Altera Cyclone III
FPGA [15] to identify possible edge-to-edge jitter correlation
in its clock signal. The CUT is implemented with a series
of nine inverters such that the path has approximately sym-
metrical rising and falling transition delays, and it is driven
by toggle stimulus. Fig. 10 depicts the TP profiles, where
the individual TP profile of each transition type is isolated
by taking separate measurements at even or odd clock cycles.
The small deviation in the overall TP profile clearly shows the
existence of correlated multicycle jitter in the clock signal.
IV. TP/HP PROFILES FOR MULTIPATH CIRCUITS
The general TP/HP measurement circuitry described in the
previous subsection can be adapted to test more complex cir-
cuits with multiple inputs and paths by using a pseudo-random
vector generator (RVG) to stimulate the CUT (Fig. 11). Since
the vector generation process is stationary, the statistics (TP
and HP) of the resultant random test vectors are also stationary.
We quantify the statistics of the random input (V ) in terms of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 5
0
0.2
0.4
0.6
0.8
1
240 245 250 255 260 265
Frequency (MHz)
Tr
an
si
tio
n 
pr
ob
ab
ili
ty
Overall TP Profile
TP - Falling Transitions
TP - Rising Transitions
TP Profiles of Similar Rising and Falling Transition Delays - 9 LUTs: Inverter
Fig. 10. TP profile measurements showing clock jitter correlation on the
Cyclone III EP3C25 FPGA.
Random
Vector
Generator
(RVG)
Circuit Under Test (CUT)
TCG
LR SR
TP or HP
profiles
V TP/HP
Measurement
Circuitry
Arbitrary
Complex Circuit
Fig. 11. TP/HP Measurement circuitry for complex circuit with n input and
m output bits using uniform random input vectors. The CUT may contain
multiple pipelined combinatorial paths and sequential logic.
LR SR
V S
Delays: t , tA-fall A-rise
R R
t , tB-fall B-rise t , tC-fall C-rise
Stage A Stage B Stage C
Fig. 12. Simple three-stage pipeline circuit consisting of three simple
combinatorial paths (stages A, B, and C) with different path delays.
HP, which can be varied linearly to obtain specific random
bit patterns. TP values, on the other hand, could each refer to
two HP values of different random bit patterns, since TP is
a quadratic function of HP for random patterns (2) bounded
by T P ≤ 0.5. For example, when T P = 0.375, (2) has two
HP solutions at 0.25 and 0.75. The only exception is when TP
reaches its maximum at T P = H P = 0.5.
Through simulation, we obtained the TP and HP output
profiles of a single path stimulated by uniform random input
as shown in Fig. 13(a) and (b). The TP profile has the same
characteristic shape as the one observed in Fig. 5(b) with the
toggling input, where the midpoint of the two slopes represent
the nominal propagation delay of the path’s falling and rising
transitions (tslow and tfast). This simulated TP behavior is also
observed on Altera Cyclone III FPGA in [10] and [15].
A. TP Model of Sequential Paths
When a series of combinatorial paths are connected in series
by registers in a pipeline arrangement, the TP profile of the
entire circuit can be expressed from the TP profiles of each
individual path. An example of such pipeline circuit containing
three stages is presented in Fig. 12, where the three stages
(A, B, and C) are associated with their respective propagation
delays: tA-fall, tA-rise, tB-fall, tB-rise and tC-fall, tC-rise. Fig. 13(c)
depicts the TP profile of the circuit. When the range between
the rise and fall delays of a path does not overlap with the
delays of other paths (stage C in this case), the change in
TP response due to timing failure is independent and simply
70080090010001100120013001400
0
0.1
0.2
0.3
0.4
0.5
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
70080090010001100120013001400
0.5
0.6
0.7
0.8
0.9
Period (ps)
H
ig
h 
Pr
ob
ab
ilit
y
10001100120013001400
0
0.1
0.2
0.3
0.4
0.5
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
10001100120013001400
0.5
0.6
0.7
0.8
0.9
Period (ps)
H
ig
h 
Pr
ob
ab
ilit
y
t
slow
tA−fall tA−rise
tC−fall tC−risetB−rise
tfast tB−fall
(c)
(d)
(a)
(b)
Fig. 13. (a), (b) Simulated TP and HP profiles of a single path and (c), (d)
3-stage pipeline depicted in Fig. 12, driven by uniform pseudo-random input
stimulus. tslow and tfast represent the propagation delays of falling and rising
transitions of a single path, respectively, and tA-fall, tA-rise, tB-fall, tB-rise,
tC-fall, and tC-rise represent the delays of falling and rising transitions of
sequential paths A, B , and C , respectively.
propagates directly to the output, maintaining the usual form of
a single-path TP profile [Fig. 13(a)]. The overall TP profile is
formed by a multiplicative process of the individual TP profiles
from each pipeline stage, and for a three-stage pipeline circuit
(Fig. 12), the TP profile is
T Pseq ≈ 12 (2T PA × 2T PB × 2T PC) (5)
where T PA , T PB , and T PC are the TP profiles of the paths
in stage A, B, and C, respectively.
Similarly, for a simple pipeline containing N stages, the
general TP profile is expressed by
T Pseq-N ≈ 2N−1
N∏
i=1
T Pi (6)
where T Pi represents the TP profile of the i th stage combi-
natorial path, and T Pi ≤ 0.5.
It is clear that, in the case of simple sequential paths, the
failure of the worst case path would always yield an easily
distinguishable TP response no matter how the failure of the
other paths are affecting the overall TP profile. In addition, a
similar behavior for HP is observed in Fig. 13(d), where the
HP profile is approximately 1 − T Pseq.
B. Analysis of Complex Multipath TP Profile
The previously described models are useful for predicting
the TP profile of a failing path or simple sequential paths. Yet,
the problem with them is that they are not scalable to more
complex circuits containing multiple interacting combinatorial
paths. Fig. 14 depicts the TP profile of the second LSB output
of a 9 × 9 embedded multiplier on the Cyclone III EP3C25
FPGA. As can be seen, the observed output TP profile (shown
as the dotted line) is produced by the timing failure of each
individual path. While the TP profile may appear to be a direct
combination of the basic TP profile components of these paths,
it is actually not possible to predict the exact overall TP profile
using the basic single-path TP profiles alone. The main reason
for this is that the failure processes of the paths are interrelated
with each other in a difficult-to-predict manner.
Consider the timing illustration in Fig. 15, where a circuit
with multiple internal paths is stimulated by random vectors.
The probability that an input transition through a particular
path is observable at the output depends on the input pattern
and the state of the other paths, which means each path could
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
750 800 850
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Tr
an
si
tio
n 
Pr
ob
ab
ilit
y
Frequency (MHz)
Output TP Profile
TP profiles of individual paths
Fig. 14. TP profile measurement of the second LSB output of a 9 × 9
embedded multiplier on the Cyclone III EP3C25 FPGA [10], [15]. The
unusual shape of the TP profile is the result of individual paths failing
at different frequencies. The corresponding paths are isolated and tested
separately to obtain their basic TP profile components for reference.
Test Clock
Clock
Jitter
Distribution
Fastest
Path Delay
T T+τ
T+ minτ T+ maxτ
PDF( )τ
A B
...
Slowest Path Delay
Glitch
Period
Combinatorial
Output (Z)
A' B'
Register
Output (Y)
Fig. 15. Timing diagram showing the activity of output bit (Z) and registered
output (Y) in multiple inputs/paths to single output circuit. A glitch period
occurs after each clock edge due to variation in propagation delay between
different paths. The position of clock edge is governed by the jitter distribution
P DF(τ ), and the probability of the register capturing the correct value in
B′ depends on both the glitch pattern and the overlapping jitter region.
contribute differently to the observed TP profile. Such behavior
is predictable only if the exact circuit implementation, struc-
ture, and layout are known.
Although each active path may produce a signal transition
some time after the clock edge, their different arrival times
result in a “glitch period” containing a series of unwanted
transition activities. These glitch activities are unpredictable
especially with random input vectors. When the glitch period
coincides with the next clock edge, where the clock edge
position itself is unpredictable due to clock jitter, the actual
value captured by the register (B′) is not deterministic, and
hence the resultant TP cannot be determined with certainty.
Also, the rapid transitions in the glitch period could cause
undesirable metastability problem in the output register [18],
further increasing the unpredictability of the output value.
For these reasons, the direct approach of modeling the TP
profile based on specific path quickly becomes impractical
with complexity. A mere change of placement and routing
of a design could produce a layout with completely different
TP profile. The only way that a precise model of the TP
profile can be obtained is if a perfect physical model of
the circuit is available with precise information on signals
propagation, interaction, and clock jitter behavior, such that
the exact glitch pattern is known and the registered output
value is predictable. If such perfect physical model exists,
though, a delay measurement method would not be necessary
in the first place. A better direction would be to consider the
timing failure sensitivity of TP rather than its exact profile, and
deduce ways to improve its sensitivity (measurement accuracy)
in different designs.
C. Relationship Between TP/HP Sensitivity and Input
Probability Distribution
The results presented so far are based on uniform random
input patterns, with approximately the same number of high
and low cycles (HP = 0.5). However, it was shown in [10],
[15], and [19] that using weighted random input patterns
with biased probability distributions (HP other than 0.5) could
improve coverage of paths that are rarely exercised, and thus
improve TP/HP sensitivity.
A simple example would be an N-input AND gate, where the
output goes high only when all input bits are simultaneously
at high. This implies that the probability that the AND gate
produces a rising transition is particularly low, especially when
N is large. Assuming all random input bits are independent
and uniformly distributed at HP = 0.5, the probability of such
transition occurring is given by (0.5)N . It is clear that any
increase in the input HP from 0.5 would increase the transition
probability through the critical path of the AND gate, resulting
in better TP/HP sensitivity. The effect of weighted random
input in terms of measurement accuracy will be tested and
analyzed in detail in the following section.
V. ASSUMPTIONS AND CAUSE OF TP/HP INACCURACY
The accuracy of TP or HP measurements relies on two
assumptions: 1) the test vectors’ coverage is adequate—N
random test vectors successfully traverse to the correct circuit
states (in the case of state machines) and exercise the critical
paths in a CUT for at least once, and 2) TP and HP respond
to timing failures, given that assumption 1 is true. In the
following sections, we will analyze the mechanisms in which
TP and HP respond to timing failures to assess assumption 2,
then examine both 1 and 2 through Monte Carlo simulations
and actual FPGA measurements of specific corner cases to
assess TPs measurement accuracy in practice.
A. Link Between Timing Failure and TP/HP Deviations
The cause of deviation in TP and/or HP can be understood
through the cases in Fig. 16. As shown in Fig. 16(a) and (b),
TP responds to timing errors only if the pulse width t is one
clock cycle long [case 1(a) for rising transition and case 2(b)
for falling transition], whereas HP continues to respond to
timing errors even if t last for two or more clock cycles
(not illustrated in Fig. 16). Therefore, as long as the input
vectors are able to exercise the critical path at least once,
the HP method could provide accurate detection of the path
delay. The only exception for HP in Fig. 16 is when the
failure of (a) rising and (b) falling transitions create an equal
and opposite amount of change in HP in both cases 1 and
2. Due to random clock jitter, it is possible for all three
types of failure in Fig. 16(a)–(c) to coexist given a large
enough number of test clock cycles and both rising and falling
transitions fall within the clock jitter distribution (see Fig. 7).
Thus, in the case where the rising and falling transitions have
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 7
Expected Output
0
(a) Rising Transition
Failed
(b) Falling Transition
Failed
(c) Both Transitions
Failed
2esaC1esaC
TP:
HP:
TP:
HP:
TP:
HP:
TP:
HP:
TP:
HP:
TP:
HP:
1 2 3 0 1 2 3Clock Cycles
t t
Glitch onCriticalPath
Before Register
Expected Output
(d) Glitch sampled
by Register
TP:
HP:
TP:
HP:
HP: HP:
TP: TP:
Glitch on Critical Path
Before Register
Expected Output
(e) Glitch sampled
by Register
non-critical (fast) transition non-critical (fast) transition
Fig. 16. Timing diagram showing how timing failures of different transition
types and glitches can result in different TP/HP deviations. Both (a) and (b)
cause TP and HP to deviate from their original values, but in (c), both TP and
HP remain unchanged. In (d) and (e), glitches created along the critical path
cause TP/HP deviations that reflect the worst case delay, despite (d) expected
output is stationary, or (e) when transition delay is noncritical.
exactly the same delay and both (a) and (b) occur for exactly
the same number of cycles, the changes in HP could cancel
each other out, resulting in little or no overall observable
change.
Such cases of HP cancelation were observed earlier in
simulations in Figs. 6 and 8, showing only a tiny devia-
tion due to random clock jitter and metastability responses.
Nonetheless, it is highly unlikely that delays between rising
and falling transitions are exactly matched in real circuits
due to process variations. Also, perfect cancelation is vir-
tually impossible due to the fact that jitter and metastabil-
ity cannot be fully eradicated in practice. The probability
that the number of failed cycles of both transition types
matches exactly is extremely low. TP measurement, on the
other hand, does not suffer from such cancelation problem,
since TP only changes in either case 1(a) or case 2(b) as
shown in Fig. 16 and both cases cause TP to increase.
This agrees with the significant TP deviation observed earlier
in Figs. 6 and 8.
In case 1(b) and 2(a) of Fig. 16, it is true that TP
appears to be less effective in reflecting timing failures
than HP. However, when glitches are considered [Fig. 16(d)
and (e)], the picture changes dramatically. It is a known
fact that glitches account for a significant portion of
switching activities in combinatorial circuits, especially in
FPGAs. Both Wilton et al. [20] and Lamoureux et al. [21]
have shown that FPGA logic generates a lot of glitches,
and a great proportion of switching activities in general
ASIC combinatorial logic are also due to glitches [22]. As
depicted in Fig. 16(d), the critical glitches—glitches created
and propagated along the critical path—cause both TP and
HP to deviate when falling and/or rising glitch transitions
violate timing in case 1 and 2(d). Also, in Fig. 16(e), with
nontiming critical transitions, the critical glitches could
help maintain TP sensitivity to the critical path delay,
where TP deviations would reflect the timing failure of
the critical glitches instead of the transitions with lower
delays.
(a) (b)
V
S
4-LUT (2-Input Logic)
......
Critical Path1 2
12
13
LR
0
14 SR
0
V S
4-LUT
Critical Path
Delay Line SRLR
0
14
Near Critical Paths
Fig. 17. Block diagrams showing two implementations of a basic 15-input
logic circuit on the Cyclone III FPGA with two different tree structure using
only two inputs of the 4-LUT for (a) and all inputs for (b) except the bottom
LUT with three inputs only. The delay line in (b) is added to reduce its
maximum operating frequency to suit the measurement circuitry and allows
more direct comparison with (a).
VI. CORNER CASE ANALYSIS ON FPGA
As explained in Section IV-C, the best random input HP
weight to use for testing a CUT depends on its logic function,
and in Section V-A glitch activity is shown to play a significant
role in improving TP sensitivity. To examine the effect of
glitches and varying input HP weight, two logic functions—
AND and OR—are selected for testing. Although they are basic
functions, they represent the corner cases that define the worst
case accuracy bound of the TP method for any logic functions
with the same number of inputs. In addition, XOR is tested to
confirm whether measurement accuracy is indeed bounded by
the two basic logical corner cases.
An Altera Cyclone III EP3C25 FPGA is used to implement
the three logic functions with 15 inputs. They are given two
structures, degenerate binary tree and balanced tree, as shown
in Fig. 17. This allows the difference between a circuit with
one distinctive critical path [Fig. 17(a)] and four (near) critical
paths [Fig. 17(b)] to be observed and compared independently.
The two structures are implemented with four-input look-
up-tables (4-LUTs) and we switch between the three logic
functions by setting their LUT masks (SRAM configurations)
while maintaining exactly the same placement and routing for
each test. The preliminary test circuit is based on Fig. 20,
with a programmable weighted RVG (WRVG) capable of
generating random inputs with HP weights from 0 to 1 in
steps of 0.0625, see Section VII. The glitch activity at the
combinatorial output of each case is measured directly using
the same asynchronous transition counter for TP.
It is also worth exploring the average number of input test
samples (N) needed to detect TP deviation (timing failure)
at different input HP weights and critical glitch activity, such
that correct/accurate timing measurements are obtained. For
this, the TP responses of AND and OR of the two structures in
Fig. 17 are observed through Monte Carlo simulations. Their
inputs are stimulated by random vectors with HP weights from
0 to 1 in steps of 0.0625, and critical glitches are injected to the
cycles that are noncritical or stationary [Fig. 16(d) and (e)] as
the percentage ratio of the total number of such cycles. The
simulation tests through 109 input samples and the average
distance between samples that causes TP deviation and timing
failure of the critical path(s) are recorded for each HP weight.
The results are shown in Fig. 19 with 0.01% and 25% of
critical glitch activities.
A. FPGA Test and Simulation Results
The FPGA measurements of the 15-input AND, OR, and
XOR are presented in Fig. 18 in terms of their maximum
operating frequency ( fmax). The CUTs are tested through 224
(≈ 107.2) samples for each HP weight to cover most of the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
(a) (b) (c) (d)
(e) (f) (g) (h)
0 0.25 0.5 0.75 1
215
220
225
230
235
240
245
250
255
260
Input HP Weight
Fr
eq
ue
nc
y 
(M
Hz
)
Nominal Fmax (bit1 ↓ Transitions)
Actual Fmax (bit1 ↓ Transitions)
Nominal Fmax (bit0 ↓ Transitions)
Actual Fmax (bit0 ↓ Transitions)
0 0.25 0.5 0.75 1
215
220
225
230
235
240
245
250
255
260
Input HP Weight
Nominal Fmax (↓ Transitions)
Actual Fmax (↓ Transitions)
0 0.25 0.5 0.75 1
215
220
225
230
235
240
245
250
255
260
Input HP Weight
Nominal Fmax (↓ Transitions)
Actual Fmax (↓ Transitions)
0 0.25 0.5 0.75 1
100
102
104
106
108
Input HP Weight
G
litc
h 
Tr
an
sit
io
n 
Co
un
t (l
og
 sc
ale
)
15−AND 15−OR 15−XOR
0 0.25 0.5 0.75 1
230
235
240
245
250
255
260
265
Input HP Weight
Fr
eq
ue
nc
y 
(M
Hz
)
Nominal Fmax (↑ Transitions)
Actual Fmax (↑ Transitions)
Nominal Fmax (↓ Transitions)
Actual Fmax (↓ Transitions)
0 0.25 0.5 0.75 1
230
235
240
245
250
255
260
265
Input HP Weight
Nominal Fmax (↓ Transitions)
Actual Fmax (↓ Transitions)
0 0.25 0.5 0.75 1
230
235
240
245
250
255
260
265
Input HP Weight
Nominal Fmax (↓ Transitions)
Actual Fmax (↓ Transitions)
TP Fmax HP Fmax
0 0.25 0.5 0.75 1
100
102
104
106
108
Input HP Weight
G
litc
h 
Tr
an
sit
io
n 
Co
un
t (l
og
 sc
ale
)
Fig. 18. fmax (Fmax) measurement of 15-input AND, OR, and XOR gates with (a)–(c) degenerate binary tree and (e)–(g) balanced tree structures with 224
input samples. (d) and (h) show the glitch count measurements of the logic gates in the two types of tree. The test frequency step size is 0.1 MHz (≈ 1.6 ps
time resolution) and test time per circuit is approximately 30 s. Data points are omitted at HP weights less than 0.3125 for AND and greater than 0.6875 for
OR because their TP/HP profiles showed no failure responses within the test frequency range.
(a) (b) (c) (d)
0 0.25 0.5 0.75 1
102
104
106
108
Input HP Weight
Sa
m
pl
e 
Si
ze
 (N
) (
log
 sc
ale
)
0 0.25 0.5 0.75 1
102
104
106
108
Input HP Weight
0 0.25 0.5 0.75 1
102
104
106
108
Input HP Weight
15−AND Rise−Critical (AND↑) 15−AND Fall−Critical (AND↓) 15−OR Rise−Critical (OR↑) 15−OR Fall−Critical (OR↓)
0 0.25 0.5 0.75 1
102
104
106
108
Input HP Weight
OR↓ AND↑
OR↑ AND↓
AND↑OR↓
OR↑ AND↓
AND↑ AND↓
AND↑OR↓
OR↑
OR↑ AND↓
OR↓
Fig. 19. Monte Carlo simulation of the average sample sizes needed to detect TP deviation on a 15-AND and 15-OR as (a) and (c) degenerate tree
(one path) and (b) and (d) balanced tree (four paths) from HP weight of 0.0625 to 0.9375. The simulations are conducted with (a) and (b) 0.01% and (c) and
(d) 25% of critical glitches.
sample sizes observed in Fig. 19 while maintaining practical
test duration. The results from TP and HP methods are plotted
together with baseline references to illustrate their accuracy.
The references (nominal and actual fmax) are obtained through
isolating and exercising only the (near) critical path(s) in
the three cases. The isolated paths are exercised by toggle
inputs, and their fmax are obtained through analysis of their
TP profiles [see Section III and Fig. 5(b)], where nominal
fmax is based on the nominal clock period at the center of
clock jitter distribution [9], and the actual fmax is based on
the frequency point immediately before TP deviation (timing
failure) occurs. The nominal fmax is useful because it indicates
when the nominal clock period is exactly matching the critical
path delay. However, the actual fmax forms the absolute
baseline where no timing failure is detected.
As can be seen, both the TP and HP measurements are
remarkably accurate, lying mostly within the worst case nom-
inal and actual fmax bounds, except in the AND cases when
input HP weight is at 0.5 or less [Fig. 18(a) and (e)]. It
turns out that the outliers in both cases are caused by the
TP/HP methods losing sensitivity to the most critical path and
responded to the timing failure of less critical paths or signal
transition types (rise/fall). For the AND in degenerate binary
tree form [Fig. 18(a)], the second path associated with input
bit1 is responsible, whereas in balanced tree form [Fig. 18(e)],
the failure of the quicker rising transitions is detected instead.
Apart from the outliers, the measurements remained accurate
within 1.5% of the actual fmax in all cases, and are bounded
within the nominal fmax [except in Fig. 18(c) at 0.9375 HP
weight where it is slightly above].
The OR cases did not suffer from the inaccuracy seen in
the AND cases but failed to give any measurements (no TP
deviation) when input HP is greater than 0.6875. This is
because OR’s delay is governed by the slower falling signal
transitions, and according to Fig. 19(a) and (b), the number
of test samples required to detect TP deviation caused by
failure of falling transitions (OR↓) is much lower than that
of AND (AND↓), i.e., OR↓ and AND↓ are not symmetrical.
Therefore, OR’s measurements are generally more accurate
than ANDs in this given case. Also, the jump in inaccuracy
for AND [Fig. 18(e)], which is not observed in the OR case,
can be explained by the simulation results. For example, in
Fig. 19(a), if 108 input samples are used, then it is likely that
the failure of rising transitions (AND↑) are detected by TP
for any HP weights above 0.3125, but for AND↓, it requires
HP weight above 0.5 to detect any TP deviations within 108
samples. Therefore, the unwanted measurement of AND↑ is
more likely to appear between the 0.3125 to 0.5 gap instead
of the worst case AND↓ results, which reflects the observation
in Fig. 18(e) between 0.3125 and 0.5 of HP weight.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 9
Weighted Random
Vector Generator
(WRVG)
Circuit Under Test (CUT)
LR SR
V
k
High Probability (HP) Weights
Profile
Analyser
Fmax
TCG
Arbitrary
Complex Circuit
TP/HP
Measurement
Circuitry
Fig. 20. Block diagram of the programmable weighted random test platform.
Given the results from the Monte Carlo simulations and
the three test cases, the major factors that affect measurement
accuracy are the number of input test samples (N), the
input HP weight, and the level of critical glitches in the
circuit. The plots in both Figs. 18 and 19 suggest that using
uniform random input stimulus (HP = 0.5) would produce
relatively accurate results when glitch activity is high. The only
exception is seen in the balanced AND tree, which required an
HP weight of 0.5625 or more to obtain accurate measure-
ments. For the particular test cases on the Cyclone III FPGA
architecture, we can see that a fixed input HP weight of 0.625
with the TP method would guarantee measurement accuracy
to within 1% of the actual fmax in all cases. Overall, we
can conclude that in any combinatorial circuits where falling
transitions dominate propagation delay, accuracy would mostly
be bounded by AND logic. Otherwise, if rising transitions
dominate, the opposite would be true and accuracy would
be bounded by OR logic. XOR logic is expected to provide
very good measurement accuracy in both cases due to its
high glitch activities [Fig. 18(d) and (h)]. In fact, the glitches
from XOR would improve the measurement accuracy of logic
following it. They create intermediate input patterns capable
of sensitizing critical paths that are normally impossible or
unlikely to occur with external input vectors, thus improving
TP sensitivity. This is supported by Fig. 19(c) and (d), where
the number of test samples needed to detect TP deviation is
significantly reduced at high glitch activities. Also, glitches are
likely to increase with each extra logic stage in a combinatorial
circuit [22], allowing the TP method to scale with circuit
depth/complexity. This hypothesis will be tested on practical
complex circuits on FPGA in the following sections.
VII. TEST PLATFORM IMPLEMENTATION
AND TEST PROCEDURES
The general structure of the test platform is depicted in
Fig. 20, which consists of a TCG, a WRVG, a CUT, and
the TP/HP profile measurement and analysis circuitries. The
WRVG generates one random vector per clock cycle and is
launched into the CUT by the LR, while its output is captured
by the SR for TP/HP measurements. The WRVG (Fig. 21) is
implemented using a 32-b maximal-length ring generator [23],
[24] followed by a phase-shifter network to ensure interbit
phase independency [25], [26], and a logic based probability
bias circuitry to generate weighted random vectors.
A. Generation of Weighted Random Test Vectors
Random sequences with arbitrary HP weight can be gener-
ated from a combination of independent uniformly distributed
random bit streams through simple Boolean logic [19]. Con-
sider the following identities linking the HP of the boolean
combination of random bit streams A and B in terms of H (A)
.. . ..
. Vk
rot are ne
G
gni
R
Phase-shifter
Network
...
Probability Bias
Circuit
... ...
HP Weight Control
Fig. 21. Circuit diagram of the WRVG. It consists of a 32-b maximal-
length ring generator [23], [24] and an XOR-gate-based phase-shifter network
to remove phase correlation between output bits [25], [26]. The probability
bias circuit enables control of HP of the random bits according to the HP
weight control.
and H (B):
H (A · B) ≡ H (A)H (B) (7)
H (A + B) ≡ 1 − (1 − H (A))(1 − H (B)). (8)
Using (7) and (8), the HP of a bitstream can be shifted above
or below 0.5 when combined with one or more independent
uniform random bit streams through basic AND and OR logic
functions. A wide range of HP values between 0 and 1 can
be generated using this approach, and an example is shown in
Table I giving 17 levels of HP weights using four independent
uniform random bit streams (R0, R1, R2, and R3). In general,
the achievable number of HP levels is given by 2r + 1, where
r is the number of independent uniform random inputs, and
the HP weight resolution is given by 1/2r . An extra state
(Toggle) for toggling inputs is included in Table I such that
single input change (SIC) exhaustive testing, which targets one
specific input at a time, can be done for comparison purposes.
For FPGAs with partial/dynamic reconfigurability, the HP
weight-shifting logic can be implemented easily with a typical
LUT through changing the configuration SRAM bits (LUT
mask) on the fly [27], [28]. Fig. 22 shows an example of
a partial/dynamic reconfiguration-based HP weight control
circuit on a typical FPGA logic element (LE) with a four-
input LUT and register, which is capable of implementing
the 17 HP levels in Table I with only one LE per bit. Also,
by exploiting the register feedback path available on most
FPGA architectures, the LE can be reconfigured into a TFF
for toggle signals. Unfortunately, partial reconfigurability is
unavailable on the Cyclone III FPGA we used; thus a logic-
based implementation requiring three LUTs is used.
B. Test Procedure
In the previous TP-based tests [9], [10], the WRVG and
CUT are not reset between frequency steps, and the inherent
noise from the random vectors masks out small TP devia-
tions, leading to poor measurement accuracy. In this paper,
major accuracy improvement is achieved through testing each
frequency step for exactly the same number of input vectors,
and resetting the WRVG and CUT after each frequency step.
This results in identical input sequences between frequency
steps with completely stationary TP output while no timing
failures are occurring, and allows the slightest deviation in
TP to be detected. Failure from as little as one clock cycle
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE I
LOGIC FOR 17 HP LEVELS AND TOGGLE SIGNAL. R3, R2, R1, R0 ARE
INDEPENDENT UNIFORM RANDOM INPUTS
HP Weights (w) Boolean Logic/Hardware
0 GND
0.0625 R3 · R2 · R1 · R0
0.125 R3 · R2 · R1
0.1875 R3 · R2 · (R1 + R0)
0.25 R3 · R2
0.3125 R3 · (R2 + R1 · R0)
0.375 R3 · (R2 + R1)
0.4375 R3 · (R2 + R1 + R0)
0.5 R3
0.5625 R3 + R2 · R1 · R0
0.625 R3 + R2 · R1
0.6875 R3 + R2 · (R1 + R0)
0.75 R3 + R2
0.8125 R3 + R2 + R1 · R0
0.875 R3 + R2 + R1
0.9375 R3 + R2 + R1 + R0
1.0 VCC
Toggle (HP = 0.5) Toggle Flip-flop
RegisterR3 Look-up-table(LUT)
Output with
Weighted HP/
Toggle SignalR2
R1
R0
Uniform
Random
Inputs
... LUT Mask Controlling
Weight / ToggleHP
Configuration Memory Bits
Toggle Signal
Select Bit
Partial Reconfiguration
Bit Stream (HPWeight)
...
Fig. 22. Partial/dynamic reconfiguration approach for 17 levels of reconfig-
urable HP weights and toggle signal using a typical LE with four-input LUT
and register.
can be detected to provide significantly more accurate delay
inferences.
The test procedure is divided into two stages, where a
quick forward frequency sweep is first done to obtain an
approximate failure frequency of the CUT, and the second
stage performs a refined search for a more precise fmax.
This strategy maximizes measurement precision while main-
taining a short overall test time. In stage 1, the test frequency
is incremented at a coarse 2-MHz step size, starting from a
predefined low frequency (from Altera’s timing analysis) that
guarantees correct circuit operation. The HP weight of the
input test vectors is kept at 0.5 across all input bits, and the
test stage terminates as soon as any TP deviation is found in
any of the output bits. Next, the frequency steps are refined
to 0.1 MHz (≈ 1.6 ps time resolution at 250 MHz) in stage
2, and the test range is narrowed down by setting the start
frequency to the failure frequency detected in the first stage.
A backward frequency sweep is performed and the fmax is
defined by the frequency at which TP from all bits settle
to their original steady values. Tests are carried out at room
temperature between 20 ºC and 25 °C, and each frequency
step is tested through 224 input and output samples for both
stages. A 24-bit counter is used to collected a full range of
transition count from the CUT.
Circuit Under Test (CUT)
Logic Element(LE)
V
Normal
Inputs
End-Register
Start-Register
(Toggle-FF) SR
...
LR
Replaces One
Output bit
Normal
Outputs
ToTP/HP
Measurement
Circuitry
Isolated Critical Path
Look-up Tables
(LUTs)
Unused
Flip-flop
Fig. 23. Critical path isolation for reference delay measurement. The register-
to-register critical path is identified by Quartus II Timing analysis and is
logically isolated by modifying LUT masks in the path. The start register is
converted to a Toggle-FF, such that the path is exercised independently. The
end register is connected to the output SR replacing one normal output bits.
Placement and routing remain the same except for the extra local Toggle-FF
feedback and the output link to the SR.
1) Uniform HP Weight Optimization: After the initial test
procedure, a quick test can be performed to find the optimal
HP weight for all input bits. The test with refined frequency
range and resolution in stage 2 is repeated for different HP
weights from 0.0625 to 0.9375 to search for the minimum
fmax that indicates the optimal HP weight for a particular
CUT. Since we only have to sweep through a narrow range
of frequencies for each HP weight, the total test time remains
relatively short.
VIII. EVALUATION OF THE TEST PLATFORM ON
PRACTICAL FPGA DESIGNS
A. Test Setup
The test platform, test procedure, and optimization
described in the previous sections are demonstrated on the
Cyclone III FPGA with eight different practical designs. The
CUTs and their resource usage are summarized in Table II,
and they are placed between the launch and capture registers in
the test platform in Fig. 20. To obtain reference measurements
for accuracy evaluation, we considered two approaches: 1) a
post-placement-and-routing critical-path-isolation method,
and 2) an FRD-based exhaustive test [13].
For approach 1, we used the timing analysis tool in Altera’s
Quartus II to identify the most critical path in the CUT. The
path is then logically isolated from the rest of the circuit
by LUT-mask modifications, and the register at the start
is converted to a TFF such that the path is independently
exercised by a stimulus with TP = 1.0. See depiction in
Fig. 23. Placement and routing are kept unchanged along the
critical path to preserve its original propagation delay, and
the basic TP/HP methods described in Section II are used
for obtaining the measurements. The output TP will initially
be 1.0 but follows a failure profile similar to Fig. 5(b). To
examine the effect of surrounding switching activities on the
isolated critical path, we performed the tests in two different
conditions: low activity, i.e., all normal inputs of the CUT
driven by 0; and high activity, i.e., all normal inputs driven by
uniform random patterns. Note that this method is applicable
only to circuits where critical paths can be identified and
modified, and cannot be used on IP blocks/designs where
internal structure is concealed by encryption and/or locked
from modifications. This is why we included approach 2 for
testing a 9 × 9 embedded multiplier (DSP block) on the
Cyclone III, which has a fixed and unknown internal structure
but a small enough number of input bits, for FRD-based
exhaustive testing within manageable test time.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 11
TABLE II
MEASUREMENT RESULTS OF EIGHT CUTS AND THEIR ACCURACIES WITH THE TP METHOD
Circuit Under
Test (CUT)
CUT Resource Usage Latency
(cycles)
fmax Measurements (MHz) TP fmax
Accuracy
(Error %)#
Opt. HP
weight
TP fmax
Consistency
(3-σ%)LUT Reg. DSP* TP HP
Crit.path
low act.
Crit.path
high act.
Exhaustive
Test [13]
9×9 Embedded multiplier 0 0 1 2 553.43 553.43 N/A N/A 547.92 +1.01! 0.3125 ±0.68
FP32 Multiplier 418 326 7 5 205.06 204.75 206.06 201.23 N/A +1.90 0.5000 ±0.30
FP32 Adder 851 375 0 7 173.82 173.82 176.53 171.91 N/A +1.11 0.5000 ±0.43
FP32 Divider 2233 2589 0 33 232.18 232.18 238.05 224.11 N/A +3.60 0.8750 ±0.33
FP32 Square-root 762 1215 0 28 283.52 283.52 289.07 284.10 N/A −0.20 0.5000 ±0.49
Butterworth IIR Filter 991 272 0 4 111.63 111.63 117.65 111.83 N/A −0.18 0.5000 ±0.60
FIR Filter (15-taps) 361 196 30 3 164.67 164.06 172.81 165.58 N/A −0.55 0.5000 ±0.67
ITC’99 Benchmark (b12) 372 142 0 N/A 317.89 317.89 315.05 313.93 N/A +1.26 0.5000 ±0.16
* A DSP block on the Altera Cyclone III FPGA is essentially a hardware fixed-point 9 × 9 embedded multiplier with its dedicated pipeline registers.
# The % difference of fmax between TP and critical path under high activity (Crit. path high act.). A negative value implies the TP result is pessimistic.
! Error computed from exhaustive test measurement instead of measurement of isolated critical path.
One concern with approach 1 is that the actual order of
criticality of the near and most critical paths on a real FPGA
could differ from the timing analysis results due to process
variation. According to [2], the within-die delay variability
consist of mainly stochastic (random) and spatially correlated
components, where they respectively accounted for ±3.54%
3-sigma and up to 3.66% of delay variation on the Cyclone II
FPGAs with similar architecture as the Cyclone III. The effect
of the stochastic component is expected to be small with long
critical paths consist of n LUT+IC (interconnect) segments,
since the total delay is the sum of delays from all segments
and the variation relative to its mean decreases with n.
Assuming that the segments are nominally identical and their
delays can be represented by normally distributed independent
random variables, then the standard deviation of a segment
(σseg) relative to its mean (μseg)—coefficient of variation
(Cv-seg)—is σseg/μseg. Now, for a path with n segments, the
coefficient of variation becomes
Cv-path ≈
√
n · σ 2seg
n · μseg ≈
1√
n
Cv-seg. (9)
Hence, paths with large n are expected to have much smaller
stochastic variations relatively to one LUT + IC segment.
To tackle the spatially correlated variation as well as any
remaining stochastic variation, we isolated the paths with delay
within 4% of the worst case critical path delay reported by the
timing analysis tool, and selected the one with the highest mea-
sured delay as the reference. Note that we did not allow the use
of DSP blocks in the Butterworth IIR filter, because they would
form part of the critical path, preventing us from isolating it for
reference measurements. The FP32 multiplier and FIR filter,
on the other hand, do not suffer from such problem. Thus, their
critical paths can be isolated while DSP blocks are in use.
1) Limitation With Accuracy Evaluation: It is difficult to
define and obtain an absolute “ground truth” reference for
accuracy comparison, since the TP and reference tests exercise
a CUT differently, and the slightest differences in switching
activity, temperature, and voltage supply can offset the results
between them. The inclusion of the high and low switching
activity tests with the isolated critical path was an attempt to
alleviate this problem, assuming that local temperature and
voltage variations are linked to the CUT’s dynamic power
consumption and, therefore, related to its switching activity.
This gives us a set of upper and lower bounds to evaluate
the accuracy of our TP results. Although not an absolute
comparison of timing, it allows us to determine whether a
TP measurement falls within the bounds, as well as its error
from the worst case lower bound.
2) Consistency Evaluation: Consistency (or repeatability)
of results are evaluated by repeating the TP test 300 times
for each CUT. The first 50 tests are done to achieve thermal
equilibrium in the FPGA, and thus their results are discarded.
The remaining 250 results are used to calculate measurement
consistency in terms of 3-sigma variation of the fmax in
percentage of the mean value. Only stage 2 of the test
procedure described in Section VII-B is used, and each test
takes approximately 5 s to run. The consistency results for the
CUTs are presented in Table II.
3) Limitation With Complex Circuits: To accurately test
complex state machines, the input vector sequences must allow
the CUT to traverse to specific states that activate the most
critical path and expose timing failures at the outputs. This
reveals the limitation of pseudo-random test vectors, which
on one hand provide general patterns suitable for black-boxes
but are inefficient in providing exact vector sequences for
testing complex state machines. The number of test samples
(N) can be raised to increase the probability of covering
the right sequence, but there will be an upper limit with N
at which test time becomes impractical. This is an inherent
limitation with all random-vector-based test methods and is
not specific to the TP method. Note that the proposed weighted
random optimization is also effective with state machines, and
should improve accuracy while maintaining practical test time
in most cases. Generally, it is difficult to obtain the optimal
N for black-boxes, since the critical paths’ delay distribution
and their observability are unknown. Nonetheless, statistical
analysis or Monte Carlo simulation (Fig. 19) of known corner
cases of designs could serve as a guideline for N values.
B. Test Results
The TP and HP fmax measurements of the eight CUTs
are presented in Table II. For the FP32 adder, divider, and
the two filter circuits, the TP and HP profiles of both the
isolated critical path and the original CUT responses are shown
in Figs. 24–27 to observe their failure characteristics. The
fmax measurements marked by vertical lines on the plots are
determined by the first point of deviation in their TP/HP
profiles (some of the initial deviations are smaller than the
plots’ resolution and thus may not be visible).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
(a) (b)
(c) (d)
160 170 180 190 200
0
0.5
1
TP
Frequency (MHz)
 Path Fmax
(Low Activity)
 176.5MHz
 Path Fmax 
(High Activity)
 171.9MHz
160 170 180 190 200
0
0.5
1
H
P
Frequency (MHz)
 Path Fmax
(Low Activity)
 176.5MHz
 Path Fmax 
(High Activity)
 171.9MHz
HP decreasing
⇒ Rising Transitions failing
160 170 180 190 200
0
0.5
1
N
or
m
al
is
ed
 T
P
Frequency (MHz)
TP Fmax:
173.8MHz
160 170 180 190 200
0
0.5
1
N
or
m
al
is
ed
 H
P
Frequency (MHz)
HP Fmax:
173.8MHz
Uniform random
primary inputs
(High Activity)
Primary inputs
at static 0
(Low Activity) 
Fig. 24. TP/HP profiles and fmax (Fmax) inferences of the FP32 adder.
(a) TP profiles-isolated critical path. (b) HP profiles-isolated critical path.
(c) TP profiles-original CUT outputs. (d) HP profiles-original CUT outputs.
(a) (b)
(c) (d)
220 230 240 250 260
0
0.5
1
TP
Frequency (MHz)
 Path Fmax
(Low Activity)
 238.0MHz
 Path Fmax 
(High Activity)
 224.1MHz
220 230 240 250 260
0
0.5
1
H
P
Frequency (MHz)
 Path Fmax
(Low Activity)
 238.0MHz
 Path Fmax 
(High Activity)
 224.1MHz
HP decreasing
⇒ Rising Transitions failing
220 230 240 250 260
0
0.5
1
N
or
m
al
is
ed
 T
P
Frequency (MHz)
TP Fmax:
232.2MHz
220 230 240 250 260
0
0.5
1
N
or
m
al
is
ed
 H
P
Frequency (MHz)
HP Fmax:
232.2MHz
Uniform random
primary inputs
(High Activity) Primary inputsat static 0
(Low Activity) 
Fig. 25. TP/HP profiles and fmax (Fmax) inferences of the FP32 divider.
(a) TP profiles-isolated critical path. (b) HP profiles-isolated critical path.
(c) TP profiles-original CUT outputs. (d) HP profiles-original CUT outputs.
As can be seen from Table II, the TP measurements of the
basic floating-point units (multiplier, adder, and divider) all lie
within the upper and lower bounds defined by the high and
low switching activity cases of the isolated critical path. The
lowest TP accuracy is observed in the FP32 divider (Fig. 25)
with 3.6% error from the worst case lower bound. This is not
surprising, as the FP32 divider has the highest resource usage
and pipeline depth (latency) amongst the CUTs, and hence is
expected to have the lowest TP sensitivity to timing failures.
Despite being the worst case observed, the error in terms of
frequency (or delay) remains relatively small.
Also, in Fig. 25(a), the timing of the isolated critical path
in the FP32 divider is hugely impacted by high surrounding
switching activity, with a significant fmax reduction from
238.05 to 224.11 MHz. The shape of the TP profile suggests
that the path delay varies with the test clock frequency
under high switching activity, resulting in failure slopes with
unpredictable gradient as opposed to the almost linear slopes
(a) (b)
(c) (d)
100 110 120 130 140
0
0.5
1
TP
Frequency (MHz)
 Path Fmax
(Low Activity)
 117.6MHz
 Path Fmax 
(High Activity)
 111.8MHz
100 110 120 130 140
0
0.5
1
H
P
Frequency (MHz)
 Path Fmax
(Low Activity)
 117.6MHz
 Path Fmax 
(High Activity)
 111.8MHz
HP increasing
⇒ Falling Transitions failing
100 110 120 130 140
0
0.5
1
N
or
m
al
is
ed
 T
P
Frequency (MHz)
TP Fmax:
111.6MHz
100 110 120 130 140
0
0.5
1
N
or
m
al
is
ed
 H
P
Frequency (MHz)
HP Fmax:
111.6MHz
Primary inputs
at static 0
(Low Activity) 
Uniform random
primary inputs
(High Activity)
Fig. 26. TP/HP profiles and fmax (Fmax) inferences of a Butterworth IIR
filter design from Altera [29]. (a) TP profiles-isolated critical path. (b) HP
profiles-isolated critical path. (c) TP profiles- original CUT outputs. (d) HP
profiles-original CUT outputs.
(a) (b)
(c) (d)
155 160 165 170 175 180 185 190 195
0
0.5
1
TP
Frequency (MHz)
 Path Fmax
(Low Activity)
 172.8MHz
 Path Fmax
(High Activity)
 165.6MHz
155 160 165 170 175 180 185 190 195
0
0.5
1
H
P
Frequency (MHz)
 Path Fmax
(Low Activity)
 172.8MHz
 Path Fmax 
(High Activity)
 165.6MHz
HP increasing
⇒ Falling Transitions failing
155 160 165 170 175 180 185 190 195
0
0.5
1
N
or
m
al
is
ed
 T
P
Frequency (MHz)
TP Fmax:
164.7MHz
155 160 165 170 175 180 185 190 195
0
0.5
1
N
or
m
al
is
ed
 H
P
Frequency (MHz)
HP Fmax:
164.1MHz
Primary inputs
at static 0
(Low Activity) 
Uniform random
primary inputs
(High Activity)
Fig. 27. TP/HP profiles and fmax (Fmax) inferences of a low-pass FIR filter
design with 15 taps (fixed coefficients). (a) TP profiles-isolated critical path.
(b) HP profiles-isolated critical path. (c) TP profiles-original CUT outputs.
(d) HP profiles-original CUT outputs.
in the low switching activity case. The exact cause of the
differences is unclear, but it could be due to signal crosstalk,
increased localized heating from dynamic power dissipation,
and/or voltage drop due to increased load on the power supply.
For the IIR and FIR filters in Figs. 26 and 27, the results
appear to be highly accurate, yet intriguing. The TP method
yielded slightly more pessimistic fmax measurements than
the isolated critical paths even in the high activity case,
showing small but negative error percentages in Table II.
Interestingly, the FP32 square-root circuit also showed similar
results, and the one similarity that all three CUTs shared is
that they all contain a significant portion of adder trees in
their implementations. Moreover, the critical and near-critical
paths in them all transverse through the adder trees. Recall in
Section VI-A, Fig. 18(c) and (g) that the XOR trees yielded
exceptional measurement accuracy due to their high levels of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WONG AND CHEUNG: TIMING MEASUREMENT PLATFORM FOR ARBITRARY BLACK-BOX CIRCUITS 13
TABLE III
GENERAL RESOURCE USAGE OF THE TP TEST PLATFORM
4-LUT + Register (LE)* ALUT + 2 Registers (ALM)*
32-b Ring Generator + Phase Shifter ∼ 32 + r × n ∼ (32 + r × n)/2
HP weight Partial/dynamic reconfigurable n × ceil( r−13
)
n × ceil( r−15
)
Control circuit Direct logic implementation n × ceil( 2r3
)
n × ceil( 2r5
)
k-bit Shared by MUX (low area) ∼ ceil( 3m4 + 1
) + k ∼ ceil( m3
) + k/2
TP counter Parallel (short test time) m × k m × k/2
* Resource usage is quantified in the number of LEs or ALMs [30]. The estimations are based on the number of input bits n, output bits m,
the TP counter bit width k, and the order r which defines the number of HP weight levels by 2r + 1.
glitches, and, most importantly, the sum logic in an adder is
essentially built from XORs. The critical paths in the three
CUTs are thus likely to have high level of glitches during the
TP test, yielding more accurate measurements. An isolated
critical path, on the other hand, would have no glitches at all
and only toggle once every clock cycle by the TFF, resulting in
a slightly more optimistic (higher) fmax than the TP method.
In terms of state-machines testing, both the IIR filter and
ITC’99 Benchmark (b12) fall within this category with sequen-
tial feedbacks. Although the TP result of the ITC’99 circuit
is slightly more optimistic than the low switching activity
reference, the worst case error remains at a reasonable 1.26%.
Another interesting observation in Table II is that TP and HP
almost always have the same fmax results. This agrees with
the measurements of the corner cases in Section VI, where
the accuracy of TP is approximately the same as that of HP,
especially when the level of glitch activity is high. The optimal
HP weights we obtained from the tests for each CUT (Table II)
turned out to be mostly at 0.5. Only two circuits (9 × 9
embedded multiplier and FP32 divider) benefited from having
nonuniform HP weights at 0.3125 and 0.875, respectively. This
is in line with the corner case results in Section VI-A, which
suggest that most circuits would yield good measurement
accuracy at 0.5 HP weight on the Cyclone III FPGA unless
their critical paths are dominated by many-input AND trees.
Finally, all the CUTs showed remarkable test consistency
over 250 consecutive tests. Their 3-sigma variations are all
within ±0.7% and the worstcase is observed on the 9 × 9
embedded multiplier at ±0.68%. The timing resolution for
the tests ranges from 0.3 to 8.0 ps, since the frequency steps
are fixed and higher fmax results in better resolution [see (3)].
IX. TEST TIME AND AREA ESTIMATION
The test time of the TP method (Ttest) with N samples and
Nfreq frequency steps is expressed by
Ttest = N ×
Nfreq∑
i=1
1
fi ≈ N ×
Nfreq
f1 (10)
where fi is the clock frequency at the i th step. Taking the
FP32 divider case as an example with 250 steps of test clock
frequency from 225 to 250 MHz, 0.1 MHz step size, and 224
input vectors per frequency steps, the approximate test time
is 224 × 250/225 MHz = 18.6 s. In comparison, scan-chain-
based methods with scan clock frequency fscan require extra
test time (N × Nfreq × m/ fscan) to scan-out m bits of output
for all samples and frequency steps on top of Ttest in (10).
Therefore, given fscan = 10 MHz, a scan-chain-based test
for the FP32 divider takes 18.6s + 224 × 250 × 32/10 MHz =
13440.4 s, which is over 720 times slower than the TP method.
For the FRD-based SIC exhaustive test used earlier, the test
time per path is described by Ttest. Therefore, the test of a
combinatorial circuit with n inputs and m outputs would be
m × n × 2n−1 times [15] slower than using the TP method.
The resource usage of the test platform depends on four
factors: the number of input bits (n), the number of output
bits (m), the TP counter bit width (k), and the order of HP
weight levels (r ), where 2r +1 gives the actual number of HP
levels. The detailed resources estimation for each test circuit
component is presented in Table III. The general resource
usage is quantified in terms of the number of LEs containing a
four-input LUT and register, or ALMs with fracturable eight-
input adaptive LUT (ALUT), and two registers found in more
advanced FPGA architectures [30].
The test circuit can be optimized for low area overhead
by sharing the same TP counter among output bits using an
m-to-1 multiplexer, or optimized for short test time using
multiple TP counters in parallel. Table III shows the two
extreme cases with fully shared or parallelized TP counters.
The resource usage of the multiplexer in the sharing case
is estimated based on the actual optimized synthesis results
from Altera’s Quartus II design tool. For example, the test
platform used for the previous test cases with 17 HP weight
levels (r = 4), n = 2 × 32, m = 32, and 24-b TP counters
(k = 24) requires 1248 LEs (parallel counters) or 529 LEs
(shared counters) to implement on the Cyclone III FPGA.
X. CONCLUSION
This paper presented a detailed analysis and practical
demonstration of the TP method. While TP is as accurate as
HP measurements, its test circuit is more robust and flexible,
and requires less measurement hardware. Therefore, TP is the
ideal method for a self-contained timing measurement plat-
form. By analyzing the mechanisms that cause TP deviations
when timing failures occur in circuits, timing uncertainties
such as clock jitter and flip-flop metastability were found
to contribute to the method’s effectiveness and accuracy.
Through simulation and measurements of corner cases, the
effect of logic glitches and random input vectors with different
HP weights was examined. This provides insight into how
further accuracy improvement is possible via tuning input HP
weights for testing different designs, as well as the type of
circuits that are more suitable for the TP method. Although
a universal measurement platform for any black-box circuit
design is difficult to achieve, and evaluating accuracies for
all circuit structures is impossible, the TP method has shown
good measurement accuracy across common functional circuit
modules (arithmetic, filter, and state-machine circuits) that are
found in most complex modular designs. Accuracy remains
largely within 2%, except for the FP32 divider which is the
largest circuit tested, and gave a worst case error of 3.6%.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Further accuracy improvement may also be possible via bit-
specific input HP weight tuning using heuristics. However, the
current results show sufficient accuracy, and a small timing
guard band (timing margin) can be added to the measurements
to ensure reliability.
The TP method provides attractive solutions to measure and
characterize timing of designs, especially for signal processing
and arithmetically intensive tasks, which allows them to run
with minimal timing safety margin (closest to fmax) while
maintaining reliability. Also, it is especially suited to pro-
grammable architectures, such as FPGAs, where output of
virtually any registers can be routed to the TP test circuit
(transition counter). Moreover, transition counters can be built
in a compact form at the transistor level and included as
dedicated hardware in any VLSI designs to provide an accurate
yet compact built-in test platform. One interesting direction
would be to incorporate the TP measurement platform in
design tools such that the resultant hardware design in ASIC,
FPGA, or other programmable architectures can be efficiently
tested for its timing performance. Such tools will ease the
burden of design for testability on designers, and significantly
reduce design time and shorten the testing and prototyping
cycles.
ACKNOWLEDGMENT
The authors would like to thank Terasic Technologies, Inc.
for providing USB BLASTER interface support on Altera’s
FPGAs.
REFERENCES
[1] B. P. Wong, Nano-CMOS Circuit and Physical Design. New York: John
Wiley and Son, 2005.
[2] P. Sedcole and P. Y. K. Cheung, “Within-die delay variability in 90 nm
FPGAs and beyond,” in Proc. IEEE Int. Conf. Field-Program. Technol.,
Dec. 2006, pp. 97–104.
[3] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on
commercial FPGA chips,” in Proc. 6th IEEE Workshop Signal Propag.
Interconn., May 2002, pp. 157–159.
[4] S. Pei, H. Li, and X. Li, “A low overhead on-chip path delay measure-
ment circuit,” in Proc. 18th Asian Test Symp., Nov. 2009, pp. 145–150.
[5] A. Raychowdhury, S. Ghosh, and K. Roy, “A novel on-chip delay
measurement hardware for efficient speed-binning,” in Proc. IEEE 11th
Int. On-Line Test. Symp., Jul. 2005, pp. 287–292.
[6] T. Matsumoto, “High-resolution on-chip propagation delay detector for
measuring within-chip variation,” in Proc. Int. Conf. Integr. Circuit
Design Technol., May 2005, pp. 217–220.
[7] K. Katoh, T. Tanabe, H. Zahidul, K. Namba, and H. Ito, “A delay
measurement technique using signature registers,” in Proc. 18th Asian
Test Symp., Nov. 2009, pp. 157–162.
[8] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-characterization
of combinatorial circuit delays in FPGAs,” in Proc. IEEE Int. Conf.
Field-Program. Technol., Dec. 2007, pp. 245–251.
[9] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “A transition probability
based delay measurement method for arbitrary circuits on FPGAs,” in
Proc. IEEE Int. Conf. Field-Program. Technol., Dec. 2008, pp. 105–112.
[10] J. S. J. Wong and P. Y. K. Cheung, “Improved delay measurement
method in FPGA based on transition probability,” in Proc. ACM/SIGDA
Int. Symp. Field Program. Gate Arrays, Feb. 2011, PP. 163–172.
[11] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, C. Z. Toan-Pham,
D. Blaauw, T. Ustin, K. Flautner, and T. Mudge, “Razor: A low-power
pipeline based on circuit-level timing speculation,” in Proc. 36th Annu.
IEEE/ACM Int. Symp. Microarchitect., Dec. 2003, pp. 7–18.
[12] R. Ye, F. Yuan, and Q. Xu, “Online clock skew tuning for timing
speculation,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design,
Nov. 2011, pp. 442–447.
[13] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-measurement of
combinatorial circuit delays in FPGAs,” ACM Trans. Reconfig. Technol.
Syst., vol. 2, no. 2, pp. 1–22, 2009.
[14] A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation of average
switching activity in combinational and sequential circuits,” in Proc. 29th
ACM/IEEE Design Autom. Conf., Jun. 1992, pp. 253–259.
[15] J. S. J. Wong, “Delay measurement and self characterisation on FPGAs,”
Ph.D. dissertation, Imperial College London, London, U.K., 2011.
[16] Digital Clock Manager (DCM) Module, v1.9, Xilinx Inc., 2009.
[17] K. Kundert, Predicting the Phase Noise and Jitter of PLL-Based
Frequency Synthesizers. 4th ed., Designer’s Guide Consulting Inc., Aug.
2006.
[18] Understanding Metastability in FPGAs, Altera Corp., 2009.
[19] L.-T. Wang, C.-W. Wu, C.-W. Wu, and X. Wen, “VLSI test principles and
architectures: Design for testability,” in The Morgan Kaufmann Series
in Systems on Silicon. New York, CA: Academic Press, 2006.
[20] J. Wilton, S.-S. Ang, and W. Luk, “The impact of pipelining on energy
per operation in field programmable gate arrays,” in Proc. Int. Conf.
Field Program. Logic Appl., Aug. 2004, pp. 719–728.
[21] J. Lamoureux, G. G. Lemieux, and S. J. E. Wilton, “Glitchless: An
active glitch minimization technique for,” in Proc. ACM/SIGDA Int.
Symp. Field Program. Gate Arrays, Feb. 2007, pp. 156–165.
[22] A. Raghunathan, S. Dey, and N. Jha, “Glitch analysis and reduction in
register transfer level power optimization,” in Proc. 33rd Design Autom.
Conf., Jun. 1996, pp. 331–336.
[23] G. Mrugalski, J. Rajski, and J. Tyszer, “Ring generators — new devices
for embedded test applications,” IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., vol. 23, no. 9, pp. 1306–1320, Sep. 2004.
[24] G. Mrugalski, M. Mukherjee, J. Rajski, and J. Tyszer, “High perfor-
mance dense ring generators,” IEEE Trans. Comput., vol. 55, no. 1,
pp. 83–87, Jan. 2006.
[25] J. Rajski, N. Tamarapalli, and J. Tyszer, “Automated synthesis of large
phase shifters for built-in self-test,” in Proc. Int. Test Conf., Oct. 1998,
pp. 1047–1056.
[26] J. Rajski, G. Mrugalski, and J. Tyszer, “Comparative study of CA-based
PRPGs and LFSRs with phase shifters,” in Proc. 17th IEEE VLSI Test
Symp., Apr. 1999, pp. 236–45.
[27] Difference-Based Partial Reconfiguration, Xilinx Inc., 2007.
[28] Increasing Design Functionality with Partial and Dynamic Reconfigu-
ration in 28-nm FPGAs, Altera Corp., 2010.
[29] Implementing High Performance DSP Functions in Stratix and Stratix
GX Devices, Altera Corp., 2004.
[30] FPGA Architecture White Paper, Altera Corp., 2006.
Justin S. J. Wong received the M.Eng. and Ph.D. degrees in electrical and
electronic engineering from Imperial College London, London, U.K., in 2006
and 2011, respectively.
He is currently a Research Associate with the Circuits and Systems
Group, Imperial College London. His current research interests include testing,
timing measurement, circuit reliability, delay variability characterization, and
variation-aware designs.
Peter Y. K. Cheung is a Professor of digital systems and the Head
of the Department of Electrical and Electronic Engineering, Imperial Col-
lege London, London, U.K. Detailed biography and research interests:
http://www.ee.ic.ac.uk/pcheung/
