Abstract-The key aspects of a good on-chip timing measurement platform are high measurement resolution, accuracy, and low area overhead. A measurement method based on transition probability (TP) has shown promising characteristics in all these areas. In this paper, the TP measurement method is examined through simulation to understand its apparent effectiveness and accuracy in measuring complex circuits. Timing uncertainties and logic glitch activities are considered in detail, and the effect of varying input vectors' probability distributions is analyzed to enable further accuracy improvements. Using a field-programmable gate array, the method is implemented and demonstrated as a modular on-chip test platform for testing complex arbitrary circuits. Practical circuits found in typical modular designs, including fixed/floating-point arithmetic and filter circuits, are chosen to evaluate the test platform. The resolution of the timing measurements ranges from 0.3 to 8.0 ps, and the measurement errors against reference measurements are found to be within 3.6%. The test platform can be applied to VLSI designs with minor area overhead, and provides designers with precise and accurate physical timing information of circuits.
I. INTRODUCTION

D
ESIGN with VLSI circuits as both application-specific integrated circuits (ASICs) and programmable architectures, e.g., field-programmable gate arrays (FPGAs), is facing a dilemma. While the need for greater operating speed encourages designers to push ever closer toward the absolute physical timing limit of hardware circuits, the increase in circuit complexity and density decrease timing predictability due to process variations [1] , [2] . This forces the use of wider timing margins to ensure reliability, but this also sacrifices potential timing performance. The most obvious solution is to measure the actual circuit timing chip by chip, down to specific parts or components, such that deterministic timing information of critical circuit paths/components is available to allow much narrower timing margins while maintaining reliability. The use of intellectual property (IP) and embedded blocks is, however, problematic, since it is not possible to isolate/extract their actual critical paths due to encrypted/undisclosed circuit information. There are many existing timing measurement methods, but most of them are neither practical nor applicable to "black-box" circuits. For ring-oscillator-based methods [3] , extra interconnects for completing the feedback path impact measurement accuracy and prohibit registers in the circuit. Similarly, methods relying on time-to-digital conversion using Vernire delay line [4] , analog-based delay converter [5] , and delay-lock loop [6] also fail to support registered paths, and the extra interconnects needed to route the inputs and outputs to and from the test circuit cause inaccuracy. Scan-chain-based methods for pass/fail analysis at specific clock frequency can be repeated over multiple frequency steps to measure delay [7] . However, the test time is impractical for high timing resolution measurements, since capturing and analyzing output samples for many frequency steps through the scan chain is time consuming. We introduced a method that allows high-resolution delay measurement of register-to-register path through failure rate detection (FRD) [8] . However, it is limited to testing one path at a time and cannot directly support multipath circuits such as state machines and pipelined circuits. We proposed a measurement method in [9] that was shown to give relatively accurate measurements for multipath complex circuits [10] , providing an attractive solution to the problem. The method is based on measuring the output transition probability (TP) of a circuit under test (CUT). It is able to achieve the same measurement accuracy and resolution as FRD for isolated path but requires fewer hardware resources [9] , [10] . Test time is also significantly shorter than scan-chainbased methods for complex circuits at high timing resolution. The remaining question is what mechanisms are behind the TP method that govern the apparent high measurement accuracy, and how it can be improved further, as well as how well it performs over a wider range of practical circuits as a universal test platform. This paper proceeds as follows. Section II examines TP and related measurement methods in detail. Then, in Sections III and IV, TP is simulated and modeled to further understand its characteristics. In Section V, the mechanisms that allow TP measurement to infer delay of a circuit are examined and the possible causes of measurement inaccuracy are identified. Actual measurements of corner case logic functions on FPGA are analyzed in Section VI along with investigation of measurement accuracy. Next, the detailed implementation and test procedure of the complete TP test platform are described in Section VII. The test platform is demonstrated on FPGA with practical designs (arithmetic, filter, and state machine) in Section VIII along with accuracy and consistency evaluations. Finally, the test time and resource usage of the test platform is estimated in Section IX. Section X concludes this paper.
II. FAILURE RATE, TRANSITION PROBABILITY, AND HIGH-PROBABILITY METHODS
The FRD method introduced earlier was inspired by a register-level timing error detection/correction mechanism (RAZOR) proposed by [11] , which can be adapted to infer path delay [12] . The FRD method measures the timing failure rate of a circuit path while stepping up the system clock frequency, and infers the path delay from the point at which timing failures begin to occur. Failure rate is measured by comparing the registered output of a path under test (PUT) against a correct reference signal using a hardware comparator [ Fig. 1(b) ]. The biggest drawback of FRD is the need of an at-speed reference generator circuit which must operate correctly beyond the maximum speed of the PUT, and often utilizes an extensive amount of hardware resources [13] . We were able to efficiently generate a reference signal using the preregistered output of the PUT [8] . However, the technique is applicable to testing only one isolated combinatorial path at a time.
Similar to the FRD method, the TP method [9] , [10] measures circuit delay by detecting timing failures. However, the failures are inferred indirectly from statistical observation of the CUTs' outputs instead of using a hardware comparator and a reference generator. This significantly reduces the hardware resource usage, and it also enables the method to be used on multipath multistage pipelines and sequential circuits. In conjunction with the TP method, we also introduce a high probability (HP) based method which has similar behavior as TP but enables cross-comparison and analysis of the TP method.
A. Definitions of TP/HP and Measurement Concepts
Consider a typical synchronous circuit with a combinatorial stage and output register [ Fig. 1(a) ]. The output signal from the register can be seen as a series of discrete time samples S(k) of the preceding combinatorial output, where k = 1, 2, . . . Since the output sample rate obeys the clock frequency driving the register, two types of relative statistical measurement over N clock cycles can be observed.
1) The HP or H (S), where H (S) = P{S(k) = 1}. It represents the ratio of the number of samples whose S is high over N clock cycles. It is a first-order statistical measurement of S and its value lies within the range 0-1. 2) The TP or D(S), which is the probability that S changes state between consecutive samples, i.e., the average number of signal transitions in S per cycle over N clock cycles. It is given by
D(S) is a second-order statistical measurement of S. When S contains random binary samples, D(S) obeys the following quadratic relationship with a maximum of 0.5:
It was shown in [14] that the probability of an output of a Boolean function evaluating to 1 is equal to the sum of the probabilities of each of the disjoint cubes in the cover evaluating to 1. If the input vectors of a circuit are chosen randomly or follow a fixed sequential pattern, i.e., the vectors form a stationary process, then the probability of its output(s) evaluating to 1 will be stationary as well. Therefore, H (S) and D(S) of the output samples will be stationary (unchanging). Any timing violations in the circuit disrupting the stationary process would cause the probabilities to change and hence indicate a timing failure. The idea is illustrated by the example plots in Fig. 2 . Such disruption due to timing violation can be explained through the following example.
In Fig. 1(a) , the output register captures a sample S(k) of the output z after time T , one clock cycle after applying the input V (k). If the clock frequency is low enough, then the circuit operates without faults: S(k) = z(k), and the probabilities H (S) and D(S) remains stationary. However, if the test clock frequency is increased step by step, at some point the clock period will breach the timing constraint imposed by the propagation delay of z, and the register will begin to sample the z value from the previous cycle, such that S(k) = z(k −1). This disrupts the stationary process and causes H (S) and D(S) (HP and TP) to deviate from their normal stationary values. The HP or TP value for each frequency step is collected to plot a profile that shows the failure behavior of the CUT over a range of test frequencies (Fig. 2) , and is used to estimate the maximum operating frequency ( f max ) or circuit delay.
This test method relies on two features: 1) the ability to sweep the test clock frequency f clk in fine steps and 2) the ability to infer circuit delay from TP and/or HP measurements assuming they reflect timing failures in the circuit as frequency is swept from low to high. The clock generation and sweeping process for 1 has been thoroughly implemented in [8] , [13] , and [15] using phase-locked loops (PLLs) and/or digital clock managers (DCMs) [16] . For 2, the idea will be evaluated and simulated in the following sections to understand how TP and HP respond to timing failure in real circuits.
B. Measurement Circuit
The top-level implementation of the measurement circuit is depicted in Fig. 3 . The CUT represents combinatorial or sequential circuits with input V and output y. The launch register (LR) and the sample register (SR) at the beginning and end of the CUT are clocked by a test clock generator (TCG) which steps through a range of test frequencies. The minimum achievable timing resolution ( t) in terms of the start clock frequency ( f ) and frequency step size ( f ) is expressed in [8] as
If f begins at 200 MHz and increments with steps of 0.1 MHz ( f ), the minimum timing resolution achieved is 2.5 ps.
A test vector generator (TVG) provides test vectors V to the CUT such that during normal operation each output bit of the CUT exhibits a nonzero and steady output TP/HP. The measurement circuitry at the output of the CUT generates either TP or HP profile and it is analyzed by the probability profile analyzer (PPA) to produce delay measurement of the CUT. The measurement circuits for generating TP or HP profile are depicted in Fig. 4 . The asynchronous transition counter is implemented as a ripple counter which takes input as a clock and processes N samples from the CUT's output register over N test clock cycles for each frequency step. The count is stored by the probability profile collector (PPC) for all test frequency steps to create a profile. In practice, as long as N is kept constant across all frequency steps, circuit delay can be inferred directly by the PPA through detecting the change in transition count, without the need for dividing by N to obtain absolute TP values. Also, it is only necessary to store the most recent count in the PPC for change detection. Full TP/HP profiles are collected only for illustration purpose and to gain insight in circuit failure process.
For HP, the CUT's output is first fed through a toggle flip-flop (TFF) to translate each sample at logical high into a signal transition for the asynchronous transition counter to count the number of high cycles and allow the PPC to produce an HP profile. Notice that the TFF must be synchronized to the test clock. Therefore, to prevent clock skew related errors, it is important to place it close to the output registers of the CUT where the relative clock skew is small. For this reason, the purely asynchronous TP measurement circuit is preferred due to the greater robustness, freedom of placement location, and lower resource usage.
III. CHARACTERISTICS OF TP AND HP PROFILES
To further understand the characteristics of the statistical profiles of TP and HP, the circuit in Fig. 3 is simulated with a single register-to-register path. Signal transitions are simulated as timing events, taking into account their interaction and propagation along the path. The propagation delays of the path for rising and falling transitions are set to two distinctive values, and the registers are driven by a clock with 30-pswide uniform clock jitter distribution around each expected clock edge (see Fig. 7 ). The path is stimulated by an input that toggles every clock cycle, and the clock frequency increases in 1.0-MHz steps. TP and HP are recorded over 2000 clock cycles (trials) for each frequency step to construct their profiles against frequency. 2000 cycles are sufficient because only one path is being tested. For actual circuits with many interacting paths, more test cycles are necessary (see Section VI). Flipflop metastability is accounted by a metastable window (20-ps wide) defined by symmetrical setup and hold times. The probability of resolving to the previous cycle's output is set to be linearly proportional to where the violation occurred, i.e., the probability varies linearly from 0 to 1 depending on where the violation occurs between the start and end of the window.
The simulated TP and HP profiles of a path with falling and rising transition delays at 1350 and 1250 ns (∼740 and 800 MHz) are shown in Fig. 5 , where a reference failure rate profile from the FRD method is included. In Fig. 5(a) , the first change in failure rate around 25% reflects the failure of the slower signal transitions (falling transitions in this case). The second change around 75% shows the failure of the quicker rising transitions, leading to 100% failure toward the end. The nominal maximum operating frequency ( f max ) of the path can be deduced from the mid-point of the first failure slope at 25%, where TP and HP plots [Fig. 5(b) and (c)] infer timing failures in a similar way as the failure rate profile. Both TP and HP respond to the failure of the falling/rising transitions around TP = 0.5 and H P = 0.75, respectively, with similar slopes. By taking their mid-point, the same nominal f max can be obtained accurately.
One interesting aspect about HP is that it allows us to identify the transition type (rise or fall) corresponding to the path delay. If the transition types are swapped for the same slow and fast delay values, both the failure rate and TP profiles will remain exactly the same, but with HP [ Fig. 5(d) ], the profile is flipped upside down around H P = 0.5.
A. Contributions of Timing Uncertainties
The main concern with TP or HP is when the delays of the rising and falling transitions are symmetrical (equal), potentially causing the two failure slopes to overlap and cancel each other out. However, the existence of timing uncertaintiesclock jitter and metastability-would prevent the TP profile from completely loosing sensitivity to timing failure. Their effects are illustrated in Fig. 6 with both delays at 1300 ns (∼770 MHz), which clearly shows the positive effect of random clock jitter. When jitter is absent, the TP profile has absolutely no sensitivity to timing failure, whereas the cases with jitter and metastability produce easily distinguishable TP responses. in this case. The reason for the sensitivity loss with HP will be discussed further in Section V-A.
1) Effect of Random and Correlated Jitter:
Clearly, clock jitter leads to the success of the TP method. Yet, the behavior of jitter could vary between different clock sources, and jitter could be induced by different processes [17] . Therefore, it is important to thoroughly understand how these differences could affect the resultant TP profile.
The main concept of jitter is illustrated in Fig. 7 , where jitter is described by a random variable τ relative to the expected clock edge at time T . Since the CUT's combinatorial output must settle within one clock cycle (assuming no multicycle transfer), we can model the clock period variation between two consecutive clock edges by T + τ as pure edge-to-edge jitter, assuming a jitter-free initial clock edge. According to [17] , there are two main types of jitter that could affect the apparent edge-to-edge jitter experienced by a CUT.
1) Edge-to-edge random jitter-independent random phase variation between each clock edge. 2) Low-frequency multiple-cycle random jitter-random but gradual phase drift over multiple clock cycles, which causes high degree of edge-to-edge correlation. Type 1 jitter can be simulated by interpolating a low frequency random drift to compute intermediate jitter values at higher frequency for each clock cycle with edge-to-edge correlation. In comparison to uncorrelated jitter, Fig. 8 shows that the edge-to-edge correlation of the low-frequency multicycle jitter causes a significantly smaller TP response during the period of timing failure. The reduced TP sensitivity is explained as follows. Consider the combinatorial output in Fig. 7 , with the delays coinciding exactly with the clock edges (t slow = t fast = T ). The general expression of TP under such condition is given by
Without jitter correlation, the probability that one type of transitions fails is 0.5, independent of the outcome of the Comparison between TP profiles with low-frequency multicycle random jitter and edge-to-edge random jitter in a circuit path with different rising and falling transition delays.
previous transition, hence T P = 0.5 2 + (1 − 0.5) 2 = 0.5, which agrees with the TP value in Fig. 6 at ∼770 MHz. However, when jitter correlation exists, the probability that a transition fails becomes dependent on the outcome of the previous transition. Thus increasing the probability that both transitions failing at the same time (P(Fall fail ∩ Rise fail )) and both not failing at the same time (P(Fall fail ∩ Rise fail )) causes a smaller TP change from its initial value of 1.0 and lower sensitivity against timing errors (Fig. 8) . The TP deviation caused by metastability becomes more apparent in this case and helps maintain timing failure sensitivity, since it introduces uncertainty that is always independent of clock edges regardless of jitter correlation.
Note that, when the rising and falling transition delays are asymmetrical with t slow t fast , the probability that they both fail at the same frequency is zero. Hence, the TP profile is unaffected by jitter correlation (Fig. 9) .
A simple test is carried out on an Altera Cyclone III FPGA [15] to identify possible edge-to-edge jitter correlation in its clock signal. The CUT is implemented with a series of nine inverters such that the path has approximately symmetrical rising and falling transition delays, and it is driven by toggle stimulus. Fig. 10 depicts the TP profiles, where the individual TP profile of each transition type is isolated by taking separate measurements at even or odd clock cycles. The small deviation in the overall TP profile clearly shows the existence of correlated multicycle jitter in the clock signal.
IV. TP/HP PROFILES FOR MULTIPATH CIRCUITS
The general TP/HP measurement circuitry described in the previous subsection can be adapted to test more complex circuits with multiple inputs and paths by using a pseudo-random vector generator (RVG) to stimulate the CUT (Fig. 11 ). Since the vector generation process is stationary, the statistics (TP and HP) of the resultant random test vectors are also stationary. We quantify the statistics of the random input (V ) in terms of HP, which can be varied linearly to obtain specific random bit patterns. TP values, on the other hand, could each refer to two HP values of different random bit patterns, since TP is a quadratic function of HP for random patterns (2) bounded by T P ≤ 0.5. For example, when T P = 0.375, (2) has two HP solutions at 0.25 and 0.75. The only exception is when TP reaches its maximum at T P = H P = 0.5.
Through simulation, we obtained the TP and HP output profiles of a single path stimulated by uniform random input as shown in Fig. 13 (a) and (b). The TP profile has the same characteristic shape as the one observed in Fig. 5(b) with the toggling input, where the midpoint of the two slopes represent the nominal propagation delay of the path's falling and rising transitions (t slow and t fast ). This simulated TP behavior is also observed on Altera Cyclone III FPGA in [10] and [15] .
A. TP Model of Sequential Paths
When a series of combinatorial paths are connected in series by registers in a pipeline arrangement, the TP profile of the entire circuit can be expressed from the TP profiles of each individual path. An example of such pipeline circuit containing three stages is presented in Fig. 12 , where the three stages (A, B, and C) are associated with their respective propagation delays: t A-fall , t A-rise , t B-fall , t B-rise and t C-fall , t C-rise . Fig. 13(c) depicts the TP profile of the circuit. When the range between the rise and fall delays of a path does not overlap with the delays of other paths (stage C in this case), the change in TP response due to timing failure is independent and simply Period (ps) High Probability propagates directly to the output, maintaining the usual form of a single-path TP profile [ Fig. 13(a) ]. The overall TP profile is formed by a multiplicative process of the individual TP profiles from each pipeline stage, and for a three-stage pipeline circuit (Fig. 12) , the TP profile is
where T P A , T P B , and T P C are the TP profiles of the paths in stage A, B, and C, respectively. Similarly, for a simple pipeline containing N stages, the general TP profile is expressed by
where T P i represents the TP profile of the i th stage combinatorial path, and T P i ≤ 0.5. It is clear that, in the case of simple sequential paths, the failure of the worst case path would always yield an easily distinguishable TP response no matter how the failure of the other paths are affecting the overall TP profile. In addition, a similar behavior for HP is observed in Fig. 13(d) , where the HP profile is approximately 1 − T P seq .
B. Analysis of Complex Multipath TP Profile
The previously described models are useful for predicting the TP profile of a failing path or simple sequential paths. Yet, the problem with them is that they are not scalable to more complex circuits containing multiple interacting combinatorial paths. Fig. 14 depicts the TP profile of the second LSB output of a 9 × 9 embedded multiplier on the Cyclone III EP3C25 FPGA. As can be seen, the observed output TP profile (shown as the dotted line) is produced by the timing failure of each individual path. While the TP profile may appear to be a direct combination of the basic TP profile components of these paths, it is actually not possible to predict the exact overall TP profile using the basic single-path TP profiles alone. The main reason for this is that the failure processes of the paths are interrelated with each other in a difficult-to-predict manner.
Consider the timing illustration in Fig. 15 , where a circuit with multiple internal paths is stimulated by random vectors. The probability that an input transition through a particular path is observable at the output depends on the input pattern and the state of the other paths, which means each path could Fig. 14. TP profile measurement of the second LSB output of a 9 × 9 embedded multiplier on the Cyclone III EP3C25 FPGA [10] , [15] . The unusual shape of the TP profile is the result of individual paths failing at different frequencies. The corresponding paths are isolated and tested separately to obtain their basic TP profile components for reference. contribute differently to the observed TP profile. Such behavior is predictable only if the exact circuit implementation, structure, and layout are known.
Although each active path may produce a signal transition some time after the clock edge, their different arrival times result in a "glitch period" containing a series of unwanted transition activities. These glitch activities are unpredictable especially with random input vectors. When the glitch period coincides with the next clock edge, where the clock edge position itself is unpredictable due to clock jitter, the actual value captured by the register (B ) is not deterministic, and hence the resultant TP cannot be determined with certainty. Also, the rapid transitions in the glitch period could cause undesirable metastability problem in the output register [18] , further increasing the unpredictability of the output value.
For these reasons, the direct approach of modeling the TP profile based on specific path quickly becomes impractical with complexity. A mere change of placement and routing of a design could produce a layout with completely different TP profile. The only way that a precise model of the TP profile can be obtained is if a perfect physical model of the circuit is available with precise information on signals propagation, interaction, and clock jitter behavior, such that the exact glitch pattern is known and the registered output value is predictable. If such perfect physical model exists, though, a delay measurement method would not be necessary in the first place. A better direction would be to consider the timing failure sensitivity of TP rather than its exact profile, and deduce ways to improve its sensitivity (measurement accuracy) in different designs.
C. Relationship Between TP/HP Sensitivity and Input Probability Distribution
The results presented so far are based on uniform random input patterns, with approximately the same number of high and low cycles (HP = 0.5). However, it was shown in [10] , [15] , and [19] that using weighted random input patterns with biased probability distributions (HP other than 0.5) could improve coverage of paths that are rarely exercised, and thus improve TP/HP sensitivity.
A simple example would be an N-input AND gate, where the output goes high only when all input bits are simultaneously at high. This implies that the probability that the AND gate produces a rising transition is particularly low, especially when N is large. Assuming all random input bits are independent and uniformly distributed at HP = 0.5, the probability of such transition occurring is given by (0.5) N . It is clear that any increase in the input HP from 0.5 would increase the transition probability through the critical path of the AND gate, resulting in better TP/HP sensitivity. The effect of weighted random input in terms of measurement accuracy will be tested and analyzed in detail in the following section.
V. ASSUMPTIONS AND CAUSE OF TP/HP INACCURACY The accuracy of TP or HP measurements relies on two assumptions: 1) the test vectors' coverage is adequate-N random test vectors successfully traverse to the correct circuit states (in the case of state machines) and exercise the critical paths in a CUT for at least once, and 2) TP and HP respond to timing failures, given that assumption 1 is true. In the following sections, we will analyze the mechanisms in which TP and HP respond to timing failures to assess assumption 2, then examine both 1 and 2 through Monte Carlo simulations and actual FPGA measurements of specific corner cases to assess TPs measurement accuracy in practice.
A. Link Between Timing Failure and TP/HP Deviations
The cause of deviation in TP and/or HP can be understood through the cases in Fig. 16 . As shown in Fig. 16(a) and (b) , TP responds to timing errors only if the pulse width t is one clock cycle long [case 1(a) for rising transition and case 2(b) for falling transition], whereas HP continues to respond to timing errors even if t last for two or more clock cycles (not illustrated in Fig. 16 ). Therefore, as long as the input vectors are able to exercise the critical path at least once, the HP method could provide accurate detection of the path delay. The only exception for HP in Fig. 16 is when the failure of (a) rising and (b) falling transitions create an equal and opposite amount of change in HP in both cases 1 and 2. Due to random clock jitter, it is possible for all three types of failure in Fig. 16(a) -(c) to coexist given a large enough number of test clock cycles and both rising and falling transitions fall within the clock jitter distribution (see Fig. 7 ). Thus, in the case where the rising and falling transitions have exactly the same delay and both (a) and (b) occur for exactly the same number of cycles, the changes in HP could cancel each other out, resulting in little or no overall observable change. Such cases of HP cancelation were observed earlier in simulations in Figs. 6 and 8 , showing only a tiny deviation due to random clock jitter and metastability responses. Nonetheless, it is highly unlikely that delays between rising and falling transitions are exactly matched in real circuits due to process variations. Also, perfect cancelation is virtually impossible due to the fact that jitter and metastability cannot be fully eradicated in practice. The probability that the number of failed cycles of both transition types matches exactly is extremely low. TP measurement, on the other hand, does not suffer from such cancelation problem, since TP only changes in either case 1(a) or case 2(b) as shown in Fig. 16 and both cases cause TP to increase. This agrees with the significant TP deviation observed earlier in Figs. 6 and 8.
In case 1(b) and 2(a) of Fig. 16 , it is true that TP appears to be less effective in reflecting timing failures than HP. However, when glitches are considered [ Fig. 16(d) and (e)], the picture changes dramatically. It is a known fact that glitches account for a significant portion of switching activities in combinatorial circuits, especially in FPGAs. Both Wilton et al. [20] and Lamoureux et al. [21] have shown that FPGA logic generates a lot of glitches, and a great proportion of switching activities in general ASIC combinatorial logic are also due to glitches [22] . As depicted in Fig. 16(d) , the critical glitches-glitches created and propagated along the critical path-cause both TP and HP to deviate when falling and/or rising glitch transitions violate timing in case 1 and 2(d). Also, in Fig. 16(e) , with nontiming critical transitions, the critical glitches could help maintain TP sensitivity to the critical path delay, where TP deviations would reflect the timing failure of the critical glitches instead of the transitions with lower delays. VI. CORNER CASE ANALYSIS ON FPGA As explained in Section IV-C, the best random input HP weight to use for testing a CUT depends on its logic function, and in Section V-A glitch activity is shown to play a significant role in improving TP sensitivity. To examine the effect of glitches and varying input HP weight, two logic functions-AND and OR-are selected for testing. Although they are basic functions, they represent the corner cases that define the worst case accuracy bound of the TP method for any logic functions with the same number of inputs. In addition, XOR is tested to confirm whether measurement accuracy is indeed bounded by the two basic logical corner cases.
An Altera Cyclone III EP3C25 FPGA is used to implement the three logic functions with 15 inputs. They are given two structures, degenerate binary tree and balanced tree, as shown in Fig. 17 . This allows the difference between a circuit with one distinctive critical path [ Fig. 17(a) ] and four (near) critical paths [ Fig. 17(b) ] to be observed and compared independently. The two structures are implemented with four-input lookup-tables (4-LUTs) and we switch between the three logic functions by setting their LUT masks (SRAM configurations) while maintaining exactly the same placement and routing for each test. The preliminary test circuit is based on Fig. 20 , with a programmable weighted RVG (WRVG) capable of generating random inputs with HP weights from 0 to 1 in steps of 0.0625, see Section VII. The glitch activity at the combinatorial output of each case is measured directly using the same asynchronous transition counter for TP.
It is also worth exploring the average number of input test samples (N) needed to detect TP deviation (timing failure) at different input HP weights and critical glitch activity, such that correct/accurate timing measurements are obtained. For this, the TP responses of AND and OR of the two structures in Fig. 17 are observed through Monte Carlo simulations. Their inputs are stimulated by random vectors with HP weights from 0 to 1 in steps of 0.0625, and critical glitches are injected to the cycles that are noncritical or stationary [ Fig. 16(d) and (e)] as the percentage ratio of the total number of such cycles. The simulation tests through 10 9 input samples and the average distance between samples that causes TP deviation and timing failure of the critical path(s) are recorded for each HP weight. The results are shown in Fig. 19 with 0.01% and 25% of critical glitch activities.
A. FPGA Test and Simulation Results
The FPGA measurements of the 15-input AND, OR, and XOR are presented in Fig. 18 in terms of their maximum operating frequency ( f max ). The CUTs are tested through 2 24 (≈ 10 7.2 ) samples for each HP weight to cover most of the Fig. 5(b) ], where nominal f max is based on the nominal clock period at the center of clock jitter distribution [9] , and the actual f max is based on the frequency point immediately before TP deviation (timing failure) occurs. The nominal f max is useful because it indicates when the nominal clock period is exactly matching the critical path delay. However, the actual f max forms the absolute baseline where no timing failure is detected.
As can be seen, both the TP and HP measurements are remarkably accurate, lying mostly within the worst case nominal and actual f max bounds, except in the AND cases when input HP weight is at 0.5 or less [ Fig. 18(a) and (e)]. It turns out that the outliers in both cases are caused by the TP/HP methods losing sensitivity to the most critical path and responded to the timing failure of less critical paths or signal transition types (rise/fall). For the AND in degenerate binary tree form [ Fig. 18(a) ], the second path associated with input bit1 is responsible, whereas in balanced tree form [ Fig. 18(e) ], the failure of the quicker rising transitions is detected instead. Apart from the outliers, the measurements remained accurate within 1.5% of the actual f max in all cases, and are bounded within the nominal f max [except in Fig. 18(c) at 0.9375 HP weight where it is slightly above].
The OR cases did not suffer from the inaccuracy seen in the AND cases but failed to give any measurements (no TP deviation) when input HP is greater than 0.6875. This is because OR's delay is governed by the slower falling signal transitions, and according to Fig. 19(a) and (b) , the number of test samples required to detect TP deviation caused by failure of falling transitions (OR↓) is much lower than that of AND (AND↓), i.e., OR↓ and AND↓ are not symmetrical. Therefore, OR's measurements are generally more accurate than ANDs in this given case. Also, the jump in inaccuracy for AND [ Fig. 18(e) ], which is not observed in the OR case, can be explained by the simulation results. For example, in Fig. 19(a) , if 10 8 input samples are used, then it is likely that the failure of rising transitions (AND↑) are detected by TP for any HP weights above 0.3125, but for AND↓, it requires HP weight above 0.5 to detect any TP deviations within 10 8 samples. Therefore, the unwanted measurement of AND↑ is more likely to appear between the 0.3125 to 0.5 gap instead of the worst case AND↓ results, which reflects the observation in Fig. 18(e) between 0.3125 and 0.5 of HP weight. Given the results from the Monte Carlo simulations and the three test cases, the major factors that affect measurement accuracy are the number of input test samples (N), the input HP weight, and the level of critical glitches in the circuit. The plots in both Figs. 18 and 19 suggest that using uniform random input stimulus (HP = 0.5) would produce relatively accurate results when glitch activity is high. The only exception is seen in the balanced AND tree, which required an HP weight of 0.5625 or more to obtain accurate measurements. For the particular test cases on the Cyclone III FPGA architecture, we can see that a fixed input HP weight of 0.625 with the TP method would guarantee measurement accuracy to within 1% of the actual f max in all cases. Overall, we can conclude that in any combinatorial circuits where falling transitions dominate propagation delay, accuracy would mostly be bounded by AND logic. Otherwise, if rising transitions dominate, the opposite would be true and accuracy would be bounded by OR logic. XOR logic is expected to provide very good measurement accuracy in both cases due to its high glitch activities [ Fig. 18(d) and (h) ]. In fact, the glitches from XOR would improve the measurement accuracy of logic following it. They create intermediate input patterns capable of sensitizing critical paths that are normally impossible or unlikely to occur with external input vectors, thus improving TP sensitivity. This is supported by Fig. 19(c) and (d) , where the number of test samples needed to detect TP deviation is significantly reduced at high glitch activities. Also, glitches are likely to increase with each extra logic stage in a combinatorial circuit [22] , allowing the TP method to scale with circuit depth/complexity. This hypothesis will be tested on practical complex circuits on FPGA in the following sections.
VII. TEST PLATFORM IMPLEMENTATION AND TEST PROCEDURES
The general structure of the test platform is depicted in Fig. 20 , which consists of a TCG, a WRVG, a CUT, and the TP/HP profile measurement and analysis circuitries. The WRVG generates one random vector per clock cycle and is launched into the CUT by the LR, while its output is captured by the SR for TP/HP measurements. The WRVG (Fig. 21) is implemented using a 32-b maximal-length ring generator [23] , [24] followed by a phase-shifter network to ensure interbit phase independency [25] , [26] , and a logic based probability bias circuitry to generate weighted random vectors.
A. Generation of Weighted Random Test Vectors
Random sequences with arbitrary HP weight can be generated from a combination of independent uniformly distributed random bit streams through simple Boolean logic [19] . Consider the following identities linking the HP of the boolean combination of random bit streams A and B in terms of H (A) ... ... [23] , [24] and an XOR-gate-based phase-shifter network to remove phase correlation between output bits [25] , [26] . The probability bias circuit enables control of HP of the random bits according to the HP weight control.
and H (B):
Using (7) and (8), the HP of a bitstream can be shifted above or below 0.5 when combined with one or more independent uniform random bit streams through basic AND and OR logic functions. A wide range of HP values between 0 and 1 can be generated using this approach, and an example is shown in Table I giving 17 levels of HP weights using four independent uniform random bit streams (R0, R1, R2, and R3). In general, the achievable number of HP levels is given by 2 r + 1, where r is the number of independent uniform random inputs, and the HP weight resolution is given by 1/2 r . An extra state (Toggle) for toggling inputs is included in Table I such that single input change (SIC) exhaustive testing, which targets one specific input at a time, can be done for comparison purposes. For FPGAs with partial/dynamic reconfigurability, the HP weight-shifting logic can be implemented easily with a typical LUT through changing the configuration SRAM bits (LUT mask) on the fly [27] , [28] . Fig. 22 shows an example of a partial/dynamic reconfiguration-based HP weight control circuit on a typical FPGA logic element (LE) with a fourinput LUT and register, which is capable of implementing the 17 HP levels in Table I with only one LE per bit. Also, by exploiting the register feedback path available on most FPGA architectures, the LE can be reconfigured into a TFF for toggle signals. Unfortunately, partial reconfigurability is unavailable on the Cyclone III FPGA we used; thus a logicbased implementation requiring three LUTs is used.
B. Test Procedure
In the previous TP-based tests [9] , [10] , the WRVG and CUT are not reset between frequency steps, and the inherent noise from the random vectors masks out small TP deviations, leading to poor measurement accuracy. In this paper, major accuracy improvement is achieved through testing each frequency step for exactly the same number of input vectors, and resetting the WRVG and CUT after each frequency step. This results in identical input sequences between frequency steps with completely stationary TP output while no timing failures are occurring, and allows the slightest deviation in TP to be detected. Failure from as little as one clock cycle ... can be detected to provide significantly more accurate delay inferences. The test procedure is divided into two stages, where a quick forward frequency sweep is first done to obtain an approximate failure frequency of the CUT, and the second stage performs a refined search for a more precise f max . This strategy maximizes measurement precision while maintaining a short overall test time. In stage 1, the test frequency is incremented at a coarse 2-MHz step size, starting from a predefined low frequency (from Altera's timing analysis) that guarantees correct circuit operation. The HP weight of the input test vectors is kept at 0.5 across all input bits, and the test stage terminates as soon as any TP deviation is found in any of the output bits. Next, the frequency steps are refined to 0.1 MHz (≈ 1.6 ps time resolution at 250 MHz) in stage 2, and the test range is narrowed down by setting the start frequency to the failure frequency detected in the first stage. A backward frequency sweep is performed and the f max is defined by the frequency at which TP from all bits settle to their original steady values. Tests are carried out at room temperature between 20 ºC and 25°C, and each frequency step is tested through 2 24 input and output samples for both stages. A 24-bit counter is used to collected a full range of transition count from the CUT. 
1) Uniform HP Weight Optimization:
After the initial test procedure, a quick test can be performed to find the optimal HP weight for all input bits. The test with refined frequency range and resolution in stage 2 is repeated for different HP weights from 0.0625 to 0.9375 to search for the minimum f max that indicates the optimal HP weight for a particular CUT. Since we only have to sweep through a narrow range of frequencies for each HP weight, the total test time remains relatively short.
VIII. EVALUATION OF THE TEST PLATFORM ON PRACTICAL FPGA DESIGNS
A. Test Setup
The test platform, test procedure, and optimization described in the previous sections are demonstrated on the Cyclone III FPGA with eight different practical designs. The CUTs and their resource usage are summarized in Table II , and they are placed between the launch and capture registers in the test platform in Fig. 20 . To obtain reference measurements for accuracy evaluation, we considered two approaches: 1) a post-placement-and-routing critical-path-isolation method, and 2) an FRD-based exhaustive test [13] .
For approach 1, we used the timing analysis tool in Altera's Quartus II to identify the most critical path in the CUT. The path is then logically isolated from the rest of the circuit by LUT-mask modifications, and the register at the start is converted to a TFF such that the path is independently exercised by a stimulus with TP = 1.0. See depiction in Fig. 23 . Placement and routing are kept unchanged along the critical path to preserve its original propagation delay, and the basic TP/HP methods described in Section II are used for obtaining the measurements. The output TP will initially be 1.0 but follows a failure profile similar to Fig. 5(b) . To examine the effect of surrounding switching activities on the isolated critical path, we performed the tests in two different conditions: low activity, i.e., all normal inputs of the CUT driven by 0; and high activity, i.e., all normal inputs driven by uniform random patterns. Note that this method is applicable only to circuits where critical paths can be identified and modified, and cannot be used on IP blocks/designs where internal structure is concealed by encryption and/or locked from modifications. This is why we included approach 2 for testing a 9 × 9 embedded multiplier (DSP block) on the Cyclone III, which has a fixed and unknown internal structure but a small enough number of input bits, for FRD-based exhaustive testing within manageable test time. One concern with approach 1 is that the actual order of criticality of the near and most critical paths on a real FPGA could differ from the timing analysis results due to process variation. According to [2] , the within-die delay variability consist of mainly stochastic (random) and spatially correlated components, where they respectively accounted for ±3.54% 3-sigma and up to 3.66% of delay variation on the Cyclone II FPGAs with similar architecture as the Cyclone III. The effect of the stochastic component is expected to be small with long critical paths consist of n LUT+IC (interconnect) segments, since the total delay is the sum of delays from all segments and the variation relative to its mean decreases with n. Assuming that the segments are nominally identical and their delays can be represented by normally distributed independent random variables, then the standard deviation of a segment (σ seg ) relative to its mean (μ seg )-coefficient of variation (C v-seg )-is σ seg /μ seg . Now, for a path with n segments, the coefficient of variation becomes
Hence, paths with large n are expected to have much smaller stochastic variations relatively to one LUT + IC segment. To tackle the spatially correlated variation as well as any remaining stochastic variation, we isolated the paths with delay within 4% of the worst case critical path delay reported by the timing analysis tool, and selected the one with the highest measured delay as the reference. Note that we did not allow the use of DSP blocks in the Butterworth IIR filter, because they would form part of the critical path, preventing us from isolating it for reference measurements. The FP32 multiplier and FIR filter, on the other hand, do not suffer from such problem. Thus, their critical paths can be isolated while DSP blocks are in use.
1) Limitation With Accuracy Evaluation:
It is difficult to define and obtain an absolute "ground truth" reference for accuracy comparison, since the TP and reference tests exercise a CUT differently, and the slightest differences in switching activity, temperature, and voltage supply can offset the results between them. The inclusion of the high and low switching activity tests with the isolated critical path was an attempt to alleviate this problem, assuming that local temperature and voltage variations are linked to the CUT's dynamic power consumption and, therefore, related to its switching activity. This gives us a set of upper and lower bounds to evaluate the accuracy of our TP results. Although not an absolute comparison of timing, it allows us to determine whether a TP measurement falls within the bounds, as well as its error from the worst case lower bound.
2) Consistency Evaluation: Consistency (or repeatability) of results are evaluated by repeating the TP test 300 times for each CUT. The first 50 tests are done to achieve thermal equilibrium in the FPGA, and thus their results are discarded. The remaining 250 results are used to calculate measurement consistency in terms of 3-sigma variation of the f max in percentage of the mean value. Only stage 2 of the test procedure described in Section VII-B is used, and each test takes approximately 5 s to run. The consistency results for the CUTs are presented in Table II. 3) Limitation With Complex Circuits: To accurately test complex state machines, the input vector sequences must allow the CUT to traverse to specific states that activate the most critical path and expose timing failures at the outputs. This reveals the limitation of pseudo-random test vectors, which on one hand provide general patterns suitable for black-boxes but are inefficient in providing exact vector sequences for testing complex state machines. The number of test samples (N) can be raised to increase the probability of covering the right sequence, but there will be an upper limit with N at which test time becomes impractical. This is an inherent limitation with all random-vector-based test methods and is not specific to the TP method. Note that the proposed weighted random optimization is also effective with state machines, and should improve accuracy while maintaining practical test time in most cases. Generally, it is difficult to obtain the optimal N for black-boxes, since the critical paths' delay distribution and their observability are unknown. Nonetheless, statistical analysis or Monte Carlo simulation (Fig. 19 ) of known corner cases of designs could serve as a guideline for N values.
B. Test Results
The TP and HP f max measurements of the eight CUTs are presented in Table II As can be seen from Table II, the TP measurements of the basic floating-point units (multiplier, adder, and divider) all lie within the upper and lower bounds defined by the high and low switching activity cases of the isolated critical path. The lowest TP accuracy is observed in the FP32 divider ( Fig. 25 ) with 3.6% error from the worst case lower bound. This is not surprising, as the FP32 divider has the highest resource usage and pipeline depth (latency) amongst the CUTs, and hence is expected to have the lowest TP sensitivity to timing failures. Despite being the worst case observed, the error in terms of frequency (or delay) remains relatively small. Also, in Fig. 25(a) , the timing of the isolated critical path in the FP32 divider is hugely impacted by high surrounding switching activity, with a significant f max reduction from 238.05 to 224.11 MHz. The shape of the TP profile suggests that the path delay varies with the test clock frequency under high switching activity, resulting in failure slopes with unpredictable gradient as opposed to the almost linear slopes in the low switching activity case. The exact cause of the differences is unclear, but it could be due to signal crosstalk, increased localized heating from dynamic power dissipation, and/or voltage drop due to increased load on the power supply.
For the IIR and FIR filters in Figs. 26 and 27 , the results appear to be highly accurate, yet intriguing. The TP method yielded slightly more pessimistic f max measurements than the isolated critical paths even in the high activity case, showing small but negative error percentages in Table II . Interestingly, the FP32 square-root circuit also showed similar results, and the one similarity that all three CUTs shared is that they all contain a significant portion of adder trees in their implementations. Moreover, the critical and near-critical paths in them all transverse through the adder trees. Recall in Section VI-A, Fig. 18(c) and (g) that the XOR trees yielded exceptional measurement accuracy due to their high levels of 
* Resource usage is quantified in the number of LEs or ALMs [30] . The estimations are based on the number of input bits n, output bits m, the TP counter bit width k, and the order r which defines the number of HP weight levels by 2 r + 1.
glitches, and, most importantly, the sum logic in an adder is essentially built from XORs. The critical paths in the three CUTs are thus likely to have high level of glitches during the TP test, yielding more accurate measurements. An isolated critical path, on the other hand, would have no glitches at all and only toggle once every clock cycle by the TFF, resulting in a slightly more optimistic (higher) f max than the TP method.
In terms of state-machines testing, both the IIR filter and ITC'99 Benchmark (b12) fall within this category with sequential feedbacks. Although the TP result of the ITC'99 circuit is slightly more optimistic than the low switching activity reference, the worst case error remains at a reasonable 1.26%.
Another interesting observation in Table II is that TP and HP almost always have the same f max results. This agrees with the measurements of the corner cases in Section VI, where the accuracy of TP is approximately the same as that of HP, especially when the level of glitch activity is high. The optimal HP weights we obtained from the tests for each CUT (Table II) turned out to be mostly at 0.5. Only two circuits (9 × 9 embedded multiplier and FP32 divider) benefited from having nonuniform HP weights at 0.3125 and 0.875, respectively. This is in line with the corner case results in Section VI-A, which suggest that most circuits would yield good measurement accuracy at 0.5 HP weight on the Cyclone III FPGA unless their critical paths are dominated by many-input AND trees.
Finally, all the CUTs showed remarkable test consistency over 250 consecutive tests. Their 3-sigma variations are all within ±0.7% and the worstcase is observed on the 9 × 9 embedded multiplier at ±0.68%. The timing resolution for the tests ranges from 0.3 to 8.0 ps, since the frequency steps are fixed and higher f max results in better resolution [see (3)].
IX. TEST TIME AND AREA ESTIMATION The test time of the TP method (T test ) with N samples and N freq frequency steps is expressed by
where f i is the clock frequency at the i th step. (10) . Therefore, given f scan = 10 MHz, a scan-chain-based test for the FP32 divider takes 18.6s + 2 24 × 250 × 32/10 MHz = 13440.4 s, which is over 720 times slower than the TP method. For the FRD-based SIC exhaustive test used earlier, the test time per path is described by T test . Therefore, the test of a combinatorial circuit with n inputs and m outputs would be m × n × 2 n−1 times [15] slower than using the TP method. The resource usage of the test platform depends on four factors: the number of input bits (n), the number of output bits (m), the TP counter bit width (k), and the order of HP weight levels (r ), where 2 r + 1 gives the actual number of HP levels. The detailed resources estimation for each test circuit component is presented in Table III . The general resource usage is quantified in terms of the number of LEs containing a four-input LUT and register, or ALMs with fracturable eightinput adaptive LUT (ALUT), and two registers found in more advanced FPGA architectures [30] .
The test circuit can be optimized for low area overhead by sharing the same TP counter among output bits using an m-to-1 multiplexer, or optimized for short test time using multiple TP counters in parallel. Table III shows the two extreme cases with fully shared or parallelized TP counters. The resource usage of the multiplexer in the sharing case is estimated based on the actual optimized synthesis results from Altera's Quartus II design tool. For example, the test platform used for the previous test cases with 17 HP weight levels (r = 4), n = 2 × 32, m = 32, and 24-b TP counters (k = 24) requires 1248 LEs (parallel counters) or 529 LEs (shared counters) to implement on the Cyclone III FPGA.
X. CONCLUSION
This paper presented a detailed analysis and practical demonstration of the TP method. While TP is as accurate as HP measurements, its test circuit is more robust and flexible, and requires less measurement hardware. Therefore, TP is the ideal method for a self-contained timing measurement platform. By analyzing the mechanisms that cause TP deviations when timing failures occur in circuits, timing uncertainties such as clock jitter and flip-flop metastability were found to contribute to the method's effectiveness and accuracy. Through simulation and measurements of corner cases, the effect of logic glitches and random input vectors with different HP weights was examined. This provides insight into how further accuracy improvement is possible via tuning input HP weights for testing different designs, as well as the type of circuits that are more suitable for the TP method. Although a universal measurement platform for any black-box circuit design is difficult to achieve, and evaluating accuracies for all circuit structures is impossible, the TP method has shown good measurement accuracy across common functional circuit modules (arithmetic, filter, and state-machine circuits) that are found in most complex modular designs. Accuracy remains largely within 2%, except for the FP32 divider which is the largest circuit tested, and gave a worst case error of 3.6%.
Further accuracy improvement may also be possible via bitspecific input HP weight tuning using heuristics. However, the current results show sufficient accuracy, and a small timing guard band (timing margin) can be added to the measurements to ensure reliability.
The TP method provides attractive solutions to measure and characterize timing of designs, especially for signal processing and arithmetically intensive tasks, which allows them to run with minimal timing safety margin (closest to f max ) while maintaining reliability. Also, it is especially suited to programmable architectures, such as FPGAs, where output of virtually any registers can be routed to the TP test circuit (transition counter). Moreover, transition counters can be built in a compact form at the transistor level and included as dedicated hardware in any VLSI designs to provide an accurate yet compact built-in test platform. One interesting direction would be to incorporate the TP measurement platform in design tools such that the resultant hardware design in ASIC, FPGA, or other programmable architectures can be efficiently tested for its timing performance. Such tools will ease the burden of design for testability on designers, and significantly reduce design time and shorten the testing and prototyping cycles.
