We present a general technique for measuring the propagation delay on the internal wires of FPGA chips. The measure is based on the comparison between the operating frequencies of two ring oscillators that differ only for the structure under test, that is included (or not) in the loop. Experimental results are presented for a device of the Xilinx XC4000 family.
Differential delay measures
The period T of a ring oscillator (RO) is twice the propagation delay of the loop. In principle, inserting a new element (such as a wire, a non-inverting component, or an arbitrary non-inverting path) in the loop of a RO, adds a contribution d to the propagation delay of the loop, causing the oscillation period to become T'=T+2d. Hence, the actual delay of the additional element can be easily obtained from the measures of T' and T, by computing d=(T'-T)/2.
Measuring internal propagation delays reduces to two subtasks: i) implementing ring oscillators differing only for the structure under test, ii) measuring the period of a square waveform. We address the two tasks in the following subsections. Figure 1 shows the schematic of a RO including a structure under test (SUT) composed of a net and a non-inverting CMOS driver (represented in bold lines). The propagation delay of the SUT is measured from the difference between the period of the ring oscillator of Fig. 1 and the period of the same RO with inverter 2 directly driven by inverter 1. Notice, however, that adding the SUT between inverters 1 and 2 modifies the overall oscillation period not only for the propagation delay of the SUT, but also because of the different topology of the net (n1) driven by inverter 1. In fact, net n1 in Fig. 1 is different from a net directly connecting inverters 1 and 2.
Implementation of the ring oscillators
The key issue with the proposed methodology is the implementation of two ROs that differ only for the SUT. We addressed this issue as shown in Fig. 2 , where the shaded square blocks represent programmable complex logic blocks (CLBs). In the reference RO ( Fig. 2 .a) each CLB implements a single inverter, contiguous inverters are mapped on contiguous CLBs laying on the same column, and the same routing strategy is used to connect each CLB pair.
Since in most device families each CLB may implement two or more logic functions, the non-inverting driver of the SUT may be implemented in the same CLB used to implement inverter 2. The modified CLB is shown in Fig.2 .b. This implementation minimizes the undesired side effects of the insertion of the SUT. In fact, net n1 driven by inverter 1 needs only a marginal change to contact a different input pin of the same CLB, while the implementation of inverter 2 and its load doesn't change at all. This strategy can be used to add to the RO arbitrary non-inverting paths with unknown propagation delays, possibly including many different wire segments and logic blocks.
Measurement technique
Measuring a period is not an issue. It can be done either by external instruments or by internal circuitry. In the first case, the square waveform generated by the ring oscillator needs to be made available at an output pin in order to be monitored by an oscilloscope. In the second case, a reference clock signal has to be provided to the device. Two internal counters can then be used to compare the unknown square waveform with the reference clock signal. When the counter driven by the external clock reaches a given value N, the counter driven by the internal waveform is stopped and its count Nx is compared with N. The unknown periodicity Tx is then computed as Tx=T N/Nx.
Internal measurements are attractive since they do not require any external equipment but a stable clock signal, that is usually available in any system. Moreover, in a synchronous system timing constraints are referred to the clock cycle, so that using the clock cycle as a reference for delay measures is a natural choice.
For our experiments, however, we used an external oscilloscope since we were mainly interested in investigating the performance of internal lines, rather than in verifying timing constraints of a given design.
Time periods of tens of ns were measured with a standard deviation of about 0.1ns. In order to improve accuracy, we implemented on the same FPGA a 16-stage frequency divider providing well-shaped square waveforms with periods of a few ms, that were measured with a standard deviation of tens of ns. The actual period was then obtained from the divided frequency by dividing by 2 16 , thus obtaining a standard deviation of less than 1ps.
Experimental results
We run our experiments on a device of the XC4000X family [5] from Xilinx (namely, the XC4005XL). Ring oscillators and SUTs were implemented using the FPGA Editor [4] distributed with the Xilinx Foundation Series 2.1i [3] , that enables accurate control of placement and routing and provides single-path delay estimates.
We performed systematic experiments (based on the SUT template shown in Fig. 3 ) to study the dependence of propagation delays on the type and length of wire segments, on the number of connectors and switches in the path, and on the number and position of fanout branches adding concentrated loads to the path. The propagation delay of each SUT was measured with 1ps accuracy as described in the previous section. Measurements were then compared with delay estimates provided by the FPGA Editor. The same structure was implemented in different locations to measure intra-chip variance, and each measure was performed at different temperatures to evaluate the temperature coefficient (TC). Experimental results are reported in the following subsections. Figure 4 shows the implementation of an 11-stage RO used to evaluate the incremental propagation delay due to single wire segments of a given type. Gray boxes represent CLBs. The input pin provides the EN signal that feeds the OR (i.e., the input NOR and the first inverter of Figure 1 ) implemented by the first CLB. All other CLBs implement a single inverter each. The non-inverting SUT is connected to the CLB that implements the 6-th inverter in the loop.
Effects of wire types and length
We consider wire segments (denoted by 's' in Figure 4 ) that cover the distance between two CLB columns or rows. The SUT of Fig. 4 .a contains 2 segments of long wires, while the SUT of Fig. 4.b contains 4 (i.e., 2 more) segments of the same wires. The incremental delay due to the additional segments is obtained by comparing the time periods of the two ROs. We implemented parameterized SUTs including from 2 to 12 segments of each type of wire, and we obtained the propagation delays from the corresponding oscillation periods.
For global lines, the measured incremental delay added by a new wire segment is of about 0.053ns when no buffers are inserted, while it is of about 0.720ns for buffered segments. This can be explained by considering a long wire with a buffer every 4 segments: When a buffer is included in the SUT, the propagation delay it introduces is mainly due to the load capacitance it has to drive, that depends on the length and size of all driven segments even if they are not included in the SUT. On the other hand, when the SUT is extended so as to include driven wire segments whose driving buffers are already included in the SUT, the incremental delay is only due to the additional flight time of the signals.
Similar results were obtained for local lines (single/double/quad), with the only difference that the incremental contribution of passive wire segments was larger (above 0.1ns) and the capacitive contribution of the entire wire was lower than those of global lines.
Effects of fanout branches
In general, fanout branches act as additional loads that increase the inertia of signals propagating along the main path of the SUT.
The effects of fanout branches were measured by adding branches of different type and length in different positions of the SUT, as schematically depicted in Fig. 3 . Three kinds of behaviors were observed. The incremetal delay introduced by unbuffered branches (such as segments of long and local wires) depends on their number, length and load. The incremental delay introduced by buffered branches (such as pin wires) is independent of their length and load. The incremental delay is null if the branching point is right after a buffer that is already driven by the SUT.
Intra-chip variations
Measure stability and repeatability were tested by repeating each measure 20 times. The measured standard deviation was below the value of 1ps computed in Section 2.2 from the data sheet of the measurement equipements.
Intra-chip variations, measured by implementing the same SUT in different locations, where within 5%. Further experiments will be performed to evaluate inter-chip variation.
Device-specific model fitting
We compared measured results with the net-specific delay estimates provided by FPGA Editor [4] . Surprisingly enough, measured delays were much lower (below 50%) of the corresponding estimates. In fact, CAD tools provide overconservative post-layout estimates in order to take into account worst-case parameters and operating conditions. On the other hand, the estimates were consistent with measurements, in that the ratio between measured and estimated delays was almost the same for all experiments, suggesting the introduction of simple fitting coefficients to be used to adapt the general delay models to a specific target device. For our device, we obtained a scaling coefficient SC=0.43.
To take into account operating conditions, a temperature coefficient (TC=0.2%) was also characterized by performing delay measures at temperatures ranging from 20°C to 70°C.
The estimates obtained by applying the scaling factor and the temperature coefficient to the delay models of FPGA Editor provided a root mean square error lower than 3% against measurements.
Conclusions
Path delays within commercial FPGAs can be measured with 1ps resolution by implementing two ring oscillators that differ only for the path of interest and by comparing their operating frequencies.
We systematically applied this approach to study the delay introduced by the different types of wire segments available on a Xilinx XC4005XL device. In general, wiring delay is dominated by the effect of the wiring capacitance on the transition time of the driving stage.
Measured delays were used to validate the estimates provided by Xilinx FPGA Editor. Since delay estimates resulted to be systematically over-conservative, we characterized a scaling coefficient and a temperature coefficient to be used to adapt the conservative estimates to the target device and operating conditions. The correction introduced by the two coefficients reduces the average estimate error from above 100% to below 3%, thus enabling more aggressive design optimizations.
