Abstract-High cost of qualifying library standard cells on silicon wafer limits the number of test circuits on the test chip. This paper proposes a technique to share common load circuits among test circuits to reduce the silicon area. By enabling the load sharing, number of transistors for the common load can be reduced significantly. Results show up to 80% reduction in silicon area due to load area reduction.
INTRODUCTION
Correlating electrical parameter model of library standard cells to actual silicon data is essential in delivering quality standard cell library [1] . Early qualification on the standard cell's characterized electrical model (i.e. timing, power, noise, etc.) on silicon wafer become compulsory to guarantee correct Application Specific Integrated Circuits (ASIC) design in silicon function under all process, voltage, temperature (PVT) conditions [2, 3] . Ideally, each individual library cells need to be qualified to ensure the layout quality (i.e. design rule violation free), functionality and various models describing the cells including timing and power as well as applicability of the cells in ASIC synthesis and place and route tool [4, 5] . This validation need to be in post-silicon to ensure the matching between silicon and cell's model provided in the library [6, 7] .
Benchmark circuits in the form of digital blocks or Test Element Group (TEG) circuits are usually designed (or synthesized and placed and routed in the case of digital block), followed by pre-and post-silicon measurement at different PVT corners [8] . For TEG circuit is a special circuit structure designed to validate specific library cells, namely combinational, sequential and complex cells including memories and I/O. TEG circuit is favorable for validating functional and electrical model of individual library cells due to accuracy in the measurement data and the ability to test entire set of available gates in the library. Three types of TEG circuits used for library cell qualification are the ring oscillator, dummy path and delay chain circuits [4, 5, 9] . In general, output of the circuit structures is connected to the I/O pins for timing delay measurement.
Ring oscillator circuit is composed of odd number of device under test (DUT) with inverter like function in which the circuit's output oscillates between two voltage levels, representing true and false. This circuit is commonly used to electrically characterize standard cells [10] [11] [12] . Delay path circuit is composed of two paths; one with chain of cells and the second without (i.e. interconnect only) in which the differences of delay between the two paths are used to calculate individual cell delay. Main drawback of both ring oscillator and delay path circuits is the influence of external test environment on the delay measurement due to direct connection to I/O pins as well as require high end oscilloscope to be able to measure these fast signals [13] [14] [15] [16] . The delay chain circuit is a slightly modified version of delay path circuit and uses the principal of measuring pulse width to calculate cell delay in the chain [17] [18] [19] . This method gives accurate cell delay and power measurement due to less influence of external test environment and relax the requirement on the high-end oscilloscope.
For validating libraries with large number of cells (i.e. various cell type and cell drive strengths) using TEG circuits, substantial number of TEG circuits need to be designed. With the requirement to validate the effect of fanout/load capacitance for each device under test (DUT) in the TEG circuits, even larger silicon area is required to implement the fanout circuits. Note that a similar range of load capacitance values is defined in the library for each cells, thus in most cases same load circuits are designed for all individual TEG circuits to reduce design efforts.
In this paper, area efficient delay chain circuit architecture has been proposed to qualify library standard cells. The proposed architecture used the concept of sharing common fanout/load between the TEG circuits, thus reducing the number of transistors implementing the load. The paper covers the design and implementation of proposed method in section 2. Section 3 shows the results and analysis for the total area reduction. Finally, section 4 concludes the paper. Fig. 1 shows a typical delay chain circuit, which consists of a chain of N basic circuit sets, an XOR and series of buffers. Each basic set contain a DUT (i.e. individual library cell such as INV, NAND, NOR) followed by inverted buffer (usually minimum size inverter) to avoid the signal alternation. The fanout/load for DUT is inserted between DUT and its buffer, which can be single load or configurable load with pure capacitance or library cells. In general, The chain output is fed to an XOR gate along with a direct connection from the circuit input to generate pulse width that correspond to the total delay of the chain with N basic circuit sets. In most cases, similar load design is applied to the other delay chains due to fixed range of load defined in the library and reduced design effort making the load common to the delay chains. Thus, by sharing the load among different chains, the number of loads can be reduced leading to total test chip area reduction. Fig. 2 shows the architecture of multidelay chain with load sharing technique. DUT of each delay chain is connected to the load through switch, allowing single chain accessing the load at a time. Signal to switches in the same delay chain are linked together and signal to switches of different chain are either connected to individual I/O pins or a decoder to minimize number of I/O pins. Shown in this example, a configurable load is employed with control signal sel_L.
II. LOAD SHARING DESIGN ARCHITECTURE

A. Design Implementation
Note that, the load seen by the DUT increases with the number of chain sharing the same load due to increase in wiring for connecting the load to multi-delay chains and additional switches. Two segments of wiring namely WS 1 that connect the DUT, switch and inverted buffer and WS 2 that link the switch of specific chain in the same basic circuit set to the common load. Each delay chain is arranged such that the length between two switches in same basic circuit set and switch to load, WS 2 is in equal length. The total wire that link all the switches in the same basic circuit set to the load is named switch-load network, SNL while WS 3 is the wire segment connect this network to the common load. Thus, wire capacitance, C W seen by DUT in any chain of N C -delay chain architecture can be given by:
where C M is the metal capacitance per mm length. Total load capacitance, C T seen by DUT of any chain considering the load capacitance C L , capacitance of the switch, C S and the following buffer input capacitance, C B can be described by:
To measure the delay of each cell, pulse width, D T that corresponds to total delay of N basic circuit sets is measured at the I/O pins. Each basic circuit set consists of DUT delay, D DUT and inverting buffer delay, D IBUFF . Thus, the individual DUT cell delay can be calculated using Eq. 3.
Output pulses are generated at rising and falling input transition. For a positive unate DUT, rising input transition correspond to cell rise delay while a negative unate DUT result in cell fall delay.
B. Area Estimation
The total area of a single delay chain, A S-DC (i.e. Fig. 1 ) can be given as:
where A Set , A XOR , A OBUFF are the area of basic circuit set, XOR gate and output buffer driving I/O pad respectively. A Set can be given as: 
where A DUT , A L , A IBUFF are the area of DUT, load and inverting buffer respectively.
Conventionally, for a multi-delay chain, N single delay chain method, the total area can be given:
While using the proposed load sharing technique delay chain, the total area is given by:
where A SW is the area of the switch.
In the load sharing technique for multi-delay chain, all chains will be controlled by one decoder number of sets N, then its size will be negligibl overhead incurred from the implemented switches load does not scale up with the number of chain, thus area saving can be achieved. The area saving, A S given by:
III. RESULTS AND DISSCUSION
A. Timing Validation Results
Multiple test chips have been designed based on standard single delay chain and using proposed load sharing technique for multi-delay chain. Two chains, N C = 2 were selected in the design since our aim is to validate the timing data and cell fall delay) of library cells based on commercial Silterra 130nm process standard cell library. Delay chain with N = 100 basic circuit sets consist of DUT attached to as buffer, followed by XOR gate and connected to output buffers was constructed. DUTs considered in the experiment include INVX1, INVX2, NAND2X1 and NOR2X1. Cascaded output buffers are designed with the drive strength 64X, and 256X to drive the output PAD. To determine dependent delays, configurable load with eight values (ranging from gate size of 1X to 31X the DUT in each basic circuit sets with a set of control signals to enable and disable the individual loads. real-silicon measurement under different PVTs (Process: Slow, Typical, Fast; Voltages: 1.35V, 1.5V, 1.65V, Temp: 0 25°C, 125°C) have been applied to all the chains Fig. 3 shows the SPICE simulation based on parasitic Fig . 3 shows the SPICE simulation based on parasitic extracted netlist for 2 delay chain technique. DUTs are the INVX2 delay chain 1 and delay chain 2 respectively initialized to 0V and set to V DD chain 1 to the configurable load followed by chain 2. It is clear that the pulse width varies indicating values driven by the DUT and same pulse width indicating that delay chain has been disconnected from the configurable load. Fig. 4 shows the silicon INVX1 chain driving the largest load where the output pulses reflect the cell fall and cell rise delay corresponding to the input rise and fall transition Note that the voltage swing is limited impedance mismatch in the measurement ins and INVX1 gates u sed in delay chain 1 and delay chain 2 respectively. Signal sel_sw is at 1.6ns, connecting delay chain 1 to the configurable load followed by chain 2. It is clear that the pulse width varies indicating delay of different load values driven by the DUT and same pulse width indicating that delay chain has been disconnected from the configurable silicon measurements results for load (i.e. gate size of 31X) where the output pulses reflect the cell fall and cell rise delay corresponding to the input rise and fall transition respectively. Note that the voltage swing is limited 220mV due to impedance mismatch in the measurement instrument used. when the maximum error was 11.50% at load of 0.05pf. On the other hand, the minimum error occurred at slow library with value of 0.03% at load of 0.08pf.
B. Area Analysis and Discussion
In the load sharing technique, the area reduction depended on the size and drive strength of DUT, the cells used as output load, the number of chains sharing the same load N C , and the number of sets N. Fig. 6 shows the result of using equation (9) on the case where the DUT size was the smallest size (i.e. INVX1), whereas the number of basic circuit sets was represented by N=100. By varying the cell size and drive strength that were used as output load and the number of chains N C , the area was reduced by up to 83%. In this case, using small cells at output load showed inefficiency compared with DUT. Hence, such technique is effective only for a large amount of output load (i.e. load larger than 8X in gate size). On the other hand, the technique is more efficient with larger number of chains sharing same load. 
IV. CONCLUSION
The effectiveness of our proposed load sharing technique has been evaluated on SPICE simulation and on-silicon with several test chips. The functionality and timing delay of DUT of the library cells driving different loads and at different PVTs are validated against the information provided in the standard cell library and shows good correlation. The accuracy, with a maximum error of 11.50% is acceptable compared with that of the conventional single delay chain technique. Total area reduction can reach up to 83% for load sharing technique compared to conventional single delay chain approach, which is a function of number of chain and load size. 
