programmable power-gating devices. The first method improves the maximum operating frequency of designs implemented with multiple power-gating domains by adjusting the strength of power-gating devices, domain by domain, which is followed by scaling global supply voltage for higher operating frequency. Our experimental results showed that the proposed method improved the maximum operating frequency by 5%-21% for 2-, 4-, 8-, and 16-core processors.
The second method recovers discarded dies due to excessive active leakage power; a necessary amount of active leakage power is reduced by the strength of power-gating devices until the dies are brought back into the acceptable operating region. To demonstrate the effectiveness of the method, we examined two different design scenarios: 1) ASICtype fixed leakage power and 2) processor-type variable leakage power constraints. Our experiments demonstrated that about 88% and 98% of discarded dies could be recovered by the proposed methods in two design scenarios, respectively. Shin, "Frequency and yield optimization using power gates in powerconstrained designs," 
I. INTRODUCTION
Functional test is commonly used in industry to target defects that are not detected by structural tests [1] . An advantage of functional test is that it avoids overtesting since it is performed in normal functional mode. In contrast, structural test is accompanied by some degree of yield loss [2] .
Given a large pool of functional test sequences (for example, design verification sequences), it is necessary to develop an efficient method to select a subset of sequences for manufacturing testing. Since functional test sequences are much longer than structural tests, it is time-consuming to grade functional test sequences using traditional gate-level fault simulation methods.
The evaluation of functional test sequences is a daunting task if we consider the sheer number of cycles one may have to simulate to evaluate the fault coverage (assuming we know what fault model we are going to grade against). For example, consider a functional test sequence that is equivalent to one second of runtime on a processor with a 3 GHz clock frequency. This means that we have to simulate 3 billion cycles to evaluate the fault coverage. Even for stuck-at or transition fault models, simulation of the order of a billion cycles on a small fault sample on a reasonably large server farm can take months to complete. Furthermore, for system-level tests that are often created to catch circuit marginality and timing (speed path) errors, if we try to grade these tests on delay fault models, the time taken would be orders of magnitude more than the time needed to grade on transition or stuck-at fault models.
To quickly estimate the quality of functional tests, a high-level coverage metric for estimating the gate-level coverage of functional tests is proposed in [3] . However, this approach requires considerable time and resources for the extraction of coverage objects. In particular, experienced engineers and manual techniques are needed to extract the best coverage objects. In [4] , another coverage metric is proposed to estimate the gate-level fault coverage of functional test sequences. However, this method does not take the observability of flip-flops into consideration. Another drawback of this method is that it is impractical for large designs since it requires gate-level logic simulation. In this paper, we propose output deviations as a metric at registertransfer level (RTL) to grade functional test sequences [5] . The deviation metric at the gate-level has been used in [6] to select effective test patterns from a large repository of n-detect test patterns. It has also been used in [7] to select appropriate linear feedback shift register (LFSR) seeds for LFSR-reseeding-based test compression.
II. OUTPUT DEVIATIONS AT RTL: PRELIMINARIES
First, we define the concept of transition count (TC) for a register. Typically, there is dataflow between registers when an instruction is executed and the dataflow affects the values of registers. For any given bit of a register, if the dataflow causes a change from 0 to 1 (1 to 0), we record that there is a 0 0 ! 1 (1 0 ! 0) transition. If the dataflow makes no change to this bit of the register, we record that there is a 0 0 ! 0 transition or a 1 0 ! 1 transition, as the case may be.
After a transition occurs, the value of the bit of a register can be correct or faulty (due to an error). When an instruction is executed, there may be several transitions in the bits of a register. Therefore, an error may be manifested after the instruction is executed. We define the output deviation for instruction Ij, 4(j), as the probability that an error is produced when I j is executed.
Similarly, since a functional test sequence is composed of several instructions, we define the output deviation for a functional test sequence to be the probability that an error is observed when this functional test sequence is executed. For example, suppose a functional test sequence, labeled T 1 , is composed of instructions I 1 ; I 2 ; . . . ; I N , and let 4(j) be the deviation for I j , as defined above. The output deviation for T 1 , 4 3 (T 1 ), is defined as: 4 3 (T 1 ) = N j=1 4(j). This corresponds to the probability that an error is produced when T1 is executed.
III. DEVIATION CALCULATION AT RTL
We consider three contributors to output deviations. The first is the TC of registers. Higher the TC for a functional test sequence, the more likely is it that this functional test sequence will detect defects. The second contributor is the observability of a register. The TC of a register will have little impact on defect detection if its observability is so low that transitions cannot be propagated to primary outputs. The third contributor is the amount of logic connected to a register.
A. Observability Vector
The output of each register is assigned an observability value. The observability vector for a design at RTL is composed of the observability values of all its registers. Let us consider the calculation of the observability vector for the Parwan processor [8] ; see Fig. 1 
(a).
From the instruction-set architecture and RTL description of Parwan, we extract the dataflow diagram to represent all possible functional paths. Fig. 1(b) shows the dataflow graph of Parwan. Each node represents a register. The IN and OUT nodes represent memory. A directed edge between registers represents a possible functional path between registers. For example, there is an edge between the AC node and the OUT node. This edge indicates that there exists a possible functional path from register AC to memory.
From the dataflow diagram, we can calculate the observability vector. The primary output OUT has the highest observability since it is directly observable. Using sequential-depth-like measure for observability, we define the observability value of OUT to be 0, written as OUT obs = 0. For every other register node, we define its observability parameter as 1 plus the minimum of the observability parameters of all its fanout nodes.
For example, the fanout nodes of register AC are OUT, SR, and AC itself. Thus the observability parameter of register AC is 1 plus the minimal observability parameter among OUT, SR, AC. That is, the observability parameter of register AC is 1. In the same way, we can obtain the observability parameters for MAR, PC, IR and SR. We define the observability value of a register as the reciprocal of its observability parameter. Finally, we obtain the observability vector for Parwan. It is simply (1=AC obs 1=IR obs 1=PC obs 1=MAR obs 1=SR obs), i.e., (1 0.5 1 1 0.5).
B. Weight Vector
Each register is assigned a weight value, representing the relative sizes of its input cone and fanout cone. The weight vector for a design is composed of the weight values of all its registers. Obviously, if a register has a large input cone and a large fan-out cone, it will affect and be affected by many lines and gates. Thus it is expected to contribute more to defect detection. In order to accurately extract this information, we need gate-level information to calculate the weight vector. We only need to report the number of stuck-at faults for each component based on the gate-level netlist. This can be easily implemented without gatelevel logic simulation by an automatic test pattern generation (ATPG) tool or a design analysis tool. There are 248 stuck-at faults in AC, 136 in IR, 936 in PC, 202 in MAR, 96 in SR, 1460 in ALU and 464 in SHU. Based on the RTL description of the design, we can determine the fanin cone and fanout cone of a register. For example, the AC register is connected to three components: AC, SHU, and ALU. Given a set of registers fR i g; i = 1; 2; . ..;n, let f i be the total number of stuck-at faults in components connected to register Ri. Let fmax = maxff1; ... ; fng. We define the weight of register Ri as fi=fmax to normalize the size of gate-level logic. Table I shows the numbers of faults affecting registers and weights of registers. We can see that 
C. Threshold Value
The higher the TC is for a functional test sequence, the more likely is it that this functional test sequence will detect defects. However, suppose that the TC is already large enough (greater than a threshold value) after the given functional test sequence is executed for a given number of clock cycles. As the functional test sequence is applied for more clock cycles, TC keeps increasing. However, higher values of TC make no significant contribution to the detection of new defects if they generate only a few new transitions in the registers. The threshold value is used to model what value of TC for a register is useful for defect detection. Each register is assigned a threshold value, representing an upper limit on the TC that is useful for defect detection. We assume that all registers are assigned identical threshold values.
The threshold value can be obtained by learning from a few of the given functional test sequences. We only need to run logic simulation at RTL for these test sequences. Suppose that we have k registers in the design. Our goal is to learn from m functional test sequences T 1 ; T 2 ; . ..;T m . The parameter m is a user-defined parameter, and it can be chosen based on how much pre-processing can be carried out in a limited amount of time. We use the following steps to obtain the threshold.
• Run RTL Verilog simulation for the selected m functional test sequences. Record the new transition counts on all registers for each clock cycle during the execution of the test sequence;
• For each of the m functional test sequences, plot the graph for cumulative new transition count (CNTC). The x-axis represents the number of clock cycles. The y-axis represents the CNTC, summed over all registers.
• From the above graph, find the "critical point" on the x-axis, after which the CNTC curve flattens out. Formally speaking, the critical point is the x-axis value at which the CNTC reaches 98% of the maximum CNTC value. Denote the critical point for m test sequences as the set fcp1; .. .;cpmg.
• Obtain the aggregated TC for all registers up to the critical point.
Denote the aggregated TC for m test sequences by the vector (th1 th2; . ..;thm). Given k registers in the design, we set the threshold value th to be the average (th 1 =k + 1 1 1 th m =k)=m.
D. Calculation of Output Deviations
The output deviations can be calculated for functional test sequences using the TCs, the observability vector, the weight vector and the threshold value. First, we define the countable TCs for registers. Here th is the threshold value for the design.
Suppose TS is composed of q instructions I 1 ; I 2 ; ... ; I q . For each instruction Ij, suppose the countable 0 0 ! 0 TC of register Ri(1 i k) for instruction I j is t 3 ij00 , the countable 0 0 ! 1 TC of register R i for I j is t 3 ij01 , the countable 1 0 ! 0 TC of register R i for I j is t 3 fS i00 + S i01 + S i10 + S i11 g: (3) Note that S i00 represents the aggregated contributions of countable 0 0 ! 0 TC of register R i for all the instructions in TS. The parameters Si01, Si10 and Si11 are defined in a similar way.
Let 
IV. EXPERIMENTAL RESULTS
We next evaluate the efficiency of deviation-based test-sequence grading by performing experiments on the Biquad filter core [9] and the Scheduler module of the Illinois Verilog Model (IVM) [10] . A synthesized gate-level design for the Biquad filter consists of 160 147 gates and 1116 flip-flops. IVM employs a microarchitecture that is similar in complexity to the Alpha 21264 [11] . It has most of the features of modern microprocessors, featuring superscalar operation, dynamical scheduling, and out-of-order execution. A synthesized gate-level design for it consists of 375 061 gates and 8590 flip-flops.
A. Results for Biquad Filter 1) Experimental Setup:
All experiments were performed on a 64-bit Linux server with 4 GB memory. Design Compiler (DC) from Synopsys was used to extract the gate-level information for calculating weight vector. Synopsys Verilog Compiler (VCS) was used to run Verilog simulation and compute the deviations. A commercial tool was used to run gate-level fault simulation and MATLAB was used to obtain the Kendall's correlation coefficient [12] . A coefficient of 1 indicates perfect correlation while a coefficient of 0 indicates no correlation.
We adopt 6 test sequences, labeled as T S 1 ; T S 2 ; . . . ; T S 6 , respectively. Each sequence consists of 500 clock cycles of vectors. T S 1 , T S 2 , and T S 3 are generated according to the functionality of the Biquad filter. The data input in T S 1 , T S 2 , and T S 3 is composed of samples from a sinusoid signal with 1 kHz frequency, a cosine signal with 15 kHz frequency and white noise. In T S 4 , T S 5 , and T S 6 , the data input is randomly generated.
2) Correlation Between Output Deviations and Gate-Level Fault Coverage:
For these six functional test sequences, we derived the Kendall's correlation coefficient between their RTL output deviations and their gate-level fault coverage. The stuck-at fault model is used, as well as the transition fault model, the bridging-coverage estimate (BCE) [13] and the modified BCE measure (BCE + ). First we calculate the deviations for the six functional test sequences according to the method described in Section III. For each threshold value, we obtain the deviation values for the six functional test sequences and record them as a vector. For example, for the threshold value T H 1, the deviations are recorded as DEV 1(dev 1 ; dev 2 ; . . . ; dev 6 ).
We next obtain test patterns in extended-VCD (VCDE) format for the six functional test sequences by running Verilog simulation. Using the VCDE patterns, gate-level fault coverage is obtained by running fault simulation for the ten functional test sequences. The coverage values are recorded as a vector: COV(cov 1 ; cov 2 ; . . . ; cov 6 ).
In order to evaluate the effectiveness of the deviation-based functional test-grading method, we calculate the Kendall's correlation coefficient between DEV 1 and COV. Fig. 2 shows the correlation between deviations and stuck-at/transition fault coverage, as well as between deviations and BCE=BCE + metrics, for different threshold values. For stuck-at faults and bridging faults, we see that the coefficients are very close to 1 for all three threshold values. For transition faults, the correlation is less, but still significant. The results demonstrate that the deviation-based method is effective for grading functional test sequences. The total CPU time for deviation calculation and test-sequence grading is less than 2 h. The CPU time for gate-level stuck-at (transition) fault simulation is 12 (19) h, and the CPU time for computing the gate-level BCE (BCE+) measure is 15 (18) h.
3) Cumulative Gate-Level Fault Coverage (Ramp-Up): We next evaluate the effectiveness of output deviations by comparing the cumulative gate-level fault coverage of several reordered functional test sequences. For the Biquad filter, traditional stuck-at fault coverage as well as BCE+ metric are considered.
These reordered sets are obtained in four ways: 1) baseline order T S 1 ; T S 2 ; . . . ; T S 6 ; 2) random ordering T S 1 ; T S 3 ; T S 2 ; T S 6 ; T S 4 ; T S 5 ;
3) random ordering T S 3 ; T S 5 ; T S 1 ; T S 2 ; T S 4 ; T S 6 ; 4) the descending order of output deviations. In the deviation-based method, test sequences with higher deviations are ranked first. Fig. 3 shows the cumulative stuck-at fault coverage for the four reordered functional test sequences following the above four orders. Similar results are obtained for the BCE+ metric [5] . The deviation-based method results in the steepest curves for cumulative stuck-at coverage and BCE+ metric.
B. Results for Scheduler Module
The experimental setup and procedure are similar to that of Biquad filter: 1) obtain the RT-level deviations by considering TCs, observability vector, weight vector, and threshold value; 2) obtain the gatelevel fault coverage; 3) calculate the correlation coefficient and compare the coverage ramp-up curves.
1) Scheduler Module:
The Scheduler dynamically schedules instructions, and it is a key component of the IVM architecture. It contains an array of up to 32 instructions waiting to be issued and can issue 6 instructions in each clock cycle.
2) Experimental Setup: The experimental setup is the same as that for Biquad filter. All experiments are based on 10 functional test sequences. Six of them are composed of instruction sequences, referred to as T0, T1, T2, T3, T4, T5. The other four, labeled T6, T7, T8, and T9, are obtained indirectly by sequential ATPG. First, cycle-based gate-level stuck-at ATPG is carried out. From the generated patterns, stimuli are extracted and composed into a functional test sequence, labeled as T 6 . Next we perform "NOT", "XNOR", "XOR" bit-operations on T 6 separately and obtain new functional test sequences T 7 , T 8 , and T 9 . T 7 is obtained by inverting each bit of T 6 ; T 8 is obtained by performing "XNOR" operation between the adjacent bits of T 6 ; T 9 is obtained by performing "XOR" operation between the adjacent bits of T 6 . The above bit-operations do not affect the clock and reset signals in the design.
3) RTL Deviations: Using the deviation-calculation method in Section III, we calculate the RTL deviations for the Scheduler module by considering the parameter TC, weight vector, observability vector and threshold value.
We obtain the threshold value according to the description in Section III. We randomly select four test sequences, T 0 , T 1 , T 6 , T 7 , for learning the threshold value. For each of them, we run Verilog simulation and generate the graph of cumulative new transition counts. Given the information of the TCs, the weight vector, the observability vector and the threshold value, RTL deviations can be calculated. The total CPU time for deviation calculation and test-sequence grading is less than 8 h. The CPU time for gate-level stuck-at (transition) fault simulation is 110 (175) hours and the CPU time for computing the gate-level BCE (BCE + ) measure is 120 (125) h. 
4) Correlation Between Output Deviations and Gate-Level Fault Coverages:
We obtain the stuck-at and transition fault coverages for the functional test sequences by running fault simulation at the gatelevel. The correlation between these gate-level fault coverage measures and RTL deviations are computed and the Kendall's correlation coefficients are shown in Fig. 4 . As in the case of the Biquad filter, the correlation coefficients are close to 1 for all the three threshold values. This demonstrates that the RTL deviations are a good predictor of the gate-level fault coverage.
5) Cumulative Gate-Level Fault Coverage (Ramp-Up):
We evaluate the cumulative gate-level fault coverage of several reordered functional test sequences. The results are presented in [5] , [14] . We find that the deviation-based method results in the steepest cumulative stuck-at, transition and bridging fault coverage curves.
C. Results for Different Gate-Level Implementations
There can be different gate-level implementations for the same RTL design. For the Scheduler module, all the experiments reported thus far were carried out using a specific gate-level netlist, which we call imp1. We next evaluate whether the proposed RT-level deviation-based functional test grading method is sensitive to gate-level implementation. We obtain two other gate-level netlists, namely imp2 and imp3, for the given Scheduler module. Note that imp1 and imp2 adopt the FREEPDK library in synthesis while imp3 uses the GSCLib3.0 library. To calculate RT-level output deviations for imp2 and imp3, we only need to recalculate the weight vector since transition counts, observability vector and threshold value remains unchanged for the same functional test and the same RT-level design.
The Kendall's correlation coefficient values are shown in Fig. 5 . Additional results are reported in [5] . As in the case of imp1, the correlation coefficients are close to 1 for all the CL vectors for imp2 and imp3. These results highlight the fact that RTL deviations are a good predictor of gate-level defect coverage for different gate-level implementations of an RTL design. We therefore conclude that RTL deviations can be used to grade functional test sequences for different gate-level implementations. 
D. Stratified Sampling for Threshold-Value Setting
Here we address the problem of threshold-value setting from a survey sampling perspective [15] . The given N functional test sequences constitute the population for the survey. The CNTC is the characteristic under study. The threshold value is estimated by simulating the design using a number of samples drawn from the population, a procedure referred to as sampling. Stratified sampling [15] has been widely used for surveys because of its efficiency. We partition the given functional test sequences into subgroups according to their sources. For the Scheduler module, we partition its 10 functional test sequences into two subgroups: G1 composed of six functional test sequences coming from Intel and G2 composed of four functional test sequences from ATPG.
If the sample is taken randomly from each stratum, the procedure is known as stratified random sampling. We use stratified random sampling for the Scheduler module. Table II shows the correlation results for the Scheduler module when different stratified random sampling choices are used. We can see that by setting threshold value using stratified sampling in the calculation of RTL deviations, we consistently obtain good correlation results.
