Abstract-Timing analysis and timing closure are critical steps in digital circuit design. To ensure an error-free design, timing constraints are usually set-based upon the longest path delay from static timing analysis. However, a circuit could have dramatically different internal activity because of the variation of input workload. The path with the longest delay may not be active for certain input workloads, which would enable timing speculation for increased performance. This paper describes an approach to identify the greatest contributors of timing errors and mitigate those errors by replacing certain standard cells in the design. We evaluated our mitigation for several benchmark designs and demonstrated an error-free performance gain up to 37%. The entire design flow uses Synopsys electronic design automation tools and customized scripts, which can be adapted for other designs.
I. INTRODUCTION
With the scaling of fabrication technologies, transistors are more sensitive to environmental conditions, within-die variations [1] , [2] , and even input workload variations [3] . The demand on both performance and reliability [4] , [5] are equally persistent. In the traditional design, the process corners are analyzed to determine the worst-case delay, and then designers include additional timing margin as a guard band to ensure an error-free operation. However, analysis of the process corners can be extremely pessimistic for the actual circuit, with a large amount of slack in both timing and energy [6] .
Better-than-worst-case (BTWC) design, introduced by Intel engineer Bob Colwell, emphasizes operating on average-cases. Timing errors are allowed to occur at a certain, well-balanced threshold in order to avoid overdesign as much as possible. An error detection and correction (EDAC) module is required in BTWC design. Because of the penalty for error correction, a properly selected operating frequency is crucial in order to have a positive gain in overall performance [7] .
In this paper, we aim to create a convenient way to obtain the typical circuit activity under a given workload. By determining the largest error-contributors in the design, it is possible to reduce timing errors in a systematic way. This paper contributes the following. 1) An all-clock-frequency error-estimation method, which is based on statistical analysis of the settling time for all outputs at each cycle.
2) An offline error checking method that does not require a test bench to compare models operating in parallel. 3) Identification of the largest error-contributing outputs based on the circuit activity for given workload. 4) Identification of the high-impact cells on the fan-in cone of the error-contributor outputs. 5) Resynthesis the circuit with low-V t cells to decrease the propagation delay in order to reduce most timing errors [8] . Our methods are evaluated with selected ISCAS85 benchmark circuits, which represent different classes of functionality. The error-free performance gain for each circuit was the following: 1) 33.9% for c432; 2) 25.63% for c880; 3) 22.21% for c1908; and 4) 37.71% for c6288. The potential speedup could be higher, if the design can account for an acceptable error rate greater than zero. We selected a 30% clock increase as an example comparison point to show the achievement of our error-rate reduction. The leakage power was also examined to quantify the cost-benefit tradeoff for the different benchmark circuits.
The rest of this paper is organized as follows. Section II gives a brief background review about methods for dynamic activity analysis and for error checking. Section III introduces the overall design flow and the details of the proposed error-checking and errorestimation method; it also describes the dual-threshold error-reduction approach. In Section IV, the results are discussed for the selected benchmark circuits. The summary is provided in Section V.
II. BACKGROUND
The key idea of BTWC design is to improve performance (e.g., speed or power) by breaking the traditional circuit design boundary to allow certain errors to occur during the normal operation, while preserving the correct operation by adding EDAC to the design. The approaches of BTWC design could be categorized as three main groups: 1) aggressive runtime adjustment of clock frequency or supply voltage coupled with circuitry [9] - [13] ; 2) logic circuitry based on the circuit's dynamic behavior [14] - [16] ; and 3) error detection using approximate logic circuits for sensitive paths [17] , [18] .
Because two primary outputs (POs) with the same static path delay could have dramatically different path settling behavior with the same input workload, the error-reduction method should be selectively applied on those error-contributing paths/cells. There are several works designed for obtaining the circuit's typical internal activity, e.g., BlueShift [14] , DynaTune [15] , timed ternary decision diagram [19] , and the common case promotion method [16] . All the methods are based on using a timed characteristic function (TCF) and a binary decision diagram (BDD). However, the nature of the TCF and BDD makes the calculation too complex to apply on large circuits.
The achievements presented in this paper are as follows. 1) Reduce errors efficiently for timing speculation without major modifications to the original circuitry by considering the impact of input workload variance on the circuit's activity. 2) Create a universal design/simulation flow with commercial tools that enables accurate error estimation for a range of operating clock frequencies, which provides the insight for a BTWC designer to maximize performance gain. 3) Implement an offline error-checking method that allows designers to perform the desired cost-benefit analysis.
III. METHODOLOGY
This section discusses the details of our methodology. Section III-A gives the overview of the design flow. Section III-B describes the offline error-checking method. Section III-C describes the error-estimation method developed for this paper. Section III-D explains the identification of the error-contributing outputs and the selection of key cells for error reduction under a given input workload.
A. Experimental Setup and General Work Flow
The whole design flow uses the Synopsys design tool suite (design compiler, IC compiler, primetime, and VCS) with the Synopsys 32-nm library [20] and customized Python scripts. Four benchmark circuits from ISCAS85 were used to represent four different types of functions, shown in Table I . The gate-level netlist is simulated with back-annotated timing information (.sdf file) in VCS. The value change dump (VCD) file records the switching activity for all nodes. The customized scripts process the data to obtain the timing error information; the scripts also modify the netlist accordingly. Then, the design is resynthesized to perform place and route as part of the regular step of design flow. This method is well incorporated with standard electronic design automation (EDA) flow that could be applied to any design. Fig. 1 shows the flow chart of the design flow. The yellow steps represent the main contribution described in this paper.
B. Error Checking Method
As discussed in Section II, many kinds of error detection circuits exist, which require special circuitry for parallel comparison. The proposed error-checking module was developed to perform statistical analysis of errors after simulation in order to inform our mitigation approach. Fig. 2 shows a comparison of the general structure for two previous error-checking methods and the one used in this paper. The regular error checking method requires either the golden circuit or the delay element in the simulation to detect the errors. The test bench needs to be modified accordingly for every design under test. The proposed method uses customized Python scripts to extract information from the VCD file of the golden circuit and the tested circuits. The VCD file is the important raw data for this proposed work. The error detection method, error estimation method, and cell activity analysis are all based on the VCD file. This proposed error detection method does not need a special test bench. Error detection and analysis could be performed on the desired nodes directly from the simulation results. To detect errors, the settled value at every cycle of the desired nodes is extracted from both the tested and the golden VCD file; this information is compared cycle-by-cycle using scripts. The saved VCD file could be used multiple times for different types of analysis and comparison. This setup structure and the scripts are suitable for multiple types of circuits.
C. Identification of Error-Contributing Outputs and Error Estimation
An error could be observed depending on the settling time versus the required operating clock period. The circuit outputs could be separated into three groups.
1) All paths to the output are shorter than the operating clock period; these paths are classified as error-free. Fig. 3 . Benchmark C1908 outputs with the average active cycles out of 10 000 cycles (using 100 simulation trials), and the total error counts for 1 million cycles using a clock period of 1.7 ns.
2) The worst-case propagation delay to the output is longer than the operating clock period; these paths are categorized as errorpossible.
3) The subset of group 2) that are highly likely to settle after the required operating clock period for the given workload; these paths are the error-contributors. The output's activity level is not accurate enough to identify the error-contributors, because only the switching activity that occurs after the required clock period creates errors. Therefore, an analysis of the settling probability is necessary for the error-possible outputs. Fig. 3 is an example that suggests static propagation delay and average active cycles are not the error indicator. Fig. 3 shows a comparison between the activity level and the actual error count of each output when the clock period is 1.7 ns. The outputs are listed in ascending order of static propagation delay from top to bottom. Output N2899 has the longest static propagation delay with 2.2 ns (not shown). N2886 is the most active output, but N2891 is the largest error-contributing output. Fig. 4 shows the settling probability (zoomed view from probability 0.99 to 1) of each output. When operating at 1.7 ns, N2891 is the most unsettled output. The settling probability curve of each output is the cumulative distribution function of the output's settling histogram. Because an output may not switch every cycle, for some POs, there may be a large number of cycles that stabilize at 0 ns. The error-contributing outputs can be identified based on the settling probability, and the error-count estimation can be calculated by using the stabilization probability multiplied with the clock cycles, or it can be calculated by summing up the histogram bins that represent data switching after the required clock period.
The settling time histogram was plotted based on the simulation VCD file of the benchmark circuit at the original clock period. The VCD file contains switching activity and switching timestamps for all nodes; we use a customized Python script to extract the outputs' settling timestamp of each cycle from the golden VCD file. The settling time histogram of the desired outputs can be plotted with an appropriate bin size. (We used 50 ns because it is the average propagation delay of a nand gate for the Synopsys 32-nm library.)
As shown in Fig. 5 , when the clock is 1.7 ns, outputs N2891, N2811, and N2892 have not reached a probability of 1, and they are the error-contributing outputs for a 1.7 ns clock period. The error count estimation for N2891, N2822, and N2892 when the clock period is 1.7 ns is 13, 4, and 1, respectively. The error-count Full and zoomed view (clock period of 1.7-2.2 ns) of errorcontributing outputs for C1908, showing the settling time histogram and the settling time probability within one million cycles. estimation from Fig. 5 matches the simulation results in Fig. 3 . Based on the simulation VCD file with the original clock period, we can estimate the errors for all possible clock frequencies accurately.
D. Dual-Threshold Voltage Approach for Error Reduction
We use a dual-threshold voltage approach on selected cells to improve the propagation delay of the error-contributing outputs. Error-contributing outputs were identified with the error-estimation method described in Section III-C. However, there are multiple paths that feed into an error-contributing output. Replacing all cells of the fan-in cone is impractical due to the leakage power increase of using low-V t cells. Therefore, consideration must be given to improve performance in a cost-effective manner.
The steps to implement the error reduction method are as follows. 1) List all error-possible outputs.
2) Identify the error-contributing outputs using the errorestimation method. 3) List cells on the error-contributing output's fan-in paths. 4) Identify the convergence point of paths to the output that start to share the same path. 5) Replace cells after the convergence point with low-V t cells. 6) Perform cell activity analysis as described in [8] on the remaining cells and replace the highly active cells accordingly. (The activity analysis method is also based on the VCD file and customized Python scripts to extract the switching activity of selected cells.)
IV. SIMULATION RESULTS AND DISCUSSION
This section presents the results of the experiments on ISCAS85 benchmark circuits. With the method described previously, tedious time-sweep simulations are not necessary to obtain the characterization of error trends. However, to verify the accuracy of the error estimation method, the simulation clock period is swept from the traditional critical path delay to 70% of the value.
To compare the error reduction results, each circuit has three versions of implementation. Our results show that the error reduction will be more significant if the static critical path output is not identified as an error-contributing output. Otherwise, the advantage of SCR will be diminished because of the overlap of the replaced cells. Sometimes, the error reduction merits may be hindered by the circuit's logic structure. For example, in the C6288 multiplier circuit, the paths to all other outputs are subsets of the path to output N6288. Fig. 6 shows the error reduction results as compared to the baseline circuit for both FPR and SCR when using 70% of the original clock period. Designers could select the speculation clock based on the ability of the EDAC module. The error-free timing speculation clock is also examined. Fig. 7 shows the speedup improvement of SCR with for an error-free implementation. Table II shows the total error counts and the improvement from the FPR method to the SCR method. Table III lists the low-V t cells usage, and Table IV shows the leakage power comparison.
V. SUMMARY
This paper presented an offline error-checking method and an errorestimation method for all clock periods without tedious simulations. We described an approach for timing speculation that is aware of the impact of the input workload on the circuit's dynamic activity, and then used that information to implement error reduction by resynthesizing the circuit with low-threshold voltage cells for the high-impact cells on the fan-in cone of identified error-contributing outputs. This error-reduction approach reduces a majority of errors while maintaining the minimum usage of low-V t cells. This paper demonstrated improved performance and error reduction for the input workloads. The entire design flow was based on commercial EDA tools augmented with customized scripts.
